Because multicore architectures have become the industry standard, programming abstractions for concurrent programming are of key importance. Stream programming languages facilitate application domains characterized by regular sequences of data, such as multimedia, graphics, signal processing and networking. With stream programs, computations are expressed through independent actors that interact through FIFO data channels. A major challenge with stream programs is to load-balance actors among available processing cores. The workload of a stream program is determined by actor execution times and the communication overhead induced by data channels. Estimating communication costs on cache-coherent shared-memory multiprocessors is difficult, because data movements are abstracted away by the cache coherence protocol. Standard execution time profiling techniques cannot separate actor execution times from communication costs, because communication costs manifest in terms of execution time overhead. In this work we present a unified Integer Linear Programming (ILP) formulation that balances the workload of stream programs on cache-coherent multicore architectures. For estimating the communication costs of data channels, we devise a novel profiling scheme that minimizes the number of profiling steps. We conduct experiments across a range of StreamIt benchmarks and show that our method achieves a speedup of up to 4.02x on 6 processors. The number of profiling steps is on average only 17% of an exhaustive profiling run over all data channels of a stream program.