"Parallel computing" is no longer a buzzword, it is synonymous with high-performance computing, it is practical. Parallel computers are here to stay. By interconnecting hundreds and thousands of the world's most advanced processors, teraflop computers will soon be a reality and they will be all out to confront the most complex problems and the grandest challenges. A notable example is the project by the U.S. Department of Energy (DOE) on building the world's first trillion-operations/second computer, which will be powered by more than 9000 Intel's Pentium Pro processors. Another project of similar scale, also involving the DOE, will use a vast number of IBM's RS/6000 processors to achieve a comparable performance. But parallel computing is not limited to massively parallel processing (MPP). Symmetric multiprocessing (SMP) is now a common trend in the servers market. There is the likelihood that before too long multiprocessing will reach even the desktop. Recent advances in high speed communication networks have enabled parallel computing on clusters of workstations.
The raw power of computers has kept on increasing by leaps and bounds, but human's ability to harness that power does not seem to be keeping up. We perhaps are too accustomed to solving problems sequentially, especially when using the computer. The gap must be bridged by advanced software. A huge amount of effort has been devoted by researchers worldwide to the development of software techniques for parallel computing. These researchers all share the common goal of making the use of parallel computers much less formidable and enabling the user to fully exploit the power of the parallel computer. One such essential software technique is {\em load balancing}, which is the subject of this book. Load balancing aims at improving the performance of parallel computers by equalizing the workloads of processors automatically during the execution of parallel programs.
This book is about load balancing in distributed memory message-passing parallel computers, also called multicomputers. Each processor has its own address space and has to communicate with other processors by message passing. In general, a direct, point-to-point interconnection network is used for the communications. Many commercial parallel computers are of this class, including Intel Paragon, TMC CM-5, and IBM SP2. This book presents a comprehensive treatment of the subject using rigorous mathematical analyses and practical implementations. Focus is on nearest-neighbor load balancing methods in which every processor at every step is restricted to balancing its workload with its direct neighbors only. Nearest-neighbor methods are iterative in nature because a global balanced state could be reached through processors' successive local operations. Since nearest-neighbor methods have a relatively relaxed requirement on the spread of local load information around the system, they are flexible in terms of allowing one to control the balancing quality, effective for preserving communication locality, and can be easily scaled in parallel computers with a direct communication network.
In the design and analysis of nearest-neighbor load balancing algorithms, two most important performance metrics are stability and efficiency. Stability measures the ability of the algorithm to coerce any initial workload distribution into a global balanced state in the static workload model and the ability to bound the variance of processors' workload in the dynamic workload model. Efficiency measures the time cost for arriving at the global balanced state or for reducing the variance to a certain level. The objective of this work is to try to design nearest-neighbor algorithms that have good stability and efficiency characteristics.
Two of the most well-known nearest-neighbor load balancing algorithms are the dimension exchange and the diffusion methods. With the dimension exchange method, a processor goes around the table, balancing workload with its nearest neighbors one at a time. With the diffusion method, a processor communicates simultaneously with all its nearest neighbors in order to reach a local balance. These two methods are rigorously analyzed in this book, resulting in optimal tunings of the methods for a number of popular interconnection networks. On the practical side, these two methods are implemented on multicomputers with different characteristics and evaluated in applications with different behaviors. The methods are shown to effective and efficient.
For other popular networks, the ring, the chain, the mesh, the torus and the k-ary n-cube, we derive the optimal exchange parameters in closed form and establish several important relationships between the efficiencies of these structures using circulant matrix theory. Based on these relationships, we conclude that the dimension exchange method favors high dimensional networks.
With the diffusion method, a processor balances its workload with those of its nearest neighbors all at the same time rather than one by one as in the dimension exchange method. Its efficiency is dependent on a diffusion parameter, which characterizes the behavior of a local balance operation. We analyze the diffusion method using circulant matrix theory and derive the optimal values for the diffusion parameter for the k-ary n-cube and its variants. Through statistical simulation, we show significant improvements due to the optimal exchange and the diffusion parameters.
Furthermore, we analyze the dimension exchange and the diffusion method in different workload models and system characteristics. We show that the optimally-tuned dimension exchange algorithm outperforms the diffusion method in both one-port and all-port communication models in achieving a global balanced state. The strength of the diffusion method is in load sharing (i.e., keeping all processors busy but not necessarily balancing their loads) in the all-port communication model.
The last application is parallel combinatorial optimizations. We experiment with the dimension exchange and the diffusion methods for distributing dynamically generated workloads at run-time. Their performance is evaluated in the solution of set partitioning problems on two distributed memory parallel computers. It is found that both methods lead to an almost linear speedup in a system with 32 processors and a speedup of 146.8 in a system with 256 processors. These two methods give the best results among all the methods we tried.
The authors are very grateful to Burkhard Monien, Ralf Diekmann, Reinhard Lueling and Stefan Tschoeke for their valuable comments and contributions to both the theoretical and the experimental aspects of this research. Thanks also go to Erich Koester and other associates of AG-Monien group for their considerate arrangements in both academic and non-academic affairs while the first author was in Germany.
Many other people have contributed to this book. We thank Professor Kai Hwang for the foreword, and Professors Dimitri P. Bertsekas, Tony Chan, Vipin Chaudhary, Henry Cheung, Francis Chin, Andrew Choi, Georege Cybenko, F. Meyer auf der Heide, David Nassimi, and Loren Schwiebert for their valuable inputs at various stages of this project. Special thanks go to our families who had suffered through many long nights of being neglected. Their love, patience and support had meant a lot to us. To jiwen who had assisted us wholeheartedly from beginning to end, we are greatly indebted.