Speaker: Miaoqing Huang, Computer Science and Engineering, University of Arkansas.
Date: Monday, April 23rd, 2018, 5:00PM-6:00PM
Location: SCEN 322

Title: Performance Optimization on Intel Many-Integrated-Core Architectures Through Load Balancing

Abstract: Coprocessors based on the Intel Many-Integrated-Core (MIC) Architecture (branded as Xeon Phi) have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC coprocessors to achieve the parallelism. A supercomputer may provide a heterogeneous environment including the host Xeon CPU and the coprocessor Xeon Phi. The heterogeneous environment may lead to a problem of load imbalance, in which the workload is evenly distributed between the host CPUs and the coprocessors. The MPI programs may synchronize frequently. When the workload is evenly distributed, the faster ranks will idle at the synchronization point waiting for the slowest rank to finish. In order to achieve a good performance, load imbalance has to be minimized in the heterogeneous environment. In this paper, we conduct a detailed study on the performance of different programming models on heterogeneous environments such as the Beacon computer cluster. Our findings are as follows.
(1) The native MPI programming model on the MIC coprocessors avoids the complex programming heterogeneity by running only on Xeon Phi coprocessors. However it does not take the advantage of the computing capacity of the host Xeon CPU.
(2) The symmetric programming model decomposes the workload into even partitions and distributes them among the host processors and the coprocessors. A load imbalance problem occurs when the faster ranks have to wait at the synchronize point for the slowest one.
(3) On top of the symmetric programming model, a hybrid model launches multiple threads inside each MPI process on MIC coprocessors can partially solve the load imbalance problem and improve the performance of parallel applications by carefully adjusting the extent of multithreading inside the MPI processes on MIC coprocessors.
(4) The proposed smart data distribution model relies on the accurate performance profiling as the parameter to allocate the amount of input data to different types of processors and coprocessors based on their computing capabilities to achieve the load balance.

print