Beowulf Distributed Computing and Applications Performance
Martyn F. Guest, Barry G. Searle, and Steve J. Andrews Quantum Chemistry Group, Daresbury Laboratory, Warrington WA4 4AD p.sherwood@dl.ac.uk, and h.j.j.vandam@dl.ac.uk
ContentsDaresbury Laboratory Beowulf Projects * SPEC Benchmarks * Computational Chemistry Benchmarks * Communications Performance Benchmarks * Metrics used in the Performance Evaluation of Beowulf Clusters * Performance Analysis Tools * Applications Performance: NWChem * Applications Performance: GAMESS-UK DFT, MP2 and 2nd Derivatives * Applications Performance: GAMESS-UK DFT Coulomb Module * Applications Performance: DL_POLY * Applications Performance: CHARMM * Applications Performance: CASTEP, CPMD and CRYSTAL * Applications Performance: ANGUS * Applications Performance: FLITE3D * Applications Performance: SUMMARY * Daresbury Laboratory Beowulf ProjectsS.J. Andrews, M.F. Guest and B.G. SearleNetworks of personal computers (so called Beowulf systems) composed of fast PCs configured with large quantities of RAM and hard disk, and running the Linux operating system are becoming more and more attractive as cheap and efficient platforms for distributed applications [1]. The main drawback of a standard Beowulf architecture is the poor performance of the conventional inter-process communication mechanisms based on RPC, sockets, TCP/IP, Ethernet. Such standard mechanisms are thought to perform poorly both in terms of throughput and message latency. Nevertheless, there is increasing interest in the use of commodity ‘off-the-shelf’ components as building blocks for high-performance computing. This is evident in many areas. An example within the UK academic community is provided by the latest round of the Joint Research Equipment Initiative (JREI'2001). This included equipment requests for more commodity-based clusters than proprietary SMP-based solutions (e.g. Origin 3400), in stark contrast to previous rounds of such Initiatives. To investigate the potential of this class of system, a number of these "Beowulf-class" prototype systems have been assembled and evaluated at Daresbury [2]. The programme of evaluation is focused on both system software and hardware, and on assessing the delivered performance across a broad spectrum of end-applications. The programme aims to inform the community over the wide variety of available options, from choice of CPU (Alphas, single- and dual-Pentiums, caches, Xeon or old memory subsystems) to choice of interconnect (Ethernet, Myrinet, Scali SCI, QsNet etc.). This report provides an update on progress, with a particular focus on application performance. As our assessment also aims to position Beowulf systems on the HPC landscape, we also take the opportunity to present application results on the following proprietary high-end solutions:
Following an overview of the commodity-based systems currently under assessment, we consider in subsequent articles the results of a number of benchmarks designed to assess the building blocks of any cluster, namely (i) serial node CPU performance, and (ii) communications benchmarks across the variety of possible cluster interconnects. The major focus of these subsequent reports is, however, on applications and we present performance comparisons between Beowulf systems, the CSAR Cray T3E/1200E, and the proprietary systems outlined above. Applications considered include those from computational chemistry (NWChem [3], GAMESS-UK [4], DL_POLY [5] and CHARMM [6]), computational materials (CRYSTAL [7], and the Car Parrinello codes, CASTEP [8] and CPMD [9]), and computational engineering (ANGUS [10] and FLITE3D). Evaluation has focused on both PC- and Alpha EV6-based platforms. The former includes Intel's Pentium III, Pentium 4 and Itanium CPUs, and AMD's Athlon processors. The Alpha-based platforms include DS10, DS20, ES40, XP-1000 and DPC264 machines, from 466 MHz to industry-leading 1000 MHz EV68 models. Various system and resource management packages (Lobosq, Sychron, Beowulf, Quadrics RMS and PBS) have been investigated, while several flavours of message passing software (MPICH, LAM6.2, LAM6.3, MPI-VIA and Shmem), compilers (Compaq, Absoft, PGI and GNU/g77) and numerical libraries (ATLAS, NASA, Intel MKL) have been tested. Implementations of the Global Array (GA) tools and parallel eigensolvers from PNNL have also been completed. As well as the choice of processor, a variety of networking options including fast Ethernet, Myrinet and more recently the low-latency, high bandwidth solutions, SCI from Dolphin and QsNet from Quadrics, have also been evaluated. The assessment of a variety of prototype commodity-based systems (CS) is now producing a wealth of data. Seven such systems have been used in the present study (CS0-CS6), four in-house (CS0-CS3), the three others though collaborative links to vendors and other Beowulf sites (CS4-CS6). The in-house Pentium-based machines (CS0 and CS1) have been used to benchmark a host of applications (see below), while CS2 (a Linux Alpha EV67-based UP2000/Quadrics system) and CS3 (an AMD Athlon-based Myrinet system) have both been installed during the past twelve months.
In addition to the in-house systems above, the following machines have been evaluated and used to benchmark applications: References [1] D. Ridge, D. Becker, P. Merkey, T. Sterling and P. Merkey, Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs, Proceedings, IEEE Aerospace (1997). [2] Additional information on the Daresbury Beowulf Systems is available via World Wide Web URL http://www.cse.clrc.ac.uk/disco. [3] Additional information on NWChem and the Environmental Molecular Sciences Laboratory, is available via the Pacific Northwest Home Page at http://www.pnl.gov:2080/ [4] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick, K. Schoffel and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders and A.J. Stone. (http://www.dl.ac.uk/CFS) [5] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/ [6] CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. [7] CRYSTAL was jointly developed by the Theoretical Chemistry Group at the University of Torino and the Computational Materials Science group in CLRC. The program computes the electronic structure of periodic materials within Hartree Fock, density functional or various hybrid approximations (http://www.cse.clrc.ac.uk/Activity/CRYSTAL/). [8] CASTEP (CAmbridge Serial Total Energy Package) is the ab initio code for the solution of the electronic ground state of periodic systems with the wavefunctions expanded in a plane wave (PW) basis, using techniques based on density functional theory. M.C. Payne et. al., Rev. Mod. Phys. (1992) 64 p. 1045 (http://www.cse.clrc.ac.uk/Activity/UKCP/). [9] CPMD Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999). [10] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from an engineering perspective, Parallel Computing (1996). SPEC BenchmarksM.F. Guest The prototype commodity-based systems described in the preceding article are based on the Pentium, Athlon and Alpha CPUs as the node building block. The choice of optimal CPU (performance vs. cost) has been informed by an analysis of the serial performance of a wide variety of processors across a number of computational chemistry benchmarks. Part of the Distributed Computing support programme at Daresbury Laboratory involves an on-going comparison of a variety of different computer systems across a variety of application areas. We summarise below the results of such a comparison for the SPEC benchmark; benchmark comparisons in the area of computational chemistry are presented in the following article. One of the most useful indicator of CPU performance is provided by the SPEC (``Standard Performance Evaluation Corporation'') benchmarks. This benchmark suite [1] contains non-tuned application-based code to measure processor speed for both integer (SPECint) and floating point (SPECfp) arithmetic. SPECfp95 and SPECint95, and their successors, SPECfp2000 and SPECint2000, have become industry standards in measuring primarily the performance of a system's processor, memory architecture, operating system and compiler. The next generation of SPEC benchmarks, SPEC CPU2000 [2], has recently replaced SPEC95. CFP2000 is derived from the results of fourteen floating-point benchmarks compiled with aggressive optimization. It is the geometric mean of fourteen normalised ratios (one for each floating-point benchmark). CINT2000 is derived from the results of twelve integer benchmarks compiled with aggressive optimization. It is the geometric mean of twelve normalised ratios (one for each integer benchmark). Note that the level of optimisation is not mandated. While highly aggressive optimisation is permitted, results derived from benchmarks compiled with conservative optimisation (SPECfp_base2000) can be submitted. Table 1: SPEC CPU2000 - SPECfp and SPECint Values and Values Relative to the Compaq AlphaServer ES40/6-833.
A subset of the SPECfp2000 and SPECint2000 results for many of the leading CPUs are given in Table 1 (with the baseline system the Ultra 10 333MHz). In each case we have normalised the values relative to those of the Compaq AlphaServer ES40/6-833. An examination of the SPECfp2000 values of Table 1 shows that the leading five machines are from Compaq, with the A21264B 833 MHz CPU in the DS20E and ES40 Model 6/833, and the A21264C 1001 MHz CPU exhibiting ratings of 784, 777 and 756 respectively. Seventeen of the next nineteen entries feature either the Pentium 4 or Itanium/800 MHz CPUs. SPECfp figures for Intel's Pentium 4 range from 590 (1.4 GHz) to 714 (2.0 GHz), while the Itanium-based systems exhibit values from 703 (Dell Poweredge 7150, 800 MHz with 4MB L3 cache) to 623 (HP i2000, 733 MHz, with 2 MB L3 cache). Note that both Itanium values are in fact SPECfp2000_base and not SPECfp2000. The only non-Intel based CPU in this range is the 833MHz Alpha A21264An (a rating of 644 in the API/UP2000 6/833). All other CPUs exhibit SPECfp2000 ratings of less than 600 i.e. are a factor of at least 1.3 slower than the Compaq ES40 Model 6/833 and a factor of 1.2 slower than the 2.0 GHz Pentium 4. Leading machines from other vendors exhibit the following values:
Based on the normalised ratings of Table 1, we might expect the Compaq AlphaServer ES40/833 (100%) to marginally outperform the 2.0 GHz Pentium 4 (92%) and Itanium 800 (4MB L3, 90%; 2MB L3, 84%). These CPUs are followed by the API UP2000/833 (A21264A, 83%). Normalised ratings for the leading CPUs from other vendors outlined above are as follows 75% for the HP 9000 Model J6700/PA8700-750 and Compaq AlphaServer DS20E Model 6/667; 62% for the AMD 1.2 GHz Athlon, Tyan Thunder K7 and Sun Blade 1000 Model 1900/900MHz; 60% for the SGI Origin 3200/R14k-500 and 52% for the Origin 3200/R12k-400; 53% for IBM/s RS/6000 44P-170 (450 MHz); and 44% for the Dell Prec.WorkSt. 420 (1.0 GHz Pentium III). In terms of leading CPUS from each vendor, the poorest performer would appear to be the current offerings from IBM. Thus the 450 MHz CPU in the IBM RS/6000 44P-170 is a factor of 1.9 times slower than the AlphaServer ES40/833. Clearly the awaited power4 developments from IBM, with a likely SPECfp2000 figure of 1000+ for the initial 1.3 GHz CPU offering will address this situation. The leading 25 entries of Table 1 feature 7 systems from Compaq / API and five from Hewlett Packard (only 1 PA-RISC system). Eleven systems feature Intel's Pentium 4 CPU and six Intel's Itanium CPU. The leading 62 systems in the full Table achieve 50% of the performance of the Compaq AlphaServer ES40/6-833, with 27 of these featuring the commodity CPUs from Intel and AMD. The 2.0 GHz Pentium 4 CPU is seen to outperform the leading Pentium III-based solution by a factor of 2.1.
An examination of the SPECfp2000_base values [2] shows a somewhat different picture. The leading machines are no longer from Compaq, with the A21264B 833 MHz CPU in the DS20E 68/833 (SPECfp2000_base, 643) and ES40 Model 6/833 (621) now outperformed by a number of Pentium 4 (1.7 - 2.0 GHz) and Itanium/800 Mhz CPUs. SPECfp2000_base figures for Intel's Pentium 4 range from 581 (1.4 GHz) to 704 (2.0 GHz), while the Itanium-based systems exhibit values from 703 (Dell Poweredge 7150, 800 MHz with 4MB L3 cache) to 623 (HP i2000, 733 MHz, with 2 MB L3 cache). The only systems that do not feature an Intel-based CPU appearing in top 30 SPECfp2000_base entries include the alpha systems from Compaq and API (Compaq AlphaServer DS20E Model 68/833, ES40 Model 6/833, GS80, GS160 and GS320 Model 68/1001, DS20E Model 6/667 and API UP2000 833 MHz), plus the HP 9000 Model J6700 / PA8700-750 from Hewlett Packard. References [1] A SPEC FAQ describing the SPEC benchmark suite and the SPEC consortium is periodically posted to comp.benchmarks, and can be found on the WWW at: www.specbench.org/spec/faq. An excellent summary of the SPEC benchmarks that is periodically updated is available via anonymous ftp from: ftp.cs.toronto.edu in the file /pub/spectable. [2] http://www.specbench.org/osg/cpu2000/results/cpu2000.html Computational Chemistry BenchmarksM.F. Guest In the previous article we have discussed CPU performance on the general SPEC benchmarks. We summarise below performance in the area of computational chemistry, focusing on a benchmark suite that includes a set of twelve quantum chemistry calculations using the GAMESS-UK electronic structure program [1]. The comparison involves approximately one hundred and twenty computers, ranging from supercomputers to scientific workstations and Pentium, Athlon and Itanium-based PCs. Vector supercomputers used in this report include the NEC SX-5 and SX-4. A large number of workstations and workstation servers have been benchmarked, including the recent offerings from:
Note that the present results are taken from a more detailed report on computational chemistry benchmarks [2]. The GAMESS-UK Benchmark is designed to represent the typical range of calculations commonly performed by the ab initio quantum chemist. It includes 12 calculations that feature conventional- and direct-SCF, CASSCF and MCSCF, CI calculations (both direct-CI and conventional table-driven MRD-CI), MP2, and both SCF and MP2 analytic 2nd derivatives. Table 1: The GAMESS-UK Serial Benchmark: total CPU time (user and system), elapsed time (minutes), efficiency (%) and relative performance from both SPECfp2000 and GAMESS-UK.
(+) using the portland group compiler, pgf77 The data presented in Table 1 is collected under control of the UNIX time command, and includes CPU time (both user and system summed over all 12 calculations), total elapsed time and efficiency (measured as CPU versus elapsed). Also shown is the relative performance of each machine based on SPECfp2000 values and on the GAMESS-UK total CPU times, normalised to values for Compaq's ES40/833. These suggest that the dominant CPUs are the Alpha EV68 and EV67, together with the Athlon AMD K7. Five of the leading 10 machines feature the Alpha processor, and two the Athlon CPU. The fastest machine is the Compaq ES45/1000. The AlphaServer ES45 (3.7 mins.) is seen to outperform the Alpha ES40/833 (4.4 mins,) the AMD K7/1400 (5.4 mins.) and the API UP2000 6/833 (5.6 mins) by factors of 1.19, 1.46 and 1.52 respectively. The UP2000 exhibits similar user CPU times to the Alpha DS20E/667 (5.5mins.) and ES40/667 (5.6 mins.). These six machines are followed by the HP PA-9000/J6000-552 (6.1 mins.), the Pentium 4/1500 (6.2 mins.), the AMD K7/1200 and Compaq PW XP1000/667 (6.3 mins) and the API UP2000 6/667 (6.9 mins.). The power3-based IBM SP/WH2/375 (7.2 mins) and RS/6000 44P-270 (7.4 mins) follow, somewhat ahead of the AMD Athlon K7/1000 (pgf77, 7.8 mins) and PA8500-based HP PA-9000/J5000-440 (7.9 mins.). The latter shows comparable performance to the EV6/500; of the next ten machines with user CPU timings between 8-9 mins, six feature the EV6 processor. Non-alpha machines in this range include the SGI O3800/R14k-500 (8.0 mins.), Pentium III/1000 (8.2 mins.), and the Pentium 4/1400 and HP PA-9000/C3000-400 (8.4 mins.). The leading 20 machines are from Compaq (8), API (2), IBM (2), HP(2), and SGI (1), plus the leading CPUs from Intel (2) and AMD (3). The fastest machine from SUN (the SUN Blade 1000/M1750, 9.5 mins) is a factor of 2.2 times slower than the Compaq ES40/6-833. The 750 MHz UltraSPARC III is outperformed by the AMD K7/1000 (7.8 mins.), Pentium 4/1400 (8.4 mins.) and Pentium III/866 (9.3 mins.). The availability of Intel's MKL libraries on the Pentium 4/1500 (not on the 1400 MHz) accounts for the significant decrease in user time, from 8.4 to 6.2 mins. Twelve machines are found to lie within a factor of two of the fastest. A somewhat modified picture emerges when considering the system CPU and elapsed times. With the exception of the Alpha, and to a lesser extent the SGI R10k, most machines exhibit a system CPU time of the order of 10-15% of the user time; this percentage increases significantly, particularly on the AXP, to between 20-40% for the systems shown in the Table. The API UP/2000-833 shows an alarming increase, to a figure of 66%. Based on the summed CPU times, the Compaq ES45/1000 remains the optimum CPU (4.6 mins.).The ES40/6-833 (6.3) is now only marginally faster than the Pentium 4/1500 and the AMD K7/1400 (6.5), and only a factor of 1.1 times faster than the ES40/667 and HP J6000-552 (6.8 mins.). These five machines are followed by the AMD K7/1200 (7.2 mins.) and Compaq's DS20E/667 (7.4 mins.) and PW XP1000/667 (7.7 mins.), IBM's SP/WH2-375 and 44P-270 (8.2-8.4 mins.), and the AMD Athlon K7/1000, HP PA-9000/J5000 and API UP2000 6/833 (9.0-9.2 mins). Finally we note the performance of the Pentium- and AMD-based hardware; using the pgf77 compiler resulted in identical CPU times for the P4/1500 and AMD K7/1400 (6.5 mins); the 1 GHz Pentium III exhibits a CPU time of 9.4 minutes. While the AMD Athlon K7/1400 is 1.42 times slower than the AlphaServer ES45/1000, we note that SPECfp2000-based prediction would have led to a higher factor of 1.83. Use of the Portland Group pgf77 compiler produces a consistent level of performance improvement compared to g77, typically by a factor of between 1.13-1.17 for the Pentium III and AMD Athlon CPUs, while somewhat higher for the Pentium 4/1500 (a factor of 1.25). The relative performance figures of Table 1 suggest that the GAMESS-UK results are broadly in line with the SPECfp ratings, with the EV68- and EV67-based Compaq machines providing the optimum CPUs. However, this superiority is not as pronounced as might be expected from just a consideration of the SPECfp2000 rankings. These results, together with known costs, strongly suggest that the EV67 and EV68 Alpha processors from API / Compaq, together with the AMD Athlon and Pentium 4 CPUs provide the most likely building blocks for Beowulf systems. References [1] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick, K. Schoeffel and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, and A.J. Stone. The package is derived from the original GAMESS code due to M. Dupuis, D. Spangler and J. Wendoloski, NRCC Software Catalog, Vol. 1, Program No. QG01 (GAMESS), 1980. [2] M.F. Guest, Performance of Various Computers in Computational Chemistry, in Proceedings of the Daresbury Machine Evaluation Workshop, CLRC Daresbury Laboratory, November 2000. The associated MS powerpoint presentation is also available.
S.J. Andrews, M.F. Guest and B.G. Searle There are an increasing number of available options for the network connections in a Beowulf cluster. The proprietary options include Myrinet from Myricom Inc., SCI from Dolphin and QsNet from Quadrics Supercomputer World Ltd. The other choices are Fast Ethernet and Gigabit Ethernet which are available from most companies that produce network products. An example of the relative performance of possible combinations is given in Table 1 below. Here, we take the peak bandwidth and latencies we have measured on several clusters for various Fast Ethernet configurations and compare these with the interconnect performance of;
Table 1: Network interconnect performance; Latency (micro sec) and Bandwidth (MB/sec). Such point-to-point measurements are of course of limited value and provide no more than an indication of performance likely to be encountered in parallel applications. To provide a more quantitative assessment, we have adopted a number of benchmarks designed to provide a systematic evaluation of both point-to-point and collective operations. The PMB BenchmarkWe have used the MPI Parallel Communications benchmarks from Pallas (PMB, Pallas MPI Benchmarks [1]) across a variety of parallel hardware. PMB considers a wide variety of point-to-point communications e.g. PingPong, Sendrecv and Exchange) plus a number of MPI Collective Operations (Allreduce, Reduce, Reduce_scatter, Allgather, Allgatherv, Alltoall, Bcast and Barrier). Results for the former are reported in MBytes/sec, results for the latter as time (microsec) to complete. Each operation is run for a variety of message lengths (0 to 4194304 Bytes), with the collective operations performed for various combinations of the number of CPUs available. Systems evaluated to date include:
Full details of these benchmarks has been presented elsewhere [2]. To provide an example of the output, we show below a plot for the performance of MPI_allreduce on 16 CPUs of the machines identified above.
An Effective Bandwidth Benchmark, EFF_BWAs a spin off from PMB, a benchmark "EFF_BW" has been developed by Pallas to calculate the "effective bandwidth". For a given machine and processor number, one integral number is calculated which includes the performance for small and for large messages under participation of all available processors (see [1]). EFF_BW uses only the PMB PingPong Benchmark for measuring startup and throughput. We have adopted this benchmark [3], and present in Table 2 below the EFF_BW figures for 8 and 16 CPUs (together with reported Latency figures) measured on both High-end and the commodity-based hardware. References [1] http://www.pallas.de/pages/pmb.htm; PMB, a comprehensive set of MPI benchmarks written by Pallas, targeted at measuring important MPI functions: point-to-point message-passing, global data movement and computation routines, one-sided communications and file-I/O, [2] see http://www.dl.ac.uk/CFS/benchmarks/pmb [3] see http://www.hlrs.de/organization/par/services/models/mpi/b_eff
Table 2: Effective Bandwidth figures (MBytes/sec) and Ping-Pong Latency (micro sec) for a number of commodity systems and high-end parallel machines.
Metrics used in the Performance Evaluation of Beowulf ClustersS.J. Andrews and M.F. Guest Clusters of commodity processors, or Beowulf-class machines, are often considered as a serious alternative to classical HPC systems. While such clusters are substantially cheaper than the latter when compared on their peak performance, the sustained performance still needs to be demonstrated for large and communication intensive applications. As part of the Beowulf Project at Daresbury, we have undertaken a collaborative study with Pallas GmbH to provide greater insight and a more quantitative approach to the often qualitative debate over the role of commodity clusters as potential alternatives to proprietary high-end platforms. Approach: This has focused to date on an evaluation of the differences between a Pentium cluster with Fast Ethernet (the CS1 Pentium III/450 Cluster at Daresbury) and the Cray T3E/ 1200E, and has been extended to include the CS2 Alpha Linux Cluster with Quadrics interconnect. An advantage of including the Cray is that it gives the ability, through the pat tool, to count MFlops directly. In comparing the two classes of machine a number of criteria can be established. These include descriptive ones such as:
An understanding of these basic quantities then allows for performance prediction and cost modelling. Parallel efficiency losses can arise from high numbers of messages or global operations and / or high data volumes passed in these operations. To quantify this an intensity can be defined as a measure of frequency and volume of transfers. Thus, for any code run on a number of processes, a certain number of messages and global operations are performed and certain data volumes are processed in these operations. These then become machine independent code properties. Examples include message (I_msg = Nmess / MFlops), global operations (I_glo = Ng_ops / MFlops) and transfer (I_trans = overall_message_vol / MFlops) intensities. Sensitivity of a machine to these intensities is dependent on the efficiency factorisation. This involves breaking down the total time into several components (reflecting the different sources of the overheads) which may be achieved using Vampir [1]. By using a calibration program we have determined whether such intensities are performance critical or not and thus classify actual application codes as critical or non-critical on a particular architecture. Results: A cross-section of application codes have been considered including:
From the Beff benchmark [6] we find, not surprisingly, that the Cray bandwidth is almost 30 times better than the Intel/Ethernet combination (2.6 vs 68 MB/sec per CPU) with the Alpha / Qsnet a factor of 10 better (25.9 MB/s). In comparing the Intel and the Cray, we find that the former is well suited to these codes in terms of network losses and, in the main, CPU losses also. LU, however, is CPU-critical and benefits from well-tuned commercial libraries on the Cray. Considering the intensity sensitivities we find, for example, that when compared to both the Cray and Alpha machines the Intel clusters are highly sensitive to I_glo and I_trans but not to I_msg. What this means in practice is that the number of messages (as a proportion of the MFlops achieved) is generally not a critical factor in the performance of the Intel cluster: We can now use the intensity analysis to predict performance of these codes without explicitly running them on the target machine, estimating a suitable number of processors for that code and also the impact of, say, upgraded components such as CPU or network. We can use these predictions with a simple cost model that combines the pure performance of the code with the machine and project (e.g. personnel) costs incurred when a code is executed. We find that, in cost terms:
To take the most extreme example (LM), for between 4 and 32 processors, we can say that for a high priority case:
and for low priority:
We intend to expand on this work to study the performance of further codes and other platforms. A full paper detailing the methodologies used here, and further examples, may be found at http://www.cse.clrc.ac.uk/Activity/DISCO. References [1] http://www.pallas.com/pages/vampir.htm [4] http://www.netlib.org/scalapack/index.html [5] http://geomfem.tokyo.rist.or.jp [6] http://www.hlrs.de/organization/par/services/models/mpi/b_eff Performance Analysis ToolsM.F. Guest Considerable progress has been made in targeting end-user application software, involving the installation, benchmarking, and understanding of performance issues across a wide variety of parallel applications. With a focus on computational chemistry, applications include those from CCP1 (GAMESS-UK, CRYSTAL and CPMD, the base code for the new flagship project), NWChem, and REALC (chemical reaction dynamics software from CCP6). Other Car Parrinello codes include VASP and CASTEP, while molecular simulation software benchmarked includes DL_POLY (from CCP5) and CHARMM. Codes from other disciplines include ANGUS and FLITE3D (computational engineering) and the UK Met. Office's Unified Model code (climate modelling). In each case we have attempted to generate performance data on the prototype and evaluation commodity-based systems (CSx), and compared this with results from the CSAR Cray T3E/1200E and a variety of more recent high-end solutions. The latter includes the IBM SP/WH2-375, the Compaq AlphaServer SC (with both 667 and 833 MHz CPUs), the SGI Origin 3800 (both R12k-400 and R14k-500 CPUs) and a prototype of Cray's EV68-based Linux Alpha Cluster. In trying to quantify this comparison, we consider for each benchmark the percentage of a 32-node partition of the Cray T3E/1200E delivered by the range of commodity based systems (i.e. T32-nodeCray T3E / T32-nodeCSx). Prior to presenting these results in subsequent articles, we outline below use of the VAMPIR MPI instrumentation tool that has proved extremely useful in understanding a variety of performance issues associated with the benchmarks. It is much harder to debug and tune parallel programs than sequential ones with only one instruction stream given that the much larger state space and the necessary communication between processes greatly complicate the task of analysing the behaviour of such applications. Usually, a tracefile may be generated that uses up much of the available disk space but contains the answer to the performance problem somewhere within it. VAMPIR, the MPI Instrumentation Tool from Pallas GmbH, provides a mechanism for the programmer to obtain an overview about an execution trace quickly, allowing focus on important parts of the program execution. It converts the trace information into a variety of graphical views at finer and finer detail and in doing so can separate the algorithmic performance drawbacks from those due to the (network) architecture. The initial phase of a collaboration between CSE Department and Pallas GmbH involves the instrumentation of several MPI codes to explore the performance potential and limits of Beowulf clusters. It is intended to study both kernels and full applications including, (i) communication kernels (OCCOM, PMB), (ii) kernels (NAS), and, of course (iii) the end applications themselves (see previous article). We have previously illustrated the value of such traces in highlighting the difference in communication patterns and load, and hence scalability, in both macromolecular and Ewald simulations using the DL_POLY package (see subsequent article). A more recent, and more complex, application has been to apply VAMPIR to understand a number of performance issues in the GAMESS-UK code (see subsequent article on the application performance of the GAMESS-UK DFT Coulomb Module).
Figure 1. The Vampir trace of a single cycle of the SCF process in an 8-CPU DFT calculation of the Si8O25H18 zeolite cluster calculation showing the communication behaviour and state of each process as a function of time.
The trace of figure 1 is from the calculation of the DFT energy of the Si8O25H18 zeolite cluster (in a basis of 617 functions and 1444 fitting functions) run on 8 CPUs (two four-way Compaq Alpha ES40 nodes) connected by a quadrics switch. The first feature of the trace to note commences at the depicted timeline of around 2:20 and ends at 2:37 minutes. This corresponds to the quadrature in the Kohn-Sham energy and matrix evaluation, and would be expected to dominate the SCF cycle. The black lines denote the messages required to administer dynamic load balancing. The most notable feature however consists of the red and blue bars around two minutes and forty seconds. This actually corresponds to the first of four calls to the GAMESS-UK routine, GAMULT2, that is used to perform a similarity transformation (Q†HQ). The red bars represent MPI_Barrier calls waiting for the other processors to complete their work. The blue bars represent GA_Get operations waiting for data transfers to complete. The black lines represent the flow of data between processors. The trace shows that half of the processors are waiting for an MPI barrier to complete while the other half are mostly waiting for single-sided get operations to complete. The other interesting feature is that there are multiple instances where all get operations try to access data from a single processor at the same time. The key point to note here is that there are in fact 3 other calls to the same routine, all of which require far less time to complete, all contained under the solid black bars to the right of the trace. Although we have still to quantify the cause of this behaviour, the trace shown above illustrates how the instrumented GA can be used to identify performance bottlenecks and how it does highlight communication patterns that may cause such bottlenecks. This ability to quantify the causes of performance differentials on various machines continues to be vital in code optimisation and in u nderstanding the key metrics in particular application / platform combinations
Applications Performance: NWChemM.F. Guest In a previous report we have described the functionality within the NWChem package [1] and illustrated the scaling of the DFT module. This involved calculations of a number of different fragment models of a zeolite cluster conducted on the 512 processor IBM SP/P2SC-120 at the EMSL's Molecular Sciences Computing Facility (MSCF). These fragments ranged in size from Si8O7H18, with 347 basis functions and 832 CD fitting functions, to Si28O67H30, with 1687 basis functions and 3928 fitting functions. With the assistance of PNNL’s Jarek Nieplocha and Eduoardo Apra, we have recently completed an initial port of the NWChem package to both the Pentium and Alpha Beowulf Clusters, and benchmarked these systems using the same Zeolite fragments. Total elapsed times on the Cray T3E/1200E, Compaq AlphaServer SC, and commodity clusters are given in Table 1. The super-linear speed-ups observed at high processor counts for the larger fragments on both the Cray T3E/1200E and Alphaserver SC arises from the increased availability of memory. In such cases, the 3c2e-integrals are held entirely in memory; at lower processor count a fraction of these integrals must be re-computed on each iterative cycle of the SCF. Considering the total times to solution on 32 CPUs, we see that the Pentium cluster is delivering 51% (Si8O7H18) and 40% (Si8O25H18) of the Cray T3E/1200E in the DFT calculations. Table 1: Total Elapsed times (secs) using the NWChem DFT module in calculations on a variety of Zeolite fragments on the Cray T3E/1200E, and commodity-based systems.
As expected, the more powerful CPUs of the Alpha Linux Cluster lead to much higher percentage delivery. Thus the 32-CPU Alpha cluster delivers 220% (Si8O7H18) and 215% (Si8O25H18) of the Cray T3E. The much higher figure (464%) for Si26O37H36 is attributed to the increased memory available on the cluster (see above). On 32-CPUs the Alpha Linux Cluster is seen to deliver between 95-118% of the Compaq AlphaServer SC EV67/667. It is of some interest to compare the present Alpha timings for the larger fragments with those originally reported on the IBM SP/P2SC-120. The timings of Table 2 suggests that the 32-CPU Alpha Linux Cluster is, for the larger fragments, delivering ca. 50% of the performance of the 256 node IBM/P2SC-120 at the EMSL's Molecular Sciences Computing Facility (MSCF). This is a creditable level of performance even allowing for the dated nature of the IBM hardware. Table 2. Total Elapsed times (secs) using the NWChem DFT module in calculations on a variety of Zeolite fragments on the IBM SP/P2SC-120.
References [1] Additional information on NWChem and the Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Battelle Memorial Institute, and the U.S. Department of Energy is available via the Pacific Northwest Home Page at http://www.pnl.gov:2080/ Applications Performance: GAMESS-UK DFT, MP2 and 2nd DerivativesM.F. Guest The basis for much of the current parallel SCF functionality within GAMESS-UK was undertaken within the Esprit-funded EUROPORT 2 project, IMMP (Interactive Modelling through Parallelism). The primary goals of EUROPORT were to provide exemplars of commercial codes that demonstrated the potential of parallel processing to industry and to establish a broad spectrum of standards-based medium-scale parallel application software, rather than targeting the highest levels of scalability and performance. In contrast to NWChem, both SCF and DFT modules are parallelised in a replicated data fashion, with each node maintaining a copy of all data structures present in the serial version. While this structure limits the treatment of molecular systems beyond a certain size, experience suggests that it is possible on machines with 256 MByte nodes to handle systems of up to 2,000 basis functions. The main source of parallelism in the SCF module is the computation of the one- and two-electron integrals and their summation into the Fock matrix, with the more costly two-electron quantities allocated dynamically using a shared global counter. The result of parallelism implemented at this level is a code scalable to a modest number of processors (around 32), at which point the cost of other components of the SCF procedure starts to become significant. The first of these addressed was the diagonalisation, which is now based on the PeIGS module from NWChem. Once the capability for GA [1] is added, some distribution of the linear algebra becomes trivial. As an example, the SCF convergence acceleration algorithm (DIIS - direct inversion in the iterative subspace) is distributed using GA storage for all matrices, and parallel matrix multiply and dot-product functions. This not only reduces the time to perform the step, but the use of distributed memory storage (instead of disk) reduces the need for I/O during the SCF process. Substantial modifications were required to enable the MP2 gradient [2] and SCF 2nd derivatives to be computed in parallel. In both cases the conventional integral transformation step has been omitted, with the SCF step performed in direct fashion and the MO integrals, generated by re-computation of the AO integrals, and stored in the global memory of the parallel machine. This storage and subsequent access is managed by the GA tools. The basic principle by which the subsequent steps are parallelised involves each node computing a contribution to the current term from MO integrals resident on that node. For some steps, however, more substantial changes to the algorithms are required. For the MP2 gradient, the construction of the Lagrangian (the right-hand side of the coupled Hartree-Fock (CPHF) equations) requires MO integrals with three virtual orbital indices. Given the size of this class of integrals, they are not stored, the required terms of the Lagrangian being constructed directly from AO integrals. A second departure from the serial algorithm concerns the MP2 2-particle density matrix. This quantity, which is required in the AO basis, is of a similar size to the 2-electron integrals and is stored on disk in the conventional algorithm, but generated as required during the derivative integral generation from intermediates stored in the GAs. Table 1: Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient modules in calculations on Morphine, Cyclosporin, di(tri-fluoromethyl)-biphenyl and Mn(CO)5H on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems.
In the SCF 2nd derivative module the coupled Hartree-Fock (CPHF) step and construction of perturbed Fock matrices are again parallelised according to the distribution of the MO integrals. The most costly step in the serial 2nd derivative algorithm is the computation of the 2nd derivative two-electron integrals. This step is trivially parallelised through a similar approach to that adopted in the direct SCF scheme - using dynamic load-balancing based on a shared global counter. In contrast to the serial code, the construction of the perturbed Fock matrices dominates the parallel computation. It seems almost certain that these matrices would be more efficiently computed in the AO basis, rather than from the MO integrals as in the current implementation, thus enabling more effective use of sparsity when dealing with systems comprising more than 25 atoms. The performance of the DFT, MP2 and 2nd Derivative modules on the Cray T3E/1200E and the IBM, SGI, Compaq and Cray High-end Systems are shown in Table 1. Corresponding timings on a variety of commodity-based systems are shown in Table 2. The DFT calculations on morphine used a 6-31G** basis of 410 functions, those on cyclosporin a 6-31G basis of 1000 functions, both using the B3LYP hybrid functional. Note that the DFT calculations did not exploit CD fitting, but evaluated the coulomb matrix explicitly. Considering the DFT results on the high-end systems, speedups of 99 and 107 are obtained on 128 Cray T3E nodes for the morphine and cyclosporin calculation, respectively. The 32 CPU timings for cyclosporin show that the T3E is outperformed by the IBM SP/WH2, the SGI Origin 3800/R14k-500, the Compaq AlphaServer SC and the Cray Linux Supercluster by factors of 2.07, 3.03, 2.99 and 3.65 respectively. The fastest machine is evidently that with the fastest CPU, the EV67-833 based Cray Linux Supercluster, while the SGI Origin/R14K and Compaq AlphaServer exhibit almost identical run times up to 64 CPUs. Considering the higher node counts, it is clear that all machines exhibit inferior scalability compared to the Cray T3E. Thus the AlphaServer / Cray ratio performance ratio of 3.23 found at 16 CPUs decreases to just 1.96 on 128 CPUs. Table 2: Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient benchmark calculations on a variety of commodity-based systems.
CS4 AMD K7/700 + FE, *CS4 AMD K7/1200 + FE CS6 PIII/800 + FE Turning to the cluster results of Table 2, and the total times to solution on 32 CPUs, we see that the CS1 Pentium III/450 cluster is delivering 72% (morphine) and 62% (cyclosporin) of the Cray T3E/1200E in the DFT B3LYP calculations. Increasing the CPU speed while leaving the interconnect effectively unchanged leads to a predictable impact on performance. Thus the corresponding delivery figures for the fast ethernet connected CS6 Pentium III/800 and the CS4 Athlon AMD/1200 clusters in the cyclosporin calculation are 130% and 163%, with the AMD/1200-based cluster a factor of 2.34 times faster than CS1. While both clusters comfortably outperform the T3E, higher factors might have been expected based solely on single node performance. The more powerful CPUs together with enhanced interconnect of the CS2 Alpha Linux Cluster predictably lead to much higher percentage delivery. Thus the Linux Alpha Cluster, with corresponding figures of 300% (morphine) and 301% (cyclosporin), outperforms both IBM SP and Origin 3800/R12k-400. At 32 CPUs, the Alpha Cluster is performing on a par with the Compaq AlphaServer SC and SGI Origin 3800/R14k-500. In both benchmarks we find the elapsed times on the 32-CPU Alpha cluster fall between the 64- and 128-node T3E timings, lying closer to the 128-node T3E time in each case. The 64-CPU Alpha Cluster is outperforming the 256-node Cray T3E/1200E in the cyclosporin calculation. Considering the performance data for the MP2 gradient and SCF analytic 2nd derivative modules, we see that the MP2 geometry optimisation of the Mn(CO)5H molecule (with 217 basis functions) shows a speedup of 93 achieved using 128 T3E/1200 processors to perform the complete optimisation (involving 5 energy and 5 gradient calculations). A corresponding speedup of 86 is found when calculating the frequencies of 2,2'-di(tri-fluoromethyl)-biphenyl using a 6-31G basis of 196 functions. The greater dependency on interconnect in both SCF 2nd Derivative and MP2 calculations compared to the DFT module leads to less marked performance enhancements on all high-end platforms relative to the T3E (Table 1). Thus at 32 CPUs, the performance advantage of the Compaq AlphaServer SC over the Cray is reduced to factors of 2.2 (MP2) and 1.9 (2nd Derivatives) compared to the figure of 3.0 found in the cyclosporin calculations. Indeed at 64 CPUs the AlphaServer SC is now outperformed by the R14k-based SGI Origin 3800 in the MP2 calculation, where the AlphaServer shows little performance improvement on moving from 64 to 128 CPUs. This degradation in scalability of the AlphaServer is such that the 128 CPU time is only some 10% better than that found on the Cray. The minimum time to solution on 64 CPUs is given by the SGI Origin 3800 in the MP2 calculation, and by the Cray Linux Supercluster in the 2nd Derivatives calculation. This greater dependence on the global arrays (GAs) in both MP2 gradient and analytic 2nd derivative applications produces the expected impact in performance of the commodity clusters. Considering the total times to solution on 32 CPUs, we see that the CS1 Pentium III cluster is delivering a much reduced percentage of the Cray T3E (44%) in the MP2 gradient calculation, with only a modest reduction in elapsed time between 16 (11,499 seconds) and 32 CPUs (8,113 seconds). The significant increase in node CPU capability associated with the CS4 AMD-based cluster is seen to have little impact in this benchmark, with the solution time only a factor of 1.16 better than that of CS1. A similar degradation in performance was originally noted on the Alpha Linux Cluster, caused not by limited communications but by problems in the effective utilisation of shared memory on the dual CPUs of the UP2000. Revisions in release 3.1 of the GAs have largely addressed this, with the 32-CPU Alpha now delivering 228% of 32-node Cray performance and outperforming the IBM SP/WH2-375 (1550 vs. 2129 seconds). Somewhat surprisingly the Pentium and AMD-based clusters perform far more effectively in the SCF 2nd derivative benchmark. CS1 achieves 81% of Cray T3E/1200E performance, while both the CS6 Pentium III/800 and CS4 AMD/1200 clusters outperform the Cray at 32 CPUs (CS6, 127%, CS4, 135%). An initial performance analysis reveals load-balancing problems in the Fock matrix construction, which may explain this effect. Neither IBM/SP nor the Alpha Cluster perform that effectively on this benchmark; while the revised GAs have improved the Alpha performance, the 32-CPU Linux Cluster delivers only 154% of T3E performance, one of the lowest such figures recorded in these benchmarks. References [1] J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global arrays; A portable shared memory programming model for distributed memory computers, in: Supercomputing '94, IEEE Computer Society Press, Washington, D.C. (1994). [2] G.D. Fletcher, A.P. Rendell and P. Sherwood, A parallel second-order Moller-Plesset gradient, Molec. Phys. 91:431-38 (1997). Applications Performance: GAMESS-UK DFT Coulomb ModuleM.F. Guest, P. Sherwood and H.J.J. van Dam Recent work has focused on optimising and extending the fitted coulomb module of the CCP1 density functional theory (DFT) code within GAMESS-UK for use on MPP machines. In order to reduce the cost of evaluating the Coulomb repulsion energy in medium sized molecules the charge density can be fitted to an auxiliary basis as proposed by Dunlap et al [1]:
where the fitting coefficients C can be obtained from:
In this equation
For efficiency only the maximal values of The parallelisation is performed trivially by distributing the evaluation of integrals with the same set of wavefunction basis functions Timings for a number of DFT calculations on the Morphine and Valinomycin molecules, conducted on the variety of high-end proprietary and commodity hardware under consideration are shown in Table 1. Calculations on morphine used a DZVP_A2 Dgauss basis of 410 functions, those on valinomycin a DZV_A2 basis of 882 functions, both using the HCTH functional. Timings are reported for calculations in which the coulomb matrix was evaluated explicitly (J-explicit) and for those that used CD fitting (J-fit). The latter employed an A2_DFT auxiliary fitting basis for morphine (1171 functions), and an A1_DFT fitting basis (3012 functions) for valinomycin. It can be seen that the current approach provides significant benefit, with scalability on the T3E greatly enhanced over that reported previously. Speedups of 105 and 110 are obtained on 128 nodes Cray T3E/1200 for the morphine and valinomycin calculations when evaluating the coulomb matrix explicitly. Corresponding speedups when using the coulomb fit are 75 and 100 respectively. Overall times to solution on 128 T3E nodes when using the fitted Coulomb approach are reduced by factors of 2.9 (morphine) and 2.1 (valinomycin). Considering the total times to solution on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the morphine calculation when evaluating the coulomb matrix explicitly show the following ordering: Cray Supercluster (325) < Compaq AlphaServer (373) < SGI O3800/R14k (505) < IBM SP (589) with the Cray SuperCluster outperforming the Cray T3E/1200 by a factor of 4.6. A similar ordering is found in the larger valinomycin benchmark, with a somewhat reduced factor of 4.3. The timings of Table 1 point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU valinomycin improvement factors of 4.3 (Cray Supercluster) and 3.7 (AlphaServer) with explicit treatment of the Coulomb matrix are reduced to 3.6 (Supercluster) and 3.0 (AlphaServer) based on the 128-CPU timings. All machines show a significant reduction in time to solution when using the fitted Coulomb matrix compared to explicit treatment of the Coulomb term. The 32-CPU timings for the morphine calculation when using the fitted coulomb matrix show the following order: Cray Supercluster (112) ~ SGI O3800/R14k (113) < Compaq AlphaServer SC (129) < SGI O3800/R12k (145) Note that all 3c-2e integrals are held in memory for this 32-CPU morphine calculation. A comparison with the explicit-J timings shows the greater dependency of the fitted approach on interconnect. The SGI Origin 3800/R14k is now performing on a par with the Cray Supercluster, while the Origin 3800/R12k outperforms the IBM SP. The R12k-based Origin is now only a factor of 1.12 slower than the Compaq AlphaServer SC, compared to the figure of 1.71 found with explicit treatment of the coulomb matrix. Improvement factors when using the fitted approach versus explicit J in the larger valinomycin benchmark reflect this dependency on interconnect, particularly at higher processor count. Total times for the 32CPU Coulomb fit calculations are as follows: SGI O3800/R14k (724) ~ Cray Supercluster (726) < Compaq AlphaServer SC (881) < SGI O3800/R12k (897) The 32-CPU performance delivery figures for the Compaq AlphaServer of 399% and 375% with explicit treatment of the Coulomb matrix are reduced to 251% (morphine) and 300% (valinomycin) based on the 128-CPU timings. In similar fashion the 32-CPU figures for the Cray Supercluster of 427% is reduced to 358% in the 128 CPU valinomycin calculation. Table 1: Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on Morphine and Valinomycin on both High-end and Commodity-based Systems.
Considering the total times to solution on the commodity hardware, we see that the 32 CPU Pentium III/450 CS1 cluster is delivering 85% (morphine) and 83% (valinomycin) of the Cray T3E/1200E in the DFT calculations with explicit treatment of the Coulomb matrix. These factors are reduced somewhat (to 59% and 77% respectively) when using the fitted Coulomb matrix. The more powerful CPUs of the other clusters of Table 2 lead to much higher percentage delivery, particularly when evaluating the Coulomb matrix explicitly. The 32 CPU explicit coulomb figures for valinomycin of 178% (CS6 Pentium III/800) and 253% (CS4 AMD/1200) for the ethernet-interconnected machines are reduced to 131% and 145% respectively in the corresponding fitted coulomb calculations. Enhancing both interconnect and CPU speed results in much higher figures, particularly in the fitted calculations. Thus the 32 CPU Linux Alpha CS2 Cluster exhibits delivery figures in the valinomycin calculation of 361% (explicit coulomb) and 379% (fitted coulomb). The CS2 cluster is seen to outperform the IBM SP/WH2-375 and the Origin 3800/R12k-400 in both explicit- and fitted-coulomb 32 CPU valinomycin calculations, and the Compaq AlphaServer SC/667 in the fitted calculation. The Alpha Cluster is seen to deliver between 96% of the Compaq AlphaServer SC when evaluating the Coulomb matrix explicitly, and between 89% of the Origin 3800/R14k when using the Coulomb fit. Indeed 32-CPUs of the CS2 cluster outperforms 128 nodes of the Cray T3E in both explicit and fitted coulomb calculations. Table 2: Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on the Cray T3E/1200 and High-end systems from IBM, SGI, Compaq and Cray (see text).
+Zeolite, Basis (AOs/CD) A further demonstration of the Coulomb Fit DFT code is given in Tables 2 and 3. Here we present timings for complete DFT calculations on the same series of Zeolite fragments used previously in demonstrating the NWChem software, conducted on the variety of high-end proprietary (Table 2) and commodity hardware (Table 3) under consideration. Note that the 833 MHz EV67 Compaq AlphaServer SC is now included in the hardware under evaluation. While limited speedups are observed for the smaller fragments on the Cray T3E/1200E (46 and 54 for Si8O7H18 and Si8O25H18 respectively on 128 nodes), the higher value of 93 found for the largest fragment, Si28O67H30 is associated with the need for re-computation of the 3-centre integrals. Considering the total times to solution for the larger fragments on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the Si26O37H36 calculation show the following ordering: Compaq Alpha SC/833 (695) < SGI O3800/R14k (748) < Cray Supercluster (770) ~ Compaq Alpha SC/667 (783) with the Compaq AlphaServer/833 outperforming the Cray T3E/1200 by a factor of 2.8. The performance of the Origin 3800/R14k is far stronger than might have been expected based solely on a consideration of node performance (e.g. SPECfp2000). The timings of Table 2 again point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU improvement factors for Si26O37H36 of 2.8 (Compaq AlphaServer/833) is reduced to 2.1 based on the 128-CPU timings. Similar conclusions arise from a consideration of the timings for the largest fragment (Si28O67H30). The improvement factor for the Compaq AlphaServer/833 against the Cray T3E at 32 CPUs of 280% is reduced to 171% based on the 128-CPU timings. Table 3: Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on a number of commodity-based systems.
+CS1 PIII/450 + FE, CS2 QSNet Alpha Linux EV67/667 $ Basis (AOs/CD) Considering the total times to solution on 32 CPUs of the commodity hardware (Table 3), we see that the Pentium III/450 CS1 cluster is delivering between 43-52% of the Cray T3E/1200E in the fitted coulomb DFT calculations. The more powerful CPUs of the other clusters of Table 3 lead to higher percentage delivery, although these do not reflect the individual CPU performance for the ethernet-interconnected machines. Thus we find 32 CPU figures of 69% (CS6 Pentium III/800) and 77% (CS4 AMD/1200) for Si26O37H36, and 71% (CS6 Pentium III/800) and 82% (CS4 AMD/1200) for Si28O67H30. Enhancing both interconnect and CPU speed results in much higher figures. Thus the 32 CPU Linux Alpha CS2 Cluster exhibits delivery figures of 241% (Si26O37H36) and 219% (Si28O67H30). The CS2 cluster is seen to outperform the IBM SP/WH2-375 and the Origin 3800/R12k-400 in all 32 CPU fragment calculations; 32-CPUs of the cluster outperform 128 nodes of the T3E on all but the largest fragment. For the fragments under consideration, the Alpha Cluster is seen to deliver between 78-87% of the performance of the Compaq AlphaServer SC/833 and between 89-100% of the Origin 3800/R14k-500 in all 32 CPU calculations. References [1] B.J. Dunlap, W.D. Connolly, J.R. Sabin, On some approximations in applications of Xα theory, Journal of Chemical Physics 71, 3396-3402. [2] G. Fann and R.J. Littlefield, Parallel inverse iteration with reorthogonalisation, in: Sixth SIAM Conference on Parallel Processing for Scientific Computing (SIAM), pp409-13 (1993). Applications Performance: DL_POLYM.F. Guest DL_POLY [1] is the parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and T.R. Forester for CCP5 (the Collaborative Computational Project for the Computer Simulation of Condensed Phases). The parallel implementation was initially based on a replicated data (RD) strategy, and was designed at the outset for machines with up to 64 processors and systems of up to 30,000 atoms, although it has since found use on much larger architectures. Implicit in the RD approach is a dependence on fast global summations, which are not available on all machines. The performance scaling varies according to the kind of simulation being undertaken - systems possessing complex molecular topologies and constraint bonds typically scale less well than ones requiring simple atomic descriptions, as they require a higher communication overhead. If constraint bonds are present, as they usually are in bio-molecular or polymer systems, then significant deviations from ideal behaviour are to be expected. The four benchmarks described below are those described on the CCP5 web site [2], with the same benchmark-numbering scheme adopted here: Benchmark 4: A straightforward simulation of sodium chloride at 500K, using the standard Ewald summation method to handle the electrostatic forces. A multiple time-step algorithm is used to increase performance, which requires recalculating the reciprocal space forces only twice in every five time steps. The electrostatic cut-off is set at 24 Angstrom in real space, with a primary cut-off of 12 Angstrom for the multiple time-step algorithm. The Van der Waals terms are calculated with a cut-off of 12 Angstrom. The simulation is for 200 steps with a time step of 1 fs in the Berendsen NVT ensemble. The system size is 27,000 ions. Benchmark 5: This simulation is of 8,640 atoms of an alkali disilicate glass at 1000 K. The electrostatics are again handled by the Ewald sum, with the interaction potential including a three-body valence angle term, which requires a link-cell scheme to locate atom triplets. The electrostatic cut-off is 12 Angstrom and the Van der Waals cut-off is 7.6 Angstrom; 3-body forces are cut off at 3.45 Angstrom. The simulation is for 300 steps in the Hoover NVT ensemble, with a timestep of 1 fs. Benchmark 3: This simulation is of the enzyme transferrin in a solution comprised of 8102 TIP3P water molecules. A total of 27,593 atoms are in the system. The electrostatic forces are handled by a combination of neutral groups with the Coulombic potential. All force cut-offs are set at 8 Angstrom. The simulation is for 250 steps with a time step of .1 fs, in the NVE ensemble. The water molecules are treated as rigid bodies and the transferrin is maintained by bond constraints using SHAKE. Valence angles and dihedral potentials are present in the transferrin model. Benchmark 7: This system is comprised of 13,390 atoms, including 4012 TIP3P water molecules solvating the gramicidin A protein molecule at 300K. Both the protein and water molecules are defined with rigid bonds and maintained by the SHAKE algorithm. The water is held completely rigid, while the protein has angular and dihedral potential terms. Electrostatic interactions are handled by the neutral group method with a Coulombic potential truncated at 12 Angstrom. The Van der Waals interactions are truncated at 8 Angstrom. The simulation is for 500 time steps in the NVE ensemble with a 1 fs time step. Performance scaling on the Cray T3E for Benchmarks 4 & 5 has been shown to be extremely good and is almost linear over the entire range of processor numbers. This reflects the high parallel efficiency of the Ewald sum implementation. Significantly inferior scaling is found in the two macromolecular benchmarks, Benchmarks 3 & 7. This may be attributed to the difficulty in apportioning the neutral group calculations across processors, and the use of SHAKE for the bond constraints. Somewhat better scaling is found in 7 which uses a larger cut-off in the electrostatic calculations and hence a lower communication/computation ratio. The performance of the four DL_POLY benchmarks are shown in Table 1 (on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems) and Table 2 (on a number of commodity based systems). Initial modifications made to the DL_POLY implementation on the commodity clusters included replacing the MPI_ALLREDUCE routines from both LAM and MPICH libraries with a Daresbury rewritten hypercube-based version. Considering the Ewald-based benchmarks, we again find excellent scalability on the Cray T3E/1200E, with speedups of 135 (super-linear) and 98 obtained on 128 nodes Cray for benchmark 4 and 5 respectively. This excellent scaling on the T3E is put into perspective when considering the total 32 CPU times to solution for the high-end systems of Table 1. The weakness of the Cray EV56 CPU is clearly apparent, with the slowest of these systems, the IBM/SP-WH2-375, delivering 369% (Benchmark 4) and 242% (benchmark 5) of the Cray T3E on 32 CPUs. There is clearly little difference in performance between the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray SuperCluster EV67/833, with the Supercluster the leading 32 CPU machine in benchmark 5 (433%), and the Alphaserver marginally faster in benchmark 4 (597%). There is some evidence that the optimum scalability is shown by the Origin 3800/R14k, although this is inferior to that found on the Cray T3E. Turning to the commodity systems of Table 2, the weakness of the Cray EV56 CPU is again clearly apparent. Even the CS1 Pentium III/450 cluster is outperforming the Cray T3E/1200E in the NaCl simulation (352 vs. 376 seconds), and is only marginally slower than the T3E (238 vs. 225 seconds) in the NaK silicate simulation. These percentage delivery figures of 107% and 95% on the 450 MHz Pentium cluster increase substantially on the more powerful CPUs of the AMD Athlon and Alpha Clusters. In Benchmark 4 we find delivery figures of 184% and 257% for the CS6 Pentium III/800 and CS4 K7/1200 clusters respectively: the Benchmark 5 percentages are 152% and 192%. Corresponding 16-node figures for the CS3 AMD Athlon cluster are 233% (benchmark 4) and 255% (benchmark 5). It is clear however that providing just fast ethernet as interconnect is not sustainable much beyond 32 CPUs. The 64 CPU performance of the CS6 Pentium III/800 cluster is only marginally superior to that at 32 CPUs in Benchmark 4, while in Benchmark 5 the 64 CPU timing is actually slower. The faster EV67 CPU of the Linux Alpha cluster, together with its enhanced QSNet interconnect, results in much higher delivery figures. The Alpha CS2 cluster outperforms the IBM SP and Origin 3800/R12k, with corresponding figures of 470% (benchmark 4) and 363% (benchmark 5) at 32 CPUs. The potential of the Beowulf systems in these simulations is striking; the 32-CPU Linux Alpha Cluster is outperforming 128 nodes of the Cray T3E in both benchmarks. A quite different picture of performance is revealed when considering the two macromolecular simulations, benchmarks 3 and 7. Now the scalability on the T3E/1200E is far more limited, with speedups of just 27 (benchmark 3) and 61 (benchmark 7) on 128 nodes of the Cray. The improvement in performance of the high-end systems over the T3E/1200E is also less apparent compared to the Ewald-based simulations. The slowest of these systems, the IBM/SP-WH2-375, is no faster than the Cray on benchmark 3, and is only a factor of 1.4 times faster on benchmark 7 on 32 CPUs (cf. factors of 3.7, benchmark 4 and 2.4, benchmark 5). While there remains little to choose in performance between the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray Alpha Linux cluster, the Origin 3800/R14k is now the leading 32 CPU machine in both benchmark 3 and 7. The optimum scalability is clearly shown by the Origin 3800/R14k, although this significantly inferior to that found on the Cray T3E. Table 1: Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on the Cray T3E/1200 and IBM, SGI, Compaq and Cray High-end Systems.
This lack of scalability has a predictable effect on the performance of the commodity clusters, which now deliver significantly lower percentage delivery figures compared to those found in the Ewald-based simulations. Considering the CS1 PentiumIII/450 cluster, we find 32-node delivery figures of just 34% and 55% for benchmarks 3 and 7 respectively. While these figures increase significantly on the more powerful CPUs, they are far from impressive. Focusing on benchmark 7, we see only modest increases in delivery on CS6 PentiumIII/800 (69%) and CS4 K7/1200 clusters (95%). These figures do increase substantially with improvements in interconnect. Thus the 16 CPU CS5 Pentium/930 cluster with SCI interconnect leads to the machine outperforming the T3E/1200 (561 vs. 688 seconds) in benchmark 7. The performance of the CS3 AMD K7/850 Cluster is worth noting here. The 16-CPU Myrinet connected cluster outperforms the ethernet connected CS4 K7/1200 cluster and performs largely on a par with 16 CPUs of the IBM SP/WH2-375, delivering about 50% of the corresponding partition of the Linux Alpha Cluster. The 32-CPU Linux Alpha Cluster remains, however, the optimum cluster of those considered. It delivers 152% (benchmark 3) and 260% (benchmark 7) of the Cray T3E. A significant performance limitation on the Alpha Cluster arose from the way DL_POLY handled both co-ordinate and forces arrays. The x-, y- and z-coordinates and corresponding forces were stored as separate linear arrays, x(mxatms), y(mxatms) etc., coding that led to exceedingly poor cache re-usage on the UP2000 processor. Re-writing the code to use, in hopefully obvious notation, xyz(3,mxatms) and fxyz(3,mxatms) improved overall performance on the Alpha cluster by a factor of 2.5 (although it had little effect on, for example, the IBM/SP-WH2 with its larger 8 MByte cache). Having made these changes, the 32-CPU Linux Alpha CS2 Cluster again outperforms 128 nodes of the Cray T3E in both benchmarks. Table 2: Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on a variety of commodity-based systems.
+CS1 PIII/450 + FE: LAM/MPI, CS2 QSNet Alpha Linux EV67/667 An additional feature exemplified by these benchmarks is the impact of the underlying MPI libraries on performance. While little effect was found in the Ewald-based simulations, a much greater impact was apparent on benchmarks 3 and 7. Thus the reduced latency associated with LAM MPI as against MPICH reduced the 32-node benchmark 3 timing on the CS1 Pentium III/450 cluster from 583 (MPICH) to 391 seconds (LAM). Finally, it is perhaps worth questioning the value and cost effectiveness of the Cray T3E/1200E in running molecular simulations using the DL_POLY software, a question that is reinforced by considering the CHARMM benchmarks presented below. In both classes of DL_POLY benchmark considered, those that scale well on the Cray (benchmarks 4 and 5) and those that scale badly (the macromolecular simulations), we see that 128-node Cray T3E performance is matched or exceeded by 32 CPUs of the Linux Alpha Cluster. While the latter scales less effectively than the Cray, the total times to solution are less. Given that DL_POLY itself does not scale effectively beyond 128 Cray CPUs, it is difficult to justify using the Cray at all, given the implicit cost differential involved against the clusters considered in this report. References [1] see, http://www.dl.ac.uk/TCSC/Software/DL_POLY/main.html [2] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/ Applications Performance: CHARMMM.F. Guest and P. Sherwood CHARMM "Chemistry at HARvard Macromolecular Mechanics" (version c26b2) is the general-purpose molecular mechanics, molecular dynamics and vibrational analysis package for modelling and simulation of the structure and behaviour of molecular systems. The benchmark is the standard CHARMM parallel benchmark involving an MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules (14026 atoms, 1000 steps (1 ps), 12-14 Angstrom shift). Although a macromolecular simulation, this MD benchmark shows many of the attributes demonstrated by the Ewald-based DL_POLY simulations. Again, while excellent scalability is shown on the Cray T3E/1200E (a speedup of 96 on 128 nodes), the EV56 node appears to be considerably slower than the 450 MHz Pentium III. Thus the CS1 Pentium III cluster is outperforming the Cray at small node counts, and exhibits comparable performance on 32 nodes, with a percentage delivery figure of 96%. Table 1. Time in Wall Clock Seconds for the CHARMM Carboxy Myoglobin parallel benchmark.
This figure increases substantially on the more powerful EV67 CPU of the Alpha Cluster, with 32-CPUs of the CS2 cluster exhibiting a corresponding figure of 404%. The potential of the commodity-based systems in this simulation is again striking; the 32-CPU Alpha Linux Cluster is outperforming 128 nodes of the Cray T3E/1200E. References [1] CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. Applications Performance: CASTEP, CPMD and CRYSTALM.F. Guest CASTEP (CAmbridge Serial Total Energy Package, [1]) is the ab initio code for the solution of the electronic ground state of periodic systems with the wavefunctions expanded in a plane wave (PW) basis using techniques based on density functional theory. CASTEP calculates the total energy, forces and stresses in a 3D periodic system, and is continuously under development within the UK Car-Parrinello Consortium (UKCP) collaboration. The code uses FFTs on a 3-dimensional grid decomposed in fat pencils to allow the use of a large number of processors. Use is made of portable Temperton General Prime Factor Algorithm for multi-radix multi sequence FFTs (GPFA). For production work the vendor’s proprietary FFTs would be used. A second part of the code performs orthogonalisation of the wavefunctions using an explicit Gram-Schmidt procedure. The CASTEP 4.2b benchmark comprises a single k point total energy calculation of Chabazite, Si11 O24 Al H, using the Vanderbilt ulatrasoft pseudo-potential. With 15045 plane waves and a 3D FFT grid size of 54x54x54, convergence was achieved in 17 SCF cycles using the Pulay density mixing minimiser scheme. Benchmark timings on the Cray T3E/1200E, IBM SP/WH2-375, SGI Origin R14k/500 and a number of commodity based systems are reported in Table 1. Note that the Pentium III/450 cluster timings (using PGI, MPICH) featured channel bonding. Table 1: Time in Wall Clock Seconds for the CASTEP Chabazite parallel benchmark on the Cray T3E/1200E, IBM SP/WH2-375, SGI Origin 3800/R14k-500 and a number of commodity clusters.
+CS1 PIII/450 + FE: LAM/MPI, CS2 QSNet Alpha Linux EV67/667 While excellent scalability is found on the T3E, the reverse is seen on both the IBM SP/WH2-375 and the Pentium-based ethernet connected CS1 and CS6 clusters; this poor scalability is caused in large part by the MPI_ALLTOALLV routine used in the matrix transpose part of the 3D FFT. An analysis of the 32 CPU timings of 271 and 989 seconds on the IBM SP and Pentium CS1 cluster reveals that 157 and 660 seconds respectively are spent in the associated communication routines. Thus the percentage delivery figure of 33% of the Cray T3E on 32 nodes of the Pentium Cluster is, along with the DL_POLY bond-constraint benchmarks, the lowest figure encountered in this benchmarking report. Effectively doubling the CPU clock speed in the CS6 Pentium III/800 cluster leads to only a small improvement in time to solution, with the CS1 32-CPU timing of 989 seconds reduced to 774 seconds (of which 600 seconds are spent in communications). Enhancing the interconnect on the clusters does lead to a significant improvement in timings, with the 16 processor CS5 Pentium III/930 SCI-connected cluster achieving 65% of the T3E performance, although one might have expected better. The optimum cluster performance is derived from the QSNet connected CS2 Linux Alpha Cluster, which at 32 CPUs outperforms the Cray T3E by a factor of 1.66, and achieves 78% of the performance of the SGI Origin 3800/R14k-500. Clearly optimisation of the MPI_ALLTOALLV in the CS2 Cluster is still not optimal, with 111 seconds of the total wall time of 195 seconds spent in communications. The corresponding figure in the Origin 3800/R14k is 71 seconds (in a total wall time of 153 seconds). On the IBM SP it would clearly be beneficial to call the IBM PESSLP2 3D FFT routine instead of going through GPFA to obviate the transpose altogether. CPMD is the Car-Parrinello code developed by Michele Parrinello and his group at the MPI in Stuttgart and a number of users worldwide [2]. The code has been ported to the CS1 Pentium III/450 cluster, and subsequently benchmarked, by Sprik and Vuilleumier (Cambridge). Note that CPMD is to act as the base code for the new CCP1 flagship, and that further optimisation for Beowulf-class systems is planned during the course of the project. The comparison of Cray and Beowulf hardware shown in Table 2 centres around a Liquid Water benchmark. The simulation comprises 32 water molecules, in a simple cubic periodic box of length 9.86 Angstrom at a temperature of 300K, with a time step of 7 au i.e. 0.169 fs, and a test run of 200 steps (34 fs). The calculation used the BLYP functional and Trouillier and Martins pseudo-potential, with a reciprocal space cut-off of 70 Ry (952 eV). In stark contrast to the CASTEP benchmark, the Pentium Cluster is seen to be performing well in comparison to the Cray T3E/1200E. This may be attributed to the significantly longer iteration times associated with CPMD, given the efficient usage of short-range pseudopotentials within CASTEP, and the smaller impact that the MPI_ALLTOALLV routine has on the total elapsed times. Good scalability is shown on the Cray T3E/1200E (a speedup of 49 on 64 nodes), although the EV56 node appears to be only marginally faster than the 450 MHz Pentium III. Thus the Pentium cluster achieves a percentage delivery figure of 62% of the Cray T3E/1200E on 32 nodes. Table 2: Time in Wall Clock Seconds for the CPMD Liquid Water parallel benchmark on the Cray T3E/1200E and the Pentium III/450 CS1 Cluster.
CRYSTAL [3] was jointly developed by the Theoretical Chemistry Group at the University of Torino and the Computational Materials Science group in CLRC. The program computes the electronic structure of periodic materials within Hartree Fock, density functional or various hybrid approximations. The Bloch functions of the periodic systems are expanded as linear combinations of atom-centred Gaussian functions. Powerful screening techniques are used to exploit real space locality.
Table 3: Time in Wall Clock Seconds for the CRYSTAL98 parallel benchmarks on the Cray T3E/1200E and the CS1 PIII/450 and CS2 QSNet Linux Alpha Clusters.
The code may be used to perform consistent studies of the physical, electronic and magnetic structure of molecules, polymers, surfaces and crystalline solids. Benchmark timings of CRYSTAL 98 on the Cray T3E/1200E, Pentium and Alpha Beowulf systems are reported in Table 3. Two representative RHF calculations are included, (i) TiO2 bulk crystal, with 75 k-points, 6 atoms/cell, and a modest basis set comprising 126 GTOs, and (ii) a much larger calculation of MgO, with 36 k points, 64 atoms/cell and 576 GTOs. Parallelisation in each case is conducted over both integrals and k-points, the benchmarks being based on the replicated data parallel version of the code. These benchmarks again provide compelling evidence as to the value of the Beowulf clusters, and to the limited performance of the Cray EV56 node. The CS1 Pentium III/450 cluster outperforms the Cray T3E/1200E at all node counts in both benchmarks, showing a percentage delivery figure of 145% of the Cray T3E on 32 nodes. This figure increases substantially on the more powerful CPUs of the Alpha Cluster, with the Linux CS2 cluster outperforming the 32-node Cray T3E by a factor of 3.49. It should be noted that the size of basis set in these benchmarks still enables the linear algebra to be treated in serial fashion. A more stringent test of the clusters will be apparent once the distributed data MPP version of CRYSTAL is implemented, this requiring an efficient cluster implementation of SCALAPACK (work currently in progress). References [1] M.C. Payne et. al., Rev. Mod. Phys. (1992) 64 p. 1045, see http://www.cse.clrc.ac.uk/Activity/UKCP/ [2] CPMD, Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999). [3] see, http://www.cse.clrc.ac.uk/Activity/CRYSTAL/ Applications Performance: ANGUSM.F. Guest ANGUS [1] performs direct numerical simulation (DNS) of turbulent premixed combustion in order to generate statistical data in support of modelling. The equations to be solved are the Navier-Stokes equations for fluid flow, augmented by two additional equations each describing the transport of a single scalar variable and together specifying the thermochemical state of the system in the presence of differential diffusion effects. Thus, in total there are six partial differential equations to be solved. A grid partitioning strategy is employed for ANGUS which is quite typical of many domain decomposition techniques used in parallel CFD. Due to the finite difference stencil it is necessary to introduce ‘halo’ or ‘ghost’ cells at the interface boundaries. These are used as a message cache and allow derivatives to be determined in these regions with only local variables. The halo cells are then updated as required. Discretisation of the equations is carried out using standard second-order central differences on a three-dimensional grid. The velocity nodes are located at the face-centres of each cell, giving a staggered-grid arrangement that conserves kinetic energy as well as mass and momentum. The pressure solver utilises a conjugate gradient method with a Modified Incomplete LU (MILU) preconditioner [2]. As with many CFD algorithms, the resulting matrix is both sparse and symmetric. In this case, it is heptadiagonal and the periodic boundary conditions also mean that the matrix is singular. A Multi-grid solver has also been provided. Level-1 BLAS are used heavily in both solvers, with the overall computational work expected to be roughly proportional to n3. The current ANGUS CG-ILU benchmark utilises a grid size of 1443. Benchmark timings on the Cray T3E/1200E, IBM SP/WH2-375, plus a number of commodity-based systems reported in Table 1 are for one hundred iterations of the conjugate gradient solver.Note that the timings reported for the IBM/SP, the Alpha Linux and Pentium III/930 SCALI Cluster refer to CPU configurations in which all CPUs on a given node are involved in the computation. These timings show several distinct features. Both the Cray T3E/1200E and IBM SP/WH2-375 appear to exhibit super-linear speedups, although the SP is only marginally faster than the Cray for a given node count. While the Alpha Linux cluster outperforms both Cray and IBM up to 16 nodes, this advantage is effectively lost at 32 CPUs when the machine exhibits almost identical timings as the IBM (751 vs. 776 seconds respectively). The degradation in performance on 32 CPUs of the Alpha Cluster leads to a delivery figure of 145% of the Cray T3E/1200E. Considering the Pentium clusters, the PIII/450 CS1 cluster again performs well, exhibiting close to linear scaling with the 32-node time some 70% of that recorded on the T3E. The improvement in performance with increased processor speed is modest, with the CS6 PIII/800 cluster outperforming CS1 by factors well below the MHz ratio (a factor of 1.3 on 8 CPUS, decreasing to just 1.1 on 32 CPUs). Table 1: Time in Wall Clock Seconds for the ANGUS CG-ILU Benchmark (1443)
+CS1 PIII/450 + FE: LAM/MPI, CS2 QSNet Alpha Linux EV67/667 What is confusing is the performance of the SCALI cluster, CS5. With PIII/930 CPUs and the SCALI interconnect it is apparently outperformed at all CPU counts by the CS1 Cluster with more its modest PIII/450 CPUs and fast ether interconnect. Additional insight into these results can be gained by varying the distribution of processors over the available nodes (see Table 2). Now, for example, a 16-processor job on the IBM SP/WH2-375 is run on either 4 or 8 nodes given the configuration available (with all CPUs used in the former case, and only 2 CPUs/node in the latter). On both the Linux Alpha and SCALI Pentium cluster the same 16-processor job may be run on either 8 nodes (employing both of the dual processor CPUs) or on 16 nodes, where just a single processor would be involved. Table 2: Time in Wall Clock Seconds on the IBM SP/WH2-375, Alpha Linux and SCALI Clusters as a Function of Processor distribution for the ANGUS CG-ILU Benchmark (1443);
The strong correlation between elapsed time and node occupancy points to the driving influence of memory bandwidth on this benchmark. Thus performing an 8 CPU run on the IBM SP/WH2-375 realises elapsed times that vary by a factor of 2.3 depending on processor distribution (from 4394 seconds on 2 nodes to 1899 seconds on 8 nodes). Similarly the 16 CPU benchmark on the Alpha Linux Cluster requires 1635 seconds on 8 dual processor nodes, and 936 seconds when using a single CPU of each of the available 16 nodes. On the SCALI Pentium cluster, the 8 CPU benchmark requires 7037 seconds on 4 dual processor nodes, and 4556 seconds when using a single CPU of each of the available 8 nodes. These performance attributes are completely consistent with the STREAM memory bandwidth benchmark [3] on the nodes of each machine. The TRIAD bandwidth of 900 Mbytes/sec measured on a dedicated single SP/WH2 node is reduced to some 225 Mbytes/sec when running the same benchmark on all 4 CPUs of the node. Equally, the UP2000 6/667 figure of 1 GByte/sec measured on a dedicated dual processor node is reduced to 500 Mbytes/sec when both CPUs are involved in the same benchmark. References [1] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from and engineering perspective, Parallel Computing (1996). [2] T.F. Chan and C-C.J. Kuo, Parallel Elliptic Preconditioners: Fourier Analysis and Performance on the Connection Machine, Computer, Physics Communications, Vol. 53, 1989, pp 237-252. [3] The STREAM Memory Bandwidth benchmark, see http://www.cs.virginia.edu/stream. M.F. Guest FLITE3D is a finite-element code for solving the Euler equations governing airflow over whole aircraft. Parallelisation of FLITE3D for shared and distributed memory parallel systems has been undertaken as part of a collaboration between the Computational Engineering Group at Daresbury and the Sowerby Research Centre at British Aerospace. The code comprises a suite of modules for obtaining Euler solutions of the flow over complex configurations. Table 1: Time in Wall Clock Seconds for the FLITE3D benchmarks on the Cray T3E/1200E, IBM SP/WH2-375 and CS1 Pentium III/450 and CS2 Linux Alpha Commodity Systems.
Work has been carried out on the parallelisation of the steady Euler flow solver using standard techniques of mesh-partitioning for a single-program multiple-data (SPMD) programming model implemented in Fortran 77 and C with message passing using a choice (selectable at compile time) of MPI or PVM. The flow solver now reads in the partitioned mesh and performs the necessary communications at the boundaries between sub-domains. Fields are gathered onto the master processor for output so that no changes are necessary in the post-processing stages. This also enables the flow solver to be stopped and restarted using a different number of processors, if necessary. Table 1 shows timings on the Cray T3E/1200E, IBM-SP/WH2-375 and both the Pentium III/450 CS1 cluster and the QSNet Linux Alpha cluster, CS2 for two MPI-based FLITE3D benchmark studies, (i) a modest wing body benchmark using 298,244 elements, and (ii) the more demanding F18 benchmark using 3,444,350 elements. These benchmarks provide further compelling evidence as to the value of the commodity clusters, and to the limited performance of the Cray EV56 node. Focusing on just the largest F18 benchmark, we see that although the Cray is scaling well (a speedup of 81 on 128 nodes), the Pentium III/450 CS1 cluster outperforms the Cray T3E/1200E at all node counts. The Pentium III cluster shows a percentage delivery figure of 145% of the Cray T3E on 32 nodes. This figure increases substantially on the more powerful CPUs of the IBM SP/WH2-375 and Alpha Linux cluster. The CS2 cluster outperforms the 32-node Cray T3E by a factor of 4.8, with the 32-node Alpha significantly faster than 128 nodes of the Cray. The relative performance of the IBM SP/WH2 is also impressive. While slower than the Alpha cluster, the 32-CPU SP timing is again significantly faster than that recorded on 128 nodes of the Cray. Although the code was originally developed for the Cray, these results strongly suggest that the individual node performance of the T3E is far from optimal. Applications Performance: SUMMARYM.F. Guest We summarise In Table 1 the conclusions of the benchmarking exercise on applications reported in a number of previous articles, by showing
Table 1: Application Performance: Percentage of a 32-node partition of (i) the Cray T3E/1200E achieved by the 32-node CS1 Pentium /450 and 32-node CS2 QSNet Linux Alpha Cluster, and (ii) the SGI Origin 3800 R14k achieved by the CS2 QSNet Linux Alpha Cluster.
(§) Outperforms 128 nodes of the Cray T3E/1200E These figures suggest the following:
The collection of results presented in this appendix provides compelling evidence that suitably-configured Beowulf systems can provide not only highly cost-effective departmental, mid-range solutions, but can match the levels of performance associated with a significant fraction of a high-end MPP machine, again for a small fraction of the cost. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||