Application Performance on High-end and Commodity-class Systems
M.F. Guest, P. Sherwood, W. Smith, I.J. Bush, and H.J.J. van Dam
Introduction and BackgroundM.F. GuestCCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD m.f.guest@dl.ac.uk As part of our assessment of the current High-end Computing (HEC) landscape, and to assist in positioning commodity-based systems on this landscape, we continue to implement, benchmark and assess the performance of a number of key applications on proprietary high-end systems from IBM, SGI, Compaq and Cray. Applications considered below include those from computational chemistry (GAMESS-UK, DL_POLY and CHARMM), computational materials (CPMD), and computational engineering (ANGUS and FLITE3D). The HEC hardware involved in this work is listed below. Those accessed for the first time in the current reporting period include the Compaq AlphaServer SC ES45/1000 (at Pittsburgh Supercomputing Centre, PSC), and a variety of power4-based systems from IBM (the 8-way IBM pSeries 690Turbo, 16-way Regatta HPC and 32-way Regatta H node):
The commodity-based systems used in out studies have been described elsewhere. We merely provide a list of these here in Table 1, noting the presence of three new additions to the list, CS7, CS8 and CS9. CS7 is the SCALI/SCI interconnected dual AMD K7/1000 MP "ukcp" cluster, CS8 the Itanium-based "Titan" system at NCSA, consisting of 160 dual-processor IBM IntelliStation Z Pro servers machines, and CS9 the dual Pentium 4/2000 Xeon "dirac" system at Bristol University. Both CS8 and CS9 feature the Myrinet 2000 interconnect. Performance MetricsIn previous SLA reports we have summarised the conclusions of related benchmarking exercises of applications on commodity-based systems by showing the effective delivery of such systems against corresponding high-end hardware such as the Cray T3E/1200E and SGI Origin 3800, i.e. those high-end machines available to the UK’s HPC community. These comparisons have highlighted the inappropriate use of the latter systems for delivering capacity computing solutions, based on the simplest of cost-effective arguments. As a starting point for the present analysis, we show in Table 2 below a somewhat updated version of the summary table from the SLA 2000/2001 report. This has been modified to include the Pentium III/800 CS6 cluster, rather than the now outdated Pentium III/450 CS1 cluster, and shows
These figures suggested the following:
Table 2. Application performance: percentage of a 32-node partition of (i) the Cray T3E/1200E achieved by the 32 processors of the CS6 Pentium/800 and CS2 QSNet Alpha Linux Clusters, and (ii) the SGI Origin 3800 R14k achieved by the CS2 Cluster.
(§) Outperforms 128 nodes of the Cray T3E/1200E In the articles below we attempt to update these comparisons, for there is now little point in taking the Cray T3E/1200E as the standard, or in considering the performance of commodity systems based on Pentium III processors. We now position the Compaq AlphaServer SC ES45/1000 (at Pittsburgh) as the standard, and consider the relative performance of the CS9 Pentium 4/2000 with Myrinet interconnect as representative of today's typical commodity based offering. Our interest will centre on whether such suitably-configured Beowulf systems can still provide not only highly cost-effective departmental, mid-range solutions, but can match the levels of performance associated with a significant fraction of a high-end machine, again for a small fraction of the cost. Before moving to the applications, however, we present initially a summary of the work undertaken over the past 12 months in evaluating systems based on the two major arrivals into the HEC market place, the power4 processor from IBM, and Intel's IA64 commodity-based Itanium processors. Applications Performance: The Parallel Implementation of GAMESS-UKM.F. GuestCCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD. m.f.guest@dl.ac.uk In GAMESS-UK both SCF and DFT modules are essentially parallelised in a replicated data fashion, with each node maintaining a copy of all data structures present in the serial version. While this structure limits the treatment of molecular systems beyond a certain size, experience suggests that it is possible on machines with 256 MByte nodes to handle systems of up to 2,000 basis functions. The main source of parallelism in the SCF module is the computation of the one- and two-electron integrals and their summation into the Fock matrix, with the more costly two-electron quantities allocated dynamically using a shared global counter. The result of parallelism implemented at this level is a code scalable to a modest number of processors (around 32), at which point the cost of other components of the SCF procedure starts to become significant. The first of these addressed was the diagonalisation, which is now based on the PeIGS module from NWChem. Once the capability for GA [1] is added, some distribution of the linear algebra becomes trivial. As an example, the SCF convergence acceleration algorithm (DIIS - direct inversion in the iterative subspace) is distributed using GA storage for all matrices, and parallel matrix multiply and dot-product functions. This not only reduces the time to perform the step, but the use of distributed memory storage (instead of disk) reduces the need for I/O during the SCF process. Substantial modifications were required to enable the MP2 gradient [2] and SCF 2nd derivatives to be computed in parallel. In both cases the conventional integral transformation step has been omitted, with the SCF step performed in direct fashion and the MO integrals, generated by re-computation of the AO integrals, and stored in the global memory of the parallel machine. The GA tools manage this storage and subsequent access. The basic principle by which the subsequent steps are parallelised involves each node computing a contribution to the current term from MO integrals resident on that node. For some steps, however, more substantial changes to the algorithms are required. For the MP2 gradient, the construction of the Lagrangian (the right-hand side of the coupled Hartree-Fock (CPHF) equations) requires MO integrals with three virtual orbital indices. Given the size of this class of integrals, they are not stored, the required terms of the Lagrangian being constructed directly from AO integrals. A second departure from the serial algorithm concerns the MP2 2-particle density matrix. This quantity, which is required in the AO basis, is of a similar size to the 2-electron integrals and is stored on disk in the conventional algorithm, but is now generated as required during the derivative integral generation from intermediates stored in the GAs. In the SCF 2nd derivative module the coupled Hartree-Fock (CPHF) step and construction of perturbed Fock matrices are again parallelised according to the distribution of the MO integrals. The most costly step in the serial 2nd derivative algorithm is the computation of the 2nd derivative two-electron integrals. This step is trivially parallelised through a similar approach to that adopted in the direct SCF scheme - using dynamic load balancing based on a shared global counter. In contrast to the serial code, the construction of the perturbed Fock matrices dominates the parallel computation. It seems almost certain that these matrices would be more efficiently computed in the AO basis, rather than from the MO integrals as in the current implementation, thus enabling more effective use of scarcity when dealing with systems comprising more than 25 atoms. The performance of the DFT, MP2 and 2nd Derivative modules on the Cray T3E/1200E and the High-end systems from IBM, SGI, Compaq and Cray are shown in Table 1. Corresponding timings on a variety of commodity-based systems are shown in Table 2. The DFT calculations on morphine used a 6-31G** basis of 410 functions, those on cyclosporin a 6-31G basis of 1000 functions, both using the B3LYP hybrid functional. Note that the DFT calculations did not exploit CD fitting, but evaluated the coulomb matrix explicitly. Considering the DFT results on the high-end systems, speedups of 99 and 107 are obtained on 128 Cray T3E nodes for the morphine and cyclosporin calculation, respectively. The 32 CPU timings for cyclosporin show that the fastest machine is evidently that with the fastest CPU, with the AlphaServer SC ES45/1000 outperforming the IBM SP/WH2, the SGI Origin 3800/R14k-500, the AlphaServer SC ES40/667 and the Cray Linux Supercluster by factors of 2.44, 1.67, 1.70 and 1.43 respectively. Note that the SGI Origin/R14K and AlphaServer SC ES40/667exhibit almost identical run times up to 64 CPUs. Considering the higher node counts it is clear that all machines exhibit inferior scalability compared to the Cray T3E. Thus for cyclosporin, the AlphaServer SC ES45/1000 / Cray performance ratio of 5.55 found at 16 CPUs decreases to just 3.35 on 128 CPUs; corresponding figures for the smaller morphine calculation are 5.44 and 3.27. Table 1. Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient modules in calculations on Morphine, Cyclosporin, di(tri-fluoromethyl)-biphenyl and Mn(CO)5H on the Compaq AlphaServer SC ES45/1000 and IBM, SGI, Compaq and Cray High-end Systems.
Turning to the cluster results of Table 2, and the total times to solution on 32 CPUs, we see that even the fast ethernet connected CS6 Pentium III/800 cluster is outperforming the Cray T3E/1200E in the DFT B3LYP calculations, delivering 117% (morphine) and 130% (cyclosporin) of Cray performance. Increasing the CPU speed while leaving the interconnect effectively unchanged leads to a predictable impact on performance. Thus the corresponding delivery figure for the fast ethernet CS4 Athlon AMD/1200 cluster in the cyclosporin calculation is 163%, with the AMD/1200-based cluster a factor of 1.25 times faster than CS6. While comfortably outperforming the T3E, a higher factor might have been expected based solely on single node performance. Coupling these more powerful CPUs with enhanced interconnect, as in the CS2 Alpha Linux and CS9 Pentium 4/2000 Myrinet-based Clusters, predictably leads to much higher percentage delivery. Thus the Myrinet-based Pentium 4 Cluster, with 32 CPU T3E-delivery figures of 323% (morphine) and 355% (cyclosporin), outperforms both the Origin 3800/R14k-500 and the CS2 Alpha Linux Cluster. In both benchmarks we find the 32-CPU elapsed times on the Pentium 4 cluster to be almost identical to those of the 128-node Cray T3E/1200E. With 64 CPUs the CS9 Cluster is faster than both the Origin 3800/R14k and Cray Alpha Linux SC in the cyclosporin calculation, and is outperforming the 256-node Cray T3E/1200E. Compared to the optimal high-end system, the AlphaServer SC ES45/1000, we find the CS9 Cluster to be delivering ca. 70% of the AlphaServer performance in both morphine and cyclosporin calculations, while delivery from the CS2 cluster is somewhat less (62 and 70%). Also worth noting is the relatively poor performance of the SCI/SCALI-connected CS7 cluster, slower by almost a factor of two than the Myrinet-connected CS9 cluster in both 32-CPU calculations. Again this stems not from any inherent inadequacy of the SCI interconnect, but from the non-tuned implementation of the Global Arrays on the SCALI platform. Table 2. Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient benchmark calculations on a variety of commodity-based systems.
†CS1 PIII/450 + FE, CS2 QSNet Alpha Linux EV67/667 CS9 P4/1200 + Myrinet CS4 AMD K7/700 + FE, CS4‡ AMD K7/1200 + FE CS6 PIII/800 + FE CS7 AMD K7/1000 + SCI ≠ Single CPU per dual processor node Considering the performance data for the MP2 gradient and SCF analytic 2nd derivative modules, we see that the MP2 geometry optimisation of the Mn(CO)5H molecule (with 217 basis functions) shows a speedup of 93 achieved using 128 T3E/1200E processors to perform the complete optimisation (involving 5 energy and 5 gradient calculations). A corresponding speedup of 86 is found when calculating the frequencies of 2,2'-di(tri-fluoromethyl)-biphenyl using a 6-31G basis of 196 functions. The greater reliance on the Global Arrays (GAs) in both SCF 2nd Derivative and MP2 calculations, and hence dependency on efficient interconnect, compared to the DFT module leads to less marked performance enhancements on all high-end platforms relative to the T3E (Table 1). Thus at 32 CPUs, the performance advantage of the AlphaServer SC ES45/1000 over the Cray is reduced to factors of 3.5 (MP2) and 2.9 (2nd Derivatives) compared to the figure of 5.1 found in the cyclosporin DFT calculations. The AlphaServer SC ES45/1000 remains the optimum high-end platform, outperforming the SGI Origin/R14k by a factor of 1.33 in the 32-CPU MP2 calculation, and the ES40/667-based AlphaServer SC by a factor of 1.49 in the corresponding 2nd Derivatives calculation. Again we note the relative degradation in scalability of the high-end platforms. In the MP2 calculation, the ES45/1000 performance advantage of 3.5 found at 32 CPUs decreases to 2.2 in the 128-CPU calculation; corresponding figures of 2.9 and 2.0 are found in the SCF 2nd Derivative calculation. This decline in scalability with faster CPU is particularly noticeable in the AlphaServer SC ES40/667; at 64 CPUs the AlphaServer is outperformed by the R14k-based SGI Origin 3800 in the MP2 calculation, with little performance improvement on moving from 64 to 128 CPUs. This effect is such that the 128 CPU AlphaServer is only some 10% better than the Cray T3E. The performance of the Cray Linux Supercluster in the 2nd Derivatives calculation is worth noting, only a factor of 1.2 slower than the AlphaServer ES45/1000 and faster than the SGI Origin in the 64-CPU calculation. This more central role of the GAs in both MP2 gradient and analytic 2nd derivative applications produces the expected impact in performance on the commodity clusters. Considering the total times to solution on 32 CPUs, we see that the CS6 Pentium III cluster is delivering a much reduced percentage of the Cray T3E (73%) in the MP2 gradient calculation, with only a modest reduction in elapsed time between 32 (4,847 seconds) and 48 CPUs (4,024 seconds). The significant increase in node CPU capability associated with the CS4 AMD-based cluster is seen to have no impact in this benchmark, with the solution time significantly slower than the CS6 Pentium III/800 cluster. It would appear that latency effects are crucial in this benchmark, with the Myrinet-connected CS9 cluster a factor of 1.8 times slower on 64 CPUs than the Quadrics based CS2 Alpha Linux Cluster. The impact of the non-tuned GA libraries on the CS7 Athlon Cluster is also apparent, with the 32-CPU performance some 2.8 times slower than that of the Pentium 4/2000 CS9 Cluster. A significant degradation in performance was originally noted on CS2, caused not by limited communications but by problems in the effective utilisation of shared memory on the dual CPUs of the UP2000. Revisions in release 3.1 of the GAs have largely addressed this, with the 32-CPU Alpha timing of 1550 seconds representing 228% of 32-node Cray performance, the cluster outperforming the Origin 3800/R12k-400 (1714 seconds) and AlphaServer SC ES40/667 (1603 seconds). With 64 CPUs the CS2 cluster (883 seconds) continues to outperform the SGI Origin 3800/R12k (1082) and AlphaServer SC ES40/667(1078), and approaches the 256 node Cray T3E timing of 792 seconds. 32 CPUs of the CS2 and CS9 clusters deliver 65% and 59% respectively of the AlphaServer ES45/1000 in the MP2 benchmark. Somewhat surprisingly the Pentium and AMD-based clusters perform far more effectively in the SCF 2nd derivative benchmark. It would certainly appear that in marked contrast to the MP2 calculation, this benchmark is relatively insensitive to latency effects. Both the CS6 Pentium III/800 and CS4 AMD/1200 clusters outperform the Cray at 32 CPUs (CS6, 127%, CS4, 135%). Neither IBM/SP nor the Alpha Cluster perform that effectively on this benchmark; while the revised GAs have improved the Alpha performance, the 32-CPU Linux Cluster delivers only 154% of T3E performance, one of the lowest such figures recorded in these benchmarks. An initial performance analysis reveals load-balancing problems in the Fock matrix construction, which may explain this effect. The 64 CPU timings do suggest however that the CS2 Linux Cluster (512 seconds) is performing on a par with the AlphaServer SC ES40/667 (488). In contrast the CS9 Pentium 4 cluster performs exceptionally well, with the 64 CPU timing of 356 seconds matching that of the AlphaServer SC ES45/1000; CS9 is a factor of 1.4 times faster than the CS2 Linux Cluster, and outperforms 128 nodes of the Cray T3E/1200E (499 seconds). 32 CPUs of the CS2 and CS9 clusters deliver 54% and 92% respectively of the AlphaServer ES45/1000 in this benchmark References[1] J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global arrays; A portable shared memory programming model for distributed memory computers, in: Supercomputing '94, IEEE Computer Society Press, Washington, D.C. (1994). [2] G.D. Fletcher, A.P. Rendell and P. Sherwood, A parallel second-order Moller-Plesset gradient, Molec. Phys. 91:431-38 (1997).
Applications Performance: The DFT Coulomb Module of GAMESS-UKM.F. Guest, P. Sherwood and H.J.J. van DamCCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD. m.f.guest@dl.ac.uk, p.sherwood@dl.ac.uk, and h.j.j.vandam@dl.ac.uk, Recent work has continued to focus on optimising and extending the fitted coulomb module of the CCP1 density functional theory (DFT) code within GAMESS-UK for use on MPP machines. In order to reduce the cost of evaluating the Coulomb repulsion energy in medium sized molecules the charge density can be fitted to an auxiliary basis as proposed by Dunlap et al [1]:
In this equation
For efficiency only the maximal values of The parallelisation is performed trivially by distributing the evaluation of integrals with the same set of wavefunction basis functions Timings for a number of DFT calculations on the Morphine and Valinomycin molecules, conducted on the variety of high-end proprietary and commodity hardware under consideration are shown in Tables 1 and 2. Calculations on morphine used a DZVP_A2 Dgauss basis of 410 functions, those on valinomycin a DZV_A2 basis of 882 functions, both using the HCTH functional. Timings are reported for calculations in which the coulomb matrix was evaluated explicitly (J-explicit) and for those that used CD fitting (J-fit). The latter employed an A2_DFT auxiliary fitting basis for morphine (1171 functions), and an A1_DFT fitting basis (3012 functions) for valinomycin. Table 1. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on Morphine and Valinomycin on High-end systems from IBM, SGI, Compaq and Cray.
It can be seen that the current implementation of the fitted Coulomb modules provides significant benefit, with scalability on the T3E greatly enhanced over that reported previously. Speedups of 105 and 110 are obtained on 128 nodes of the Cray T3E/1200E for the morphine and valinomycin calculations when evaluating the coulomb matrix explicitly. Corresponding speedups when using the coulomb fit are 75 and 100 respectively. Overall times to solution on 128 T3E nodes when using the fitted Coulomb approach are reduced by factors of 2.9 (morphine) and 2.1 (valinomycin). Considering the total times to solution on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the morphine calculation when evaluating the coulomb matrix explicitly show the following ordering: AlphaServer ES45/1000 (243) < Cray Supercluster (325) < AlphaServer ES40/667 (373) < SGI O3800/R14k (505) < IBM SP (589) with the AlphaServer ES45/1000 outperforming the Cray T3E/1200E by a factor of 6.1. A similar ordering is found in the larger valinomycin benchmark, with a somewhat reduced factor of 5.9. The timings of Table 1 do point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU valinomycin improvement factors of 5.9 (ES45/1000) and 4.3 (Cray Supercluster) with explicit treatment of the Coulomb matrix are reduced to 5.2 (ES45/1000) and 3.6 (Supercluster) based on the 128-CPU timings. All machines show a significant reduction in time to solution when using the fitted Coulomb matrix compared to explicit treatment of the Coulomb term. The 32-CPU J-fit timings for the morphine calculation show the following order: AlphaServer ES45/1000 (72) < Cray Supercluster (112) ~ O3800/R14k (113) < AlphaServer ES40/667 (129) Note that all 3c-2e integrals are held in memory for this 32-CPU morphine calculation. A comparison with the explicit-J timings shows the greater dependency of the fitted approach on interconnect. Table 2. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on Morphine and Valinomycin on a number of Commodity-based Systems (see text).
The SGI Origin 3800/R14k is now performing on a par with the Cray Supercluster, while the Origin 3800/R12k outperforms the IBM SP. The R12k-based Origin is now only a factor of 1.12 slower than the AlphaServer ES40/667, compared to the figure of 1.71 found with explicit treatment of the coulomb matrix. Improvement factors when using the fitted approach versus explicit J in the larger valinomycin benchmark reflect this dependency on interconnect, particularly at higher processor count. Total times for the 32 CPU-J-fit calculations are as follows: AlphaServer ES45/1000 (477) < SGI O3800/R14k (724) ~ Cray Supercluster (726) < AlphaServer ES40/667 (881) < SGI O3800/R12k (897) The 32-CPU T3E-performance delivery figures for the Compaq AlphaServer of 612% and 586% with explicit treatment of the Coulomb matrix are reduced to 518% (morphine) and 515% (valinomycin) based on the 128-CPU timings. In similar fashion the 32-CPU figures for the Cray Supercluster of 427% is reduced to 358% in the 128 CPU valinomycin calculation. Considering the total times to solution on the commodity hardware (Table 2), we see that the 32-CPU Pentium III/800 CS6 cluster is delivering 173% (morphine) and 178% (valinomycin) of the Cray T3E/1200E in the DFT calculations with explicit treatment of the Coulomb matrix. These factors show a significant reduction (to 93% and 131% respectively) when using the fitted Coulomb matrix. The more powerful CPUs of the other clusters of Table 2 lead to higher percentage delivery, particularly when evaluating the Coulomb matrix explicitly. The 32-CPU explicit coulomb figure for valinomycin of 253% for the ethernet-interconnected CS4 AMD/1200 is reduced to 145% in the corresponding J-fit calculation. Enhancing both interconnect and CPU speed results in much higher figures, particularly in the fitted calculations. Thus the 32-CPU Pentium4-based CS9 Cluster exhibits T3E-delivery figures in the valinomycin calculation of 425% (explicit coulomb) and 393% (fitted coulomb). The CS9 cluster is seen to outperform the AlphaServer SC/667 and the Origin 3800/R12k-400 in both explicit- and fitted-coulomb 32 CPU valinomycin calculations, and the Origin 3800/R14k-500 in the explicit calculation. The Cluster is seen to deliver 73% of the AlphaServer ES45/1000 when evaluating the Coulomb matrix explicitly, and 61% of the ES45/1000 when using the Coulomb fit. Indeed 32-CPUs of the CS9 cluster comfortably outperform 128 nodes of the Cray T3E in both explicit and J-fit calculations. Table 3. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on the Compaq AlphaServer SC ES45/1000 and High-end systems from IBM, SGI, Compaq and Cray (see text).
†Zeolite, Basis (AOs/CD) The 64-CPU timings for the explicit calculation suggests that the CS9 cluster (971 secs.) is performing on a par with the Cray Supercluster (978 seconds). While CS9 outperforms the Linux Alpha CS2 Cluster at lower node counts, the Quadrics interconnect on the latter results in almost identical J-fit run times at 64 CPUs. At this node count both machines are somewhat slower than the Origin 3800/R14k (443 seconds). A further demonstration of the Coulomb Fit DFT code is given in Tables 3 and 4. Here we present timings for complete DFT calculations on a series of Zeolite fragments, conducted on the variety of high-end proprietary (Table 3) and commodity hardware (Table 4) under consideration. Note that the 833 MHz EV67 Compaq AlphaServer SC is now included in the proprietary hardware. While limited speedups are observed for the smaller fragments on the Cray T3E/1200E (46 and 54 for Si8O7H18 and Si8O25H18 respectively on 128 nodes), the higher value of 93 found for the largest fragment, Si28O67H30 is associated with the need for re-computation of the 3-centre integrals. Considering the total times to solution for the larger fragments on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the Si26O37H36 calculation show the following ordering: AlphaServer ES45/1000 (504) < AlphaServer ES40/833 (695) < SGI O3800/R14k (748) < Cray Supercluster (770) with the AlphaServer ES45/1000 outperforming the Cray T3E/1200E by a factor of 3.8. The performance of the Origin 3800/R14k is far stronger than might have been expected based solely on a consideration of CPU performance (e.g. SPECfp2000). The timings of Table 3 again point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU improvement factors for Si26O37H36 of 3.8 (AlphaServer SC ES45/1000) is reduced to 2.6 based on the 128-CPU timings. Similar conclusions arise from a consideration of the timings for the largest fragment (Si28O67H30). The improvement factor for the AlphaServer ES45/1000 against the Cray T3E at 32 CPUs of 425% is reduced to 209% based on the 128-CPU timings. Table 4. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on a number of commodity-based systems.
†CS1 PIII/450 + FE, CS2 QSNet Alpha Linux EV67/667 CS9 P4/1200 + Myrinet CS4 AMD K7/1200 + FE CS6 PIII/800 + FE CS7 AMD/K7-1000 + SCI ‡ Basis (AOs/CD), ≠ Figures in parentheses indicate the 1 CPU/node timings Considering the total times to solution on 32 CPUs of the commodity hardware (Table 4), we see that the Pentium III/800 CS6 cluster is delivering between 60-73% of the Cray T3E/1200E in the J-fit calculations. Interestingly there are no problems encountered when trying to run the same calculations on the fast-ethernet clusters because of the small demand on interconnect imposed by the replicated-data characteristics of GAMESS-UK. The more powerful CPUs of the other clusters of Table 4 lead to higher percentage delivery, although these do not reflect the individual CPU performance for the ethernet-interconnected machines. Thus we find 32 CPU figures of 77% for Si26O37H36 and 82% for Si28O67H30 (CS4 AMD/1200). Enhancing both interconnect and CPU speed results in much higher figures. Thus the 32 CPU Linux Alpha CS2 Cluster exhibits delivery figures of 241% (Si26O37H36) and 219% (Si28O67H30), the CS9 Pentium 4/2000 cluster figures of 269% (Si26O37H36) and 265% (Si28O67H30). The CS2 cluster is seen to outperform the IBM SP/WH2-375 and the Origin 3800/R12k-400 in all 32 CPU fragment calculations; 32-CPUs of the cluster outperform 128 nodes of the T3E on all but the largest fragment. The CS9 cluster is faster still on the larger fragments, outperforming the Origin 3800/R14k-500 and Cray Supercluster in the 32 CPU Si26O37H36 and Si28O67H30 calculations; 32-CPUs of the CS9 cluster again outperforms 128 nodes of the T3E on all but the largest fragment. Excluding the smallest fragment from those under consideration, the Alpha CS2 Cluster is seen to deliver between 52-63% of the performance of the AlphaServer SC ES45/1000 and between 90-94% of the Origin 3800/R14k-500 in all 32 CPU calculations. Increased figures are found on the Pentium 4/2000 CS9 Cluster; 62-70% of the AlphaServer SC ES45/1000 and between 97-110% of the Origin 3800/R14k-500. Timings for the 64-CPU calculations reveal the following ordering for the CS9 cluster against the proprietary hardware: Si26O37H36: AlphaServer ES45/1000 (379) < AlphaServer ES40/833 (484) < P4/2000 CS9 Cluster (499) < SGI 3800/R14k (515) < SGI O3800/R12k (559) Si28O67H30: AlphaServer ES45/1000 (739) < AlphaServer ES40/833 (928) < P4/2000 CS9 Cluster (954) < SGI O3800/R14k (994) < SGI O3800/R12k (1247) References[1] B.J. Dunlap, W.D. Connolly, J.R. Sabin, On some approximations in applications of Xα theory, Journal of Chemical Physics 71, 3396-3402. [2] G. Fann and R.J. Littlefield, Parallel inverse iteration with reorthogonalisation, in: Sixth SIAM Conference on Parallel Processing for Scientific Computing (SIAM), pp409-13 (1993). Applications Performance: Release 3.5 of CPMDM.F. GuestCCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD. m.f.guest@dl.ac.uk The CPMD code is a plane wave / pseudopotential implementation of Density Functional Theory, particularly designed for ab-initio molecular dynamics. The first version was developed by Jurg Hutter at IBM Zurich Research Laboratory starting from the original Car-Parrinello codes. Over the years many people from diverse organisations have contributed to the development of the code and of its pseudopotential library [1]. The current version, 3.5, is copyrighted jointly by IBM Corp and by Max Planck Institute, Stuttgart, and is distributed free of charge to non-profit organisations. CPMD runs on many different computer architectures and is well parallelized (MPI and Mixed MPI/SMP). Its main characteristics are:
Initially Version 3.3a of the code [4] was ported to the CS1 Pentium III/450 cluster, and subsequently benchmarked, by Sprik and Vuilleumier (Cambridge). Note that CPMD is acting as the base code for the new CCP1 flagship project, and that further optimisation for Beowulf-class systems is planned during the course of this work. The Initial comparison of Cray T3E and Beowulf hardware shown in Table 1 centres around a Liquid Water benchmark. The simulation comprises 32 water molecules, in a simple cubic periodic box of length 9.86 Ǻ at a temperature of 300K, with a time step of 7 au i.e. 0.169 fs, and a test run of 200 steps (34 fs). The calculation used the BLYP functional and Trouillier and Martins pseudo-potential, with a reciprocal space cut-off of 70 Ry (952 eV). Table 1. Time in Wall Clock Seconds for the CPMD Liquid Water parallel benchmark on the Cray T3E/1200E and the Pentium III/450 CS1 Cluster.
Note that the Pentium Cluster is seen to be performing well in comparison to the Cray T3E/1200E. This may be attributed to the relatively long iteration times associated with CPMD, and the small impact that the MPI_ALLTOALL routine has on the total elapsed times (compared to for example the more demanding MPI_ALLTOALLV). Good scalability is shown on the Cray T3E/1200E (a speedup of 49 on 64 nodes), although the EV56 node appears to be only marginally faster than the 450 MHz Pentium III. Thus the Pentium cluster achieves a percentage delivery figure of 62% of the Cray T3E/1200E on 32 nodes. We have recently implemented the latest version of the code on a number of additional platforms. These include the IBM SP/WH2-375 and Regatta-HPC node, the SGI Origin 3800/R14k and Compaq AlphaServer SC/ES45 1000, plus three commodity systems, the CS2 Alpha Linux Cluster, "ukcp", the CS7 dual-Athlon K7/1000 with SCALI interconnect, and the CS9 Pentium4/2000 Cluster with Myrinet interconnect. Using the same cluster of 32 Liquid water molecules, we report in Table 2 the time for performing a single point energy calculation, the calculation converging in 22 iterations. The AlphaServer SC ES45/1000 is clearly the optimal machine at 32 CPUs, outperforming the SGI Origin 3800 and IBM SP/WH2-375 by factors of 1.8 and 1.9 respectively. At 16 CPUs however the power4-based Regatta-HPC node outperforms the AlphaServer ES45/1000 by a factor of 1.25. Considering the three clusters, the CS2 Alpha Linux Cluster is optimal, almost twice the speed of the Myrinet-based CS9 Cluster and 2.4 times faster than the SCALI-based CS7 cluster. While this is in part due to the enhanced latency of QSnet over Myrinet, it probably also reflects a non-optimal implementation of the MPI_ALLTOALL collective on both machines. This certainly contributes to the lack of scalability found on both clusters, and to a 32CPU percentage delivery figure for the CS9 Cluster of just 30% against the AlphaServer ES45/1000, the lowest such figure in all the benchmarks described in this report. Table 2. Time in Wall Clock Seconds for the CPMD Liquid Water parallel benchmark on both High-end and Commodity-based Systems.
References[1] Michele Parrinello, Jurg Hutter, D. Marx, P. Focher, M. Tuckerman, W. Andreoni, A. Curioni, E. Fois, U. Roetlisberger, P. Giannozzi, T. Deutsch, A. Alavi, D. Sebastiani, A. Laio, J. VandeVondele, A. Seitsonen, S. Billeter and others. [2] D. Marx and J. Hutter, "Ab-initio Molecular Dynamics: Theory and Implementation", Modern Methods and Algorithms in Quantum Chemistry, Forschungzentrum Juelich, NIC Series, vol. 1, (2000). [4] CPMD, Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999).
Applications Performance: DL_POLY - Version 2M.F. GuestCCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD. m.f.guest@dl.ac.uk DL_POLY [1] is the parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and T.R. Forester for CCP5 (the Collaborative Computational Project for the Computer Simulation of Condensed Phases). The parallel implementation within Version 2 of the parallel code is based on a replicated data (RD) strategy, and was designed at the outset for machines with up to 64 processors and systems of up to 30,000 atoms, although it has since found use on much larger architectures. Implicit in the RD approach is a dependence on fast global summations, a potential bottleneck on clusters with commodity interconnects. The performance scaling varies according to the kind of simulation being undertaken - systems possessing complex molecular topologies and constraint bonds typically scale less well than those requiring simple atomic descriptions, as they lead to a higher communication overhead. If constraint bonds are present, as they usually are in bio-molecular or polymer systems, then significant deviations from ideal behaviour are to be expected. The four benchmarks outlined below are those described on the CCP5 web site [2], with the same benchmark-numbering scheme adopted here: Benchmark 4: A straightforward simulation of sodium chloride at 500K, using the standard Ewald summation method to handle the electrostatic forces. A multiple time-step algorithm is used to increase performance, which requires recalculating the reciprocal space forces only twice in every five time steps. The electrostatic cut-off is set at 24 Ǻ in real space, with a primary cut-off of 12 Ǻ for the multiple time-step algorithm. The Van der Waals terms are calculated with a cut-off of 12 Ǻ. The simulation is for 200 steps with a time step of 1 fs in the Berendsen NVT ensemble. The system size is 27,000 ions. Benchmark 5: This simulation is of 8,640 atoms of an alkali disilicate glass at 1000 K. The electrostatics are again handled by the Ewald sum, with the interaction potential including a three-body valence angle term, which requires a link-cell scheme to locate atom triplets. The electrostatic cut-off is 12 Ǻ and the Van der Waals cut-off is 7.6 Ǻ; 3-body forces are cut off at 3.45 Ǻ. The simulation is for 300 steps in the Hoover NVT ensemble, with a timestep of 1 fs. Benchmark 3: This simulation is of the enzyme transferrin in a solution comprised of 8102 TIP3P water molecules. A total of 27,593 atoms are in the system. The electrostatic forces are handled by a combination of neutral groups with the Coulombic potential. All force cut-offs are set at 8 Ǻ. The simulation is for 250 steps with a time step of .1 fs, in the NVE ensemble. The water molecules are treated as rigid bodies and the transferrin is maintained by bond constraints using SHAKE. Valence angles and dihedral potentials are present in the transferrin model. Benchmark 7: This system is comprised of 13,390 atoms, including 4012 TIP3P water molecules solvating the gramicidin A protein molecule at 300K. Both the protein and water molecules are defined with rigid bonds and maintained by the SHAKE algorithm. The water is held completely rigid, while the protein has angular and dihedral potential terms. Electrostatic interactions are handled by the neutral group method with a Coulombic potential truncated at 12 Ǻ. The Van der Waals interactions are truncated at 8 Ǻ. The simulation is for 500 time steps in the NVE ensemble with a 1 fs time step. Performance scaling on the Cray T3E for Benchmarks 4 & 5 has been shown to be extremely good and is almost linear over the entire range of processor numbers. This reflects the high parallel efficiency of the Ewald sum implementation. Significantly inferior scaling is found in the two macromolecular benchmarks, Benchmarks 3 & 7. This may be attributed to the difficulty in apportioning the neutral group calculations across processors, and the use of SHAKE for the bond constraints. Somewhat better scaling is found in Benchmark 7 which uses a larger cut-off in the electrostatic calculations and hence a lower communication/computation ratio. The performance of the four DL_POLY benchmarks are shown in Table 1 (on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems) and Table 2 (on a number of commodity based systems). Initial modifications made to the DL_POLY implementation on the commodity clusters included replacing the MPI_ALLREDUCE routines from both LAM and MPICH libraries with a Daresbury rewritten hypercube-based version. Considering the Ewald-based benchmarks, we would again point to the excellent scalability on the Cray T3E/1200E, with speedups of 135 (super-linear) and 98 obtained on 128 nodes Cray for benchmark 4 and 5 respectively. This excellent scaling on the T3E is put into perspective when comparing the total 32 CPU times to solution against the high-end systems of Table 1. These suggest comparable run times for the IBM Regatta-H and AlphaServer SC ES45/1000 on each benchmark, with the AlphaServer SC delivering 874% (Benchmark 4) and 625% (Benchmark 5) of the Cray T3E on 32 CPUs. The weakness of the Cray EV56 CPU is clearly apparent. These factors decrease substantially on 128 CPUs, with the AlphaServer SC delivering 552% and 394% on Benchmarks 4 and 5 respectively. It is arguable, however, that these benchmarks are not providing a realistic assessment of high CPU capability of current high-end systems given the limited size of the simulations under investigation and the extremely short run times involved. Considering again the 32 CPU performance, there is evidently little difference in performance between the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray SuperCluster EV67/833, all three being between 1.4-1.5 times slower than the Regatta H and AlphaServer ES45/1000. The timings do suggest that the optimum scalability is shown by the Origin 3800/R14k, although this is inferior to that found on the Cray T3E. Turning to the commodity systems of Table 2, the weakness of the Cray EV56 CPU is again apparent. Even the CS6 Pentium III/800 cluster is comfortably outperforming the Cray T3E/1200E in both the NaCl simulation (204 vs. 376 seconds) and NaK silicate simulation. These percentage delivery figures of 184% and 151% on the 800 MHz Pentium cluster increase substantially on the more powerful CPUs of the AMD Athlon and Alpha Clusters. In Benchmark 4 we find a delivery figure of 257% for the CS4 K7/1200 cluster: the Benchmark 5 percentage is 192%. Corresponding 16-node figures for the CS3 AMD Athlon cluster are 233% (benchmark 4) and 255% (benchmark 5). It is clear however that providing just fast ethernet as interconnect is not sustainable much beyond 32 CPUs. The 64 CPU performance of the CS6 Pentium III/800 cluster is only marginally superior to that at 32 CPUs in Benchmark 4, while in Benchmark 5 the 64 CPU timing is actually slower. Table 1. Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems.
The faster Alpha EV67 (CS2), Pentium 4/2000 (CS9) and Itanium/800 (CS8) CPUs, together with their enhanced QSNet and Myrinet interconnects, result in much higher delivery figures. The Myrinet-connected Itanium-based CS8 cluster performs exceptionally well on the NaCl benchmark, with 32 CPUs delivering 78% of the AlphaServer SC ES45/1000, outperforming both the CS2 Linux Alpha and CS9 Pentium/4 clusters by factors of 1.45 and 2.0 respectively. The performance advantage of the Itanium-based CS8 cluster is not apparent in Benchmark 5 however, when all three clusters show comparable performance. The 32-CPU CS2 Linux Alpha Cluster now appears to be somewhat faster than CS8 and the Pentium/4-based CS9 cluster. The performance of the Pentium/4-based cluster is generally less impressive in the DL_POLY benchmarks compared to the electronic structure results presented previously (e.g., GAMESS-UK). This would appear to be a single CPU optimisation issue with DL_POLY itself on the Pentium 4, for it is not evident when analysing the related Charmm benchmarks. The Alpha CS2 cluster outperforms the IBM SP and Origin 3800/R12k, with corresponding figures of 470% (benchmark 4) and 363% (benchmark 5) at 32 CPUs. The potential of the commodity-based systems in these simulations is striking; the 32-CPU Linux Alpha Cluster is outperforming 128 nodes of the Cray T3E in both benchmarks. A quite different picture of performance is revealed when considering the two macromolecular simulations, benchmarks 3 and 7. Now the scalability on the T3E/1200E is far more limited, with speedups of just 29 (benchmark 3) and 60 (benchmark 7) on 128 nodes of the Cray. The improvement in performance of the high-end systems over the T3E/1200E is also less apparent compared to the Ewald-based simulations. The fastest of these systems, the AlphaServer SC/ES45 1000, is only a factor of 4.9 times faster on benchmark 7 on 32 CPUs (cf. factors of 8.7, benchmark 4 and 6.3, benchmark 5). The 32-CPU AlphaServer SC/ES45 is now seen to marginally outperform the IBM Regatta-H (by a factor of 1.2), with both machines some way ahead of the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray Alpha Linux cluster. Comparable scalability is shown by the Origin 3800/R14k and AlphaServer SC/ES45, although this is significantly inferior to that found on the Cray T3E. Table 2. Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on a variety of commodity-based systems.
† CS1 PIII/450 + FE: LAM/MPI, CS2 QSNet Alpha Linux EV67/667, CS3 AMD K7/850 + Myrinet,CS4 AMD K7/1200 + FE: LAM/MPI CS5 dual PIII/930 + SCALI, CS6 PIII/800 + FE: LAM/MPI CS7 AMD K7/1000 + SCALI CS8 dual Itanium/800 + Myrinet 2k CS9 dual P4/2000 + Myrinet 2k (IFC) This lack of scalability has a predictable effect on the performance of the commodity clusters, which now deliver significantly lower percentage delivery figures compared to those found in the Ewald-based simulations. Considering the CS6 PentiumIII/800 cluster, we find 32-node T3E delivery figures of just 42% and 69% for benchmarks 3 and 7 respectively. While these figures increase significantly on the more powerful CPUs, they are far from impressive. Focusing on benchmark 7, we see only modest increases in delivery on the CS4 K7/1200 clusters (95%). These figures do increase substantially with improvements in interconnect. The advantage of enhanced interconnect is clear when comparing the performance of the CS4 Athlon K7/1200 and CS7 Athlon K7/1000 clusters, machines with comparable CPU performance. While the impact of the SCALI/SCI interconnect on the latter has no impact on the Ewald-based benchmark 4, it leads to the CS7 cluster outperforming CS4 by a factor of 1.7 in the macromolecular benchmark. Of the three leading clusters, the Myrinet-connected Itanium-based CS8 cluster is again competitive, 32 CPUs outperforming the CS2 Linux Alpha and CS9 Pentium/4 clusters by factors of 1.2 and 1.6 respectively. The 32 CPU elapsed time is identical to that of the SGI Origin 3800/R14k, AlphaServer ES40/667 and Cray Supercluster, delivering 64% of the AlphaServer ES45/1000. Worth noting here is the initial performance limitations on the Linux Alpha Cluster that arose from the way DL_POLY handled both co-ordinate and forces arrays. The x-, y- and z-co-ordinates and corresponding forces were stored as separate linear arrays, x(mxatms), y(mxatms) etc., coding that led to exceedingly poor cache re-usage on the UP2000 processor. Re-writing the code to use, in hopefully obvious notation, xyz(3,mxatms) and fxyz(3,mxatms) improved overall performance on the Alpha cluster by a factor of 2.5 (although it had little effect on, for example, the IBM/SP-WH2 with its larger 8 MByte cache). Having made these changes, the 32-CPU Linux Alpha CS2 Cluster again outperforms 128 nodes of the Cray T3E in both benchmarks. An additional feature exemplified by these benchmarks is the impact of the underlying MPI libraries on performance. While little effect was found in the Ewald-based simulations, a much greater impact was apparent on benchmarks 3 and 7. Thus the reduced latency associated with LAM MPI as against MPICH reduced the 32-node benchmark 3 timing on the CS1 Pentium III/450 cluster from 583 (MPICH) to 391 seconds (LAM). Finally, it is perhaps worth questioning the value and cost effectiveness of the Cray T3E/1200E in running molecular simulations using the DL_POLY software, a question that is reinforced by considering the CHARMM benchmarks presented below. In both classes of DL_POLY benchmark considered, those that scale well on the Cray (benchmarks 4 and 5) and those that scale badly (the macromolecular simulations), we see that 128-node Cray T3E performance is matched or exceeded by 32 CPUs of the Linux Alpha Cluster. While the latter scales less effectively than the Cray, the total times to solution are less. Given that the replicated date implementation within Version2 of DL_POLY itself does not scale effectively beyond 128 Cray CPUs, it is difficult to justify using the Cray at all, given the implicit cost differential involved against the clusters considered in this report. Considerable effort has now been invested in the distributed data version of the code, DL_POLY 3 (see other articles in this report). This significantly extends the size of system amenable to study, and with major algorithmic enhancements, exhibits far better scalability than the replicated data code discussed above e.g., through the use of the Particle Mesh Ewald Scheme for the Coulombic energy. This code will certainly require high-end resources in the pursuit of 106+ particle simulations and, based on our initial findings, will scale well on 256+ CPUs characterising these machines. References[1] see, http://www.dl.ac.uk/TCSC/Software/DL_POLY/main.html [2] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/
Applications Performance: DL_POLY - Version 3W. Smith, I.J. Bush, M.F. Guest and P. SherwoodCCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD. W.Smith@dl.ac.uk, I.J.Bush@dl.ac.uk, M.F.Guest@dl.ac.uk, and P.Sherwood@dl.ac.uk The previous section provided a comprehensive benchmarking of the replicated data (RD) version (Version 2.11) of DL_POLY [1], the parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and T.R. Forester for CCP5.These results clearly revealed the limitations inherent in the RD strategy, with restrictions in the size of system amenable to study, and limited scalability on current high-end platforms typified by the Compaq AlphaServer SC and Origin 3800. These limitations apply not only to systems possessing complex molecular topologies and constraint bonds, but also to systems requiring simple atomic descriptions, systems that historically exhibited excellent scaling on the Cray T3E/1200E. Other articles in this report have described the significant extensions to the code made possible by the development of the distributed data (or domain decomposition) version of the code (DL_POLY 3), developments that have been accelerated in light of the impending arrival of the HPC(X) system. In the present article we present recent results obtained on the Compaq AlphaServer ES45/1000, the SGI Origin 3800 and CS9 Pentium 4/2000-based CS9 Cluster, results which highlight the drastic improvements in both system size and performance made possible through recent developments. Table 1. Time in Wall Clock Seconds for four DL_POLY 3 benchmark calculations on the Compaq AlphaServer SC ES45/1000, SGI Origin 3800 and CS9 Pentium 4/2000 Cluster.
The four benchmarks reported in Table 1 include two Coulombic-based simulations of NaCl, one with 27.000 ions, the second with 216,000 ions. Both simulations involve use of the Particle Mesh Ewald Scheme, with the associated FFT treated by an algorithm due to Bush that is designed to reduce communications cost. This circumvents use of the traditional all-to-all communications through a scheme (see separate article) that relies on column-wise communications only. The reported timings are for 500 time steps in the smaller calculation, and 200 time steps in the larger simulation. The other two benchmarks are macromolecular simulations based on Gramicidin-A; the first includes a total of 99,120 atoms and 100 time steps. The second, much larger simulation, is for a system of eight Gramicidin-A species (792,960 atoms), with the timings reported for just 10 time steps. In terms of time to solution, we see that the AlphaServer SC outperforms the Origin 3800 at all processor counts in all four benchmarks; both 256 CPU runs for the larger NaCl and Gramicidin-A simulations suggest a factor of 1.6. These results show a marked improvement in performance compared to the replicated data version of the code, with the gratifying characteristic of enhanced scalability with increasing size of simulation, both in the ionic and macromolecular simulations. Considering the NaCl simulations, we find speedups of 139 and 122 respectively on 256 processors of the Origin 3800 and AlphaServer SC in the 27,000-ion simulation. These figures increase to 186 and 171 respectively in the larger simulation featuring 216,000 ions. A more compelling improvement with system size is found in the macromolecular Gramicidin-A simulations. In the distributed data implementation, both SHAKE and short-range forces require only nearest neighbour communications, suggesting that communications should scale linearly with the number of nodes, in marked contrast to the replicated data implementation. This is borne out in practice. In the larger simulation (with 792,960 atoms) we find speedups of 208 and 175 on 256 processors of the Origin 3800 and AlphaServer SC respectively. This level of scalability provides a significant advance over the performance exhibited by both DL_POLY 2 and CHARMM (see next article), and represents a major step forward towards the goal of effective exploitation of the HPC(X) system in the field of molecular simulation.
Applications Performance: CHARMMM.F. Guest and P. SherwoodCCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD. M.F.Guest@dl.ac.uk, and P.Sherwood@dl.ac.uk CHARMM ‘Chemistry at HARvard Macromolecular Mechanics’ (version c26b2) is the general-purpose molecular mechanics, molecular dynamics and vibrational analysis package for modelling and simulation of the structure and behaviour of molecular systems. The benchmark is the standard CHARMM parallel benchmark involving an MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules (14026 atoms, 1000 steps (1 ps), 12-14 Ǻ shift). Although a macromolecular simulation, this MD benchmark shows many of the performance attributes demonstrated by the Ewald-based DL_POLY simulations. The performance of this benchmark is shown in Table 1 (on the Cray T3E/1200E, SGI Origin 3800/R14k-500 and High-end Systems from Compaq) and Table 2 (on a number of commodity based systems). Table 1. Time in Wall Clock Seconds for the CHARMM Carboxy Myoglobin parallel benchmark on the Cray T3E 1200/E, SGI Origin 3800/R14k-500 and Compaq AlphaServer SC ES40/667 and ES45/1000.
We would again point to the excellent scalability on the Cray T3E/1200E, with a speedup of 96 obtained on 128 nodes of the Cray. This good scaling on the T3E is put into perspective when comparing the total 32 CPU times to solution against the high-end systems of Table 1. These suggest that the AlphaServer SC ES45/1000 is delivering 562% of the Cray T3E on 32 CPUs; the weakness of the Cray EV56 CPU is clearly apparent. This factor decreases substantially at higher node count, with neither AlphaServer SC nor Origin 3800/R14k-500 scaling beyond 32 CPUs. As with DL_POLY, It is arguable that these benchmarks are not providing a realistic assessment of high CPU capability of current high-end systems given the limited size of the simulation under investigation and the extremely short run times involved. The timings do suggest that the optimum scalability is shown by the Origin 3800/R14k, although this is far inferior to that found on the Cray T3E. Although this lack of scalability has a predictable effect on the performance of the commodity clusters, the results of Table 2 suggest that CHARMM is delivering significantly higher percentage delivery figures compared to the corresponding macromolecular simulations using the replicated data version of DL_POLY. Considering the CS6 PentiumIII/800 cluster, we find a 32-node T3E delivery figure of 172%, a figure close to the Ewald-based simulations using DL_POLY. Of particular note is the considerable advantage afforded by use of LAM-MPI rather than the more popular MPICH. This improves the 32-CPU timing by a factor of over two, a sure pointer to the latency sensitive nature of these simulations. This delivery figure increases significantly on the more powerful CPUs with enhanced interconnect. Of the three leading clusters, the Myrinet-connected Pentium 4-based CS9 is optimal, 32 CPUs outperforming the CS7 AMD K7/100 SCI and CS2 Linux Alpha clusters by factors of 1.2 and 1.3 respectively. The 32 CPU elapsed time is almost identical to that of the SGI Origin 3800/R14k, delivering 95% of the AlphaServer ES45/1000. This latter percentage is the highest delivered by the CS9 cluster throughout all the applications considered. Table 2. Time in Wall Clock Seconds for the CHARMM Carboxy Myoglobin parallel benchmark on a number of commodity-based systems.
†CS1 PIII/450 + FE: LAM/MPI, CS2 QSNet Alpha Linux EV67/667 CS6 PIII/800 + FE: LAM/MPI, CS7 AMD K7/1000 + SCALII CS9 dual P4/2000 + Myrinet The potential of the commodity-based systems in this simulation is again striking; the 16-CPU Pentium 4 CS9 Cluster is outperforming 128 nodes of the Cray T3E/1200E. References[1] [CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus.
Applications Performance: QM/MM Coupling Approaches with CHARMM/GAMESS-UKM.F. Guest and P. SherwoodCCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD. M.F.Guest@dl.ac.uk, and P.Sherwood@dl.ac.uk While developments in computer performance and QM algorithms are bringing increasingly complex systems within the scope of quantum mechanical calculations, many important chemical systems remain too large for pure quantum simulation. This is especially true if energies for many configurations are required, as in molecular dynamics studies. Parameterised classical energy methods remain important, and the simulation code CHARMM [1, see previous article] is one of the most widely used packages for the study of macromolecules such as proteins, nucleic acids and lipids. It supports energy minimisation and molecular dynamics approaches using a classical parameterised force field. In order to permit studies of reacting species it is useful to be able to incorporate the quantum mechanical energy of a part of the system into the forcefield, and over recent years a number of interfaces to quantum mechanical programs have been developed. Initially these were based on semi-empirical wavefunctions. More recently computational and hardware developments have led to increased interest in ab initio QM/MM schemes and interfaces to the GAMESS (US) and CADPAC packages have been implemented. The coupling between CHARMM and GAMESS-UK has been developed in collaboration with the groups of Bernie Brooks and Milan Hodoscek and follows a similar approach to these [2]. In the CHARMM QM/MM model the standard CHARMM forcefield is used for the classical partition and the QM/MM van der Waals interactions. The QM/MM electrostatics are handled by including point charges at the MM positions in the Hamiltonian. The energy and forces from the QM calculation, including electrostatic forces acting on the classical centres, are added to those computed by CHARMM. The QM/MM approach involves introducing additional hydrogen (link) atoms to the edges of the QM cluster to terminate the quantum mechanical calculation. The forces on the link atoms can be handled by CHARMM using the same methods developed for treating explicit models of lone pairs. GAMESS-UK incorporates a DFT module in which an auxiliary basis fit of the charge density is used to provide an approximation to the Coulomb energy (see above). We have used these elements of GAMESS-UK to implement an alternative model in which the charge density of the classical system is included in the QM Hamiltonian not as a set of point charges but as a continuous charge distribution represented as a sum of Gaussian terms. This allows greater overlap between the QM and MM charge distributions without the introduction of major artefacts and thereby permits the exploration of a number of QM/MM schemes. Full details of QM/MM models based on this functionality will be published elsewhere [3]. The CHARMM package is parallelised using a variety of message passing protocols; we have chosen to base the parallel GAMESS-UK/CHARMM implementation on MPI. We can couple this with either the MPI- or GA-based parallel implementations of GAMESS-UK. When using the GAs we configure them to use MPI as the underlying communication protocol which simplifies the maintenance of the merged parallel code and also allows us to take advantage of optimised MPI implementations when provided by the vendors for specific networking hardware. We report some sample timings of the GA version on the Compaq AlphaServer SC ES45/1000 and SGI Origin 3800, plus two commodity clusters (CS7, the dual AMD K7/1000 MP with SCALI interconnect, and CS9, the dual P4/2000 Xeon with Myrinet). Ports to other parallel platforms are in progress, and details of the current status of the CHARMM/GAMESS-UK project may be found on the web [4]. The timings of Table 1 refer to a single energy and force calculation on the enzyme Triosephosphate Isomerase (TIM). This is one structure in a pathway that has been studied in considerable detail [5]. The system comprises a total of 4180 atoms, of which 35 are treated quantum mechanically, with the addition of 2 link atoms. A DFT calculation, using the B3LYP functional and the Ahlrichs DZP basis set (424 GTOs) is used for the QM region. With this balance of QM and MM calculations the time is dominated by the QM calculation, the only impact of the MM region being the increased number of 1-electron integrals required, which have necessitated addition parallelisation with respect to the previous GAMESS-UK implementation. The timings above are consistent with the previously reported DFT timings, with the CS9 P4/2000 Xeon cluster significantly faster than both Origin and the AMD-based cluster. The performance of the latter is again far from ideal given the non-optimised version of the underlying Global Array (GA) tools. Performance is limited primarily by the communication costs associated with the linear algebra steps in the SCF, which include diagonalisation, matrix multiply etc. As an example, the parallel diagonalisation (PeIGS) is slightly slower on 128 processors of the SGI Origin than 64 (1.07 vs 0.99s per iteration. Although inefficient, the parallelisation of these steps is nevertheless important when running on larger processor counts, as an illustration the same calculation using serial matrix algebra takes 374s on 64 processors. Perhaps the most promising way to extend the efficiency of parallel QM/MM calculations is to run a number of QM calculations simultaneously. This approach has been implemented in the GAMESS-UK/CHARMM interface and is being explored in the studies of reaction pathways using the replica path method [6]. References[1] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, "CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations", J. Comp. Chem. 4 (1983) pp. 187-217. [2] P.D. Lyne, M. Hodoscek and M. Karplus. A Hybrid QM-MM Potential employing Hartree-Fock or Density Functional Methods in the Quantum Region, J. Phys. Chem. A., 103 (1999) 3462. [3] Optimization of Quantum Mechanical/Molecular Mechanical Partitioning Schemes: Gaussian Delocalization of MM Charges and the Double Link Atom Method, D. Das, K.P. Eurenius, E.M. Billings, P. Sherwood, D.C. Chatfield, M. Hodoscek, and B R. Brooks, in preparation. [4] http://www.cse.clrc.ac.uk/Activity/CHMGUK [5] C. Lennartz, A. Schäfer, F. Terstegen and W. Thiel Enzymatic Reactions of Triosephosphate Isomerase: A Theoretical Calibration Study, J. Phys. Chem. A., in press. [6] H. L. Woodcock, M. Hodoscek, P. Sherwood, Y.S. Lee, H.F. Schaefer III, and B.R. Brooks, Exploring the QM/MM Replica Path Method: A Pathway Optimisation of the Chorismate to Prephenate Claisen Rearrangement Catalyzed by Chorismate Mutase, Theor. Chem. Accts. in press.
Applications Performance: ANGUSM.F. GuestCCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD. M.F.Guest@dl.ac.uk ANGUS [1] performs direct numerical simulation (DNS) of turbulent premixed combustion in order to generate statistical data in support of modelling. The equations to be solved are the Navier-Stokes equations for fluid flow, augmented by two additional equations each describing the transport of a single scalar variable and together specifying the thermochemical state of the system in the presence of differential diffusion effects. Thus, in total there are six partial differential equations to be solved. A grid partitioning strategy is employed for ANGUS which is quite typical of many domain decomposition techniques used in parallel CFD. Due to the finite difference stencil it is necessary to introduce ‘halo’ or ‘ghost’ cells at the interface boundaries. These are used as a message cache and allow derivatives to be determined in these regions with only local variables. The halo cells are then updated as required. Discretisation of the equations is carried out using standard second-order central differences on a three-dimensional grid. The velocity nodes are located at the face-centres of each cell, giving a staggered-grid arrangement that conserves kinetic energy as well as mass and momentum. The pressure solver utilises a conjugate gradient method with a Modified Incomplete LU (MILU) preconditioner [2]. As with many CFD algorithms, the resulting matrix is both sparse and symmetric. In this case, it is heptadiagonal and the periodic boundary conditions also mean that the matrix is singular. A Multi-grid solver has also been provided. Level-1 BLAS are used heavily in both solvers, with the overall computational work expected to be roughly proportional to n3. The initial ANGUS CG-ILU benchmark considered below utilises a grid size of 1443. Benchmark timings for one hundred iterations of the conjugate gradient solver on the Cray T3E/1200E, SGI Origin 3800/R14k-500, IBM SP/WH2-375 and Regatta-H and the Compaq AlphaServer SC (ES40/667 and ES45/1000) are reported in Tables 1 and 2. Corresponding timings on a variety of commodity-based systems are given in Table 3 and 4. Table 1. Time in Wall Clock Seconds on the Compaq AlphaServer SC (ES40/667 and ES45/1000), SGI Origin 3800/R14k-500 and IBM SP/WH2-375 and Regatta-H for the ANGUS CG-ILU Benchmark (1443).
Note that the timings reported for the IBM/SP WH2-375 and AlphaServer SC refer to CPU configurations in which all CPUs on a given 4-way node are involved in the computation. These timings show several distinct features. All machines, with the exception of the IBM Regatta-H, appear to exhibit super-linear speedups, although the IBM SP/WH2-375 is only marginally faster than the Cray T3E for a given node count. The optimal machine would appear to be the Compaq AlphaServer SC ES45/1000, with an effective speed up of 12.8 on moving from 8 to 64 CPUs. The Origin 3800 R14k/500 also performs well; while a factor of 1.6 slower than the ES45/1000 at 64 CPUs, it also shows a super-linear speed up factor of 15.8 on moving from 8 to 64 CPUs. In stark contrast the IBM power4-based Regatta H scales extremely badly from 8 to 32 CPUs. It is the fastest machine by almost a factor of two with 8 CPUs, but is outperformed by the SGI Origin 3800/R14k-500 and both AlphaServer SC ES45/1000 and ES40/667 at 32 CPUs. The performance advantage over the Cray T3E/1200E, a factor of 6.8 at 8 CPUs, is reduced to just 3.0 when using 32 processors. While this behaviour is at first sight confusing, it may be rationalised from a consideration of the driving force behind this benchmark, namely memory bandwidth. Additional insight can be gained by varying the distribution of processors over the available nodes (see Table 2). Now, for example, a 16-processor job on the IBM SP/WH2-375 or AlphaServer SC is run on either 4 or 8 nodes given the configuration available (with all CPUs used in the former case, and only 2 CPUs/node in the latter). Table 2. Time in Wall Clock Seconds on the IBM SP/WH2-375 and Compaq AlphaServer ES40/EV67-667 as a Function of Processor distribution for the ANGUS CG-ILU Benchmark (1443);
The strong correlation between elapsed time and node occupancy in the above timings points to the driving influence of memory bandwidth on this benchmark. Thus performing an 8 CPU run on the IBM SP/WH2-375 realises elapsed times that vary by a factor of 2.3 depending on processor distribution (from 4394 seconds on 2 nodes to 1899 seconds on 8 nodes). Similarly the 16 CPU benchmark on the Alpha Server SC requires 676 seconds on four 4-way processor nodes, and 511 seconds when using a single CPU of each of the available 16 nodes. The better memory bandwidth of the AlphaServer node accounts for the somewhat smaller variation in timings for a given node occupancy (1935 secs. for an 8 CPU run on 2 nodes, 1174 secs. for 8 CPUs on 8 nodes). These performance attributes are completely consistent with the STREAM memory bandwidth benchmark [3] on the nodes of each machine. The TRIAD bandwidth of 900 Mbytes/sec measured on a dedicated single SP/WH2 node is reduced to some 225 Mbytes/sec when running the same benchmark on all 4 CPUs of the node. While the performance of the Regatta-H is indeed impressive at small processor count, this advantage is rapidly lost as more CPUs have to compete for memory bandwidth on the same-shared memory node. While the complex cache hierarchy of the Regatta-H is designed to minimize this effect (a single processor job is actually running in an environment comprising the total cache associated with an 8-way MCM i.e. 128 MByte), it is clear that once CPUs are forced to access main memory, then performance rapidly degenerates. A TRIAD bandwidth of 4.2 GB/sec for a single process on a 16-way Regatta-HPC is reduced to 1.6 GB/sec when running the same benchmark on all 16 CPUs of the node. The super-linear speed-up noted at 64+ CPUs on both AlphaServer SC ES45 and Origin 3800 R14k is almost certainly caused by cache effects. At this point the 8MByte cache on CPUs is certainly alleviating the memory bandwidth problems encountered at smaller node counts. Turning to the commodity-based timings of Table 3, note again that figures for the CS2, CS5 and CS7-CS9 clusters refer to CPU configurations in which all CPUs on a given node are involved in the computation. The optimum cluster performance is derived from the Myrinet connected CS9 Pentium 4/2000 Cluster, which at 32 CPUs outperforms the Cray T3E by a modest factor of 1.68, and achieves only 35% of the performance of the AlphaServer SC ES45/1000. This benchmark does provide an example of one of the shortcomings of commodity-based systems with their reliance on "cheap" memory sub-systems. The CS9 cluster is seen to outperform the dual Itanium/800-based CS8 cluster and the Alpha Linux CS2 cluster by factors of ca.1.2. We see that while the Alpha Linux cluster outperforms both Cray T3E/1200 and IBM SP/WH2-375 up to 16 nodes, this advantage is effectively lost at 32 CPUs when the machine exhibits almost identical timings as the IBM (751 vs. 776 seconds respectively). Table 3. Time in Wall Clock Seconds on a variety of commodity-based systems (CS1-CS9) for the ANGUS CG-ILU Benchmark (1443)
† CS1 PIII/450 + FE: LAM/MPI CS2 QSNet Alpha Linux EV67/667 CS5 dual PIII/930 + SCALI CS6 PIII/800 + FE: LAM/MPI CS7 dual K7/1000 + SCALI CS8 dual Itanium/800 + myrinetCS9 dual P4/2000 + myrinet Considering the slower Pentium III clusters, it is clear that enhancing the CPU speed while leaving the memory subsystem unaltered produces at best a modest increase in performance. The CS6 PIII/800 cluster outperforms the CS1 PIII/450 cluster by a factor well below the MHz ratio (a factor of 1.3 on 8 CPUs, decreasing to just 1.1 on 32 CPUs). Equally moving to dual processor nodes, with the effective halving of memory bandwidth leads to a major performance hit; thus the CS5 SCALI-based cluster with dual-processor PIII/930 CPUs is outperformed at all CPU counts by the CS1 Cluster with its more modest PIII/450 CPUs and fast ether interconnect. Additional insight into the findings above can again be seen by varying the distribution of processors over the available dual processor nodes (see Table 4). Table 4. Time in Wall Clock Seconds on a Variety of Commodity Clusters as a Function of Processor distribution for the ANGUS CG-ILU Benchmark (1443);
The strong correlation between elapsed time and node occupancy again points to the driving influence of memory bandwidth. Thus performing a 16 CPU run on the Alpha Linux Cluster requires 1635 seconds on 8 dual processor nodes, and 936 seconds when using a single CPU of each of the available 16 nodes i.e. a factor of 1.7 difference in performance. Similar factors are found for the 16 CPU runs on the CS9 Pentium 4 (1.81) and CS5 Dual PIII / 930 SCALI Cluster (1.54). The timings above would suggest that the memory subsystem on the CS8 Itanium/800 cluster, and to a lesser extent that on the CS7 AMD K7/1000 MP cluster, is significantly better than that on the Pentium and Alpha systems. While the performance gain from using both, rather than just 1 of the processors on 16 nodes, is modest at best on the latter systems, that on CS8 and CS7 shows an improvement factor of 2.2 and 1.4 in line with the increased number of CPUs. These performance attributes are again completely consistent with the STREAM memory bandwidth benchmark [3] on the nodes of each machine. Thus the TRIAD bandwidth of 1 GByte/sec measured on a dedicated dual processor UP2000 6/667 node is reduced to some 500 Mbytes/sec when running the same benchmark on both CPUs of the node. Finally, we have increased the grid size from the rather modest value of 1443 above in two further series of calculations, and present the timings for just ten iterations on a variety of hardware in Tables 5 and 6 (1963) and Tables 7 and 8 (2883). Table 5. Time in Wall Clock Seconds for the Ten iterations of the ANGUS CG-ILU Benchmark (1963) on the SGI Origin 3800/R14k-500, Compaq AlphaServer SC ES40/667 and the IBM SP/WH2-375 and Regatta-HPC.
Table 6. Time in Wall Clock Seconds for the Ten iterations of the ANGUS CG-ILU Benchmark (1963) on a number of commodity clusters.
Table 7. Time in Wall Clock Seconds on the Compaq AlphaServer SC (ES40/667 and ES45/1000), SGI Origin 3800/R14k-500 and IBM SP/WH2-375 and Regatta-H for Ten iterations of the ANGUS CG-ILU Benchmark (2883).
Table 8. Time in Wall Clock Seconds for the Ten iterations of the ANGUS CG-ILU Benchmark (2883) on a variety of commodity systems. The SGI Origin 3800/R14k -500 is included for comparison.
References[1] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from an engineering perspective, Parallel Computing (1996). [2] T.F. Chan and C-C.J. Kuo, Parallel Elliptic Preconditioners: Fourier Analysis and Performance on the Connection Machine, Computer, Physics Communications, Vol. 53, 1989, pp 237-252. [3] The STREAM Memory Bandwidth benchmark, see http://www.cs.virginia.edu/stream.
Applications Performance: SummaryM.F. GuestCCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD. M.F.Guest@dl.ac.uk In the reports above we have presented a number of benchmarking results intended to update work described in the previous SLA reports in which the performance of commodity systems based on Pentium III processors was judged against the Cray T3E/1200E. There is now little point in taking the Cray T3E/1200E as the standard, or in considering the performance of outdated Pentium III-based commodity systems. We now position the Compaq AlphaServer SC ES45/1000 (the TCS1 system at PSC) as the standard high-end resource, and consider the relative performance of the CS9 Pentium 4/2000 with Myrinet interconnect as representative of today's typical commodity based offering. We summarise In Table 1 the conclusions of the benchmarking exercise on the reported applications, by showing
(§) Outperforms 128 nodes of the Cray T3E/1200E Table 1. Application Performance: Percentage of a 32-processor partition of (i) the Compaq AlphaServer SC ES45/1000 and SGI Origin 3800 R14k achieved by 32-processors of the CS9 Pentium 4 /2000-based Cluster, and (ii) the SGI Origin 3800 R14k achieved by the CS2 QSNet Alpha Linux Cluster. These figures suggest the following:
In summary the collection of results presented in this report provides compelling evidence in support of commodity-based clusters. Suitably-configured Beowulf systems provide not only highly cost-effective departmental, mid-range solutions, but can match the levels of performance associated with a significant fraction of a high-end MPP machine, again for a small fraction of the cost. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||