the newsletter of
collaborative
computational project 1

Application Performance on High-end and Commodity-class Systems

M.F. Guest, P. Sherwood, W. Smith, I.J. Bush, and H.J.J. van Dam
CCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD
m.f.guest@dl.ac.uk, p.sherwood@dl.ac.uk, w.smith@dl.ac.uk, i.j.bush@dl.ac.uk, and h.j.j.vandam@dl.ac.uk

Introduction and Background

M.F. Guest
CCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD
m.f.guest@dl.ac.uk

As part of our assessment of the current High-end Computing (HEC) landscape, and to assist in positioning commodity-based systems on this landscape, we continue to implement, benchmark and assess the performance of a number of key applications on proprietary high-end systems from IBM, SGI, Compaq and Cray. Applications considered below include those from computational chemistry (GAMESS-UK, DL_POLY and CHARMM), computational materials (CPMD), and computational engineering (ANGUS and FLITE3D).

The HEC hardware involved in this work is listed below. Those accessed for the first time in the current reporting period include the Compaq AlphaServer SC ES45/1000 (at Pittsburgh Supercomputing Centre, PSC), and a variety of power4-based systems from IBM (the 8-way IBM pSeries 690Turbo, 16-way Regatta HPC and 32-way Regatta H node):

  • IBM SP/WH2-375: The 32 CPU (8-node) IBM SP system at Daresbury features 4-way Winterhawk2 SMP "thin nodes" (375 MHz Power3-II processors with 8 MB L2 cache), 1.6 GB/sec node memory bandwidth (single bus), and interconnect switch with 150 MB/sec unidirectional, 200 MB/s bi-directional bandwidth.
  • IBM Regatta Systems: The latest addition to the IBM power series is the power4; clocked at 1.3 GHz, this CPU features in a number of new systems from IBM, including the 8-way pSeries 690Turbo, the 16-way Regatta HPC and the 32-way Regatta H nodes. Initial benchmark results have been obtained on all three machines.
  • SGI Origin3800: 'TERAS', the supercomputer installed at Sara, the Netherlands is a 1024-CPU system consisting of two 512-CPU SGI Origin 3800 systems. This machine has a peak performance of 1 TFlops (1012 floating point operations) per second. The machine has recently been fitted with 500MHz R14k CPUs organized in 256 4-CPU nodes and now possess 1 TByte of memory in total. Benchmarks were conducted on both the interim solution, comprising 400 MHz R12k CPUs, and on the final R14k-based system.
  • Compaq AlphaServer SC: The Compaq AlphaServer SC features 4-way ES40 SMP nodes (with 667 MHz Alpha 21264A CPUs with 8 MB L2 cache), 5.2 GB/sec node memory bandwidth (dual bus) and the Quadrics "fat tree" interconnect, QsNet (5usec latency, 210 MB/sec bandwidth). Benchmarks were conducted on a 64-node system at the APAC National Facility (Australian Partnership for Advanced Computing at ANU). Results have also been obtained on a second 64-node AlphaServer SC featuring 833 MHz Alpha 21264A CPUs. The most recent addition is the TCS1 system at PSC. Comprising 750 4-way ES45-based nodes (with 1 GHz EV68 Alpha 21264C CPUs), and dual-rail Quadrics interconnect, this system is the most powerful of those accessed.
  • Cray Alpha Linux Supercluster: The configuration of the Cluster prototype system "cougar" included a single OS and I/O node and 96 Application nodes. Each application node had two 833-Mhz alpha CPUs (64 nodes having 2 GB memory, and 32 nodes having 1 GB memory) and a local disk with 14.5 GB of scratch space. The cluster featured the myrinet interconnect, and was running Red Hat Linux release 6.2 (kernel 2.2.20 build40 on a 2-processor alpha). This system was of particular interest given its role as one of the first proprietary high-end solutions based on commodity components. However it proved to be short lived with Cray discontinuing work around the prototype after API's announcement that it was to terminate development of any future Alpha-based products.

The commodity-based systems used in out studies have been described elsewhere. We merely provide a list of these here in Table 1, noting the presence of three new additions to the list, CS7, CS8 and CS9. CS7 is the SCALI/SCI interconnected dual AMD K7/1000 MP "ukcp" cluster, CS8 the Itanium-based "Titan" system at NCSA, consisting of 160 dual-processor IBM IntelliStation Z Pro servers machines, and CS9 the dual Pentium 4/2000 Xeon "dirac" system at Bristol University. Both CS8 and CS9 feature the Myrinet 2000 interconnect.

Performance Metrics

In previous SLA reports we have summarised the conclusions of related benchmarking exercises of applications on commodity-based systems by showing the effective delivery of such systems against corresponding high-end hardware such as the Cray T3E/1200E and SGI Origin 3800, i.e. those high-end machines available to the UK’s HPC community. These comparisons have highlighted the inappropriate use of the latter systems for delivering capacity computing solutions, based on the simplest of cost-effective arguments. As a starting point for the present analysis, we show in Table 2 below a somewhat updated version of the summary table from the SLA 2000/2001 report. This has been modified to include the Pentium III/800 CS6 cluster, rather than the now outdated Pentium III/450 CS1 cluster, and shows

  • the percentage of a 32-node partition of the Cray T3E/1200E delivered by the Pentium III/800 CS6 and QSNet Alpha Linux CS2 systems (i.e. T32-nodeCray T3E / T32-CPU CSx), and
  • the percentage of a 32-processor partition of the SGI Origin R14k/500 delivered by the QSNet Alpha Linux CS2 system (i.e. T32-CPU SGI Origin 3800-R14k / T32-CPUCSx).

These figures suggested the following:

  1. In many of the applications the inexpensive Pentium-based systems with simple fast ethernet connection delivers a significant fraction of Cray/T3E performance. While applications with extensive communication demands clearly exhibit inferior performance and scalability on the IA32-based system (e.g. DL_POLY with bond constraints, direct-MP2 gradient calculations using GAMESS-UK), the delivered performance of the Pentium III/800-based cluster is at worst 42% of the Cray T3E/1200E. Many of the other applications show a much higher delivered level of performance; in many cases the Beowulf cluster equals or actually exceeds Cray performance (e.g. GAMESS-UK, Ewald-based DL_POLY, CHARMM, FLITE3D). In these cases it made little or no sense to be using the T3E for 32-node runs when equivalent performance is achieved by a solution that costs a tiny fraction of that associated with using the high-end machine.
  2. There are a number of performance issues associated with the CS2 QSNet Alpha Linux Cluster that act to constrain performance, most notably the limited memory bandwidth of the UP2000, and the effective utilisation of L2 cache - the so-called issue of "page colouring" under Linux. Allowing for these, results from the CS2 Alpha cluster are most encouraging. In all benchmarks, the 32-CPU cluster exceeds the performance of 64-nodes of the Cray T3E/1200E (and that associated with the 32-CPU IBM/SP WH2). In optimal cases (those marked with a § in the Table) the Alpha Cluster is outperforming 128-nodes of the Cray T3E/1200E. The CS2 QSNet Alpha Linux cluster is seen to be competitive in performance with these newer machines, achieving for example between 78-106% of SGI Origin R14k/500 performance across a wide range of processor counts.
  3. Our previous consideration of performance on high-end systems has included the IBM SP/WH2-375, the Compaq AlphaServer SC (with both 667 and 833 MHz CPUs), the SGI Origin 3800 (both R12k-400 and R14k-500 CPUs) and a prototype of Cray's EV68-based Linux Alpha Cluster. These studies have proved revealing. While these more recent machines predictably outperform the Cray T3E/1200E, typically by factors of 4-6 at modest node count e.g. 32 CPUs, there is clear evidence of an increasing imbalance between CPU and interconnect performance. This manifests itself by a marked lack of scalability with increased processor count for these systems compared to the Cray T3E. A significant reduction in this factor of 4-6 is found at higher node counts (>64) on current high-end systems

Table 2. Application performance: percentage of a 32-node partition of (i) the Cray T3E/1200E achieved by the 32 processors of the CS6 Pentium/800 and CS2 QSNet Alpha Linux Clusters, and (ii) the SGI Origin 3800 R14k achieved by the CS2 Cluster.



Application Code

T32-nodeT3E / T32-CPUCSx

T32-CPU Origin 3800 / T32-CPU CS2 QSNet Alpha Linux Cluster (%)

CS6 Pentium-III / 800 + FE Cluster (%)

CS2 QSNet Alpha Linux Cluster (%)

GAMESS-UK

     

SCF

96%

256%

99%

DFT

130-178%

301-361% (§)

99%

DFT (Jfit)

65-131%

219-379%

89-100%

DFT Gradient

130%

289% (§)

89%

MP2 Gradient

73%

228%

87%

SCF Force constants

127%

154%

86%

DL_POLY

     

Ewald-based

151-184%

363-470% (§)

95%

Bond constraints

69%

143-260%

82%

CHARMM

172%

404% (§)

78%

CPMD

-

-

106%

ANGUS

68%

145%

94%

FLITE3D

-

480% (§)

 

(§) Outperforms 128 nodes of the Cray T3E/1200E

In the articles below we attempt to update these comparisons, for there is now little point in taking the Cray T3E/1200E as the standard, or in considering the performance of commodity systems based on Pentium III processors. We now position the Compaq AlphaServer SC ES45/1000 (at Pittsburgh) as the standard, and consider the relative performance of the CS9 Pentium 4/2000 with Myrinet interconnect as representative of today's typical commodity based offering. Our interest will centre on whether such suitably-configured Beowulf systems can still provide not only highly cost-effective departmental, mid-range solutions, but can match the levels of performance associated with a significant fraction of a high-end machine, again for a small fraction of the cost.

Before moving to the applications, however, we present initially a summary of the work undertaken over the past 12 months in evaluating systems based on the two major arrivals into the HEC market place, the power4 processor from IBM, and Intel's IA64 commodity-based Itanium processors.

Applications Performance: The Parallel Implementation of GAMESS-UK

M.F. Guest
CCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
m.f.guest@dl.ac.uk

In GAMESS-UK both SCF and DFT modules are essentially parallelised in a replicated data fashion, with each node maintaining a copy of all data structures present in the serial version. While this structure limits the treatment of molecular systems beyond a certain size, experience suggests that it is possible on machines with 256 MByte nodes to handle systems of up to 2,000 basis functions. The main source of parallelism in the SCF module is the computation of the one- and two-electron integrals and their summation into the Fock matrix, with the more costly two-electron quantities allocated dynamically using a shared global counter. The result of parallelism implemented at this level is a code scalable to a modest number of processors (around 32), at which point the cost of other components of the SCF procedure starts to become significant. The first of these addressed was the diagonalisation, which is now based on the PeIGS module from NWChem.

Once the capability for GA [1] is added, some distribution of the linear algebra becomes trivial. As an example, the SCF convergence acceleration algorithm (DIIS - direct inversion in the iterative subspace) is distributed using GA storage for all matrices, and parallel matrix multiply and dot-product functions. This not only reduces the time to perform the step, but the use of distributed memory storage (instead of disk) reduces the need for I/O during the SCF process.

Substantial modifications were required to enable the MP2 gradient [2] and SCF 2nd derivatives to be computed in parallel. In both cases the conventional integral transformation step has been omitted, with the SCF step performed in direct fashion and the MO integrals, generated by re-computation of the AO integrals, and stored in the global memory of the parallel machine. The GA tools manage this storage and subsequent access. The basic principle by which the subsequent steps are parallelised involves each node computing a contribution to the current term from MO integrals resident on that node. For some steps, however, more substantial changes to the algorithms are required. For the MP2 gradient, the construction of the Lagrangian (the right-hand side of the coupled Hartree-Fock (CPHF) equations) requires MO integrals with three virtual orbital indices. Given the size of this class of integrals, they are not stored, the required terms of the Lagrangian being constructed directly from AO integrals. A second departure from the serial algorithm concerns the MP2 2-particle density matrix. This quantity, which is required in the AO basis, is of a similar size to the 2-electron integrals and is stored on disk in the conventional algorithm, but is now generated as required during the derivative integral generation from intermediates stored in the GAs.

In the SCF 2nd derivative module the coupled Hartree-Fock (CPHF) step and construction of perturbed Fock matrices are again parallelised according to the distribution of the MO integrals. The most costly step in the serial 2nd derivative algorithm is the computation of the 2nd derivative two-electron integrals. This step is trivially parallelised through a similar approach to that adopted in the direct SCF scheme - using dynamic load balancing based on a shared global counter. In contrast to the serial code, the construction of the perturbed Fock matrices dominates the parallel computation. It seems almost certain that these matrices would be more efficiently computed in the AO basis, rather than from the MO integrals as in the current implementation, thus enabling more effective use of scarcity when dealing with systems comprising more than 25 atoms.

The performance of the DFT, MP2 and 2nd Derivative modules on the Cray T3E/1200E and the High-end systems from IBM, SGI, Compaq and Cray are shown in Table 1. Corresponding timings on a variety of commodity-based systems are shown in Table 2. The DFT calculations on morphine used a 6-31G** basis of 410 functions, those on cyclosporin a 6-31G basis of 1000 functions, both using the B3LYP hybrid functional. Note that the DFT calculations did not exploit CD fitting, but evaluated the coulomb matrix explicitly.

Considering the DFT results on the high-end systems, speedups of 99 and 107 are obtained on 128 Cray T3E nodes for the morphine and cyclosporin calculation, respectively. The 32 CPU timings for cyclosporin show that the fastest machine is evidently that with the fastest CPU, with the AlphaServer SC ES45/1000 outperforming the IBM SP/WH2, the SGI Origin 3800/R14k-500, the AlphaServer SC ES40/667 and the Cray Linux Supercluster by factors of 2.44, 1.67, 1.70 and 1.43 respectively. Note that the SGI Origin/R14K and AlphaServer SC ES40/667exhibit almost identical run times up to 64 CPUs. Considering the higher node counts it is clear that all machines exhibit inferior scalability compared to the Cray T3E. Thus for cyclosporin, the AlphaServer SC ES45/1000 / Cray performance ratio of 5.55 found at 16 CPUs decreases to just 3.35 on 128 CPUs; corresponding figures for the smaller morphine calculation are 5.44 and 3.27.

Table 1. Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient modules in calculations on Morphine, Cyclosporin, di(tri-fluoromethyl)-biphenyl and Mn(CO)5H on the Compaq AlphaServer SC ES45/1000 and IBM, SGI, Compaq and Cray High-end Systems.

CPUs

Cray

T3E/1200E

IBM

SP/WH2-375

SGI Origin 3800/R12k-400

SGI Origin 3800/R14k-500

Compaq Alpha SC ES40/667

Compaq Alpha SC ES45/1000

Cray Alpha Linux SC EV67/833

Morphine DFT/B3LYP (6-31G**)

16

990

399

437

355

 

182

239

32

515

249

248

192

 

106

143

64

278

 

174

116

 

66

91

128

160

       

49

 

Cyclosporin DFT/B3LYP (6-31G)

16

6927

2970

2696

2208

2144

1249

1760

32

3612

1741

1503

1191

1210

713

990

64

2003

 

961

704

722

424

606

128

1039

     

531

310

 

256

721

           

(C6H4(CF3))2 SCF 2nd Derivatives (6-31G)

16

2687

1845

1574

1490

1080

 

1313

32

1439

1085

985

803

746

501

 

64

803

 

626

494

488

360

429

128

499

     

402

246

 

Mn(CO)5H MP2 Gradient (TZVP)

16

6713

3446

3105

2946

2602

1539

 

32

3530

2123

1714

1346

1603

1012

 

64

1923

 

1082

836

1078

634

 

128

1158

     

1006

529

 

256

792

           

Turning to the cluster results of Table 2, and the total times to solution on 32 CPUs, we see that even the fast ethernet connected CS6 Pentium III/800 cluster is outperforming the Cray T3E/1200E in the DFT B3LYP calculations, delivering 117% (morphine) and 130% (cyclosporin) of Cray performance. Increasing the CPU speed while leaving the interconnect effectively unchanged leads to a predictable impact on performance. Thus the corresponding delivery figure for the fast ethernet CS4 Athlon AMD/1200 cluster in the cyclosporin calculation is 163%, with the AMD/1200-based cluster a factor of 1.25 times faster than CS6. While comfortably outperforming the T3E, a higher factor might have been expected based solely on single node performance.

Coupling these more powerful CPUs with enhanced interconnect, as in the CS2 Alpha Linux and CS9 Pentium 4/2000 Myrinet-based Clusters, predictably leads to much higher percentage delivery. Thus the Myrinet-based Pentium 4 Cluster, with 32 CPU T3E-delivery figures of 323% (morphine) and 355% (cyclosporin), outperforms both the Origin 3800/R14k-500 and the CS2 Alpha Linux Cluster. In both benchmarks we find the 32-CPU elapsed times on the Pentium 4 cluster to be almost identical to those of the 128-node Cray T3E/1200E. With 64 CPUs the CS9 Cluster is faster than both the Origin 3800/R14k and Cray Alpha Linux SC in the cyclosporin calculation, and is outperforming the 256-node Cray T3E/1200E. Compared to the optimal high-end system, the AlphaServer SC ES45/1000, we find the CS9 Cluster to be delivering ca. 70% of the AlphaServer performance in both morphine and cyclosporin calculations, while delivery from the CS2 cluster is somewhat less (62 and 70%). Also worth noting is the relatively poor performance of the SCI/SCALI-connected CS7 cluster, slower by almost a factor of two than the Myrinet-connected CS9 cluster in both 32-CPU calculations. Again this stems not from any inherent inadequacy of the SCI interconnect, but from the non-tuned implementation of the Global Arrays on the SCALI platform.

Table 2. Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient benchmark calculations on a variety of commodity-based systems.

 

CPUs

Commodity Systems, CSx

Benchmark

 

CS1

CS2

CS4

CS4

CS6

CS7

CS9

 

8

   

1062

777

   

550

Morphine

16

1183

312

673

472

676

473

288

 

32

719

171

 

351

440

319

157

 

48

 

126

   

380

 

114

 

64

 

95

       

93

Cyclosporin

8

   

7131

5166

   

3546

 

16

9156

2014

3952

3095

4399

3009

1859

 

32

5182

1200

 

2213

2774

1799

1018

 

48

 

856

   

2215

 

734

 

64

 

648

     

1151

585

(C6H4(CF3))2

16

2977

1450

       

949

 

32

1809

933

 

1067

1133

912

543

 

48

 

634

       

400

 

64

 

512

       

356

Mn(CO)5H

16

11499

2722

14578

13920

7487

5233

2521

 

32

8113

1550

 

6989

4847

3790

1725

 

48

 

1112

   

4024

   
 

64

 

883

       

1590

CS1 PIII/450 + FE, CS2 QSNet Alpha Linux EV67/667 CS9 P4/1200 + Myrinet

CS4 AMD K7/700 + FE, CS4 AMD K7/1200 + FE

CS6 PIII/800 + FE CS7 AMD K7/1000 + SCI

Single CPU per dual processor node

Considering the performance data for the MP2 gradient and SCF analytic 2nd derivative modules, we see that the MP2 geometry optimisation of the Mn(CO)5H molecule (with 217 basis functions) shows a speedup of 93 achieved using 128 T3E/1200E processors to perform the complete optimisation (involving 5 energy and 5 gradient calculations). A corresponding speedup of 86 is found when calculating the frequencies of 2,2'-di(tri-fluoromethyl)-biphenyl using a 6-31G basis of 196 functions. The greater reliance on the Global Arrays (GAs) in both SCF 2nd Derivative and MP2 calculations, and hence dependency on efficient interconnect, compared to the DFT module leads to less marked performance enhancements on all high-end platforms relative to the T3E (Table 1). Thus at 32 CPUs, the performance advantage of the AlphaServer SC ES45/1000 over the Cray is reduced to factors of 3.5 (MP2) and 2.9 (2nd Derivatives) compared to the figure of 5.1 found in the cyclosporin DFT calculations. The AlphaServer SC ES45/1000 remains the optimum high-end platform, outperforming the SGI Origin/R14k by a factor of 1.33 in the 32-CPU MP2 calculation, and the ES40/667-based AlphaServer SC by a factor of 1.49 in the corresponding 2nd Derivatives calculation. Again we note the relative degradation in scalability of the high-end platforms. In the MP2 calculation, the ES45/1000 performance advantage of 3.5 found at 32 CPUs decreases to 2.2 in the 128-CPU calculation; corresponding figures of 2.9 and 2.0 are found in the SCF 2nd Derivative calculation. This decline in scalability with faster CPU is particularly noticeable in the AlphaServer SC ES40/667; at 64 CPUs the AlphaServer is outperformed by the R14k-based SGI Origin 3800 in the MP2 calculation, with little performance improvement on moving from 64 to 128 CPUs. This effect is such that the 128 CPU AlphaServer is only some 10% better than the Cray T3E. The performance of the Cray Linux Supercluster in the 2nd Derivatives calculation is worth noting, only a factor of 1.2 slower than the AlphaServer ES45/1000 and faster than the SGI Origin in the 64-CPU calculation.

This more central role of the GAs in both MP2 gradient and analytic 2nd derivative applications produces the expected impact in performance on the commodity clusters. Considering the total times to solution on 32 CPUs, we see that the CS6 Pentium III cluster is delivering a much reduced percentage of the Cray T3E (73%) in the MP2 gradient calculation, with only a modest reduction in elapsed time between 32 (4,847 seconds) and 48 CPUs (4,024 seconds). The significant increase in node CPU capability associated with the CS4 AMD-based cluster is seen to have no impact in this benchmark, with the solution time significantly slower than the CS6 Pentium III/800 cluster. It would appear that latency effects are crucial in this benchmark, with the Myrinet-connected CS9 cluster a factor of 1.8 times slower on 64 CPUs than the Quadrics based CS2 Alpha Linux Cluster. The impact of the non-tuned GA libraries on the CS7 Athlon Cluster is also apparent, with the 32-CPU performance some 2.8 times slower than that of the Pentium 4/2000 CS9 Cluster. A significant degradation in performance was originally noted on CS2, caused not by limited communications but by problems in the effective utilisation of shared memory on the dual CPUs of the UP2000. Revisions in release 3.1 of the GAs have largely addressed this, with the 32-CPU Alpha timing of 1550 seconds representing 228% of 32-node Cray performance, the cluster outperforming the Origin 3800/R12k-400 (1714 seconds) and AlphaServer SC ES40/667 (1603 seconds). With 64 CPUs the CS2 cluster (883 seconds) continues to outperform the SGI Origin 3800/R12k (1082) and AlphaServer SC ES40/667(1078), and approaches the 256 node Cray T3E timing of 792 seconds. 32 CPUs of the CS2 and CS9 clusters deliver 65% and 59% respectively of the AlphaServer ES45/1000 in the MP2 benchmark.

Somewhat surprisingly the Pentium and AMD-based clusters perform far more effectively in the SCF 2nd derivative benchmark. It would certainly appear that in marked contrast to the MP2 calculation, this benchmark is relatively insensitive to latency effects. Both the CS6 Pentium III/800 and CS4 AMD/1200 clusters outperform the Cray at 32 CPUs (CS6, 127%, CS4, 135%). Neither IBM/SP nor the Alpha Cluster perform that effectively on this benchmark; while the revised GAs have improved the Alpha performance, the 32-CPU Linux Cluster delivers only 154% of T3E performance, one of the lowest such figures recorded in these benchmarks. An initial performance analysis reveals load-balancing problems in the Fock matrix construction, which may explain this effect. The 64 CPU timings do suggest however that the CS2 Linux Cluster (512 seconds) is performing on a par with the AlphaServer SC ES40/667 (488). In contrast the CS9 Pentium 4 cluster performs exceptionally well, with the 64 CPU timing of 356 seconds matching that of the AlphaServer SC ES45/1000; CS9 is a factor of 1.4 times faster than the CS2 Linux Cluster, and outperforms 128 nodes of the Cray T3E/1200E (499 seconds). 32 CPUs of the CS2 and CS9 clusters deliver 54% and 92% respectively of the AlphaServer ES45/1000 in this benchmark

References

[1] J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global arrays; A portable shared memory programming model for distributed memory computers, in: Supercomputing '94, IEEE Computer Society Press, Washington, D.C. (1994).

[2] G.D. Fletcher, A.P. Rendell and P. Sherwood, A parallel second-order Moller-Plesset gradient, Molec. Phys. 91:431-38 (1997).

 

Applications Performance: The DFT Coulomb Module of GAMESS-UK

M.F. Guest, P. Sherwood and H.J.J. van Dam
CCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
m.f.guest@dl.ac.uk, p.sherwood@dl.ac.uk, and h.j.j.vandam@dl.ac.uk,

Recent work has continued to focus on optimising and extending the fitted coulomb module of the CCP1 density functional theory (DFT) code within GAMESS-UK for use on MPP machines. In order to reduce the cost of evaluating the Coulomb repulsion energy in medium sized molecules the charge density can be fitted to an auxiliary basis as proposed by Dunlap et al [1]:

where the fitting coefficients can be obtained from:

In this equation are the electron repulsion integrals in the charge density basis and are the three centre electron repulsion integrals between the wavefunction basis set and the charge density basis. This technique enables the evaluation of 4-centre 2-electron integrals to be reduced to using at most 3-centre 2-electron integrals, clearly moving the formal scaling of the computational cost from 4th order to 3rd order. We can reduce the computational cost further by implementing a mechanism to store the 3-centre 2-electron integrals in main memory in a distributed fashion on parallel computers, thereby removing the need to compute these integrals on each iterative cycle of the direct-SCF process. This approach is facilitated using a Schwarz inequality to avoid evaluating and storing small integrals. The Schwarz inequality for 3-centre 2-electron integrals is (in analogy to the 4-centre 2-electron integral inequality):

For efficiency only the maximal values of or for a given set of shells are stored. We have previously established that for dense clusters of water molecules using a Schwarz inequality screening tolerance of 10-5 yields a stable relative error of 10-4 in the energy. Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored. Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre integrals in core. Although the formal n3 scaling of the inversion rules out the straightforward Dunlap scheme for very large systems, it is particularly valuable for systems of intermediate size, where the cost is dominated by the integral computation and Fock build steps. Previous studies of dense clusters of water molecules have established that the application of a Schwarz inequality screening tolerance of 10-5reduces the number of integrals evaluated by about a factor of 5.

The parallelisation is performed trivially by distributing the evaluation of integrals with the same set of wavefunction basis functions over the processors. Because the integrals are stored and read when needed, a static load balancing scheme has to be applied to avoid communication. If the amount of memory available is not sufficient to store all of the integrals then those that were not stored are recalculated. This combined in-core/direct approach can be optimised further by storing the largest integrals and recalculating the small ones. This way, as with direct-SCF, the Schwarz inequality can be tightened by including the density factors, reducing the number of integrals further. The inversion of the matrix was implemented using the PeIGS matrix diagonalisation package [2], with the matrix distributed column-wise, to maximise data locality in the evaluation of the fitting coefficients.

Timings for a number of DFT calculations on the Morphine and Valinomycin molecules, conducted on the variety of high-end proprietary and commodity hardware under consideration are shown in Tables 1 and 2. Calculations on morphine used a DZVP_A2 Dgauss basis of 410 functions, those on valinomycin a DZV_A2 basis of 882 functions, both using the HCTH functional. Timings are reported for calculations in which the coulomb matrix was evaluated explicitly (J-explicit) and for those that used CD fitting (J-fit). The latter employed an A2_DFT auxiliary fitting basis for morphine (1171 functions), and an A1_DFT fitting basis (3012 functions) for valinomycin.

Table 1. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on Morphine and Valinomycin on High-end systems from IBM, SGI, Compaq and Cray.

Machine

CPUs

Morphine (410 GTOs)

Valinomycin (882 GTOs)

   

DFT/HCTH

J-explicit

DFT/HCTH
J-fit

DFT/HCTH

J-explicit

DFT/HCTH
J-fit

Cray T3E/1200E

16

3031

728

15100

6226

 

32

1488

391

7617

3063

 

64

817

237

4300

1573

 

128

440

150

2139

995

IBM SP/WH2-375

8

2067

456

   
 

16

1072

271

6039

2204

 

32

589

192

3236

1198

SGI Origin 3800/

8

2343

431

   

R12k-400

16

1200

235

 

2007

 

32

638

145

2882

897

 

64

357

120

1680

654

SGI Origin 3800/

8

1911

349

   

R14k-500

16

974

187

4453

1632

 

32

505

113

2306

724

 

64

267

82

1228

443

Compaq

16

711

193

3910

1725

AlphaServer SC

32

373

129

2033

881

ES40/667 - QsNet

64

228

91

1123

539

 

128

175

83

713

419

Compaq

16

451

114

2418

933

AlphaServer SC

32

243

72

1301

477

ES45/1000 - QsNet

64

133

52

705

304

 

128

85

 

415

235

Cray Supercluster

16

612

189

3386

1488

Alpha Linux

32

325

112

1783

726

EV67-833 - Myrinet

64

209

81

978

476

 

128

   

598

364

It can be seen that the current implementation of the fitted Coulomb modules provides significant benefit, with scalability on the T3E greatly enhanced over that reported previously. Speedups of 105 and 110 are obtained on 128 nodes of the Cray T3E/1200E for the morphine and valinomycin calculations when evaluating the coulomb matrix explicitly. Corresponding speedups when using the coulomb fit are 75 and 100 respectively. Overall times to solution on 128 T3E nodes when using the fitted Coulomb approach are reduced by factors of 2.9 (morphine) and 2.1 (valinomycin).

Considering the total times to solution on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the morphine calculation when evaluating the coulomb matrix explicitly show the following ordering:

AlphaServer ES45/1000 (243) < Cray Supercluster (325) < AlphaServer ES40/667 (373) < SGI O3800/R14k (505) < IBM SP (589)

with the AlphaServer ES45/1000 outperforming the Cray T3E/1200E by a factor of 6.1. A similar ordering is found in the larger valinomycin benchmark, with a somewhat reduced factor of 5.9. The timings of Table 1 do point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU valinomycin improvement factors of 5.9 (ES45/1000) and 4.3 (Cray Supercluster) with explicit treatment of the Coulomb matrix are reduced to 5.2 (ES45/1000) and 3.6 (Supercluster) based on the 128-CPU timings. All machines show a significant reduction in time to solution when using the fitted Coulomb matrix compared to explicit treatment of the Coulomb term. The 32-CPU J-fit timings for the morphine calculation show the following order:

AlphaServer ES45/1000 (72) < Cray Supercluster (112) ~ O3800/R14k (113) < AlphaServer ES40/667 (129)

Note that all 3c-2e integrals are held in memory for this 32-CPU morphine calculation. A comparison with the explicit-J timings shows the greater dependency of the fitted approach on interconnect.

Table 2. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on Morphine and Valinomycin on a number of Commodity-based Systems (see text).

Machine

CPUs

Morphine (410 GTOs)

Valinomycin (882 GTOs)

   

DFT/HCTH

J-explicit

DFT/HCTH
J-fit

DFT/HCTH

J-explicit

DFT/HCTH
J-fit

CS1 PIII/450 + FE

8

5956

1638

   
 

16

3126

973

16965

6692

 

32

1746

661

9201

3955

CS2 EV67/667

8

1489

379

   

Alpha Linux

16

774

207

4057

1739

QsNet

32

404

124

2107

809

 

48

283

98

1513

599

 

64

201

78

1109

471

CS4 AMD/700 + FE

8

2459

823

   
 

16

1309

522

6937

3572

CS4 AMD/1200 + FE

8

1634

722

8971

5037

 

16

899

459

4921

3016

 

32

574

356

3009

2115

CS6 PIII/800 + FE

8

 

1024

   
 

16

1447

629

7638

3910

 

32

858

419

4281

2333

 

48

623

365

3182

1878

 

64

524

 

2634

1733

CS7 AMD/K7-1000 +

8

1854

790

   

SCALI/SCAMPI

16

999

439

5096

2753

 

32

581

312

2808

1542

 

64

   

1649

1035

CS9 P4/2000 +

8

1354

433

   

Myrinet 2k

16

686

229

3451

1567

 

32

361

137

1791

780

 

48

 

92

1237

579

 

64

201

 

971

476

The SGI Origin 3800/R14k is now performing on a par with the Cray Supercluster, while the Origin 3800/R12k outperforms the IBM SP. The R12k-based Origin is now only a factor of 1.12 slower than the AlphaServer ES40/667, compared to the figure of 1.71 found with explicit treatment of the coulomb matrix. Improvement factors when using the fitted approach versus explicit J in the larger valinomycin benchmark reflect this dependency on interconnect, particularly at higher processor count. Total times for the 32 CPU-J-fit calculations are as follows:

AlphaServer ES45/1000 (477) < SGI O3800/R14k (724) ~ Cray Supercluster (726) < AlphaServer ES40/667 (881) < SGI O3800/R12k (897)

The 32-CPU T3E-performance delivery figures for the Compaq AlphaServer of 612% and 586% with explicit treatment of the Coulomb matrix are reduced to 518% (morphine) and 515% (valinomycin) based on the 128-CPU timings. In similar fashion the 32-CPU figures for the Cray Supercluster of 427% is reduced to 358% in the 128 CPU valinomycin calculation.

Considering the total times to solution on the commodity hardware (Table 2), we see that the 32-CPU Pentium III/800 CS6 cluster is delivering 173% (morphine) and 178% (valinomycin) of the Cray T3E/1200E in the DFT calculations with explicit treatment of the Coulomb matrix. These factors show a significant reduction (to 93% and 131% respectively) when using the fitted Coulomb matrix. The more powerful CPUs of the other clusters of Table 2 lead to higher percentage delivery, particularly when evaluating the Coulomb matrix explicitly. The 32-CPU explicit coulomb figure for valinomycin of 253% for the ethernet-interconnected CS4 AMD/1200 is reduced to 145% in the corresponding J-fit calculation. Enhancing both interconnect and CPU speed results in much higher figures, particularly in the fitted calculations. Thus the 32-CPU Pentium4-based CS9 Cluster exhibits T3E-delivery figures in the valinomycin calculation of 425% (explicit coulomb) and 393% (fitted coulomb). The CS9 cluster is seen to outperform the AlphaServer SC/667 and the Origin 3800/R12k-400 in both explicit- and fitted-coulomb 32 CPU valinomycin calculations, and the Origin 3800/R14k-500 in the explicit calculation. The Cluster is seen to deliver 73% of the AlphaServer ES45/1000 when evaluating the Coulomb matrix explicitly, and 61% of the ES45/1000 when using the Coulomb fit. Indeed 32-CPUs of the CS9 cluster comfortably outperform 128 nodes of the Cray T3E in both explicit and J-fit calculations.

Table 3. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on the Compaq AlphaServer SC ES45/1000 and High-end systems from IBM, SGI, Compaq and Cray (see text).

CPUs

Cray T3E/ 1200E

IBM SP / WH2-375

SGI Origin 3800 / R12k-400

SGI Origin 3800 / R14k-500

Compaq Alpha SC ES40/883

Compaq Alpha SC ES45/1000

Cray Alpha Linux SC EV67/833

Si8O7H18 (347/832)

16

219

96

92

76

54

42

68

32

135

77

60

49

39

32

46

64

95

     

38

   

128

76

     

37

   

Si8O25H18 (617/1444)

8

         

209

 

16

596

301

254

218

182

137

229

32

352

252

161

139

124

98

145

64

232

 

126

110

98

 

120

128

175

     

90

   

Si26O37H36 (1199/2818)

16

3889

1787

1631

1396

1100

745

1290

32

1931

1205

855

748

695

504

770

64

1118

 

559

515

484

379

564

128

822

     

386

315

 

Si28O67H30 (1687/3928)

16

 

3669

3203

2629

1983

   

32

3751

2620

1781

1551

1338

883

 

64

2257

 

1247

994

928

739

 

128

1287

     

751

615

 

Zeolite, Basis (AOs/CD)

The 64-CPU timings for the explicit calculation suggests that the CS9 cluster (971 secs.) is performing on a par with the Cray Supercluster (978 seconds). While CS9 outperforms the Linux Alpha CS2 Cluster at lower node counts, the Quadrics interconnect on the latter results in almost identical J-fit run times at 64 CPUs. At this node count both machines are somewhat slower than the Origin 3800/R14k (443 seconds).

A further demonstration of the Coulomb Fit DFT code is given in Tables 3 and 4. Here we present timings for complete DFT calculations on a series of Zeolite fragments, conducted on the variety of high-end proprietary (Table 3) and commodity hardware (Table 4) under consideration. Note that the 833 MHz EV67 Compaq AlphaServer SC is now included in the proprietary hardware. While limited speedups are observed for the smaller fragments on the Cray T3E/1200E (46 and 54 for Si8O7H18 and Si8O25H18 respectively on 128 nodes), the higher value of 93 found for the largest fragment, Si28O67H30 is associated with the need for re-computation of the 3-centre integrals. Considering the total times to solution for the larger fragments on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the Si26O37H36 calculation show the following ordering:

AlphaServer ES45/1000 (504) < AlphaServer ES40/833 (695) < SGI O3800/R14k (748) < Cray Supercluster (770)

with the AlphaServer ES45/1000 outperforming the Cray T3E/1200E by a factor of 3.8. The performance of the Origin 3800/R14k is far stronger than might have been expected based solely on a consideration of CPU performance (e.g. SPECfp2000). The timings of Table 3 again point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU improvement factors for Si26O37H36 of 3.8 (AlphaServer SC ES45/1000) is reduced to 2.6 based on the 128-CPU timings. Similar conclusions arise from a consideration of the timings for the largest fragment (Si28O67H30). The improvement factor for the AlphaServer ES45/1000 against the Cray T3E at 32 CPUs of 425% is reduced to 209% based on the 128-CPU timings.

Table 4. Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on a number of commodity-based systems.

 

CPUs

Commodity Systems, CSx

Zeolite

 

CS1

CS2

CS4

CS6

CS7

CS9

Si8O7H18

8

 

118

231

327

218 (200)

126

347/832

16

314

68

178

225

156 (128)

76

 

32

290

49

184

186

168 (118)

54

 

48

 

39

       
 

64

 

36

       

Si8O25H18

8

   

899

1205

762

393

617/1444

16

1011

219

640

781

790 (458)

228

 

32

797

156

551

591

393 (324)

144

 

48

 

119

 

540

   
 

64

 

106

       

Si26O37H36

8

   

4793

     
 

16

5322

1419

3213

4224

2560

1276

1199/2818

32

3478

800

2501

2793

1770

718

 

48

 

603

 

2428

1429

542

 

64

 

533

   

1344

499

Si28O67H30

8

   

8530

     
 

16

 

2774

5864

7535

4637

2393

1687/3928

32

6809

1712

4593

5251

2853

1415

 

48

 

1366

 

4670

 

1077

 

64

 

1123

 

4362

2409

954

CS1 PIII/450 + FE, CS2 QSNet Alpha Linux EV67/667 CS9 P4/1200 + Myrinet

CS4 AMD K7/1200 + FE CS6 PIII/800 + FE CS7 AMD/K7-1000 + SCI

Basis (AOs/CD), Figures in parentheses indicate the 1 CPU/node timings

Considering the total times to solution on 32 CPUs of the commodity hardware (Table 4), we see that the Pentium III/800 CS6 cluster is delivering between 60-73% of the Cray T3E/1200E in the J-fit calculations. Interestingly there are no problems encountered when trying to run the same calculations on the fast-ethernet clusters because of the small demand on interconnect imposed by the replicated-data characteristics of GAMESS-UK. The more powerful CPUs of the other clusters of Table 4 lead to higher percentage delivery, although these do not reflect the individual CPU performance for the ethernet-interconnected machines. Thus we find 32 CPU figures of 77% for Si26O37H36 and 82% for Si28O67H30 (CS4 AMD/1200). Enhancing both interconnect and CPU speed results in much higher figures. Thus the 32 CPU Linux Alpha CS2 Cluster exhibits delivery figures of 241% (Si26O37H36) and 219% (Si28O67H30), the CS9 Pentium 4/2000 cluster figures of 269% (Si26O37H36) and 265% (Si28O67H30). The CS2 cluster is seen to outperform the IBM SP/WH2-375 and the Origin 3800/R12k-400 in all 32 CPU fragment calculations; 32-CPUs of the cluster outperform 128 nodes of the T3E on all but the largest fragment. The CS9 cluster is faster still on the larger fragments, outperforming the Origin 3800/R14k-500 and Cray Supercluster in the 32 CPU Si26O37H36 and Si28O67H30 calculations; 32-CPUs of the CS9 cluster again outperforms 128 nodes of the T3E on all but the largest fragment. Excluding the smallest fragment from those under consideration, the Alpha CS2 Cluster is seen to deliver between 52-63% of the performance of the AlphaServer SC ES45/1000 and between 90-94% of the Origin 3800/R14k-500 in all 32 CPU calculations. Increased figures are found on the Pentium 4/2000 CS9 Cluster; 62-70% of the AlphaServer SC ES45/1000 and between 97-110% of the Origin 3800/R14k-500.

Timings for the 64-CPU calculations reveal the following ordering for the CS9 cluster against the proprietary hardware:

Si26O37H36:

AlphaServer ES45/1000 (379) < AlphaServer ES40/833 (484) < P4/2000 CS9 Cluster (499) < SGI 3800/R14k (515) < SGI O3800/R12k (559)

Si28O67H30:

AlphaServer ES45/1000 (739) < AlphaServer ES40/833 (928) < P4/2000 CS9 Cluster (954) < SGI O3800/R14k (994) < SGI O3800/R12k (1247)

References

[1] B.J. Dunlap, W.D. Connolly, J.R. Sabin, On some approximations in applications of Xα theory, Journal of Chemical Physics 71, 3396-3402.

[2] G. Fann and R.J. Littlefield, Parallel inverse iteration with reorthogonalisation, in: Sixth SIAM Conference on Parallel Processing for Scientific Computing (SIAM), pp409-13 (1993).

Applications Performance: Release 3.5 of CPMD

M.F. Guest
CCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD.
m.f.guest@dl.ac.uk

The CPMD code is a plane wave / pseudopotential implementation of Density Functional Theory, particularly designed for ab-initio molecular dynamics. The first version was developed by Jurg Hutter at IBM Zurich Research Laboratory starting from the original Car-Parrinello codes. Over the years many people from diverse organisations have contributed to the development of the code and of its pseudopotential library [1]. The current version, 3.5, is copyrighted jointly by IBM Corp and by Max Planck Institute, Stuttgart, and is distributed free of charge to non-profit organisations. CPMD runs on many different computer architectures and is well parallelized (MPI and Mixed MPI/SMP). Its main characteristics are:

  • works with norm conserving or ultrasoft pseudopotentials;
  • LDA, LSD and the most popular gradient correction schemes; free energy density functional implementation;
  • isolated systems and system with periodic boundary conditions; k-points;
  • molecular and crystal symmetry;
  • wavefunction optimization: direct minimization and diagonalization;
  • geometry optimization: local optimization and simulated annealing;
  • molecular dynamics: constant energy, constant temperature and constant pressure
    path integral MD;
  • response functions;
  • excited states; and,
  • many electronic properties.

Initially Version 3.3a of the code [4] was ported to the CS1 Pentium III/450 cluster, and subsequently benchmarked, by Sprik and Vuilleumier (Cambridge). Note that CPMD is acting as the base code for the new CCP1 flagship project, and that further optimisation for Beowulf-class systems is planned during the course of this work. The Initial comparison of Cray T3E and Beowulf hardware shown in Table 1 centres around a Liquid Water benchmark. The simulation comprises 32 water molecules, in a simple cubic periodic box of length 9.86 Ǻ at a temperature of 300K, with a time step of 7 au i.e. 0.169 fs, and a test run of 200 steps (34 fs). The calculation used the BLYP functional and Trouillier and Martins pseudo-potential, with a reciprocal space cut-off of 70 Ry (952 eV).

Table 1. Time in Wall Clock Seconds for the CPMD Liquid Water parallel benchmark on the Cray T3E/1200E and the Pentium III/450 CS1 Cluster.

Processors

Cray

T3E/1200E

CS1 PIII/450 + FE

   

(MPICH)

4

15680

22458

8

7627

11441

16

4873

7627

32

2225

3602

64

1271

 

Note that the Pentium Cluster is seen to be performing well in comparison to the Cray T3E/1200E. This may be attributed to the relatively long iteration times associated with CPMD, and the small impact that the MPI_ALLTOALL routine has on the total elapsed times (compared to for example the more demanding MPI_ALLTOALLV). Good scalability is shown on the Cray T3E/1200E (a speedup of 49 on 64 nodes), although the EV56 node appears to be only marginally faster than the 450 MHz Pentium III. Thus the Pentium cluster achieves a percentage delivery figure of 62% of the Cray T3E/1200E on 32 nodes.

We have recently implemented the latest version of the code on a number of additional platforms. These include the IBM SP/WH2-375 and Regatta-HPC node, the SGI Origin 3800/R14k and Compaq AlphaServer SC/ES45 1000, plus three commodity systems, the CS2 Alpha Linux Cluster, "ukcp", the CS7 dual-Athlon K7/1000 with SCALI interconnect, and the CS9 Pentium4/2000 Cluster with Myrinet interconnect. Using the same cluster of 32 Liquid water molecules, we report in Table 2 the time for performing a single point energy calculation, the calculation converging in 22 iterations. The AlphaServer SC ES45/1000 is clearly the optimal machine at 32 CPUs, outperforming the SGI Origin 3800 and IBM SP/WH2-375 by factors of 1.8 and 1.9 respectively. At 16 CPUs however the power4-based Regatta-HPC node outperforms the AlphaServer ES45/1000 by a factor of 1.25. Considering the three clusters, the CS2 Alpha Linux Cluster is optimal, almost twice the speed of the Myrinet-based CS9 Cluster and 2.4 times faster than the SCALI-based CS7 cluster. While this is in part due to the enhanced latency of QSnet over Myrinet, it probably also reflects a non-optimal implementation of the MPI_ALLTOALL collective on both machines. This certainly contributes to the lack of scalability found on both clusters, and to a 32CPU percentage delivery figure for the CS9 Cluster of just 30% against the AlphaServer ES45/1000, the lowest such figure in all the benchmarks described in this report.

Table 2. Time in Wall Clock Seconds for the CPMD Liquid Water parallel benchmark on both High-end and Commodity-based Systems.

CPUs

IBM SP/WH2-375

IBM SP/Regatta-HPC

SGI Origin 3800 / R14k-500

Compaq AlphaServer SC/ES45 1000

CS2 Alpha Linux Cluster EV67-667

CS7 dual-Athlon K7/1000

CS9 dual-P4/2000 Xeon

         

QsNet

(SCAMPI)

Myrinet 2k

4

1147

236

856

 

543

1463

1038

8

467

137

427

165

383

716

487

16

197

75

221

99

195

390

259

32

122

 

111

63

105

253

208

48

   

88

 

77

   

64

   

70

40

     

128

   

55

       

 

References

[1] Michele Parrinello, Jurg Hutter, D. Marx, P. Focher, M. Tuckerman, W. Andreoni, A. Curioni, E. Fois, U. Roetlisberger, P. Giannozzi, T. Deutsch, A. Alavi, D. Sebastiani, A. Laio, J. VandeVondele, A. Seitsonen, S. Billeter and others.

[2] D. Marx and J. Hutter, "Ab-initio Molecular Dynamics: Theory and Implementation", Modern Methods and Algorithms in Quantum Chemistry, Forschungzentrum Juelich, NIC Series, vol. 1, (2000).
[3] W. Andreoni and A. Curioni, "New Advances in Chemistry and Material Science with CPMD and Parallel Computing", Parallel Computing 26 (2000) 819.

[4] CPMD, Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999).

 

Applications Performance: DL_POLY - Version 2

M.F. Guest
CCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
m.f.guest@dl.ac.uk

DL_POLY [1] is the parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and T.R. Forester for CCP5 (the Collaborative Computational Project for the Computer Simulation of Condensed Phases). The parallel implementation within Version 2 of the parallel code is based on a replicated data (RD) strategy, and was designed at the outset for machines with up to 64 processors and systems of up to 30,000 atoms, although it has since found use on much larger architectures. Implicit in the RD approach is a dependence on fast global summations, a potential bottleneck on clusters with commodity interconnects. The performance scaling varies according to the kind of simulation being undertaken - systems possessing complex molecular topologies and constraint bonds typically scale less well than those requiring simple atomic descriptions, as they lead to a higher communication overhead. If constraint bonds are present, as they usually are in bio-molecular or polymer systems, then significant deviations from ideal behaviour are to be expected. The four benchmarks outlined below are those described on the CCP5 web site [2], with the same benchmark-numbering scheme adopted here:

Benchmark 4: A straightforward simulation of sodium chloride at 500K, using the standard Ewald summation method to handle the electrostatic forces. A multiple time-step algorithm is used to increase performance, which requires recalculating the reciprocal space forces only twice in every five time steps. The electrostatic cut-off is set at 24 Ǻ in real space, with a primary cut-off of 12 Ǻ for the multiple time-step algorithm. The Van der Waals terms are calculated with a cut-off of 12 Ǻ. The simulation is for 200 steps with a time step of 1 fs in the Berendsen NVT ensemble. The system size is 27,000 ions.

Benchmark 5: This simulation is of 8,640 atoms of an alkali disilicate glass at 1000 K. The electrostatics are again handled by the Ewald sum, with the interaction potential including a three-body valence angle term, which requires a link-cell scheme to locate atom triplets. The electrostatic cut-off is 12 Ǻ and the Van der Waals cut-off is 7.6 Ǻ; 3-body forces are cut off at 3.45 Ǻ. The simulation is for 300 steps in the Hoover NVT ensemble, with a timestep of 1 fs.

Benchmark 3: This simulation is of the enzyme transferrin in a solution comprised of 8102 TIP3P water molecules. A total of 27,593 atoms are in the system. The electrostatic forces are handled by a combination of neutral groups with the Coulombic potential. All force cut-offs are set at 8 Ǻ. The simulation is for 250 steps with a time step of .1 fs, in the NVE ensemble. The water molecules are treated as rigid bodies and the transferrin is maintained by bond constraints using SHAKE. Valence angles and dihedral potentials are present in the transferrin model.

Benchmark 7: This system is comprised of 13,390 atoms, including 4012 TIP3P water molecules solvating the gramicidin A protein molecule at 300K. Both the protein and water molecules are defined with rigid bonds and maintained by the SHAKE algorithm. The water is held completely rigid, while the protein has angular and dihedral potential terms. Electrostatic interactions are handled by the neutral group method with a Coulombic potential truncated at 12 Ǻ. The Van der Waals interactions are truncated at 8 Ǻ. The simulation is for 500 time steps in the NVE ensemble with a 1 fs time step.

Performance scaling on the Cray T3E for Benchmarks 4 & 5 has been shown to be extremely good and is almost linear over the entire range of processor numbers. This reflects the high parallel efficiency of the Ewald sum implementation. Significantly inferior scaling is found in the two macromolecular benchmarks, Benchmarks 3 & 7. This may be attributed to the difficulty in apportioning the neutral group calculations across processors, and the use of SHAKE for the bond constraints. Somewhat better scaling is found in Benchmark 7 which uses a larger cut-off in the electrostatic calculations and hence a lower communication/computation ratio. The performance of the four DL_POLY benchmarks are shown in Table 1 (on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems) and Table 2 (on a number of commodity based systems). Initial modifications made to the DL_POLY implementation on the commodity clusters included replacing the MPI_ALLREDUCE routines from both LAM and MPICH libraries with a Daresbury rewritten hypercube-based version.

Considering the Ewald-based benchmarks, we would again point to the excellent scalability on the Cray T3E/1200E, with speedups of 135 (super-linear) and 98 obtained on 128 nodes Cray for benchmark 4 and 5 respectively. This excellent scaling on the T3E is put into perspective when comparing the total 32 CPU times to solution against the high-end systems of Table 1. These suggest comparable run times for the IBM Regatta-H and AlphaServer SC ES45/1000 on each benchmark, with the AlphaServer SC delivering 874% (Benchmark 4) and 625% (Benchmark 5) of the Cray T3E on 32 CPUs. The weakness of the Cray EV56 CPU is clearly apparent. These factors decrease substantially on 128 CPUs, with the AlphaServer SC delivering 552% and 394% on Benchmarks 4 and 5 respectively. It is arguable, however, that these benchmarks are not providing a realistic assessment of high CPU capability of current high-end systems given the limited size of the simulations under investigation and the extremely short run times involved. Considering again the 32 CPU performance, there is evidently little difference in performance between the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray SuperCluster EV67/833, all three being between 1.4-1.5 times slower than the Regatta H and AlphaServer ES45/1000. The timings do suggest that the optimum scalability is shown by the Origin 3800/R14k, although this is inferior to that found on the Cray T3E.

Turning to the commodity systems of Table 2, the weakness of the Cray EV56 CPU is again apparent. Even the CS6 Pentium III/800 cluster is comfortably outperforming the Cray T3E/1200E in both the NaCl simulation (204 vs. 376 seconds) and NaK silicate simulation. These percentage delivery figures of 184% and 151% on the 800 MHz Pentium cluster increase substantially on the more powerful CPUs of the AMD Athlon and Alpha Clusters. In Benchmark 4 we find a delivery figure of 257% for the CS4 K7/1200 cluster: the Benchmark 5 percentage is 192%. Corresponding 16-node figures for the CS3 AMD Athlon cluster are 233% (benchmark 4) and 255% (benchmark 5). It is clear however that providing just fast ethernet as interconnect is not sustainable much beyond 32 CPUs. The 64 CPU performance of the CS6 Pentium III/800 cluster is only marginally superior to that at 32 CPUs in Benchmark 4, while in Benchmark 5 the 64 CPU timing is actually slower.

Table 1. Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems.

CPUs

Cray

T3E / 1200E

IBM

SP/WH2-375

IBM

SP / Regatta H

SGI Origin 3800/R12k-400

SGI Origin 3800/R14k-500

Compaq Alpha SC ES40/667

Cray Alpha Linux SC EV67/833

Compaq Alpha ES45 / 1000

Benchmark 4: NaCl

8

1588

 

179

343

298

218

215

 

16

817

193

90

179

154

115

114

79

32

376

102

48

85

70

63

65

43

64

179

     

39

 

37

25

128

94

           

17

Benchmark 5: NaK Disilicate Glass

8

918

 

119

244

204

178

173

121

16

479

154

62

121

107

94

88

62

32

225

93

35

66

57

55

52

36

64

121

     

33

 

36

23

128

75

         

28

19

Benchmark 3: Transferrin

8

   

66

139

118

 

105

77

16

191

155

51

96

84

 

87

65

32

132

136

 

88

77

 

81

 

64

115

         

83

 

128

104

             

Benchmark 7: Gramicidin A

8

1243

 

290

372

316

 

288

198

16

688

418

153

212

182

178

171

116

32

382

273

91

135

121

124

122

77

64

232

     

91

 

104

60

128

166

         

102

56

The faster Alpha EV67 (CS2), Pentium 4/2000 (CS9) and Itanium/800 (CS8) CPUs, together with their enhanced QSNet and Myrinet interconnects, result in much higher delivery figures. The Myrinet-connected Itanium-based CS8 cluster performs exceptionally well on the NaCl benchmark, with 32 CPUs delivering 78% of the AlphaServer SC ES45/1000, outperforming both the CS2 Linux Alpha and CS9 Pentium/4 clusters by factors of 1.45 and 2.0 respectively. The performance advantage of the Itanium-based CS8 cluster is not apparent in Benchmark 5 however, when all three clusters show comparable performance. The 32-CPU CS2 Linux Alpha Cluster now appears to be somewhat faster than CS8 and the Pentium/4-based CS9 cluster. The performance of the Pentium/4-based cluster is generally less impressive in the DL_POLY benchmarks compared to the electronic structure results presented previously (e.g., GAMESS-UK). This would appear to be a single CPU optimisation issue with DL_POLY itself on the Pentium 4, for it is not evident when analysing the related Charmm benchmarks.

The Alpha CS2 cluster outperforms the IBM SP and Origin 3800/R12k, with corresponding figures of 470% (benchmark 4) and 363% (benchmark 5) at 32 CPUs. The potential of the commodity-based systems in these simulations is striking; the 32-CPU Linux Alpha Cluster is outperforming 128 nodes of the Cray T3E in both benchmarks.

A quite different picture of performance is revealed when considering the two macromolecular simulations, benchmarks 3 and 7. Now the scalability on the T3E/1200E is far more limited, with speedups of just 29 (benchmark 3) and 60 (benchmark 7) on 128 nodes of the Cray. The improvement in performance of the high-end systems over the T3E/1200E is also less apparent compared to the Ewald-based simulations. The fastest of these systems, the AlphaServer SC/ES45 1000, is only a factor of 4.9 times faster on benchmark 7 on 32 CPUs (cf. factors of 8.7, benchmark 4 and 6.3, benchmark 5). The 32-CPU AlphaServer SC/ES45 is now seen to marginally outperform the IBM Regatta-H (by a factor of 1.2), with both machines some way ahead of the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray Alpha Linux cluster. Comparable scalability is shown by the Origin 3800/R14k and AlphaServer SC/ES45, although this is significantly inferior to that found on the Cray T3E.

Table 2. Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on a variety of commodity-based systems.

   

Commodity Systems, CSx

Benchmark

CPUs

CS1

CS2

CS3

CS4

CS5

CS6

CS7

CS8

CS9

 

8

 

337

 

591

880

911

645

245

461

4

16

670

167

351

269

351

385

328

135

191

NaCl

32

352

80

 

146

 

204

160

55

98

 

64

         

169

83

33

61

                     

5

8

 

235

 

278

366

434

307

205

224

NaK

16

412

110

188

161

179

228

167

109

121

disilicate

32

238

62

 

117

 

149

89

69

68

glass

64

         

153

53

55

44

                     

3

8

 

97

 

253

202

316

142

124

107

Transferrin

16

410

74

163

255

165

290

119

86

83

 

32

391

87

     

297

121

75

73

                     

7

8

 

348

 

658

983

1131

617

335

523

Gramicidin

16

1059

190

442

458

561

733

348

191

306

A

32

681

147

 

402

 

552

238

120

190

 

64

         

611

     

†CS1 PIII/450 + FE: LAM/MPI, CS2 QSNet Alpha Linux EV67/667, CS3 AMD K7/850 + Myrinet,

CS4 AMD K7/1200 + FE: LAM/MPI CS5 dual PIII/930 + SCALI, CS6 PIII/800 + FE: LAM/MPI

CS7 AMD K7/1000 + SCALI CS8 dual Itanium/800 + Myrinet 2k CS9 dual P4/2000 + Myrinet 2k (IFC)

This lack of scalability has a predictable effect on the performance of the commodity clusters, which now deliver significantly lower percentage delivery figures compared to those found in the Ewald-based simulations. Considering the CS6 PentiumIII/800 cluster, we find 32-node T3E delivery figures of just 42% and 69% for benchmarks 3 and 7 respectively. While these figures increase significantly on the more powerful CPUs, they are far from impressive. Focusing on benchmark 7, we see only modest increases in delivery on the CS4 K7/1200 clusters (95%). These figures do increase substantially with improvements in interconnect. The advantage of enhanced interconnect is clear when comparing the performance of the CS4 Athlon K7/1200 and CS7 Athlon K7/1000 clusters, machines with comparable CPU performance. While the impact of the SCALI/SCI interconnect on the latter has no impact on the Ewald-based benchmark 4, it leads to the CS7 cluster outperforming CS4 by a factor of 1.7 in the macromolecular benchmark.

Of the three leading clusters, the Myrinet-connected Itanium-based CS8 cluster is again competitive, 32 CPUs outperforming the CS2 Linux Alpha and CS9 Pentium/4 clusters by factors of 1.2 and 1.6 respectively. The 32 CPU elapsed time is identical to that of the SGI Origin 3800/R14k, AlphaServer ES40/667 and Cray Supercluster, delivering 64% of the AlphaServer ES45/1000.

Worth noting here is the initial performance limitations on the Linux Alpha Cluster that arose from the way DL_POLY handled both co-ordinate and forces arrays. The x-, y- and z-co-ordinates and corresponding forces were stored as separate linear arrays, x(mxatms), y(mxatms) etc., coding that led to exceedingly poor cache re-usage on the UP2000 processor. Re-writing the code to use, in hopefully obvious notation, xyz(3,mxatms) and fxyz(3,mxatms) improved overall performance on the Alpha cluster by a factor of 2.5 (although it had little effect on, for example, the IBM/SP-WH2 with its larger 8 MByte cache). Having made these changes, the 32-CPU Linux Alpha CS2 Cluster again outperforms 128 nodes of the Cray T3E in both benchmarks.

An additional feature exemplified by these benchmarks is the impact of the underlying MPI libraries on performance. While little effect was found in the Ewald-based simulations, a much greater impact was apparent on benchmarks 3 and 7. Thus the reduced latency associated with LAM MPI as against MPICH reduced the 32-node benchmark 3 timing on the CS1 Pentium III/450 cluster from 583 (MPICH) to 391 seconds (LAM).

Finally, it is perhaps worth questioning the value and cost effectiveness of the Cray T3E/1200E in running molecular simulations using the DL_POLY software, a question that is reinforced by considering the CHARMM benchmarks presented below. In both classes of DL_POLY benchmark considered, those that scale well on the Cray (benchmarks 4 and 5) and those that scale badly (the macromolecular simulations), we see that 128-node Cray T3E performance is matched or exceeded by 32 CPUs of the Linux Alpha Cluster. While the latter scales less effectively than the Cray, the total times to solution are less. Given that the replicated date implementation within Version2 of DL_POLY itself does not scale effectively beyond 128 Cray CPUs, it is difficult to justify using the Cray at all, given the implicit cost differential involved against the clusters considered in this report.

Considerable effort has now been invested in the distributed data version of the code, DL_POLY 3 (see other articles in this report). This significantly extends the size of system amenable to study, and with major algorithmic enhancements, exhibits far better scalability than the replicated data code discussed above e.g., through the use of the Particle Mesh Ewald Scheme for the Coulombic energy. This code will certainly require high-end resources in the pursuit of 106+ particle simulations and, based on our initial findings, will scale well on 256+ CPUs characterising these machines.

References

[1] see, http://www.dl.ac.uk/TCSC/Software/DL_POLY/main.html

[2] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/

 

Applications Performance: DL_POLY - Version 3

W. Smith, I.J. Bush, M.F. Guest and P. Sherwood
CCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
W.Smith@dl.ac.uk, I.J.Bush@dl.ac.uk, M.F.Guest@dl.ac.uk, and P.Sherwood@dl.ac.uk

The previous section provided a comprehensive benchmarking of the replicated data (RD) version (Version 2.11) of DL_POLY [1], the parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and T.R. Forester for CCP5.These results clearly revealed the limitations inherent in the RD strategy, with restrictions in the size of system amenable to study, and limited scalability on current high-end platforms typified by the Compaq AlphaServer SC and Origin 3800. These limitations apply not only to systems possessing complex molecular topologies and constraint bonds, but also to systems requiring simple atomic descriptions, systems that historically exhibited excellent scaling on the Cray T3E/1200E. Other articles in this report have described the significant extensions to the code made possible by the development of the distributed data (or domain decomposition) version of the code (DL_POLY 3), developments that have been accelerated in light of the impending arrival of the HPC(X) system. In the present article we present recent results obtained on the Compaq AlphaServer ES45/1000, the SGI Origin 3800 and CS9 Pentium 4/2000-based CS9 Cluster, results which highlight the drastic improvements in both system size and performance made possible through recent developments.

Table 1. Time in Wall Clock Seconds for four DL_POLY 3 benchmark calculations on the Compaq AlphaServer SC ES45/1000, SGI Origin 3800 and CS9 Pentium 4/2000 Cluster.

CPUs

SGI Origin 3800/R14k-500

Compaq Alpha ES45 / 1000

CS9 Pentium 4/2000 + Myrinet 2k

NaCl ; 27,000 ions, 200 time steps

8

605

183

514

16

313

103

256

32

168

57

128

64

92

37

74

128

53

24

 

256

36

   

NaCl; 216,000 ions, 100 time steps

16

977

576

973

32

495

326

487

64

254

168

265

128

143

91

 

256

84

54

 

Gramicidin A; 99,120 atoms, 100 time steps

8

537

292

463

16

282

173

247

32

167

109

137

64

100

75

98

Gramicidin A; 792,960 atoms, 10 time steps

32

749

396

502

64

312

200

309

128

176

116

 

256

115

73

 

The four benchmarks reported in Table 1 include two Coulombic-based simulations of NaCl, one with 27.000 ions, the second with 216,000 ions. Both simulations involve use of the Particle Mesh Ewald Scheme, with the associated FFT treated by an algorithm due to Bush that is designed to reduce communications cost. This circumvents use of the traditional all-to-all communications through a scheme (see separate article) that relies on column-wise communications only. The reported timings are for 500 time steps in the smaller calculation, and 200 time steps in the larger simulation.

The other two benchmarks are macromolecular simulations based on Gramicidin-A; the first includes a total of 99,120 atoms and 100 time steps. The second, much larger simulation, is for a system of eight Gramicidin-A species (792,960 atoms), with the timings reported for just 10 time steps. In terms of time to solution, we see that the AlphaServer SC outperforms the Origin 3800 at all processor counts in all four benchmarks; both 256 CPU runs for the larger NaCl and Gramicidin-A simulations suggest a factor of 1.6.

These results show a marked improvement in performance compared to the replicated data version of the code, with the gratifying characteristic of enhanced scalability with increasing size of simulation, both in the ionic and macromolecular simulations. Considering the NaCl simulations, we find speedups of 139 and 122 respectively on 256 processors of the Origin 3800 and AlphaServer SC in the 27,000-ion simulation. These figures increase to 186 and 171 respectively in the larger simulation featuring 216,000 ions. A more compelling improvement with system size is found in the macromolecular Gramicidin-A simulations. In the distributed data implementation, both SHAKE and short-range forces require only nearest neighbour communications, suggesting that communications should scale linearly with the number of nodes, in marked contrast to the replicated data implementation. This is borne out in practice. In the larger simulation (with 792,960 atoms) we find speedups of 208 and 175 on 256 processors of the Origin 3800 and AlphaServer SC respectively. This level of scalability provides a significant advance over the performance exhibited by both DL_POLY 2 and CHARMM (see next article), and represents a major step forward towards the goal of effective exploitation of the HPC(X) system in the field of molecular simulation.

 

Applications Performance: CHARMM

M.F. Guest and P. Sherwood
CCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
M.F.Guest@dl.ac.uk, and P.Sherwood@dl.ac.uk

CHARMM ‘Chemistry at HARvard Macromolecular Mechanics’ (version c26b2) is the general-purpose molecular mechanics, molecular dynamics and vibrational analysis package for modelling and simulation of the structure and behaviour of molecular systems. The benchmark is the standard CHARMM parallel benchmark involving an MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules (14026 atoms, 1000 steps (1 ps), 12-14 Ǻ shift). Although a macromolecular simulation, this MD benchmark shows many of the performance attributes demonstrated by the Ewald-based DL_POLY simulations. The performance of this benchmark is shown in Table 1 (on the Cray T3E/1200E, SGI Origin 3800/R14k-500 and High-end Systems from Compaq) and Table 2 (on a number of commodity based systems).

Table 1. Time in Wall Clock Seconds for the CHARMM Carboxy Myoglobin parallel benchmark on the Cray T3E 1200/E, SGI Origin 3800/R14k-500 and Compaq AlphaServer SC ES40/667 and ES45/1000.

CPUs

Cray

T3E/1200E

Compaq AlphaServer SC ES40/667

Compaq AlphaServer SC ES45/1000

SGI Origin 3800/R14k-500

8

1286

252

165

215

16

665

148

89

114

32

343

146

61

66

64

183

 

73

64

128

106

     

We would again point to the excellent scalability on the Cray T3E/1200E, with a speedup of 96 obtained on 128 nodes of the Cray. This good scaling on the T3E is put into perspective when comparing the total 32 CPU times to solution against the high-end systems of Table 1. These suggest that the AlphaServer SC ES45/1000 is delivering 562% of the Cray T3E on 32 CPUs; the weakness of the Cray EV56 CPU is clearly apparent. This factor decreases substantially at higher node count, with neither AlphaServer SC nor Origin 3800/R14k-500 scaling beyond 32 CPUs. As with DL_POLY, It is arguable that these benchmarks are not providing a realistic assessment of high CPU capability of current high-end systems given the limited size of the simulation under investigation and the extremely short run times involved. The timings do suggest that the optimum scalability is shown by the Origin 3800/R14k, although this is far inferior to that found on the Cray T3E.

Although this lack of scalability has a predictable effect on the performance of the commodity clusters, the results of Table 2 suggest that CHARMM is delivering significantly higher percentage delivery figures compared to the corresponding macromolecular simulations using the replicated data version of DL_POLY. Considering the CS6 PentiumIII/800 cluster, we find a 32-node T3E delivery figure of 172%, a figure close to the Ewald-based simulations using DL_POLY. Of particular note is the considerable advantage afforded by use of LAM-MPI rather than the more popular MPICH. This improves the 32-CPU timing by a factor of over two, a sure pointer to the latency sensitive nature of these simulations. This delivery figure increases significantly on the more powerful CPUs with enhanced interconnect. Of the three leading clusters, the Myrinet-connected Pentium 4-based CS9 is optimal, 32 CPUs outperforming the CS7 AMD K7/100 SCI and CS2 Linux Alpha clusters by factors of 1.2 and 1.3 respectively. The 32 CPU elapsed time is almost identical to that of the SGI Origin 3800/R14k, delivering 95% of the AlphaServer ES45/1000. This latter percentage is the highest delivered by the CS9 cluster throughout all the applications considered.

Table 2. Time in Wall Clock Seconds for the CHARMM Carboxy Myoglobin parallel benchmark on a number of commodity-based systems.

CPUs

SGI Origin 3800/R14k-500

CS1 PIII/450 + FE

CS6 PIII/800 + FE

CS7 AMD K7/1000 MP + SCI

CS9 P4/2000 + Myrinet

CS2 Alpha Linux EV67/667

     

(LAM)

(LAM)

(MPICH)

SCAMPI

(MPICH)

(QsNet)

8

215

 

880

349

399

206

179

275

16

114

 

518

231

335

113

104

145

32

66

 

359

199

440

79

64

85

64

64

         

51

62

128

               

CS1 PIII/450 + FE: LAM/MPI, CS2 QSNet Alpha Linux EV67/667

CS6 PIII/800 + FE: LAM/MPI, CS7 AMD K7/1000 + SCALII

CS9 dual P4/2000 + Myrinet

The potential of the commodity-based systems in this simulation is again striking; the 16-CPU Pentium 4 CS9 Cluster is outperforming 128 nodes of the Cray T3E/1200E.

References

[1] [CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus.

 

Applications Performance: QM/MM Coupling Approaches with CHARMM/GAMESS-UK

M.F. Guest and P. Sherwood
CCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
M.F.Guest@dl.ac.uk, and P.Sherwood@dl.ac.uk

While developments in computer performance and QM algorithms are bringing increasingly complex systems within the scope of quantum mechanical calculations, many important chemical systems remain too large for pure quantum simulation. This is especially true if energies for many configurations are required, as in molecular dynamics studies. Parameterised classical energy methods remain important, and the simulation code CHARMM [1, see previous article] is one of the most widely used packages for the study of macromolecules such as proteins, nucleic acids and lipids. It supports energy minimisation and molecular dynamics approaches using a classical parameterised force field. In order to permit studies of reacting species it is useful to be able to incorporate the quantum mechanical energy of a part of the system into the forcefield, and over recent years a number of interfaces to quantum mechanical programs have been developed. Initially these were based on semi-empirical wavefunctions. More recently computational and hardware developments have led to increased interest in ab initio QM/MM schemes and interfaces to the GAMESS (US) and CADPAC packages have been implemented. The coupling between CHARMM and GAMESS-UK has been developed in collaboration with the groups of Bernie Brooks and Milan Hodoscek and follows a similar approach to these [2].

In the CHARMM QM/MM model the standard CHARMM forcefield is used for the classical partition and the QM/MM van der Waals interactions. The QM/MM electrostatics are handled by including point charges at the MM positions in the Hamiltonian. The energy and forces from the QM calculation, including electrostatic forces acting on the classical centres, are added to those computed by CHARMM. The QM/MM approach involves introducing additional hydrogen (link) atoms to the edges of the QM cluster to terminate the quantum mechanical calculation. The forces on the link atoms can be handled by CHARMM using the same methods developed for treating explicit models of lone pairs.

GAMESS-UK incorporates a DFT module in which an auxiliary basis fit of the charge density is used to provide an approximation to the Coulomb energy (see above). We have used these elements of GAMESS-UK to implement an alternative model in which the charge density of the classical system is included in the QM Hamiltonian not as a set of point charges but as a continuous charge distribution represented as a sum of Gaussian terms. This allows greater overlap between the QM and MM charge distributions without the introduction of major artefacts and thereby permits the exploration of a number of QM/MM schemes. Full details of QM/MM models based on this functionality will be published elsewhere [3].

The CHARMM package is parallelised using a variety of message passing protocols; we have chosen to base the parallel GAMESS-UK/CHARMM implementation on MPI. We can couple this with either the MPI- or GA-based parallel implementations of GAMESS-UK. When using the GAs we configure them to use MPI as the underlying communication protocol which simplifies the maintenance of the merged parallel code and also allows us to take advantage of optimised MPI implementations when provided by the vendors for specific networking hardware. We report some sample timings of the GA version on the Compaq AlphaServer SC ES45/1000 and SGI Origin 3800, plus two commodity clusters (CS7, the dual AMD K7/1000 MP with SCALI interconnect, and CS9, the dual P4/2000 Xeon with Myrinet). Ports to other parallel platforms are in progress, and details of the current status of the CHARMM/GAMESS-UK project may be found on the web [4].

The timings of Table 1 refer to a single energy and force calculation on the enzyme Triosephosphate Isomerase (TIM). This is one structure in a pathway that has been studied in considerable detail [5]. The system comprises a total of 4180 atoms, of which 35 are treated quantum mechanically, with the addition of 2 link atoms. A DFT calculation, using the B3LYP functional and the Ahlrichs DZP basis set (424 GTOs) is used for the QM region. With this balance of QM and MM calculations the time is dominated by the QM calculation, the only impact of the MM region being the increased number of 1-electron integrals required, which have necessitated addition parallelisation with respect to the previous GAMESS-UK implementation.

The timings above are consistent with the previously reported DFT timings, with the CS9 P4/2000 Xeon cluster significantly faster than both Origin and the AMD-based cluster. The performance of the latter is again far from ideal given the non-optimised version of the underlying Global Array (GA) tools. Performance is limited primarily by the communication costs associated with the linear algebra steps in the SCF, which include diagonalisation, matrix multiply etc. As an example, the parallel diagonalisation (PeIGS) is slightly slower on 128 processors of the SGI Origin than 64 (1.07 vs 0.99s per iteration. Although inefficient, the parallelisation of these steps is nevertheless important when running on larger processor counts, as an illustration the same calculation using serial matrix algebra takes 374s on 64 processors. Perhaps the most promising way to extend the efficiency of parallel QM/MM calculations is to run a number of QM calculations simultaneously. This approach has been implemented in the GAMESS-UK/CHARMM interface and is being explored in the studies of reaction pathways using the replica path method [6].

References

[1] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus, "CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations", J. Comp. Chem. 4 (1983) pp. 187-217.

[2] P.D. Lyne, M. Hodoscek and M. Karplus. A Hybrid QM-MM Potential employing Hartree-Fock or Density Functional Methods in the Quantum Region, J. Phys. Chem. A., 103 (1999) 3462.

[3] Optimization of Quantum Mechanical/Molecular Mechanical Partitioning Schemes: Gaussian Delocalization of MM Charges and the Double Link Atom Method, D. Das, K.P. Eurenius, E.M. Billings, P. Sherwood, D.C. Chatfield, M. Hodoscek, and B R. Brooks, in preparation.

[4] http://www.cse.clrc.ac.uk/Activity/CHMGUK

[5] C. Lennartz, A. Schäfer, F. Terstegen and W. Thiel Enzymatic Reactions of Triosephosphate Isomerase: A Theoretical Calibration Study, J. Phys. Chem. A., in press.

[6] H. L. Woodcock, M. Hodoscek, P. Sherwood, Y.S. Lee, H.F. Schaefer III, and B.R. Brooks, Exploring the QM/MM Replica Path Method: A Pathway Optimisation of the Chorismate to Prephenate Claisen Rearrangement Catalyzed by Chorismate Mutase, Theor. Chem. Accts. in press.

 

 

Applications Performance: ANGUS

M.F. Guest
CCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD.
M.F.Guest@dl.ac.uk

ANGUS [1] performs direct numerical simulation (DNS) of turbulent premixed combustion in order to generate statistical data in support of modelling. The equations to be solved are the Navier-Stokes equations for fluid flow, augmented by two additional equations each describing the transport of a single scalar variable and together specifying the thermochemical state of the system in the presence of differential diffusion effects. Thus, in total there are six partial differential equations to be solved. A grid partitioning strategy is employed for ANGUS which is quite typical of many domain decomposition techniques used in parallel CFD. Due to the finite difference stencil it is necessary to introduce ‘halo’ or ‘ghost’ cells at the interface boundaries. These are used as a message cache and allow derivatives to be determined in these regions with only local variables. The halo cells are then updated as required. Discretisation of the equations is carried out using standard second-order central differences on a three-dimensional grid. The velocity nodes are located at the face-centres of each cell, giving a staggered-grid arrangement that conserves kinetic energy as well as mass and momentum. The pressure solver utilises a conjugate gradient method with a Modified Incomplete LU (MILU) preconditioner [2].

As with many CFD algorithms, the resulting matrix is both sparse and symmetric. In this case, it is heptadiagonal and the periodic boundary conditions also mean that the matrix is singular. A Multi-grid solver has also been provided. Level-1 BLAS are used heavily in both solvers, with the overall computational work expected to be roughly proportional to n3. The initial ANGUS CG-ILU benchmark considered below utilises a grid size of 1443. Benchmark timings for one hundred iterations of the conjugate gradient solver on the Cray T3E/1200E, SGI Origin 3800/R14k-500, IBM SP/WH2-375 and Regatta-H and the Compaq AlphaServer SC (ES40/667 and ES45/1000) are reported in Tables 1 and 2. Corresponding timings on a variety of commodity-based systems are given in Table 3 and 4.

Table 1. Time in Wall Clock Seconds on the Compaq AlphaServer SC (ES40/667 and ES45/1000), SGI Origin 3800/R14k-500 and IBM SP/WH2-375 and Regatta-H for the ANGUS CG-ILU Benchmark (1443).

CPUs

Cray

SGI

IBM

Compaq

 

T3E/1200E

3800/R14k-500

SP/WH2-375

SP/Regatta H

AlphaServer SC ES40/667

AlphaServer SC ES45/1000

8

4580

2336

4394

671

1935

1213

16

2380

879

1864

512

676

531

32

1090

330

776

364

292

228

64

480

148

   

158

95

Note that the timings reported for the IBM/SP WH2-375 and AlphaServer SC refer to CPU configurations in which all CPUs on a given 4-way node are involved in the computation. These timings show several distinct features. All machines, with the exception of the IBM Regatta-H, appear to exhibit super-linear speedups, although the IBM SP/WH2-375 is only marginally faster than the Cray T3E for a given node count. The optimal machine would appear to be the Compaq AlphaServer SC ES45/1000, with an effective speed up of 12.8 on moving from 8 to 64 CPUs. The Origin 3800 R14k/500 also performs well; while a factor of 1.6 slower than the ES45/1000 at 64 CPUs, it also shows a super-linear speed up factor of 15.8 on moving from 8 to 64 CPUs. In stark contrast the IBM power4-based Regatta H scales extremely badly from 8 to 32 CPUs. It is the fastest machine by almost a factor of two with 8 CPUs, but is outperformed by the SGI Origin 3800/R14k-500 and both AlphaServer SC ES45/1000 and ES40/667 at 32 CPUs. The performance advantage over the Cray T3E/1200E, a factor of 6.8 at 8 CPUs, is reduced to just 3.0 when using 32 processors. While this behaviour is at first sight confusing, it may be rationalised from a consideration of the driving force behind this benchmark, namely memory bandwidth. Additional insight can be gained by varying the distribution of processors over the available nodes (see Table 2). Now, for example, a 16-processor job on the IBM SP/WH2-375 or AlphaServer SC is run on either 4 or 8 nodes given the configuration available (with all CPUs used in the former case, and only 2 CPUs/node in the latter).

Table 2. Time in Wall Clock Seconds on the IBM SP/WH2-375 and Compaq AlphaServer ES40/EV67-667 as a Function of Processor distribution for the ANGUS CG-ILU Benchmark (1443);

Number of Nodes

 

1

2

4

8

16

Number of CPUs

         

IBM SP/WH2-375

4

10520

5880

4060

   

8

 

4394

2560

1899

 

16

   

1864

1136

 

32

     

776

 

Compaq AlphaServer ES40/EV67-667

4

4312

3078

2511

   

8

 

1935

1414

1174

 

16

   

676

569

511

32

     

292

249

64

       

158

The strong correlation between elapsed time and node occupancy in the above timings points to the driving influence of memory bandwidth on this benchmark. Thus performing an 8 CPU run on the IBM SP/WH2-375 realises elapsed times that vary by a factor of 2.3 depending on processor distribution (from 4394 seconds on 2 nodes to 1899 seconds on 8 nodes). Similarly the 16 CPU benchmark on the Alpha Server SC requires 676 seconds on four 4-way processor nodes, and 511 seconds when using a single CPU of each of the available 16 nodes. The better memory bandwidth of the AlphaServer node accounts for the somewhat smaller variation in timings for a given node occupancy (1935 secs. for an 8 CPU run on 2 nodes, 1174 secs. for 8 CPUs on 8 nodes). These performance attributes are completely consistent with the STREAM memory bandwidth benchmark [3] on the nodes of each machine. The TRIAD bandwidth of 900 Mbytes/sec measured on a dedicated single SP/WH2 node is reduced to some 225 Mbytes/sec when running the same benchmark on all 4 CPUs of the node.

While the performance of the Regatta-H is indeed impressive at small processor count, this advantage is rapidly lost as more CPUs have to compete for memory bandwidth on the same-shared memory node. While the complex cache hierarchy of the Regatta-H is designed to minimize this effect (a single processor job is actually running in an environment comprising the total cache associated with an 8-way MCM i.e. 128 MByte), it is clear that once CPUs are forced to access main memory, then performance rapidly degenerates. A TRIAD bandwidth of 4.2 GB/sec for a single process on a 16-way Regatta-HPC is reduced to 1.6 GB/sec when running the same benchmark on all 16 CPUs of the node.

The super-linear speed-up noted at 64+ CPUs on both AlphaServer SC ES45 and Origin 3800 R14k is almost certainly caused by cache effects. At this point the 8MByte cache on CPUs is certainly alleviating the memory bandwidth problems encountered at smaller node counts.

Turning to the commodity-based timings of Table 3, note again that figures for the CS2, CS5 and CS7-CS9 clusters refer to CPU configurations in which all CPUs on a given node are involved in the computation. The optimum cluster performance is derived from the Myrinet connected CS9 Pentium 4/2000 Cluster, which at 32 CPUs outperforms the Cray T3E by a modest factor of 1.68, and achieves only 35% of the performance of the AlphaServer SC ES45/1000. This benchmark does provide an example of one of the shortcomings of commodity-based systems with their reliance on "cheap" memory sub-systems. The CS9 cluster is seen to outperform the dual Itanium/800-based CS8 cluster and the Alpha Linux CS2 cluster by factors of ca.1.2. We see that while the Alpha Linux cluster outperforms both Cray T3E/1200 and IBM SP/WH2-375 up to 16 nodes, this advantage is effectively lost at 32 CPUs when the machine exhibits almost identical timings as the IBM (751 vs. 776 seconds respectively).

Table 3. Time in Wall Clock Seconds on a variety of commodity-based systems (CS1-CS9) for the ANGUS CG-ILU Benchmark (1443)

CPUs

Commodity Systems, CSx

 

CS1

CS2

CS3

CS5

CS6

CS7

CS8

CS9

8

6540

1943

3685

7080

5130

3024

3674

2221

16

3610

936

2063

3930

2870

1895

1887

1218

32

1830

751

   

1600

995

800

647

64

-

     

780

450

440

302

 

†CS1 PIII/450 + FE: LAM/MPI   CS2 QSNet Alpha Linux EV67/667   CS5 dual PIII/930 + SCALI

CS6 PIII/800 + FE: LAM/MPI   CS7 dual K7/1000 + SCALI   CS8 dual Itanium/800 + myrinet

CS9 dual P4/2000 + myrinet

Considering the slower Pentium III clusters, it is clear that enhancing the CPU speed while leaving the memory subsystem unaltered produces at best a modest increase in performance. The CS6 PIII/800 cluster outperforms the CS1 PIII/450 cluster by a factor well below the MHz ratio (a factor of 1.3 on 8 CPUs, decreasing to just 1.1 on 32 CPUs). Equally moving to dual processor nodes, with the effective halving of memory bandwidth leads to a major performance hit; thus the CS5 SCALI-based cluster with dual-processor PIII/930 CPUs is outperformed at all CPU counts by the CS1 Cluster with its more modest PIII/450 CPUs and fast ether interconnect. Additional insight into the findings above can again be seen by varying the distribution of processors over the available dual processor nodes (see Table 4).

Table 4. Time in Wall Clock Seconds on a Variety of Commodity Clusters as a Function of Processor distribution for the ANGUS CG-ILU Benchmark (1443);

Number of Nodes

 

1

2

4

8

16

Number of CPUs

         

CS2 Alpha Linux Cluster

4

 

6030

3378

   

8

   

3238

1943

 

16

     

1635

936

32

       

751

CS5 Dual PIII / 930 SCALI Cluster

4

   

8122

   

8

   

7037

4556

 

16

     

3927

 

CS7 AMD K7/1000 MP + SCALI/SCI

4

 

5973

4771

   

8

   

3024

2411

 

16

     

1895

1347

32

       

995

CS8 Itanium/800 + Myrinet

4

 

7414

6711

   

8

   

3674

3329

 

16

     

1886

1789

32

       

800

CS9 Pentium 4 /2000 + Myrinet

4

 

4170

2478

   

8

   

2221

1351

 

16

     

1218

674

32

       

647

The strong correlation between elapsed time and node occupancy again points to the driving influence of memory bandwidth. Thus performing a 16 CPU run on the Alpha Linux Cluster requires 1635 seconds on 8 dual processor nodes, and 936 seconds when using a single CPU of each of the available 16 nodes i.e. a factor of 1.7 difference in performance. Similar factors are found for the 16 CPU runs on the CS9 Pentium 4 (1.81) and CS5 Dual PIII / 930 SCALI Cluster (1.54). The timings above would suggest that the memory subsystem on the CS8 Itanium/800 cluster, and to a lesser extent that on the CS7 AMD K7/1000 MP cluster, is significantly better than that on the Pentium and Alpha systems. While the performance gain from using both, rather than just 1 of the processors on 16 nodes, is modest at best on the latter systems, that on CS8 and CS7 shows an improvement factor of 2.2 and 1.4 in line with the increased number of CPUs. These performance attributes are again completely consistent with the STREAM memory bandwidth benchmark [3] on the nodes of each machine. Thus the TRIAD bandwidth of 1 GByte/sec measured on a dedicated dual processor UP2000 6/667 node is reduced to some 500 Mbytes/sec when running the same benchmark on both CPUs of the node.

Finally, we have increased the grid size from the rather modest value of 1443 above in two further series of calculations, and present the timings for just ten iterations on a variety of hardware in Tables 5 and 6 (1963) and Tables 7 and 8 (2883).

Table 5. Time in Wall Clock Seconds for the Ten iterations of the ANGUS CG-ILU Benchmark (1963) on the SGI Origin 3800/R14k-500, Compaq AlphaServer SC ES40/667 and the IBM SP/WH2-375 and Regatta-HPC.

CPUs

IBM

SP/WH2-375

IBM SP/Regatta-HPC

SGI Origin 3800/R14k-500

Compaq AlphaServer SC ES40/667

Compaq AlphaServer SC ES45/1000

8

1794

228

1756

764

512

16

819

157

502

354

244

32

331

 

180

125

101

64

   

45

39

31

 

Table 6. Time in Wall Clock Seconds for the Ten iterations of the ANGUS CG-ILU Benchmark (1963) on a number of commodity clusters.

CPUs

CS2 Alpha Linux Cluster / EV67

CS7 AMD dual-K7/1000 MP + SCALI

CS8 dual-Itanium/800 + Myrinet

CS9 P4/2000 Xeon + Myrinet

8

1048

1128

1390

761

16

603

587

668

394

32

268

314

320

196

64

   

117

88

Table 7. Time in Wall Clock Seconds on the Compaq AlphaServer SC (ES40/667 and ES45/1000), SGI Origin 3800/R14k-500 and IBM SP/WH2-375 and Regatta-H for Ten iterations of the ANGUS CG-ILU Benchmark (2883).

CPUs

IBM

SP/WH2-375

SGI Origin 3800/R14k-500

Compaq AlphaServer SC ES40/667

Compaq AlphaServer SC ES45/1000

IBM SP / Regatta-H 1.3 GHz

16

4098

2602

1710

1169

884

32

1961

1232

800

563

661

64

 

399

302

204

 

128

   

110

81

 

Table 8. Time in Wall Clock Seconds for the Ten iterations of the ANGUS CG-ILU Benchmark (2883) on a variety of commodity systems. The SGI Origin 3800/R14k -500 is included for comparison.

 

CPUs

SGI Origin 3800/R14k-500

CS2 Alpha Linux Cluster / EV67

CS7 AMD dual-K7/1000 MP + SCALI

CS8 dual-Itanium/800 + Myrinet

CS9 P4/2000 Xeon + Myrinet

16

2602

2348

2373

2656

1595

32

1232

1161

1144

1331

819

64

399

 

509

548

360

128

         

 

References

[1] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from an engineering perspective, Parallel Computing (1996).

[2] T.F. Chan and C-C.J. Kuo, Parallel Elliptic Preconditioners: Fourier Analysis and Performance on the Connection Machine, Computer, Physics Communications, Vol. 53, 1989, pp 237-252.

[3] The STREAM Memory Bandwidth benchmark, see http://www.cs.virginia.edu/stream.

 

Applications Performance: Summary

M.F. Guest
CCLRC Daresbury Laboratory, Daresbury, Warrington, WA4 4AD.
M.F.Guest@dl.ac.uk

In the reports above we have presented a number of benchmarking results intended to update work described in the previous SLA reports in which the performance of commodity systems based on Pentium III processors was judged against the Cray T3E/1200E. There is now little point in taking the Cray T3E/1200E as the standard, or in considering the performance of outdated Pentium III-based commodity systems. We now position the Compaq AlphaServer SC ES45/1000 (the TCS1 system at PSC) as the standard high-end resource, and consider the relative performance of the CS9 Pentium 4/2000 with Myrinet interconnect as representative of today's typical commodity based offering. We summarise In Table 1 the conclusions of the benchmarking exercise on the reported applications, by showing

  • The percentage of a 32-CPU partition of the Compaq AlphaServer SC ES45/1000 and SGI Origin 38000/R14k-500 delivered by the Pentium 4/2000 CS9 system (i.e. T32-CPUHigh-end / T32-CPUCS9), and
  • The percentage of a 32-CPU partition of the SGI Origin R14k/500 delivered by the QSNet Alpha Linux CS2 system (i.e. T32-CPU SGI Origin 3800-R14k / T32-CPUCSx).

Application Code

T32-CPU (high-end) / T32-CPUCS9

T32-CPU Origin 3800 -R14k / T32-CPU Alpha Linux Cluster

AlphaServer SC ES45/1000

SGI Origin 3800/R14k-500

 

(%)

(%)

(%)

GAMESS-UK

     

SCF

95%

135%

99%

DFT

70-73%

117-161%

99-137%

DFT (Jfit)

59-70%

93-111%

89-122%

DFT Gradient

69%

114% (§)

89%

MP2 Gradient

59%

78%

87%

SCF Forces

92%

148%

86%

DL_POLY

     

Ewald-based

44-53%

71-84%

88-95%

bond constraints

41%

64%

82%

CHARMM

95%

103%

78%

CPMD

30%

53%

106%

ANGUS

35-69%

51-147%

44-106%

(§) Outperforms 128 nodes of the Cray T3E/1200E

Table 1. Application Performance: Percentage of a 32-processor partition of (i) the Compaq AlphaServer SC ES45/1000 and SGI Origin 3800 R14k achieved by 32-processors of the CS9 Pentium 4 /2000-based Cluster, and (ii) the SGI Origin 3800 R14k achieved by the CS2 QSNet Alpha Linux Cluster.

These figures suggest the following:

  1. Suitably-configured commodity-based systems still provide not only highly cost-effective departmental, mid-range solutions, but can match the levels of performance associated with a significant fraction of a high-end machine, again for a small fraction of the cost. Averaged over 25 different data sets spanning 8 applications, we find that percentage delivery levels for 32 CPUs of the Pentium 4/2000 myrinet-connected CS9 cluster correspond to 66% of the AlphaServer SC ES45/1000. The weakest performance is found in applications with either poorly tuned collective operations (e.g. 30% in CPMD due to the MPI_ALLTOALL collective) or to limitations in the memory subsystem (35% in ANGUS). The CS9 cluster is found on average to outperform the SGI Origin 3800/R14k-500 with a 32 CPU delivery figure of 109%.
  2. We have compared throughout this report the performance of a variety of clusters with a number of recent proprietary high-end offerings from IBM, SGI, Compaq and Cray. Systems from IBM include the older power3-based SP/WH2-375 together with a variety of power4-based nodes, the 32-way Regatta-H and 16-way Regatta-HPC plus the 8-way p-series Turbo 690. Compaq systems include the AlphaServer SC ES40 (with both 667 and 833 MHz CPUs) and ES45 (with 1GHz CPUs). Origin 3800 systems from SGI include both R12k-400 and R14k-500 CPUs, while the systems from Cray include the Cray T3E/1200E and a prototype of Cray's EV68-based Linux Alpha Cluster. Although all recent high-end machines predictably outperform the Cray T3E, typically by factors of 3-8 at modest node count e.g. 32 CPUs, there is clear evidence of an increasing imbalance between CPU and interconnect performance. This manifests itself by a marked lack of scalability with increased processor count for these systems compared to the Cray T3E. Thus average percentage delivery figures for 16 (546%), 32 (474%), 64 (403%) and 128 (319%) CPUs of the AlphaServer SC ES45/1000 against the T3E/1200E are seen to decrease with increasing number of CPUs, at least across the applications in this study. The corresponding figures for the SGI Origin 3800/R14k-500 do suggest a more balanced system, with the 16 CPU figure of 323% decreasing to 240% with 128 CPUs.
  3. Our previous analysis suggested that in many applications, inexpensive Pentium-based systems with simple fast ethernet connection delivers a significant fraction of Cray/T3E performance. While applications with extensive communication demands clearly exhibit inferior performance and scalability on the IA32-based system (e.g. DL_POLY with bond constraints, direct-MP2 gradient calculations using GAMESS-UK), the delivered performance of the Pentium III/800-based CS6 cluster is at worst 42% of the Cray T3E/1200E. Many of the other applications show a much higher delivered level of performance. Relying on fast ethernet with the current generation of commodity IA32 and IA64 CPUs provides, however, too great an imbalance for such systems to be competitive with high-end solutions such as the AlphaServer SC, at least in the application space considered here. Coupling these CPUs with an enhanced interconnect such as Myrinet remedies this imbalance; it then makes little or no sense to be using the high-end machines for 32-node runs when competitive performance is achieved by a solution that costs a small fraction of that associated with using the proprietary hardware.
  4. Our previous analysis suggested that there are a number of performance issues associated with the CS2 QSNet Alpha Linux Cluster that act to constrain performance, most notably the limited memory bandwidth of the UP2000, and the effective utilisation of L2 cache - the so-called issue of "page colouring" under Linux. Allowing for these, results from the CS2 Alpha cluster remain encouraging. In all benchmarks, the 32-CPU cluster exceeds the performance of 64-nodes of the Cray T3E/1200E (and that associated with the 32-CPU IBM/SP WH2). In optimal cases (those marked with a § in the Table) the Alpha Cluster is outperforming 128-nodes of the Cray T3E/1200E. With the exception of the smaller ANGUS benchmark, the CS2 cluster is competitive in performance with the newer machines, achieving for example between 78-130% of SGI Origin R14k/500 performance across a wide range of processor counts. Averaged over all the present applications, the CS2 Cluster is only marginally slower than the Pentium 4/2000 based CS9 Cluster, and outperforms the CS7 Scali-connected Athlon system by a factor of two. The latter cluster suffers in the present comparison through the absence of a tuned implementation of the Global Array tools.
  5. Initial results from the Itanium-based Titan Cluster (CS8) at NCSA do not provide compelling evidence in support of Intel's IA64 architecture. While confined to just three application codes (DL_POLY and ANGUS), there is apparently little overall performance advantage over the IA32-based Pentium 4 CS9 Cluster to support what is undoubtedly a more expensive solution. At least part of the reason for this lies in the limited optimisation possible using Intel's efc FORTRAN compiler. Our initial benchmarking of the successor processor from Intel, the Itanium 2 or McKinley, coupled with the far more effective f90 compiler from HP, suggests this position is likely to change quite dramatically in the near future.

In summary the collection of results presented in this report provides compelling evidence in support of commodity-based clusters. Suitably-configured Beowulf systems provide not only highly cost-effective departmental, mid-range solutions, but can match the levels of performance associated with a significant fraction of a high-end MPP machine, again for a small fraction of the cost.

 

previous contents forward
design by CCP1, March 2003