the newsletter of
collaborative
computational project 1

Beowulf Distributed Computing and Applications Performance

Martyn F. Guest, Barry G. Searle, and Steve J. Andrews
Distributed Computing Group, Daresbury Laboratory, Warrington WA4 4AD
m.f.guest@dl.ac.uk, s.j.andrews@dl.ac.uk, and b.g.searle@dl.ac.uk

Paul Sherwood, and Huub J.J. van Dam
Quantum Chemistry Group, Daresbury Laboratory, Warrington WA4 4AD
p.sherwood@dl.ac.uk, and h.j.j.vandam@dl.ac.uk


Contents

Daresbury Laboratory Beowulf Projects *

SPEC Benchmarks *

Computational Chemistry Benchmarks *

Communications Performance Benchmarks *

Metrics used in the Performance Evaluation of Beowulf Clusters *

Performance Analysis Tools *

Applications Performance: NWChem *

Applications Performance: GAMESS-UK DFT, MP2 and 2nd Derivatives *

Applications Performance: GAMESS-UK DFT Coulomb Module *

Applications Performance: DL_POLY *

Applications Performance: CHARMM *

Applications Performance: CASTEP, CPMD and CRYSTAL *

Applications Performance: ANGUS *

Applications Performance: FLITE3D *

Applications Performance: SUMMARY *


Daresbury Laboratory Beowulf Projects

S.J. Andrews, M.F. Guest and B.G. Searle

Networks of personal computers (so called Beowulf systems) composed of fast PCs configured with large quantities of RAM and hard disk, and running the Linux operating system are becoming more and more attractive as cheap and efficient platforms for distributed applications [1]. The main drawback of a standard Beowulf architecture is the poor performance of the conventional inter-process communication mechanisms based on RPC, sockets, TCP/IP, Ethernet. Such standard mechanisms are thought to perform poorly both in terms of throughput and message latency. Nevertheless, there is increasing interest in the use of commodity ‘off-the-shelf’ components as building blocks for high-performance computing. This is evident in many areas. An example within the UK academic community is provided by the latest round of the Joint Research Equipment Initiative (JREI'2001). This included equipment requests for more commodity-based clusters than proprietary SMP-based solutions (e.g. Origin 3400), in stark contrast to previous rounds of such Initiatives.

To investigate the potential of this class of system, a number of these "Beowulf-class" prototype systems have been assembled and evaluated at Daresbury [2]. The programme of evaluation is focused on both system software and hardware, and on assessing the delivered performance across a broad spectrum of end-applications. The programme aims to inform the community over the wide variety of available options, from choice of CPU (Alphas, single- and dual-Pentiums, caches, Xeon or old memory subsystems) to choice of interconnect (Ethernet, Myrinet, Scali SCI, QsNet etc.). This report provides an update on progress, with a particular focus on application performance. As our assessment also aims to position Beowulf systems on the HPC landscape, we also take the opportunity to present application results on the following proprietary high-end solutions:

  1. IBM SP/WH2-375: The 32 CPU (8-node) IBM SP system at Daresbury features 4-way Winterhawk2 SMP "thin nodes" (375 MHz Power3-II processors with 8 MB L2 cache), 1.6GB/sec node memory bandwidth (single bus), and interconnect switch with 150 MB/sec unidirectional, 200 MB/s bi-directional bandwidth.
  2. SGI Origin3800: 'TERAS', the supercomputer installed at Sara, the Netherlands is a 1024-CPU system consisting of two 512-CPU SGI Origin 3800 systems. This machine will have a peak performance of 1 TFlops (1012 floating point operations) per second. The machine has recently been fitted with 500MHz R14k CPUs organized in 256 4-CPU nodes and will possess 1 TByte of memory in total. Benchmarks were conducted on both the interim solution, comprising 400 MHz R12k CPUs, and on the final R14k-based system.
  3. Compaq AlphaServer SC: The Compaq AlphaServer SC features 4-way ES40 SMP nodes (with 667 MHz Alpha 21264A CPUs with 8 MB L2 cache), 5.2 GB/sec node memory bandwidth (dual bus) and the Quadrics "fat tree" interconnect, QsNet (5usec latency, 210 MB/sec bandwidth). Benchmarks were conducted on a 64-node system. Results have also been obtained on a second 64-node AlphaServer SC featuring 833 MHz Alpha 21264A CPUs.
  4. Cray Alpha Linux Supercluster: The configuration of the Cluster prototype system "cougar" included a single OS and I/O node and 96 Application nodes. Each application node has two 833-Mhz alpha CPUs (64 nodes having 2 GB memory, and 32 nodes having 1 GB memory). All application nodes have a local disk with 14.5 GB of scratch space. The cluster featured the myrinet interconnect, and was running Red Hat Linux release 6.2 (kernel 2.2.20 build40 on a 2-processor alpha).

Following an overview of the commodity-based systems currently under assessment, we consider in subsequent articles the results of a number of benchmarks designed to assess the building blocks of any cluster, namely (i) serial node CPU performance, and (ii) communications benchmarks across the variety of possible cluster interconnects. The major focus of these subsequent reports is, however, on applications and we present performance comparisons between Beowulf systems, the CSAR Cray T3E/1200E, and the proprietary systems outlined above. Applications considered include those from computational chemistry (NWChem [3], GAMESS-UK [4], DL_POLY [5] and CHARMM [6]), computational materials (CRYSTAL [7], and the Car Parrinello codes, CASTEP [8] and CPMD [9]), and computational engineering (ANGUS [10] and FLITE3D).

Evaluation has focused on both PC- and Alpha EV6-based platforms. The former includes Intel's Pentium III, Pentium 4 and Itanium CPUs, and AMD's Athlon processors. The Alpha-based platforms include DS10, DS20, ES40, XP-1000 and DPC264 machines, from 466 MHz to industry-leading 1000 MHz EV68 models. Various system and resource management packages (Lobosq, Sychron, Beowulf, Quadrics RMS and PBS) have been investigated, while several flavours of message passing software (MPICH, LAM6.2, LAM6.3, MPI-VIA and Shmem), compilers (Compaq, Absoft, PGI and GNU/g77) and numerical libraries (ATLAS, NASA, Intel MKL) have been tested. Implementations of the Global Array (GA) tools and parallel eigensolvers from PNNL have also been completed. As well as the choice of processor, a variety of networking options including fast Ethernet, Myrinet and more recently the low-latency, high bandwidth solutions, SCI from Dolphin and QsNet from Quadrics, have also been evaluated.

The assessment of a variety of prototype commodity-based systems (CS) is now producing a wealth of data. Seven such systems have been used in the present study (CS0-CS6), four in-house (CS0-CS3), the three others though collaborative links to vendors and other Beowulf sites (CS4-CS6). The in-house Pentium-based machines (CS0 and CS1) have been used to benchmark a host of applications (see below), while CS2 (a Linux Alpha EV67-based UP2000/Quadrics system) and CS3 (an AMD Athlon-based Myrinet system) have both been installed during the past twelve months.

  1. CS0 and CS1 - An initial 10-processor 266 MHz Pentium II system (CS0) and its successor, a 32-processor 450MHz Pentium III Linux system (CS1), have acted as the primary test vehicles. CS1 has dual Ethernet networks and has been running both the LOBOSQ (from NIH, USA) and the PBS (Portable Batch system) job scheduling software. The beowulf1 system has been fully operational since mid-September '99, and is still being used successfully as a testbed installation, providing benchmark results on a number of parallel applications (see subsequent articles).
  2. CS2 - Installation of a fully integrated and configured 32 processor EV67 Alpha Linux-based system with high-performance Quadrics interconnect (5 usec latency and 210 MBytes/s bandwidth) was completed in May '00. Numerous applications have been implemented on this system, with associated benchmark results presented in later articles. Note that the loki system was recently upgraded to 64 CPUs, with an additional 8 dual UP2000 nodes plus 8 dual CS20/EV67-833 nodes.
  3. CS3 - Installation of an AMD Athlon-based / Myrinet system comprising 16 X 850MHz AMD K7 machines was completed in June '00, with the system now producing benchmark figures.
  4. In addition to the in-house systems above, the following machines have been evaluated and used to benchmark applications:

  5. CS4 - The 128 CPU Athlon-based cluster at Sara (Amsterdam). Partitioned into a number of community-based partitions, we present results on up to 16 of the 700 MHz AMD processors over a Fast Ethernet network, and on up to 32 processors of the recently upgraded cluster with 1.2 GHz AMD CPUs (CS4).
  6. CS5 - A prototype 16 processor machine (8 dual Pentium III/933 MHz processor nodes) with the high performance SCALI interconnect has been kindly made available by Workstations UK.
  7. CS6 - Access to the 528 processor CliC (Chemnitzer Linux Cluster) has enabled a number of benchmarks with higher processor count to be performed. The nodes are single processor Pentium III/800 with both fast ethernet and gigabit ethernet communication systems.

References

[1] D. Ridge, D. Becker, P. Merkey, T. Sterling and P. Merkey, Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs, Proceedings, IEEE Aerospace (1997).

[2] Additional information on the Daresbury Beowulf Systems is available via World Wide Web URL http://www.cse.clrc.ac.uk/disco.

[3] Additional information on NWChem and the Environmental Molecular Sciences Laboratory, is available via the Pacific Northwest Home Page at http://www.pnl.gov:2080/

[4] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick, K. Schoffel and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders and A.J. Stone. (http://www.dl.ac.uk/CFS)

[5] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/

[6] CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus.

[7] CRYSTAL was jointly developed by the Theoretical Chemistry Group at the University of Torino and the Computational Materials Science group in CLRC. The program computes the electronic structure of periodic materials within Hartree Fock, density functional or various hybrid approximations (http://www.cse.clrc.ac.uk/Activity/CRYSTAL/).

[8] CASTEP (CAmbridge Serial Total Energy Package) is the ab initio code for the solution of the electronic ground state of periodic systems with the wavefunctions expanded in a plane wave (PW) basis, using techniques based on density functional theory. M.C. Payne et. al., Rev. Mod. Phys. (1992) 64 p. 1045 (http://www.cse.clrc.ac.uk/Activity/UKCP/).

[9] CPMD Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999).

[10] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from an engineering perspective, Parallel Computing (1996).


SPEC Benchmarks

M.F. Guest

The prototype commodity-based systems described in the preceding article are based on the Pentium, Athlon and Alpha CPUs as the node building block. The choice of optimal CPU (performance vs. cost) has been informed by an analysis of the serial performance of a wide variety of processors across a number of computational chemistry benchmarks. Part of the Distributed Computing support programme at Daresbury Laboratory involves an on-going comparison of a variety of different computer systems across a variety of application areas. We summarise below the results of such a comparison for the SPEC benchmark; benchmark comparisons in the area of computational chemistry are presented in the following article.

One of the most useful indicator of CPU performance is provided by the SPEC (``Standard Performance Evaluation Corporation'') benchmarks. This benchmark suite [1] contains non-tuned application-based code to measure processor speed for both integer (SPECint) and floating point (SPECfp) arithmetic. SPECfp95 and SPECint95, and their successors, SPECfp2000 and SPECint2000, have become industry standards in measuring primarily the performance of a system's processor, memory architecture, operating system and compiler. The next generation of SPEC benchmarks, SPEC CPU2000 [2], has recently replaced SPEC95. CFP2000 is derived from the results of fourteen floating-point benchmarks compiled with aggressive optimization. It is the geometric mean of fourteen normalised ratios (one for each floating-point benchmark). CINT2000 is derived from the results of twelve integer benchmarks compiled with aggressive optimization. It is the geometric mean of twelve normalised ratios (one for each integer benchmark). Note that the level of optimisation is not mandated. While highly aggressive optimisation is permitted, results derived from benchmarks compiled with conservative optimisation (SPECfp_base2000) can be submitted.

Table 1: SPEC CPU2000 - SPECfp and SPECint Values and Values Relative to the Compaq AlphaServer ES40/6-833.

Machine

SPECfp2000

SPECint2000

Relative values (%)

     

SPECfp2000

SPECint2000

Compaq Alpha ES45 68/1000

842

655

108%

116%

Compaq Alpha DS20E 68/833

784

571

101%

101%

Compaq Alpha ES40 6/833

777

565

100%

100%

Compaq Alpha GS160 68/1001

756

621

97%

110%

Intel D850GB (2.0 GHz P4)

714

656

92%

116%

HP RX4610 Itanium/800 4MB L3

701

379

90%

67%

Intel D850GB (1.9 GHz P4)

696

634

90%

112%

Intel D850GB (1.8 GHz P4)

678

613

87%

108%

Intel D850GB (1.7 GHz P4)

659

587

85%

104%

HP i2000 Itanium/800 2MB L3

655

-

84%

-

API UP2000 833 MHz

644

533

83%

94%

Intel D850GB (1.6 GHz P4)

637

565

82%

100%

HP RX4610 Itanium/733 2MB L3

623

-

80%

-

Intel D850GB (1.5 GHz P4)

615

539

79%

95%

Dell 530/1.7GHz P4 Xeon

609

593

78%

105%

Intel D850GB (1.4 GHz P4)

590

512

76%

91%

Compaq Alpha DS20E 6/667

582

455

75%

81%

HP 9000 J6700 / PA8700-750

581

603

75%

107%

Dell 530/1.5GHz P4 Xeon

566

543

73%

96%

Intel D850GB (1.3 GHz P4)

565

486

73%

86%

Compaq Alpha ES40 6/667

562

433

72%

77%

Dell 530/1.4GHz P4 Xeon

540

516

69%

91%

Compaq W6000/1.4GHz P4Xe

538

509

69%

90%

API UP2000 750 MHz

533

456

69%

81%

Compaq AlphaXP1000 6/667

532

403

68%

71%

HP RP7400/PA8700-750

524

551

67%

98%

Dell 330/1.30GHz P4

516

490

66%

87%

Sun Blade 1000 1900 / 900MHz

482

467

62%

83%

AMD Thunder K7 1.2GHz

481

522

62%

92%

SGI Origin 3200 500MHz R14k

463

427

60%

76%

AMD Gigabyte GA-7DX 1.4GHz

458

554

59%

98%

Fujitsu PP800 675MHz

456

475

59%

84%

AMD Gigabyte GA-7DX 1.33GHz

445

539

57%

95%

Compaq Alpha GS160 6/731

444

397

57%

70%

HP 9000 J6000 / PA8600-552

433

441

56%

78%

Compaq Alpha DS20 6/500

422

313

54%

55%

Sun Blade 1000 1750 / 750MHz

421

395

54%

70%

AMD Gigabyte GA-7DX 1.2GHz

417

496

54%

88%

IBM RS/6000 44P-170 (450 MHz)

409

316

53%

56%

SGI Origin 3200 400MHz R12k

407

353.

52%

62%

Dell 420 / 1.0GHz P3

340

462

44%

82%

IBM RS/6000 SP-375MHz High N

337

252

43%

45%

Intel OR840 (1 GHz Pentium III)

335

442

43%

78%

Ultra 10 333MHz

126

133

16%

24%

A subset of the SPECfp2000 and SPECint2000 results for many of the leading CPUs are given in Table 1 (with the baseline system the Ultra 10 333MHz). In each case we have normalised the values relative to those of the Compaq AlphaServer ES40/6-833. An examination of the SPECfp2000 values of Table 1 shows that the leading five machines are from Compaq, with the A21264B 833 MHz CPU in the DS20E and ES40 Model 6/833, and the A21264C 1001 MHz CPU exhibiting ratings of 784, 777 and 756 respectively. Seventeen of the next nineteen entries feature either the Pentium 4 or Itanium/800 MHz CPUs. SPECfp figures for Intel's Pentium 4 range from 590 (1.4 GHz) to 714 (2.0 GHz), while the Itanium-based systems exhibit values from 703 (Dell Poweredge 7150, 800 MHz with 4MB L3 cache) to 623 (HP i2000, 733 MHz, with 2 MB L3 cache). Note that both Itanium values are in fact SPECfp2000_base and not SPECfp2000.

The only non-Intel based CPU in this range is the 833MHz Alpha A21264An (a rating of 644 in the API/UP2000 6/833). All other CPUs exhibit SPECfp2000 ratings of less than 600 i.e. are a factor of at least 1.3 slower than the Compaq ES40 Model 6/833 and a factor of 1.2 slower than the 2.0 GHz Pentium 4. Leading machines from other vendors exhibit the following values:

  • 581 for the 750 MHz PA8700 in Hewlett Packards HP 9000 Model J6700.
  • 481 for the 1.2 GHz AMD Athlon CPU (Tyan Thunder K7 motherboard) and 458 for the 1.4 GHz CPU (Gigabyte GA-7DX Motherboard).
  • 463 for the MIPS R14k/500 MHz CPU and 407 for the R12k/400 MHz CPU, both in SGI's Origin 3200.
  • 482 for the 900MHz UltraSPARC III CPU in the Sun Blade 1000 Model 1900. Based on this figure, the 900 MHz CPU is outperformed by the Alpha ES40/6-833 by a factor of 1.6, but still presents a significant performance enhancement over the older UltraSPARC-II based systems. A comparison with the Sun Enterprise 450/480MHz (291) reveals a factor (1.66) less than the clock speed ratio (1.88).
  • 409 for the 450 MHz RS/6000 power3 CPU in IBM's RS/6000 44P-170, and 382 for the 375 MHz CPU in the IBM RS/6000 SP-375MHz T/W.
  • 340 and 335 for the 1 GHz Pentium III in Dell's Precision WorkStation 420 and in Intel's OR840 motherboard, respectively. Clearly the Pentium III is no longer competitive with its Pentium 4 successor, a factor of 2.1 slower than the 2.0 GHz Pentium 4.

Based on the normalised ratings of Table 1, we might expect the Compaq AlphaServer ES40/833 (100%) to marginally outperform the 2.0 GHz Pentium 4 (92%) and Itanium 800 (4MB L3, 90%; 2MB L3, 84%). These CPUs are followed by the API UP2000/833 (A21264A, 83%). Normalised ratings for the leading CPUs from other vendors outlined above are as follows 75% for the HP 9000 Model J6700/PA8700-750 and Compaq AlphaServer DS20E Model 6/667; 62% for the AMD 1.2 GHz Athlon, Tyan Thunder K7 and Sun Blade 1000 Model 1900/900MHz; 60% for the SGI Origin 3200/R14k-500 and 52% for the Origin 3200/R12k-400; 53% for IBM/s RS/6000 44P-170 (450 MHz); and 44% for the Dell Prec.WorkSt. 420 (1.0 GHz Pentium III).

In terms of leading CPUS from each vendor, the poorest performer would appear to be the current offerings from IBM. Thus the 450 MHz CPU in the IBM RS/6000 44P-170 is a factor of 1.9 times slower than the AlphaServer ES40/833. Clearly the awaited power4 developments from IBM, with a likely SPECfp2000 figure of 1000+ for the initial 1.3 GHz CPU offering will address this situation. The leading 25 entries of Table 1 feature 7 systems from Compaq / API and five from Hewlett Packard (only 1 PA-RISC system). Eleven systems feature Intel's Pentium 4 CPU and six Intel's Itanium CPU. The leading 62 systems in the full Table achieve 50% of the performance of the Compaq AlphaServer ES40/6-833, with 27 of these featuring the commodity CPUs from Intel and AMD. The 2.0 GHz Pentium 4 CPU is seen to outperform the leading Pentium III-based solution by a factor of 2.1.

An examination of the SPECfp2000_base values [2] shows a somewhat different picture. The leading machines are no longer from Compaq, with the A21264B 833 MHz CPU in the DS20E 68/833 (SPECfp2000_base, 643) and ES40 Model 6/833 (621) now outperformed by a number of Pentium 4 (1.7 - 2.0 GHz) and Itanium/800 Mhz CPUs. SPECfp2000_base figures for Intel's Pentium 4 range from 581 (1.4 GHz) to 704 (2.0 GHz), while the Itanium-based systems exhibit values from 703 (Dell Poweredge 7150, 800 MHz with 4MB L3 cache) to 623 (HP i2000, 733 MHz, with 2 MB L3 cache). The only systems that do not feature an Intel-based CPU appearing in top 30 SPECfp2000_base entries include the alpha systems from Compaq and API (Compaq AlphaServer DS20E Model 68/833, ES40 Model 6/833, GS80, GS160 and GS320 Model 68/1001, DS20E Model 6/667 and API UP2000 833 MHz), plus the HP 9000 Model J6700 / PA8700-750 from Hewlett Packard.

References

[1] A SPEC FAQ describing the SPEC benchmark suite and the SPEC consortium is periodically posted to comp.benchmarks, and can be found on the WWW at: www.specbench.org/spec/faq. An excellent summary of the SPEC benchmarks that is periodically updated is available via anonymous ftp from: ftp.cs.toronto.edu in the file /pub/spectable.

[2] http://www.specbench.org/osg/cpu2000/results/cpu2000.html


Computational Chemistry Benchmarks

M.F. Guest

In the previous article we have discussed CPU performance on the general SPEC benchmarks. We summarise below performance in the area of computational chemistry, focusing on a benchmark suite that includes a set of twelve quantum chemistry calculations using the GAMESS-UK electronic structure program [1]. The comparison involves approximately one hundred and twenty computers, ranging from supercomputers to scientific workstations and Pentium, Athlon and Itanium-based PCs.

Vector supercomputers used in this report include the NEC SX-5 and SX-4. A large number of workstations and workstation servers have been benchmarked, including the recent offerings from:

  1. IBM – the Power3 RS6000-based CPU in the IBM RS/6000 44P-270dual and in the quad-processor Winterhawk II (WH2) CPU, both clocked at 375 MHz;
  2. Hewlett Packard – the PA-RISC PA8500 and PA8600 CPUs, with the PA8500 in the HP 9000/C3000 (400 MHz), HP 9000/J5000 and HP9000/N4000 (440 MHz) models, and the PA8600 in the HP PA-9000/J6000 (552MHz).
  3. Compaq – A variety of systems housing the A21264 EV6 CPU clocked at 500, 667 and 833 MHz. The 667 MHz EV6.7 A21264A CPU features in the Linux-based UP2000 dual-processor CPU from API, and in Compaq's PW XP1000, DS20E and ES40 Alphaservers (with 8 MByte L2 cache). The recent 833 MHz EV67 CPU appears in the Compaq ES40 and in API UP2000 (here with 4 MByte L2 DDR cache). The most recent offering is the 1 GHz A21264C EV68 CPU featuring in the ES45.
  4. Silicon Graphics – the R12k- and R14k-based machines. The R12k-based machines include the Origin 3800 (400 MHz), Origin 2000 (300 and 400 MHz), the 270 MHz Octane R12k and O2 R12k, and the 400 MHz R12k Octane 2. The most recent machine benchmarked is the R14k-based 500 MHz SGI Origin 3800.
  5. SUN – the UltraSPARC-II and UltraSPARC III-based machines. UltraSPARC-IIi based-systems include the HPC4500 (with 400 MHz CPU) and Ultra80 (with 450 MHz CPU). Both processors include 4 MByte L2 caches. The recently released UltraSPARC-III CPU, clocked at 600, 750 and 900 MHz, features in the Sun Blade 1000 series; included here is the Model 1750, with 750 MHz UltraSPARC-III and 8 MByte L2 cache.

Note that the present results are taken from a more detailed report on computational chemistry benchmarks [2]. The GAMESS-UK Benchmark is designed to represent the typical range of calculations commonly performed by the ab initio quantum chemist. It includes 12 calculations that feature conventional- and direct-SCF, CASSCF and MCSCF, CI calculations (both direct-CI and conventional table-driven MRD-CI), MP2, and both SCF and MP2 analytic 2nd derivatives.

Table 1: The GAMESS-UK Serial Benchmark: total CPU time (user and system), elapsed time (minutes), efficiency (%) and relative performance from both SPECfp2000 and GAMESS-UK.

 

Machine

GAMESS-UK

Relative Performance
(%)

CPU Time (mins)

Wall Time (mins)

Efficiency
(%)

GAMESS-UK (%)

SPEC-fp 2000 (%)

user

system

total

Compaq Alpha ES45/1000

3.7

0.9

4.6

4.7

98

138

108

Compaq Alpha ES40/833

4.4

1.9

6.3

6.4

99

100

100

AMD Athlon K7/1400 (+)

5.4

1.1

6.5

6.6

99

97

59

Pentium 4/1500 (+)

6.2

0.3

6.5

6.5

100

97

79

Compaq Alpha ES40/667

5.6

1.2

6.8

7.1

95

93

72

HP PA-9000/J6000-552

6.1

0.7

6.8

7.3

93

93

56

AMD Athlon K7/1200 (+)

6.3

0.9

7.2

10.9

66

88

54

Compaq Alpha DS20E/667

5.5

1.9

7.4

7.7

96

85

75

Compaq PW XP1000/667

6.3

1.4

7.7

9.0

85

82

57

AMD Athlon K7/1400

6.7

1.1

7.7

7.8

99

82

59

Pentium 4/1500

7.8

0.3

8.1

8.1

100

78

79

IBM RS/6000-SP/375

7.2

1.0

8.2

9.2

89

77

49

IBM RS/6000 44P-270

7.4

1.0

8.4

9.6

87

75

49

AMD Athlon K7/1200

7.7

0.8

8.5

12.4

68

75

54

AMD Athlon K7/1000 (+)

7.8

1.2

9.0

9.8

92

70

41

HP PA-9000/J5000-440

7.9

1.2

9.1

10.1

90

70

47

API UP2000 6/833

5.6

3.7

9.2

9.4

98

68

83

Pentium III /1000-CM (+)

8.2

1.2

9.4

9.4

100

67

40

HP PA-9000/C3000-400

8.4

1.1

9.5

11.1

85

67

46

Compaq Alpha ES40/500

8.0

1.9

9.9

11.2

89

64

54

SGI O3800/R14k-500

8.0

1.9

10.0

12.4

80

63

60

API UP2000 6/667

6.9

3.1

10.0

10.4

96

63

47

SUN Blade 1000/M1750

9.5

0.8

10.2

10.4

99

62

54

Compaq Alpha DS20/500

8.0

2.3

10.3

13.3

77

61

54

Pentium 4/1400 (+)

8.4

2.0

10.3

12.1

85

61

76

AMD Athlon K7/1000

9.3

1.1

10.4

11.1

94

61

41

(+) using the portland group compiler, pgf77

The data presented in Table 1 is collected under control of the UNIX time command, and includes CPU time (both user and system summed over all 12 calculations), total elapsed time and efficiency (measured as CPU versus elapsed). Also shown is the relative performance of each machine based on SPECfp2000 values and on the GAMESS-UK total CPU times, normalised to values for Compaq's ES40/833. These suggest that the dominant CPUs are the Alpha EV68 and EV67, together with the Athlon AMD K7. Five of the leading 10 machines feature the Alpha processor, and two the Athlon CPU. The fastest machine is the Compaq ES45/1000. The AlphaServer ES45 (3.7 mins.) is seen to outperform the Alpha ES40/833 (4.4 mins,) the AMD K7/1400 (5.4 mins.) and the API UP2000 6/833 (5.6 mins) by factors of 1.19, 1.46 and 1.52 respectively. The UP2000 exhibits similar user CPU times to the Alpha DS20E/667 (5.5mins.) and ES40/667 (5.6 mins.).

These six machines are followed by the HP PA-9000/J6000-552 (6.1 mins.), the Pentium 4/1500 (6.2 mins.), the AMD K7/1200 and Compaq PW XP1000/667 (6.3 mins) and the API UP2000 6/667 (6.9 mins.). The power3-based IBM SP/WH2/375 (7.2 mins) and RS/6000 44P-270 (7.4 mins) follow, somewhat ahead of the AMD Athlon K7/1000 (pgf77, 7.8 mins) and PA8500-based HP PA-9000/J5000-440 (7.9 mins.). The latter shows comparable performance to the EV6/500; of the next ten machines with user CPU timings between 8-9 mins, six feature the EV6 processor. Non-alpha machines in this range include the SGI O3800/R14k-500 (8.0 mins.), Pentium III/1000 (8.2 mins.), and the Pentium 4/1400 and HP PA-9000/C3000-400 (8.4 mins.). The leading 20 machines are from Compaq (8), API (2), IBM (2), HP(2), and SGI (1), plus the leading CPUs from Intel (2) and AMD (3). The fastest machine from SUN (the SUN Blade 1000/M1750, 9.5 mins) is a factor of 2.2 times slower than the Compaq ES40/6-833. The 750 MHz UltraSPARC III is outperformed by the AMD K7/1000 (7.8 mins.), Pentium 4/1400 (8.4 mins.) and Pentium III/866 (9.3 mins.). The availability of Intel's MKL libraries on the Pentium 4/1500 (not on the 1400 MHz) accounts for the significant decrease in user time, from 8.4 to 6.2 mins. Twelve machines are found to lie within a factor of two of the fastest.

A somewhat modified picture emerges when considering the system CPU and elapsed times. With the exception of the Alpha, and to a lesser extent the SGI R10k, most machines exhibit a system CPU time of the order of 10-15% of the user time; this percentage increases significantly, particularly on the AXP, to between 20-40% for the systems shown in the Table. The API UP/2000-833 shows an alarming increase, to a figure of 66%. Based on the summed CPU times, the Compaq ES45/1000 remains the optimum CPU (4.6 mins.).The ES40/6-833 (6.3) is now only marginally faster than the Pentium 4/1500 and the AMD K7/1400 (6.5), and only a factor of 1.1 times faster than the ES40/667 and HP J6000-552 (6.8 mins.). These five machines are followed by the AMD K7/1200 (7.2 mins.) and Compaq's DS20E/667 (7.4 mins.) and PW XP1000/667 (7.7 mins.), IBM's SP/WH2-375 and 44P-270 (8.2-8.4 mins.), and the AMD Athlon K7/1000, HP PA-9000/J5000 and API UP2000 6/833 (9.0-9.2 mins).

Finally we note the performance of the Pentium- and AMD-based hardware; using the pgf77 compiler resulted in identical CPU times for the P4/1500 and AMD K7/1400 (6.5 mins); the 1 GHz Pentium III exhibits a CPU time of 9.4 minutes. While the AMD Athlon K7/1400 is 1.42 times slower than the AlphaServer ES45/1000, we note that SPECfp2000-based prediction would have led to a higher factor of 1.83. Use of the Portland Group pgf77 compiler produces a consistent level of performance improvement compared to g77, typically by a factor of between 1.13-1.17 for the Pentium III and AMD Athlon CPUs, while somewhat higher for the Pentium 4/1500 (a factor of 1.25).

The relative performance figures of Table 1 suggest that the GAMESS-UK results are broadly in line with the SPECfp ratings, with the EV68- and EV67-based Compaq machines providing the optimum CPUs. However, this superiority is not as pronounced as might be expected from just a consideration of the SPECfp2000 rankings. These results, together with known costs, strongly suggest that the EV67 and EV68 Alpha processors from API / Compaq, together with the AMD Athlon and Pentium 4 CPUs provide the most likely building blocks for Beowulf systems.

References

[1] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick, K. Schoeffel and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, and A.J. Stone. The package is derived from the original GAMESS code due to M. Dupuis, D. Spangler and J. Wendoloski, NRCC Software Catalog, Vol. 1, Program No. QG01 (GAMESS), 1980.

[2] M.F. Guest, Performance of Various Computers in Computational Chemistry, in Proceedings of the Daresbury Machine Evaluation Workshop, CLRC Daresbury Laboratory, November 2000. The associated MS powerpoint presentation is also available.


Communications Performance Benchmarks

S.J. Andrews, M.F. Guest and B.G. Searle

There are an increasing number of available options for the network connections in a Beowulf cluster. The proprietary options include Myrinet from Myricom Inc., SCI from Dolphin and QsNet from Quadrics Supercomputer World Ltd. The other choices are Fast Ethernet and Gigabit Ethernet which are available from most companies that produce network products. An example of the relative performance of possible combinations is given in Table 1 below. Here, we take the peak bandwidth and latencies we have measured on several clusters for various Fast Ethernet configurations and compare these with the interconnect performance of;

  • the Cray T3E/ 1200E,
  • the IBM SP/WH2-375,
  • API UP2000-based clusters (with Myrinet, Fast Ethernet and QsNet), and
  • a dual Pentium III cluster with Dolphin SCI interconnect from Workstations UK.

 

 

Latency (micro sec)

Peak bandwidth (MBytes/sec)

PIII Fast Ethernet + MPICH

135

10

API UP2000 + Ethernet + MPICH

140

11

PIII Fast Ethernet + LAM MPI

83

9

2 x Fast Ethernet + LAM MPI

84

22

Cray T3E 1200E

13

169

IBM SP2 Winterhawk II

19

138

API UP2000 + Myrinet + MPICH

18

88

API UP2000 + QsNet + MPICH

5

200

Dual PIII + SCI + ScaMPI (intranode)

2

276

Dual PIII + SCI + ScaMPI (internode, 33MHz PCI)

5

100

Dual PIII + SCI + ScaMPI (internode, 66MHz PCI)

5

155

Table 1: Network interconnect performance; Latency (micro sec) and Bandwidth (MB/sec).

Such point-to-point measurements are of course of limited value and provide no more than an indication of performance likely to be encountered in parallel applications. To provide a more quantitative assessment, we have adopted a number of benchmarks designed to provide a systematic evaluation of both point-to-point and collective operations.

The PMB Benchmark

We have used the MPI Parallel Communications benchmarks from Pallas (PMB, Pallas MPI Benchmarks [1]) across a variety of parallel hardware. PMB considers a wide variety of point-to-point communications e.g. PingPong, Sendrecv and Exchange) plus a number of MPI Collective Operations (Allreduce, Reduce, Reduce_scatter, Allgather, Allgatherv, Alltoall, Bcast and Barrier). Results for the former are reported in MBytes/sec, results for the latter as time (microsec) to complete. Each operation is run for a variety of message lengths (0 to 4194304 Bytes), with the collective operations performed for various combinations of the number of CPUs available. Systems evaluated to date include:

  • The Cray T3E/1200E and a prototype of Cray's Linux Supercluster (dual 833 MHz EV68 nodes with myrinet interconnect);
  • The SGI Origin 3800/R14k-500 with Numaflex interconnect;
  • IBM SP/WH2-375 - using both 2- & 4- CPUs per quad-SMP node;
  • The CS1 Pentium III/450 cluster - with both MPICH and LAM;
  • The CS3 AMD Athlon/850 MHz with Myrinet (using MPICH);
  • The Alpha Linux CS2 Cluster with QSNet interconnect - with both single and dual CPUs per UP2000 node, and
  • The CS5 dual Pentium III/930 cluster with SCI / SCALI interconnect (using SCAMPI).

Full details of these benchmarks has been presented elsewhere [2]. To provide an example of the output, we show below a plot for the performance of MPI_allreduce on 16 CPUs of the machines identified above.

 

 

An Effective Bandwidth Benchmark, EFF_BW

As a spin off from PMB, a benchmark "EFF_BW" has been developed by Pallas to calculate the "effective bandwidth". For a given machine and processor number, one integral number is calculated which includes the performance for small and for large messages under participation of all available processors (see [1]). EFF_BW uses only the PMB PingPong Benchmark for measuring startup and throughput.

We have adopted this benchmark [3], and present in Table 2 below the EFF_BW figures for 8 and 16 CPUs (together with reported Latency figures) measured on both High-end and the commodity-based hardware.

References

[1] http://www.pallas.de/pages/pmb.htm; PMB, a comprehensive set of MPI benchmarks written by Pallas, targeted at measuring important MPI functions: point-to-point message-passing, global data movement and computation routines, one-sided communications and file-I/O,

[2] see http://www.dl.ac.uk/CFS/benchmarks/pmb

[3] see http://www.hlrs.de/organization/par/services/models/mpi/b_eff

 

Table 2: Effective Bandwidth figures (MBytes/sec) and Ping-Pong Latency (micro sec) for a number of commodity systems and high-end parallel machines.

 

Machine

Interconnect

Effective Bandwidth (MBytes/sec)

Ping-Pong Latency (micro sec)

   

8 CPUs

16 CPUs

8 CPUs

16 CPUs

CS1 PIII/450 - LAM 6.5.2

Ethernet

33.5

65.0

84

82

CS1 PIII/450 - MPICH 1.2.0

Ethernet

21.5

39.8

141

149

CS6 PIII/800 - LAM 6.3.2

Ethernet

42.0

77.9

62

63

CS6 PIII/800 - MPICH 1.2.0

Ethernet

26.0

44.2

94

97

CS4 AMD K7/1200 - LAM 6.3.2

Ethernet

43.7

84.2

67

67

CS3 AMD K7/850 - MPICH 1.2.3

Myrinet

131.1

255.2

15.6

15.7

CS5 PIII/930 - SCAMPI (2 CPU / node)

Scali / SCI

194.0

358.0

2.0+

2.0

CS5 PIII/930 - SCAMPI (1 CPU / node)

Scali /SCI

218.4

-

4.1*

-

CS2 QSNet Alpha Cluster / 667 (2 CPU/ UP2000 node)

QSNet

240.3

455.5

5.8

6.0

CS2 QSNet Alpha Cluster / 667 (1 CPU/ UP2000 node)

QSNet

360.5

697.8

5.4

5.4

Cray T3E/1200E

CrayLink

639.4

1216.9

4.1

4.1

IBM SP/WH2-375

 

153.7

263.9

13.9

13.9

SGI Origin 3800/R14k-500

NumaFlex

245.5

529.8

4.4

4.7

+ intra-node, * inter-node


Metrics used in the Performance Evaluation of Beowulf Clusters

S.J. Andrews and M.F. Guest

Clusters of commodity processors, or Beowulf-class machines, are often considered as a serious alternative to classical HPC systems. While such clusters are substantially cheaper than the latter when compared on their peak performance, the sustained performance still needs to be demonstrated for large and communication intensive applications. As part of the Beowulf Project at Daresbury, we have undertaken a collaborative study with Pallas GmbH to provide greater insight and a more quantitative approach to the often qualitative debate over the role of commodity clusters as potential alternatives to proprietary high-end platforms.

Approach: This has focused to date on an evaluation of the differences between a Pentium cluster with Fast Ethernet (the CS1 Pentium III/450 Cluster at Daresbury) and the Cray T3E/ 1200E, and has been extended to include the CS2 Alpha Linux Cluster with Quadrics interconnect. An advantage of including the Cray is that it gives the ability, through the pat tool, to count MFlops directly. In comparing the two classes of machine a number of criteria can be established. These include descriptive ones such as:

  • effective bandwidth, Beff, measuring the network performance as an average over a variety of message-passing patterns (see above);
  • intensity, measuring several overhead operations as a function of MFlop rating (and the corresponding sensitivity of a machine to each intensity), and
  • a consideration of these in benchmark application codes.

An understanding of these basic quantities then allows for performance prediction and cost modelling.

Parallel efficiency losses can arise from high numbers of messages or global operations and / or high data volumes passed in these operations. To quantify this an intensity can be defined as a measure of frequency and volume of transfers. Thus, for any code run on a number of processes, a certain number of messages and global operations are performed and certain data volumes are processed in these operations. These then become machine independent code properties. Examples include message (I_msg = Nmess / MFlops), global operations (I_glo = Ng_ops / MFlops) and transfer (I_trans = overall_message_vol / MFlops) intensities.

Sensitivity of a machine to these intensities is dependent on the efficiency factorisation. This involves breaking down the total time into several components (reflecting the different sources of the overheads) which may be achieved using Vampir [1]. By using a calibration program we have determined whether such intensities are performance critical or not and thus classify actual application codes as critical or non-critical on a particular architecture.

Results: A cross-section of application codes have been considered including:

  • GAMESS-UK - quantum chemistry
  • DM - used by the German weather service and its successor, LM
  • CRASH [2] - crash simulation
  • FIRE [3]- 3D flow analysis
  • LU [4] - ScaLAPACK linear algebra
  • GeoFEM [5] - solid earth field phenomena

From the Beff benchmark [6] we find, not surprisingly, that the Cray bandwidth is almost 30 times better than the Intel/Ethernet combination (2.6 vs 68 MB/sec per CPU) with the Alpha / Qsnet a factor of 10 better (25.9 MB/s).

In comparing the Intel and the Cray, we find that the former is well suited to these codes in terms of network losses and, in the main, CPU losses also. LU, however, is CPU-critical and benefits from well-tuned commercial libraries on the Cray.

Considering the intensity sensitivities we find, for example, that when compared to both the Cray and Alpha machines the Intel clusters are highly sensitive to I_glo and I_trans but not to I_msg. What this means in practice is that the number of messages (as a proportion of the MFlops achieved) is generally not a critical factor in the performance of the Intel cluster:

We can now use the intensity analysis to predict performance of these codes without explicitly running them on the target machine, estimating a suitable number of processors for that code and also the impact of, say, upgraded components such as CPU or network. We can use these predictions with a simple cost model that combines the pure performance of the code with the machine and project (e.g. personnel) costs incurred when a code is executed. We find that, in cost terms:

  • for LU the Cray offers the most cost effective solution
  • of the others, LM is the most critical case and here the Intel only wins for low priority jobs and > 8 CPUs - otherwise, for such low priority jobs, the Intel is generally better
  • in most cases, for Intel, 16 processors are preferable to 32.

To take the most extreme example (LM), for between 4 and 32 processors, we can say that for a high priority case:

  • one should use the proprietary machine otherwise project costs are too dominant
  • on both Cray and Intel, one should use 32 CPUs
  • a "what-if" analysis suggests that:
    • doubling the number of Intel processors (to 64) will pay off
    • increased network performance on 32 CPUs will not
    • upgrading the 32 processors' speed and network together becomes more cost effective than the Cray

and for low priority:

  • Intel is less expensive than Cray for 16 or more processors
  • on Intel 16 processors are slightly more cost effective than 32.

We intend to expand on this work to study the performance of further codes and other platforms. A full paper detailing the methodologies used here, and further examples, may be found at http://www.cse.clrc.ac.uk/Activity/DISCO.

References

[1] http://www.pallas.com/pages/vampir.htm

[2] http://www.esi.fr

[3] http://www.avl.com

[4] http://www.netlib.org/scalapack/index.html

[5] http://geomfem.tokyo.rist.or.jp

[6] http://www.hlrs.de/organization/par/services/models/mpi/b_eff


Performance Analysis Tools

M.F. Guest

Considerable progress has been made in targeting end-user application software, involving the installation, benchmarking, and understanding of performance issues across a wide variety of parallel applications. With a focus on computational chemistry, applications include those from CCP1 (GAMESS-UK, CRYSTAL and CPMD, the base code for the new flagship project), NWChem, and REALC (chemical reaction dynamics software from CCP6). Other Car Parrinello codes include VASP and CASTEP, while molecular simulation software benchmarked includes DL_POLY (from CCP5) and CHARMM. Codes from other disciplines include ANGUS and FLITE3D (computational engineering) and the UK Met. Office's Unified Model code (climate modelling). In each case we have attempted to generate performance data on the prototype and evaluation commodity-based systems (CSx), and compared this with results from the CSAR Cray T3E/1200E and a variety of more recent high-end solutions. The latter includes the IBM SP/WH2-375, the Compaq AlphaServer SC (with both 667 and 833 MHz CPUs), the SGI Origin 3800 (both R12k-400 and R14k-500 CPUs) and a prototype of Cray's EV68-based Linux Alpha Cluster. In trying to quantify this comparison, we consider for each benchmark the percentage of a 32-node partition of the Cray T3E/1200E delivered by the range of commodity based systems (i.e. T32-nodeCray T3E / T32-nodeCSx). Prior to presenting these results in subsequent articles, we outline below use of the VAMPIR MPI instrumentation tool that has proved extremely useful in understanding a variety of performance issues associated with the benchmarks.

It is much harder to debug and tune parallel programs than sequential ones with only one instruction stream given that the much larger state space and the necessary communication between processes greatly complicate the task of analysing the behaviour of such applications. Usually, a tracefile may be generated that uses up much of the available disk space but contains the answer to the performance problem somewhere within it. VAMPIR, the MPI Instrumentation Tool from Pallas GmbH, provides a mechanism for the programmer to obtain an overview about an execution trace quickly, allowing focus on important parts of the program execution. It converts the trace information into a variety of graphical views at finer and finer detail and in doing so can separate the algorithmic performance drawbacks from those due to the (network) architecture. The initial phase of a collaboration between CSE Department and Pallas GmbH involves the instrumentation of several MPI codes to explore the performance potential and limits of Beowulf clusters. It is intended to study both kernels and full applications including, (i) communication kernels (OCCOM, PMB), (ii) kernels (NAS), and, of course (iii) the end applications themselves (see previous article).

We have previously illustrated the value of such traces in highlighting the difference in communication patterns and load, and hence scalability, in both macromolecular and Ewald simulations using the DL_POLY package (see subsequent article). A more recent, and more complex, application has been to apply VAMPIR to understand a number of performance issues in the GAMESS-UK code (see subsequent article on the application performance of the GAMESS-UK DFT Coulomb Module).

 

Figure 1. The Vampir trace of a single cycle of the SCF process in an 8-CPU DFT calculation of the Si8O25H18 zeolite cluster calculation showing the communication behaviour and state of each process as a function of time.

The trace of figure 1 is from the calculation of the DFT energy of the Si8O25H18 zeolite cluster (in a basis of 617 functions and 1444 fitting functions) run on 8 CPUs (two four-way Compaq Alpha ES40 nodes) connected by a quadrics switch. The first feature of the trace to note commences at the depicted timeline of around 2:20 and ends at 2:37 minutes. This corresponds to the quadrature in the Kohn-Sham energy and matrix evaluation, and would be expected to dominate the SCF cycle. The black lines denote the messages required to administer dynamic load balancing. The most notable feature however consists of the red and blue bars around two minutes and forty seconds. This actually corresponds to the first of four calls to the GAMESS-UK routine, GAMULT2, that is used to perform a similarity transformation (QHQ). The red bars represent MPI_Barrier calls waiting for the other processors to complete their work. The blue bars represent GA_Get operations waiting for data transfers to complete. The black lines represent the flow of data between processors. The trace shows that half of the processors are waiting for an MPI barrier to complete while the other half are mostly waiting for single-sided get operations to complete. The other interesting feature is that there are multiple instances where all get operations try to access data from a single processor at the same time. The key point to note here is that there are in fact 3 other calls to the same routine, all of which require far less time to complete, all contained under the solid black bars to the right of the trace. Although we have still to quantify the cause of this behaviour, the trace shown above illustrates how the instrumented GA can be used to identify performance bottlenecks and how it does highlight communication patterns that may cause such bottlenecks. This ability to quantify the causes of performance differentials on various machines continues to be vital in code optimisation and in u nderstanding the key metrics in particular application / platform combinations

 


Applications Performance: NWChem

M.F. Guest

In a previous report we have described the functionality within the NWChem package [1] and illustrated the scaling of the DFT module. This involved calculations of a number of different fragment models of a zeolite cluster conducted on the 512 processor IBM SP/P2SC-120 at the EMSL's Molecular Sciences Computing Facility (MSCF). These fragments ranged in size from Si8O7H18, with 347 basis functions and 832 CD fitting functions, to Si28O67H30, with 1687 basis functions and 3928 fitting functions.

With the assistance of PNNL’s Jarek Nieplocha and Eduoardo Apra, we have recently completed an initial port of the NWChem package to both the Pentium and Alpha Beowulf Clusters, and benchmarked these systems using the same Zeolite fragments. Total elapsed times on the Cray T3E/1200E, Compaq AlphaServer SC, and commodity clusters are given in Table 1. The super-linear speed-ups observed at high processor counts for the larger fragments on both the Cray T3E/1200E and Alphaserver SC arises from the increased availability of memory. In such cases, the 3c2e-integrals are held entirely in memory; at lower processor count a fraction of these integrals must be re-computed on each iterative cycle of the SCF. Considering the total times to solution on 32 CPUs, we see that the Pentium cluster is delivering 51% (Si8O7H18) and 40% (Si8O25H18) of the Cray T3E/1200E in the DFT calculations.

Table 1: Total Elapsed times (secs) using the NWChem DFT module in calculations on a variety of Zeolite fragments on the Cray T3E/1200E, and commodity-based systems.

Zeolite

Basis (AOs/CD)

CPUs

Cray

T3E / 1200E

Compaq

AlphaServer SC EV67/667

CS1
PIII/450 + FE

CS2
Alpha Linux Cluster

       

QsNet

LAM/MPI

QsNet

Si8O7H18

347/832

4

   

1720

 
   

8

   

1018

193

   

16

349

108

598

107

   

32

209

78

408

70

   

64

120

61

   
   

128

97

53

   

Si8O25H18

617/1444

16

955

282

1726

280

   

32

530

202

1309

171

   

48

     

136

   

64

279

142

 

109

   

128

198

117

   

Si26O37H36

1199/2818

16

 

3025

 

3681

   

32

8309

1756

 

1792

   

48

     

621

   

64

1301

602

 

483

   

128

887

450

   

Si28O67H30

1687/3928

16

     

9854

   

32

 

4461

 

4684

   

48

     

3318

   

64

10800

2731

 

2376

   

128

1382

844

   

 

As expected, the more powerful CPUs of the Alpha Linux Cluster lead to much higher percentage delivery. Thus the 32-CPU Alpha cluster delivers 220% (Si8O7H18) and 215% (Si8O25H18) of the Cray T3E. The much higher figure (464%) for Si26O37H36 is attributed to the increased memory available on the cluster (see above). On 32-CPUs the Alpha Linux Cluster is seen to deliver between 95-118% of the Compaq AlphaServer SC EV67/667. It is of some interest to compare the present Alpha timings for the larger fragments with those originally reported on the IBM SP/P2SC-120. The timings of Table 2 suggests that the 32-CPU Alpha Linux Cluster is, for the larger fragments, delivering ca. 50% of the performance of the 256 node IBM/P2SC-120 at the EMSL's Molecular Sciences Computing Facility (MSCF). This is a creditable level of performance even allowing for the dated nature of the IBM hardware.

Table 2. Total Elapsed times (secs) using the NWChem DFT module in calculations on a variety of Zeolite fragments on the IBM SP/P2SC-120.

Zeolite Fragment

Basis (AO/CD)

Number of P2SC/120 Nodes

Wall Time (seconds)

Si8O7H18

347/832

64

238

Si8O25H18

617/1444

128

364

Si26O37H36

1199/2818

256

1137

Si28O67H30

1687/3928

256

2766

References

[1] Additional information on NWChem and the Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Battelle Memorial Institute, and the U.S. Department of Energy is available via the Pacific Northwest Home Page at http://www.pnl.gov:2080/


Applications Performance: GAMESS-UK DFT, MP2 and 2nd Derivatives

M.F. Guest

The basis for much of the current parallel SCF functionality within GAMESS-UK was undertaken within the Esprit-funded EUROPORT 2 project, IMMP (Interactive Modelling through Parallelism). The primary goals of EUROPORT were to provide exemplars of commercial codes that demonstrated the potential of parallel processing to industry and to establish a broad spectrum of standards-based medium-scale parallel application software, rather than targeting the highest levels of scalability and performance.

In contrast to NWChem, both SCF and DFT modules are parallelised in a replicated data fashion, with each node maintaining a copy of all data structures present in the serial version. While this structure limits the treatment of molecular systems beyond a certain size, experience suggests that it is possible on machines with 256 MByte nodes to handle systems of up to 2,000 basis functions. The main source of parallelism in the SCF module is the computation of the one- and two-electron integrals and their summation into the Fock matrix, with the more costly two-electron quantities allocated dynamically using a shared global counter. The result of parallelism implemented at this level is a code scalable to a modest number of processors (around 32), at which point the cost of other components of the SCF procedure starts to become significant. The first of these addressed was the diagonalisation, which is now based on the PeIGS module from NWChem.

Once the capability for GA [1] is added, some distribution of the linear algebra becomes trivial. As an example, the SCF convergence acceleration algorithm (DIIS - direct inversion in the iterative subspace) is distributed using GA storage for all matrices, and parallel matrix multiply and dot-product functions. This not only reduces the time to perform the step, but the use of distributed memory storage (instead of disk) reduces the need for I/O during the SCF process.

Substantial modifications were required to enable the MP2 gradient [2] and SCF 2nd derivatives to be computed in parallel. In both cases the conventional integral transformation step has been omitted, with the SCF step performed in direct fashion and the MO integrals, generated by re-computation of the AO integrals, and stored in the global memory of the parallel machine. This storage and subsequent access is managed by the GA tools. The basic principle by which the subsequent steps are parallelised involves each node computing a contribution to the current term from MO integrals resident on that node. For some steps, however, more substantial changes to the algorithms are required. For the MP2 gradient, the construction of the Lagrangian (the right-hand side of the coupled Hartree-Fock (CPHF) equations) requires MO integrals with three virtual orbital indices. Given the size of this class of integrals, they are not stored, the required terms of the Lagrangian being constructed directly from AO integrals. A second departure from the serial algorithm concerns the MP2 2-particle density matrix. This quantity, which is required in the AO basis, is of a similar size to the 2-electron integrals and is stored on disk in the conventional algorithm, but generated as required during the derivative integral generation from intermediates stored in the GAs.

Table 1: Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient modules in calculations on Morphine, Cyclosporin, di(tri-fluoromethyl)-biphenyl and Mn(CO)5H on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems.

CPUS

Cray

T3E/1200E

IBM

SP/WH2-375

SGI Origin 3800/R12k-400

SGI Origin 3800/R14k-500

Compaq Alpha SC EV67/667

Cray Alpha Linux SC EV67/833

Morphine DFT/B3LYP (6-31G**)

16

990

399

437

355

 

239

32

515

249

248

192

 

143

64

278

 

174

116

 

91

128

160

         

Cyclosporin DFT/B3LYP (6-31G)

8

           

16

6927

2972

2696

2208

2144

1760

32

3612

1749

1503

1191

1210

990

64

2003

 

961

704

722

606

128

1039

     

531

 

256

721

         

(C6H4(CF3))2 SCF 2nd Derviatives (6-31G)

16

2687

1845

1574

1490

1080

1313

32

1439

1085

985

803

746

 

64

803

 

626

494

488

429

128

499

     

402

 

Mn(CO)5H MP2 Gradient (TZVP)

16

6713

3456

3105

2946

2602

 

32

3530

2129

1714

1346

1603

 

64

1923

 

1082

836

1078

 

128

1158

     

1006

 

256

792

         

In the SCF 2nd derivative module the coupled Hartree-Fock (CPHF) step and construction of perturbed Fock matrices are again parallelised according to the distribution of the MO integrals. The most costly step in the serial 2nd derivative algorithm is the computation of the 2nd derivative two-electron integrals. This step is trivially parallelised through a similar approach to that adopted in the direct SCF scheme - using dynamic load-balancing based on a shared global counter. In contrast to the serial code, the construction of the perturbed Fock matrices dominates the parallel computation. It seems almost certain that these matrices would be more efficiently computed in the AO basis, rather than from the MO integrals as in the current implementation, thus enabling more effective use of sparsity when dealing with systems comprising more than 25 atoms.

The performance of the DFT, MP2 and 2nd Derivative modules on the Cray T3E/1200E and the IBM, SGI, Compaq and Cray High-end Systems are shown in Table 1. Corresponding timings on a variety of commodity-based systems are shown in Table 2. The DFT calculations on morphine used a 6-31G** basis of 410 functions, those on cyclosporin a 6-31G basis of 1000 functions, both using the B3LYP hybrid functional. Note that the DFT calculations did not exploit CD fitting, but evaluated the coulomb matrix explicitly.

Considering the DFT results on the high-end systems, speedups of 99 and 107 are obtained on 128 Cray T3E nodes for the morphine and cyclosporin calculation, respectively. The 32 CPU timings for cyclosporin show that the T3E is outperformed by the IBM SP/WH2, the SGI Origin 3800/R14k-500, the Compaq AlphaServer SC and the Cray Linux Supercluster by factors of 2.07, 3.03, 2.99 and 3.65 respectively. The fastest machine is evidently that with the fastest CPU, the EV67-833 based Cray Linux Supercluster, while the SGI Origin/R14K and Compaq AlphaServer exhibit almost identical run times up to 64 CPUs. Considering the higher node counts, it is clear that all machines exhibit inferior scalability compared to the Cray T3E. Thus the AlphaServer / Cray ratio performance ratio of 3.23 found at 16 CPUs decreases to just 1.96 on 128 CPUs.

Table 2: Total Elapsed times (seconds) using the GAMESS-UK DFT, SCF 2nd derivatives and MP2 gradient benchmark calculations on a variety of commodity-based systems.

 

CPUs

Commodity Systems, CSx+

Benchmark

 

CS1

CS2

CS4

CS4 *

CS6

 

8

   

1062

777

 

Morphine

16

1183

312

673

472

676

 

32

719

171

 

351

440

 

48

 

126

   

380

 

64

 

95

     

Cyclosporin

8

   

7131

5166

 
 

16

9156

2014

3952

3095

4399

 

32

5182

1200

 

2213

2774

 

48

 

856

   

2215

 

64

 

648

     

C6H4(CF3))2

16

2977

1450

     
 

32

1809

933

 

1067

1133

 

48

 

634

     
 

64

 

512

     

Mn(CO)5H

16

11499

2722

14578

13920

7487

 

32

8113

1550

 

6989

4847

 

48

 

1112

   

4024

 

64

 

883

     

+CS1 PIII/450 + FE,   CS2 QSNet Alpha Linux EV67/667
CS4 AMD K7/700 + FE,  
*CS4 AMD K7/1200 + FE
CS6 PIII/800 + FE

Turning to the cluster results of Table 2, and the total times to solution on 32 CPUs, we see that the CS1 Pentium III/450 cluster is delivering 72% (morphine) and 62% (cyclosporin) of the Cray T3E/1200E in the DFT B3LYP calculations. Increasing the CPU speed while leaving the interconnect effectively unchanged leads to a predictable impact on performance. Thus the corresponding delivery figures for the fast ethernet connected CS6 Pentium III/800 and the CS4 Athlon AMD/1200 clusters in the cyclosporin calculation are 130% and 163%, with the AMD/1200-based cluster a factor of 2.34 times faster than CS1. While both clusters comfortably outperform the T3E, higher factors might have been expected based solely on single node performance.

The more powerful CPUs together with enhanced interconnect of the CS2 Alpha Linux Cluster predictably lead to much higher percentage delivery. Thus the Linux Alpha Cluster, with corresponding figures of 300% (morphine) and 301% (cyclosporin), outperforms both IBM SP and Origin 3800/R12k-400. At 32 CPUs, the Alpha Cluster is performing on a par with the Compaq AlphaServer SC and SGI Origin 3800/R14k-500. In both benchmarks we find the elapsed times on the 32-CPU Alpha cluster fall between the 64- and 128-node T3E timings, lying closer to the 128-node T3E time in each case. The 64-CPU Alpha Cluster is outperforming the 256-node Cray T3E/1200E in the cyclosporin calculation.

Considering the performance data for the MP2 gradient and SCF analytic 2nd derivative modules, we see that the MP2 geometry optimisation of the Mn(CO)5H molecule (with 217 basis functions) shows a speedup of 93 achieved using 128 T3E/1200 processors to perform the complete optimisation (involving 5 energy and 5 gradient calculations). A corresponding speedup of 86 is found when calculating the frequencies of 2,2'-di(tri-fluoromethyl)-biphenyl using a 6-31G basis of 196 functions. The greater dependency on interconnect in both SCF 2nd Derivative and MP2 calculations compared to the DFT module leads to less marked performance enhancements on all high-end platforms relative to the T3E (Table 1). Thus at 32 CPUs, the performance advantage of the Compaq AlphaServer SC over the Cray is reduced to factors of 2.2 (MP2) and 1.9 (2nd Derivatives) compared to the figure of 3.0 found in the cyclosporin calculations. Indeed at 64 CPUs the AlphaServer SC is now outperformed by the R14k-based SGI Origin 3800 in the MP2 calculation, where the AlphaServer shows little performance improvement on moving from 64 to 128 CPUs. This degradation in scalability of the AlphaServer is such that the 128 CPU time is only some 10% better than that found on the Cray. The minimum time to solution on 64 CPUs is given by the SGI Origin 3800 in the MP2 calculation, and by the Cray Linux Supercluster in the 2nd Derivatives calculation.

This greater dependence on the global arrays (GAs) in both MP2 gradient and analytic 2nd derivative applications produces the expected impact in performance of the commodity clusters. Considering the total times to solution on 32 CPUs, we see that the CS1 Pentium III cluster is delivering a much reduced percentage of the Cray T3E (44%) in the MP2 gradient calculation, with only a modest reduction in elapsed time between 16 (11,499 seconds) and 32 CPUs (8,113 seconds). The significant increase in node CPU capability associated with the CS4 AMD-based cluster is seen to have little impact in this benchmark, with the solution time only a factor of 1.16 better than that of CS1. A similar degradation in performance was originally noted on the Alpha Linux Cluster, caused not by limited communications but by problems in the effective utilisation of shared memory on the dual CPUs of the UP2000. Revisions in release 3.1 of the GAs have largely addressed this, with the 32-CPU Alpha now delivering 228% of 32-node Cray performance and outperforming the IBM SP/WH2-375 (1550 vs. 2129 seconds).

Somewhat surprisingly the Pentium and AMD-based clusters perform far more effectively in the SCF 2nd derivative benchmark. CS1 achieves 81% of Cray T3E/1200E performance, while both the CS6 Pentium III/800 and CS4 AMD/1200 clusters outperform the Cray at 32 CPUs (CS6, 127%, CS4, 135%). An initial performance analysis reveals load-balancing problems in the Fock matrix construction, which may explain this effect. Neither IBM/SP nor the Alpha Cluster perform that effectively on this benchmark; while the revised GAs have improved the Alpha performance, the 32-CPU Linux Cluster delivers only 154% of T3E performance, one of the lowest such figures recorded in these benchmarks.

References

[1] J. Nieplocha, R.J. Harrison and R.J. Littlefield, Global arrays; A portable shared memory programming model for distributed memory computers, in: Supercomputing '94, IEEE Computer Society Press, Washington, D.C. (1994).

[2] G.D. Fletcher, A.P. Rendell and P. Sherwood, A parallel second-order Moller-Plesset gradient, Molec. Phys. 91:431-38 (1997).


Applications Performance: GAMESS-UK DFT Coulomb Module

M.F. Guest, P. Sherwood and H.J.J. van Dam

Recent work has focused on optimising and extending the fitted coulomb module of the CCP1 density functional theory (DFT) code within GAMESS-UK for use on MPP machines. In order to reduce the cost of evaluating the Coulomb repulsion energy in medium sized molecules the charge density can be fitted to an auxiliary basis as proposed by Dunlap et al [1]:

where the fitting coefficients C can be obtained from:

In this equation are the electron repulsion integrals in the charge density basis and are the three centre electron repulsion integrals between the wavefunction basis set and the charge density basis. This way the evaluation of 4-centre 2-electron integrals can be circumvented using at most 3-centre 2-electron integrals clearly reducing the formal scaling of the computational cost from 4th order to 3rd order. We can reduce the computational cost further by implementing a mechanism to store the 3-centre 2-electron integrals in main memory in a distributed fashion on parallel computers. This approach is facilitated using a Schwarz inequality to avoid evaluating and storing small integrals. The Schwarz inequality for 3-centre 2-electron integrals is (in analogy to the 4-centre 2-electron integral inequality:

For efficiency only the maximal values of or for a given set of shells are stored. We have established that for dense clusters of water molecules using a Schwarz inequality screening tolerance of 10-5 yields a stable relative error of 10-4 in the energy. Since screening is applied on a shell basis, the maximal integrals for each shell quartet are stored. Using this screening, and exploiting the aggregate memory of a parallel machine, it is possible to hold a significant fraction of the 3-centre integrals in core. The formal n3 scaling of the inversion rules out the straightforward Dunlap scheme for very large systems. It is particularly valuable for systems of intermediate size, where the cost is dominated by the integral computation and Fock build steps. We have established that for dense clusters of water molecules, and the application of a Schwarz inequality screening tolerance of 10-5, the number of integrals evaluated is reduced by about a factor of 5.

The parallelisation is performed trivially by distributing the evaluation of integrals with the same set of wavefunction basis functions over the processors. Because the integrals are stored and read when needed, a static load balancing scheme has to be applied to avoid communication. If the amount of memory available is not sufficient to store all of the integrals then the ones that were not stored can be recalculated. This combined in-core/direct approach can be optimised further by storing the largest integrals and recalculating the small ones. This way, as with direct-SCF, the Schwarz inequality can be tightened by including the density factors, reducing the number of integrals further. The inversion of the matrix was implemented using the PeIGS matrix diagonalisation package [2], with the matrix distributed column-wise, to maximise data locality in the evaluation of the fitting coefficients.

Timings for a number of DFT calculations on the Morphine and Valinomycin molecules, conducted on the variety of high-end proprietary and commodity hardware under consideration are shown in Table 1. Calculations on morphine used a DZVP_A2 Dgauss basis of 410 functions, those on valinomycin a DZV_A2 basis of 882 functions, both using the HCTH functional. Timings are reported for calculations in which the coulomb matrix was evaluated explicitly (J-explicit) and for those that used CD fitting (J-fit). The latter employed an A2_DFT auxiliary fitting basis for morphine (1171 functions), and an A1_DFT fitting basis (3012 functions) for valinomycin.

It can be seen that the current approach provides significant benefit, with scalability on the T3E greatly enhanced over that reported previously. Speedups of 105 and 110 are obtained on 128 nodes Cray T3E/1200 for the morphine and valinomycin calculations when evaluating the coulomb matrix explicitly. Corresponding speedups when using the coulomb fit are 75 and 100 respectively. Overall times to solution on 128 T3E nodes when using the fitted Coulomb approach are reduced by factors of 2.9 (morphine) and 2.1 (valinomycin).

Considering the total times to solution on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the morphine calculation when evaluating the coulomb matrix explicitly show the following ordering:

Cray Supercluster (325) < Compaq AlphaServer (373) < SGI O3800/R14k (505) < IBM SP (589)

with the Cray SuperCluster outperforming the Cray T3E/1200 by a factor of 4.6. A similar ordering is found in the larger valinomycin benchmark, with a somewhat reduced factor of 4.3. The timings of Table 1 point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU valinomycin improvement factors of 4.3 (Cray Supercluster) and 3.7 (AlphaServer) with explicit treatment of the Coulomb matrix are reduced to 3.6 (Supercluster) and 3.0 (AlphaServer) based on the 128-CPU timings.

All machines show a significant reduction in time to solution when using the fitted Coulomb matrix compared to explicit treatment of the Coulomb term. The 32-CPU timings for the morphine calculation when using the fitted coulomb matrix show the following order:

Cray Supercluster (112) ~ SGI O3800/R14k (113) < Compaq AlphaServer SC (129) < SGI O3800/R12k (145)

Note that all 3c-2e integrals are held in memory for this 32-CPU morphine calculation. A comparison with the explicit-J timings shows the greater dependency of the fitted approach on interconnect. The SGI Origin 3800/R14k is now performing on a par with the Cray Supercluster, while the Origin 3800/R12k outperforms the IBM SP. The R12k-based Origin is now only a factor of 1.12 slower than the Compaq AlphaServer SC, compared to the figure of 1.71 found with explicit treatment of the coulomb matrix. Improvement factors when using the fitted approach versus explicit J in the larger valinomycin benchmark reflect this dependency on interconnect, particularly at higher processor count. Total times for the 32CPU Coulomb fit calculations are as follows:

SGI O3800/R14k (724) ~ Cray Supercluster (726) < Compaq AlphaServer SC (881) < SGI O3800/R12k (897)

The 32-CPU performance delivery figures for the Compaq AlphaServer of 399% and 375% with explicit treatment of the Coulomb matrix are reduced to 251% (morphine) and 300% (valinomycin) based on the 128-CPU timings. In similar fashion the 32-CPU figures for the Cray Supercluster of 427% is reduced to 358% in the 128 CPU valinomycin calculation.

Table 1: Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on Morphine and Valinomycin on both High-end and Commodity-based Systems.

Machine

CPUs

Morphine (410 GTOs)

Valinomycin (882 GTOs)

   

DFT/HCTH

J-explicit

DFT/HCTH
J-fit

DFT/HCTH

J-explicit

DFT/HCTH
J-fit

Cray T3E/1200

16

3031

728

15100

6226

 

32

1488

391

7617

3063

 

64

817

237

4300

1573

 

128

440

150

2139

995

IBM SP/WH2-375

8

2067

456

   
 

16

1072

271

6023

2257

 

32

589

192

3231

1213

SGI Origin 3800/

8

2343

431

   

R12k-400

16

1200

235

 

2007

 

32

638

145

2882

897

 

64

357

120

1680

654

SGI Origin 3800/

8

1911

349

   

R14k-500

16

974

187

 

1632

 

32

505

113

 

724

 

64

267

82

1228

443

Compaq

16

711

193

3910

1725

AlphaServer SC

32

373

129

2033

881

EV67/667 - QsNet

64

228

91

1123

539

 

128

175

83

713

419

Cray Supercluster

16

612

189

3386

1488

Alpha Linux

32

325

112

1783

726

EV67-833 - Myrinet

64

209

81

978

476

 

128

   

598

364

CS1 PIII/450 + FE

8

5956

1638

   
 

16

3126

973

16965

6692

 

32

1746

661

9201

3955

CS2 EV67/667

8

1489

379

   

Alpha Linux

16

774

207

4057

1739

QsNet

32

404

124

2107

809

 

48

283

98

1513

599

 

64

201

78

1109

471

CS4 AMD/700 + FE

8

2459

823

   
 

16

1309

522

6937

3572

CS4 AMD/1200 + FE

8

1634

722

8971

5037

 

16

899

459

4921

3016

 

32

574

356

3009

2115

CS6 PIII/800 + FE

8

 

1024

   
 

16

1447

629

7638

3910

 

32

858

419

4281

2333

 

48

623

365

3182

1878

 

64

524

 

2634

1733

Considering the total times to solution on the commodity hardware, we see that the 32 CPU Pentium III/450 CS1 cluster is delivering 85% (morphine) and 83% (valinomycin) of the Cray T3E/1200E in the DFT calculations with explicit treatment of the Coulomb matrix. These factors are reduced somewhat (to 59% and 77% respectively) when using the fitted Coulomb matrix. The more powerful CPUs of the other clusters of Table 2 lead to much higher percentage delivery, particularly when evaluating the Coulomb matrix explicitly. The 32 CPU explicit coulomb figures for valinomycin of 178% (CS6 Pentium III/800) and 253% (CS4 AMD/1200) for the ethernet-interconnected machines are reduced to 131% and 145% respectively in the corresponding fitted coulomb calculations. Enhancing both interconnect and CPU speed results in much higher figures, particularly in the fitted calculations. Thus the 32 CPU Linux Alpha CS2 Cluster exhibits delivery figures in the valinomycin calculation of 361% (explicit coulomb) and 379% (fitted coulomb). The CS2 cluster is seen to outperform the IBM SP/WH2-375 and the Origin 3800/R12k-400 in both explicit- and fitted-coulomb 32 CPU valinomycin calculations, and the Compaq AlphaServer SC/667 in the fitted calculation. The Alpha Cluster is seen to deliver between 96% of the Compaq AlphaServer SC when evaluating the Coulomb matrix explicitly, and between 89% of the Origin 3800/R14k when using the Coulomb fit. Indeed 32-CPUs of the CS2 cluster outperforms 128 nodes of the Cray T3E in both explicit and fitted coulomb calculations.

Table 2: Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on the Cray T3E/1200 and High-end systems from IBM, SGI, Compaq and Cray (see text).

CPUS

Cray T3E / 1200E

IBM SP / WH2-375

SGI Origin 3800 / R12k-400

SGI Origin 3800 / R14k-500

Compaq Alpha SC EV67/667

Compaq Alpha SC EV67/883

Cray Alpha Linux SC EV67/833

Si8O7H18 (347/832)+

16

219

96

92

76

63

54

68

32

135

77

60

49

43

39

46

64

95

       

38

 

128

76

       

37

 

Si8O25H18 (617/1444)+

8

             

16

596

301

254

218

206

182

229

32

352

252

161

139

143

124

145

64

232

 

126

110

 

98

120

128

175

       

90

 

Si26O37H36 (1199/2818)+

16

3889

1787

1631

1396

1267

1100

1290

32

1931

1205

855

748

783

695

770

64

1118

 

559

515

 

484

564

128

822

       

386

 

Si28O67H30 (1687/3928)+

16

 

3753

3203

2629

2393

1983

 

32

3751

2701

1781

1551

1569

1338

 

64

2257

 

1247

994

 

928

 

128

1287

       

751

 

+Zeolite, Basis (AOs/CD)

A further demonstration of the Coulomb Fit DFT code is given in Tables 2 and 3. Here we present timings for complete DFT calculations on the same series of Zeolite fragments used previously in demonstrating the NWChem software, conducted on the variety of high-end proprietary (Table 2) and commodity hardware (Table 3) under consideration. Note that the 833 MHz EV67 Compaq AlphaServer SC is now included in the hardware under evaluation. While limited speedups are observed for the smaller fragments on the Cray T3E/1200E (46 and 54 for Si8O7H18 and Si8O25H18 respectively on 128 nodes), the higher value of 93 found for the largest fragment, Si28O67H30 is associated with the need for re-computation of the 3-centre integrals.

Considering the total times to solution for the larger fragments on the proprietary hardware, we find that the more powerful CPUs associated with the IBM SP, Compaq AlphaServer SC, Origin 3000 and Cray Supercluster lead to significantly reduced run times compared to the T3E. The 32-CPU timings for the Si26O37H36 calculation show the following ordering:

Compaq Alpha SC/833 (695) < SGI O3800/R14k (748) < Cray Supercluster (770) ~ Compaq Alpha SC/667 (783)

with the Compaq AlphaServer/833 outperforming the Cray T3E/1200 by a factor of 2.8. The performance of the Origin 3800/R14k is far stronger than might have been expected based solely on a consideration of node performance (e.g. SPECfp2000). The timings of Table 2 again point to the poorer scalability of more recent proprietary hardware at higher processor counts compared to the Cray T3E. Thus the 32-CPU improvement factors for Si26O37H36 of 2.8 (Compaq AlphaServer/833) is reduced to 2.1 based on the 128-CPU timings. Similar conclusions arise from a consideration of the timings for the largest fragment (Si28O67H30). The improvement factor for the Compaq AlphaServer/833 against the Cray T3E at 32 CPUs of 280% is reduced to 171% based on the 128-CPU timings.

Table 3: Total Elapsed times (seconds) using the GAMESS-UK DFT Fitted Coulomb module in calculations on a variety of Zeolite fragments on a number of commodity-based systems.

 

CPUs

Commodity Systems, CSx+

Zeolite

 

CS1

CS2

CS4

CS4 *

CS6

Si8O7H18

8

 

118

 

231

327

347/832$

16

314

68

217

178

225

 

32

290

49

 

184

186

 

48

 

39

     
 

64

 

36

     

Si8O25H18

8

     

899

1205

617/1444$

16

1011

219

741

640

781

 

32

797

156

 

551

591

 

48

 

119

   

540

 

64

 

106

     

Si26O37H36

8

     

4793

 
 

16

5322

1419

 

3213

4224

1199/2818$

32

3478

800

 

2501

2793

 

48

 

603

   

2428

 

64

 

533

     

Si28O67H30

8

     

8530

 
 

16

 

2774

 

5864

7535

1687/3928$

32

8455

1712

 

4593

5251

 

48

 

1366

   

4670

 

64

 

1123

   

4362

+CS1 PIII/450 + FE,   CS2 QSNet Alpha Linux EV67/667
CS4 AMD K7/700 + FE,   *CS4 AMD K7/1200 + FE
CS6 PIII/800 + FE

$ Basis (AOs/CD)

Considering the total times to solution on 32 CPUs of the commodity hardware (Table 3), we see that the Pentium III/450 CS1 cluster is delivering between 43-52% of the Cray T3E/1200E in the fitted coulomb DFT calculations. The more powerful CPUs of the other clusters of Table 3 lead to higher percentage delivery, although these do not reflect the individual CPU performance for the ethernet-interconnected machines. Thus we find 32 CPU figures of 69% (CS6 Pentium III/800) and 77% (CS4 AMD/1200) for Si26O37H36, and 71% (CS6 Pentium III/800) and 82% (CS4 AMD/1200) for Si28O67H30. Enhancing both interconnect and CPU speed results in much higher figures. Thus the 32 CPU Linux Alpha CS2 Cluster exhibits delivery figures of 241% (Si26O37H36) and 219% (Si28O67H30). The CS2 cluster is seen to outperform the IBM SP/WH2-375 and the Origin 3800/R12k-400 in all 32 CPU fragment calculations; 32-CPUs of the cluster outperform 128 nodes of the T3E on all but the largest fragment. For the fragments under consideration, the Alpha Cluster is seen to deliver between 78-87% of the performance of the Compaq AlphaServer SC/833 and between 89-100% of the Origin 3800/R14k-500 in all 32 CPU calculations.

References

[1] B.J. Dunlap, W.D. Connolly, J.R. Sabin, On some approximations in applications of Xα theory, Journal of Chemical Physics 71, 3396-3402.

[2] G. Fann and R.J. Littlefield, Parallel inverse iteration with reorthogonalisation, in: Sixth SIAM Conference on Parallel Processing for Scientific Computing (SIAM), pp409-13 (1993).


Applications Performance: DL_POLY

M.F. Guest

DL_POLY [1] is the parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and T.R. Forester for CCP5 (the Collaborative Computational Project for the Computer Simulation of Condensed Phases). The parallel implementation was initially based on a replicated data (RD) strategy, and was designed at the outset for machines with up to 64 processors and systems of up to 30,000 atoms, although it has since found use on much larger architectures. Implicit in the RD approach is a dependence on fast global summations, which are not available on all machines. The performance scaling varies according to the kind of simulation being undertaken - systems possessing complex molecular topologies and constraint bonds typically scale less well than ones requiring simple atomic descriptions, as they require a higher communication overhead. If constraint bonds are present, as they usually are in bio-molecular or polymer systems, then significant deviations from ideal behaviour are to be expected. The four benchmarks described below are those described on the CCP5 web site [2], with the same benchmark-numbering scheme adopted here:

Benchmark 4: A straightforward simulation of sodium chloride at 500K, using the standard Ewald summation method to handle the electrostatic forces. A multiple time-step algorithm is used to increase performance, which requires recalculating the reciprocal space forces only twice in every five time steps. The electrostatic cut-off is set at 24 Angstrom in real space, with a primary cut-off of 12 Angstrom for the multiple time-step algorithm. The Van der Waals terms are calculated with a cut-off of 12 Angstrom. The simulation is for 200 steps with a time step of 1 fs in the Berendsen NVT ensemble. The system size is 27,000 ions.

Benchmark 5: This simulation is of 8,640 atoms of an alkali disilicate glass at 1000 K. The electrostatics are again handled by the Ewald sum, with the interaction potential including a three-body valence angle term, which requires a link-cell scheme to locate atom triplets. The electrostatic cut-off is 12 Angstrom and the Van der Waals cut-off is 7.6 Angstrom; 3-body forces are cut off at 3.45 Angstrom. The simulation is for 300 steps in the Hoover NVT ensemble, with a timestep of 1 fs.

Benchmark 3: This simulation is of the enzyme transferrin in a solution comprised of 8102 TIP3P water molecules. A total of 27,593 atoms are in the system. The electrostatic forces are handled by a combination of neutral groups with the Coulombic potential. All force cut-offs are set at 8 Angstrom. The simulation is for 250 steps with a time step of .1 fs, in the NVE ensemble. The water molecules are treated as rigid bodies and the transferrin is maintained by bond constraints using SHAKE. Valence angles and dihedral potentials are present in the transferrin model.

Benchmark 7: This system is comprised of 13,390 atoms, including 4012 TIP3P water molecules solvating the gramicidin A protein molecule at 300K. Both the protein and water molecules are defined with rigid bonds and maintained by the SHAKE algorithm. The water is held completely rigid, while the protein has angular and dihedral potential terms. Electrostatic interactions are handled by the neutral group method with a Coulombic potential truncated at 12 Angstrom. The Van der Waals interactions are truncated at 8 Angstrom. The simulation is for 500 time steps in the NVE ensemble with a 1 fs time step.

Performance scaling on the Cray T3E for Benchmarks 4 & 5 has been shown to be extremely good and is almost linear over the entire range of processor numbers. This reflects the high parallel efficiency of the Ewald sum implementation. Significantly inferior scaling is found in the two macromolecular benchmarks, Benchmarks 3 & 7. This may be attributed to the difficulty in apportioning the neutral group calculations across processors, and the use of SHAKE for the bond constraints. Somewhat better scaling is found in 7 which uses a larger cut-off in the electrostatic calculations and hence a lower communication/computation ratio. The performance of the four DL_POLY benchmarks are shown in Table 1 (on the Cray T3E/1200E and IBM, SGI, Compaq and Cray High-end Systems) and Table 2 (on a number of commodity based systems). Initial modifications made to the DL_POLY implementation on the commodity clusters included replacing the MPI_ALLREDUCE routines from both LAM and MPICH libraries with a Daresbury rewritten hypercube-based version.

Considering the Ewald-based benchmarks, we again find excellent scalability on the Cray T3E/1200E, with speedups of 135 (super-linear) and 98 obtained on 128 nodes Cray for benchmark 4 and 5 respectively. This excellent scaling on the T3E is put into perspective when considering the total 32 CPU times to solution for the high-end systems of Table 1. The weakness of the Cray EV56 CPU is clearly apparent, with the slowest of these systems, the IBM/SP-WH2-375, delivering 369% (Benchmark 4) and 242% (benchmark 5) of the Cray T3E on 32 CPUs. There is clearly little difference in performance between the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray SuperCluster EV67/833, with the Supercluster the leading 32 CPU machine in benchmark 5 (433%), and the Alphaserver marginally faster in benchmark 4 (597%). There is some evidence that the optimum scalability is shown by the Origin 3800/R14k, although this is inferior to that found on the Cray T3E.

Turning to the commodity systems of Table 2, the weakness of the Cray EV56 CPU is again clearly apparent. Even the CS1 Pentium III/450 cluster is outperforming the Cray T3E/1200E in the NaCl simulation (352 vs. 376 seconds), and is only marginally slower than the T3E (238 vs. 225 seconds) in the NaK silicate simulation. These percentage delivery figures of 107% and 95% on the 450 MHz Pentium cluster increase substantially on the more powerful CPUs of the AMD Athlon and Alpha Clusters. In Benchmark 4 we find delivery figures of 184% and 257% for the CS6 Pentium III/800 and CS4 K7/1200 clusters respectively: the Benchmark 5 percentages are 152% and 192%. Corresponding 16-node figures for the CS3 AMD Athlon cluster are 233% (benchmark 4) and 255% (benchmark 5). It is clear however that providing just fast ethernet as interconnect is not sustainable much beyond 32 CPUs. The 64 CPU performance of the CS6 Pentium III/800 cluster is only marginally superior to that at 32 CPUs in Benchmark 4, while in Benchmark 5 the 64 CPU timing is actually slower. The faster EV67 CPU of the Linux Alpha cluster, together with its enhanced QSNet interconnect, results in much higher delivery figures. The Alpha CS2 cluster outperforms the IBM SP and Origin 3800/R12k, with corresponding figures of 470% (benchmark 4) and 363% (benchmark 5) at 32 CPUs. The potential of the Beowulf systems in these simulations is striking; the 32-CPU Linux Alpha Cluster is outperforming 128 nodes of the Cray T3E in both benchmarks.

A quite different picture of performance is revealed when considering the two macromolecular simulations, benchmarks 3 and 7. Now the scalability on the T3E/1200E is far more limited, with speedups of just 27 (benchmark 3) and 61 (benchmark 7) on 128 nodes of the Cray. The improvement in performance of the high-end systems over the T3E/1200E is also less apparent compared to the Ewald-based simulations. The slowest of these systems, the IBM/SP-WH2-375, is no faster than the Cray on benchmark 3, and is only a factor of 1.4 times faster on benchmark 7 on 32 CPUs (cf. factors of 3.7, benchmark 4 and 2.4, benchmark 5). While there remains little to choose in performance between the SGI Origin 3800/R14k-500, Compaq Alpha SC EV67/667 and the Cray Alpha Linux cluster, the Origin 3800/R14k is now the leading 32 CPU machine in both benchmark 3 and 7. The optimum scalability is clearly shown by the Origin 3800/R14k, although this significantly inferior to that found on the Cray T3E.

Table 1: Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on the Cray T3E/1200 and IBM, SGI, Compaq and Cray High-end Systems.

CPUs

Cray

T3E/1200E

IBM

SP/WH2-375

SGI Origin 3800/R12k-400

SGI Origin 3800/R14k-500

Compaq Alpha SC EV67/667

Cray Alpha Linux SC EV67/833

Benchmark 4: NaCl

8

1588

 

343

298

218

215

16

817

193

179

154

115

114

32

376

102

85

70

63

65

64

179

   

39

 

37

128

94

         

Benchmark 5: NaK Disilicate Glass

8

918

 

244

204

178

173

16

479

154

121

107

94

88

32

225

93

66

57

55

52

64

121

   

33

 

36

128

75

       

28

Benchmark 3: Transferrin

8

   

139

118

 

105

16

191

155

96

84

 

87

32

132

136

88

77

 

81

64

115

       

83

128

104

         

Benchmark 7: Gramicidin A

8

1243

 

372

316

 

288

16

688

418

212

182

178

171

32

382

273

135

121

124

122

64

232

   

91

 

104

128

166

       

102

This lack of scalability has a predictable effect on the performance of the commodity clusters, which now deliver significantly lower percentage delivery figures compared to those found in the Ewald-based simulations. Considering the CS1 PentiumIII/450 cluster, we find 32-node delivery figures of just 34% and 55% for benchmarks 3 and 7 respectively. While these figures increase significantly on the more powerful CPUs, they are far from impressive. Focusing on benchmark 7, we see only modest increases in delivery on CS6 PentiumIII/800 (69%) and CS4 K7/1200 clusters (95%). These figures do increase substantially with improvements in interconnect. Thus the 16 CPU CS5 Pentium/930 cluster with SCI interconnect leads to the machine outperforming the T3E/1200 (561 vs. 688 seconds) in benchmark 7. The performance of the CS3 AMD K7/850 Cluster is worth noting here. The 16-CPU Myrinet connected cluster outperforms the ethernet connected CS4 K7/1200 cluster and performs largely on a par with 16 CPUs of the IBM SP/WH2-375, delivering about 50% of the corresponding partition of the Linux Alpha Cluster. The 32-CPU Linux Alpha Cluster remains, however, the optimum cluster of those considered. It delivers 152% (benchmark 3) and 260% (benchmark 7) of the Cray T3E. A significant performance limitation on the Alpha Cluster arose from the way DL_POLY handled both co-ordinate and forces arrays. The x-, y- and z-coordinates and corresponding forces were stored as separate linear arrays, x(mxatms), y(mxatms) etc., coding that led to exceedingly poor cache re-usage on the UP2000 processor. Re-writing the code to use, in hopefully obvious notation, xyz(3,mxatms) and fxyz(3,mxatms) improved overall performance on the Alpha cluster by a factor of 2.5 (although it had little effect on, for example, the IBM/SP-WH2 with its larger 8 MByte cache). Having made these changes, the 32-CPU Linux Alpha CS2 Cluster again outperforms 128 nodes of the Cray T3E in both benchmarks.

Table 2: Time in Wall Clock Seconds for the four DL_POLY benchmark calculations on a variety of commodity-based systems.

   

Commodity Systems, CSx+

Benchmark

CPUs

CS1

CS2

CS3

CS4

CS5

CS6

 

8

 

337

 

591

880

911

4

16

670

167

351

269

351

385

NaCl

32

352

80

 

146

 

204

 

64

         

169

               

5

8

 

235

 

278

366

434

NaK

16

412

110

188

161

179

228

disilicate

32

238

62

 

117

 

149

glass

64

         

153

               

3

8

 

97

 

253

202

316

Transferrin

16

410

74

163

255

165

290

 

32

391

87

     

297

               

7

8

 

348

 

658

983

1131

Gramicidin

16

1059

190

442

458

561

733

A

32

700

147

 

402

 

552

 

64

         

611

+CS1 PIII/450 + FE: LAM/MPI,   CS2 QSNet Alpha Linux EV67/667
CS3 AMD K7/850 + Myrinet,   CS4 AMD K7/1200 + FE: LAM/MPI
CS5 dual PIII/930 + SCALI,   CS6 PIII/800 + FE: LAM/MPI

An additional feature exemplified by these benchmarks is the impact of the underlying MPI libraries on performance. While little effect was found in the Ewald-based simulations, a much greater impact was apparent on benchmarks 3 and 7. Thus the reduced latency associated with LAM MPI as against MPICH reduced the 32-node benchmark 3 timing on the CS1 Pentium III/450 cluster from 583 (MPICH) to 391 seconds (LAM).

Finally, it is perhaps worth questioning the value and cost effectiveness of the Cray T3E/1200E in running molecular simulations using the DL_POLY software, a question that is reinforced by considering the CHARMM benchmarks presented below. In both classes of DL_POLY benchmark considered, those that scale well on the Cray (benchmarks 4 and 5) and those that scale badly (the macromolecular simulations), we see that 128-node Cray T3E performance is matched or exceeded by 32 CPUs of the Linux Alpha Cluster. While the latter scales less effectively than the Cray, the total times to solution are less. Given that DL_POLY itself does not scale effectively beyond 128 Cray CPUs, it is difficult to justify using the Cray at all, given the implicit cost differential involved against the clusters considered in this report.

References

[1] see, http://www.dl.ac.uk/TCSC/Software/DL_POLY/main.html

[2] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/


Applications Performance: CHARMM

M.F. Guest and P. Sherwood

CHARMM "Chemistry at HARvard Macromolecular Mechanics" (version c26b2) is the general-purpose molecular mechanics, molecular dynamics and vibrational analysis package for modelling and simulation of the structure and behaviour of molecular systems. The benchmark is the standard CHARMM parallel benchmark involving an MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules (14026 atoms, 1000 steps (1 ps), 12-14 Angstrom shift). Although a macromolecular simulation, this MD benchmark shows many of the attributes demonstrated by the Ewald-based DL_POLY simulations. Again, while excellent scalability is shown on the Cray T3E/1200E (a speedup of 96 on 128 nodes), the EV56 node appears to be considerably slower than the 450 MHz Pentium III. Thus the CS1 Pentium III cluster is outperforming the Cray at small node counts, and exhibits comparable performance on 32 nodes, with a percentage delivery figure of 96%.

Table 1. Time in Wall Clock Seconds for the CHARMM Carboxy Myoglobin parallel benchmark.

CPUs

Cray

T3E / 1200E

CS1 PIII/450 + FE

CS2 Alpha Linux EV67/667

   

(LAM/MPI)

(QsNet)

2

 

2906

984

4

2547

1526

499

8

1286

880

275

16

665

518

145

32

343

359

85

64

183

 

62

128

106

   

 

This figure increases substantially on the more powerful EV67 CPU of the Alpha Cluster, with 32-CPUs of the CS2 cluster exhibiting a corresponding figure of 404%. The potential of the commodity-based systems in this simulation is again striking; the 32-CPU Alpha Linux Cluster is outperforming 128 nodes of the Cray T3E/1200E.

References

[1] CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus.


Applications Performance: CASTEP, CPMD and CRYSTAL

M.F. Guest

CASTEP (CAmbridge Serial Total Energy Package, [1]) is the ab initio code for the solution of the electronic ground state of periodic systems with the wavefunctions expanded in a plane wave (PW) basis using techniques based on density functional theory. CASTEP calculates the total energy, forces and stresses in a 3D periodic system, and is continuously under development within the UK Car-Parrinello Consortium (UKCP) collaboration. The code uses FFTs on a 3-dimensional grid decomposed in fat pencils to allow the use of a large number of processors. Use is made of portable Temperton General Prime Factor Algorithm for multi-radix multi sequence FFTs (GPFA). For production work the vendor’s proprietary FFTs would be used. A second part of the code performs orthogonalisation of the wavefunctions using an explicit Gram-Schmidt procedure.

The CASTEP 4.2b benchmark comprises a single k point total energy calculation of Chabazite, Si11 O24 Al H, using the Vanderbilt ulatrasoft pseudo-potential. With 15045 plane waves and a 3D FFT grid size of 54x54x54, convergence was achieved in 17 SCF cycles using the Pulay density mixing minimiser scheme. Benchmark timings on the Cray T3E/1200E, IBM SP/WH2-375, SGI Origin R14k/500 and a number of commodity based systems are reported in Table 1. Note that the Pentium III/450 cluster timings (using PGI, MPICH) featured channel bonding.

Table 1: Time in Wall Clock Seconds for the CASTEP Chabazite parallel benchmark on the Cray T3E/1200E, IBM SP/WH2-375, SGI Origin 3800/R14k-500 and a number of commodity clusters.

CPUs

Cray

T3E/1200E

IBM

SP/WH2-375

SGI Origin 3800/R14k-500

Commodity Systems, CSx+

       

CS1

CS2

CS5

CS6

8

1291

678

468

3380

727

1957

2664

16

637

422

235

1612

367

983

1228

32

324

271

153

989

195

 

774

64

216

   

-

     

+CS1 PIII/450 + FE: LAM/MPI,   CS2 QSNet Alpha Linux EV67/667
CS5 dual PIII/930 + SCALI   CS6 PIII/800 + FE: LAM/MPI

While excellent scalability is found on the T3E, the reverse is seen on both the IBM SP/WH2-375 and the Pentium-based ethernet connected CS1 and CS6 clusters; this poor scalability is caused in large part by the MPI_ALLTOALLV routine used in the matrix transpose part of the 3D FFT. An analysis of the 32 CPU timings of 271 and 989 seconds on the IBM SP and Pentium CS1 cluster reveals that 157 and 660 seconds respectively are spent in the associated communication routines. Thus the percentage delivery figure of 33% of the Cray T3E on 32 nodes of the Pentium Cluster is, along with the DL_POLY bond-constraint benchmarks, the lowest figure encountered in this benchmarking report. Effectively doubling the CPU clock speed in the CS6 Pentium III/800 cluster leads to only a small improvement in time to solution, with the CS1 32-CPU timing of 989 seconds reduced to 774 seconds (of which 600 seconds are spent in communications). Enhancing the interconnect on the clusters does lead to a significant improvement in timings, with the 16 processor CS5 Pentium III/930 SCI-connected cluster achieving 65% of the T3E performance, although one might have expected better. The optimum cluster performance is derived from the QSNet connected CS2 Linux Alpha Cluster, which at 32 CPUs outperforms the Cray T3E by a factor of 1.66, and achieves 78% of the performance of the SGI Origin 3800/R14k-500. Clearly optimisation of the MPI_ALLTOALLV in the CS2 Cluster is still not optimal, with 111 seconds of the total wall time of 195 seconds spent in communications. The corresponding figure in the Origin 3800/R14k is 71 seconds (in a total wall time of 153 seconds).

On the IBM SP it would clearly be beneficial to call the IBM PESSLP2 3D FFT routine instead of going through GPFA to obviate the transpose altogether.

CPMD is the Car-Parrinello code developed by Michele Parrinello and his group at the MPI in Stuttgart and a number of users worldwide [2]. The code has been ported to the CS1 Pentium III/450 cluster, and subsequently benchmarked, by Sprik and Vuilleumier (Cambridge). Note that CPMD is to act as the base code for the new CCP1 flagship, and that further optimisation for Beowulf-class systems is planned during the course of the project. The comparison of Cray and Beowulf hardware shown in Table 2 centres around a Liquid Water benchmark. The simulation comprises 32 water molecules, in a simple cubic periodic box of length 9.86 Angstrom at a temperature of 300K, with a time step of 7 au i.e. 0.169 fs, and a test run of 200 steps (34 fs). The calculation used the BLYP functional and Trouillier and Martins pseudo-potential, with a reciprocal space cut-off of 70 Ry (952 eV).

In stark contrast to the CASTEP benchmark, the Pentium Cluster is seen to be performing well in comparison to the Cray T3E/1200E. This may be attributed to the significantly longer iteration times associated with CPMD, given the efficient usage of short-range pseudopotentials within CASTEP, and the smaller impact that the MPI_ALLTOALLV routine has on the total elapsed times. Good scalability is shown on the Cray T3E/1200E (a speedup of 49 on 64 nodes), although the EV56 node appears to be only marginally faster than the 450 MHz Pentium III. Thus the Pentium cluster achieves a percentage delivery figure of 62% of the Cray T3E/1200E on 32 nodes.

Table 2: Time in Wall Clock Seconds for the CPMD Liquid Water parallel benchmark on the Cray T3E/1200E and the Pentium III/450 CS1 Cluster.

Processors

Cray

T3E / 1200E

CS1 PIII/450 + FE

   

(MPICH)

4

15680

22458

8

7627

11441

16

4873

7627

32

2225

3602

64

1271

 

CRYSTAL [3] was jointly developed by the Theoretical Chemistry Group at the University of Torino and the Computational Materials Science group in CLRC. The program computes the electronic structure of periodic materials within Hartree Fock, density functional or various hybrid approximations. The Bloch functions of the periodic systems are expanded as linear combinations of atom-centred Gaussian functions. Powerful screening techniques are used to exploit real space locality.

Table 3: Time in Wall Clock Seconds for the CRYSTAL98 parallel benchmarks on the Cray T3E/1200E and the CS1 PIII/450 and CS2 QSNet Linux Alpha Clusters.

 

CPUs

TiO2

MgO

Machine

 

75 k-points, 126 GTOs

36 k-points, 576 GTOs

Cray T3E/1200E

2

4270

 
 

4

2262

10510

 

8

1251

5528

 

16

851

3544

 

32

-

1617

CS1 PIII/450 + FE

2

3364

 
 

4

1796

7841

 

8

987

4053

 

16

739

2107

 

32

523

1117

CS2 QSNet Alpha

4

 

2564

Linux Cluster / EV67

8

 

1289

 

16

 

723

 

32

 

463

The code may be used to perform consistent studies of the physical, electronic and magnetic structure of molecules, polymers, surfaces and crystalline solids. Benchmark timings of CRYSTAL 98 on the Cray T3E/1200E, Pentium and Alpha Beowulf systems are reported in Table 3. Two representative RHF calculations are included, (i) TiO2 bulk crystal, with 75 k-points, 6 atoms/cell, and a modest basis set comprising 126 GTOs, and (ii) a much larger calculation of MgO, with 36 k points, 64 atoms/cell and 576 GTOs. Parallelisation in each case is conducted over both integrals and k-points, the benchmarks being based on the replicated data parallel version of the code.

These benchmarks again provide compelling evidence as to the value of the Beowulf clusters, and to the limited performance of the Cray EV56 node. The CS1 Pentium III/450 cluster outperforms the Cray T3E/1200E at all node counts in both benchmarks, showing a percentage delivery figure of 145% of the Cray T3E on 32 nodes. This figure increases substantially on the more powerful CPUs of the Alpha Cluster, with the Linux CS2 cluster outperforming the 32-node Cray T3E by a factor of 3.49. It should be noted that the size of basis set in these benchmarks still enables the linear algebra to be treated in serial fashion. A more stringent test of the clusters will be apparent once the distributed data MPP version of CRYSTAL is implemented, this requiring an efficient cluster implementation of SCALAPACK (work currently in progress).

References

[1] M.C. Payne et. al., Rev. Mod. Phys. (1992) 64 p. 1045, see http://www.cse.clrc.ac.uk/Activity/UKCP/

[2] CPMD, Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999).

[3] see, http://www.cse.clrc.ac.uk/Activity/CRYSTAL/


Applications Performance: ANGUS

M.F. Guest

ANGUS [1] performs direct numerical simulation (DNS) of turbulent premixed combustion in order to generate statistical data in support of modelling. The equations to be solved are the Navier-Stokes equations for fluid flow, augmented by two additional equations each describing the transport of a single scalar variable and together specifying the thermochemical state of the system in the presence of differential diffusion effects. Thus, in total there are six partial differential equations to be solved. A grid partitioning strategy is employed for ANGUS which is quite typical of many domain decomposition techniques used in parallel CFD. Due to the finite difference stencil it is necessary to introduce ‘halo’ or ‘ghost’ cells at the interface boundaries. These are used as a message cache and allow derivatives to be determined in these regions with only local variables. The halo cells are then updated as required. Discretisation of the equations is carried out using standard second-order central differences on a three-dimensional grid. The velocity nodes are located at the face-centres of each cell, giving a staggered-grid arrangement that conserves kinetic energy as well as mass and momentum. The pressure solver utilises a conjugate gradient method with a Modified Incomplete LU (MILU) preconditioner [2].

As with many CFD algorithms, the resulting matrix is both sparse and symmetric. In this case, it is heptadiagonal and the periodic boundary conditions also mean that the matrix is singular. A Multi-grid solver has also been provided. Level-1 BLAS are used heavily in both solvers, with the overall computational work expected to be roughly proportional to n3.

The current ANGUS CG-ILU benchmark utilises a grid size of 1443. Benchmark timings on the Cray T3E/1200E, IBM SP/WH2-375, plus a number of commodity-based systems reported in Table 1 are for one hundred iterations of the conjugate gradient solver.Note that the timings reported for the IBM/SP, the Alpha Linux and Pentium III/930 SCALI Cluster refer to CPU configurations in which all CPUs on a given node are involved in the computation. These timings show several distinct features. Both the Cray T3E/1200E and IBM SP/WH2-375 appear to exhibit super-linear speedups, although the SP is only marginally faster than the Cray for a given node count. While the Alpha Linux cluster outperforms both Cray and IBM up to 16 nodes, this advantage is effectively lost at 32 CPUs when the machine exhibits almost identical timings as the IBM (751 vs. 776 seconds respectively). The degradation in performance on 32 CPUs of the Alpha Cluster leads to a delivery figure of 145% of the Cray T3E/1200E. Considering the Pentium clusters, the PIII/450 CS1 cluster again performs well, exhibiting close to linear scaling with the 32-node time some 70% of that recorded on the T3E. The improvement in performance with increased processor speed is modest, with the CS6 PIII/800 cluster outperforming CS1 by factors well below the MHz ratio (a factor of 1.3 on 8 CPUS, decreasing to just 1.1 on 32 CPUs).

Table 1: Time in Wall Clock Seconds for the ANGUS CG-ILU Benchmark (1443)

CPUS

Cray

T3E/1200E

IBM

SP/WH2-375

Commodity Systems, CSx+

     

CS1

CS2

CS3

CS5

CS6

8

4580

4394

6540

1943

3685

7080

5130

16

2380

1864

3610

936

2063

3930

2870

32

1090

776

1830

751

   

1600

64

480

 

-

     

780

+CS1 PIII/450 + FE: LAM/MPI,   CS2 QSNet Alpha Linux EV67/667
CS3 AMD K7/850 + Myrinet,   CS5 dual PIII/930 + SCALI
CS6 PIII/800 + FE: LAM/MPI

What is confusing is the performance of the SCALI cluster, CS5. With PIII/930 CPUs and the SCALI interconnect it is apparently outperformed at all CPU counts by the CS1 Cluster with more its modest PIII/450 CPUs and fast ether interconnect. Additional insight into these results can be gained by varying the distribution of processors over the available nodes (see Table 2). Now, for example, a 16-processor job on the IBM SP/WH2-375 is run on either 4 or 8 nodes given the configuration available (with all CPUs used in the former case, and only 2 CPUs/node in the latter). On both the Linux Alpha and SCALI Pentium cluster the same 16-processor job may be run on either 8 nodes (employing both of the dual processor CPUs) or on 16 nodes, where just a single processor would be involved.

Table 2: Time in Wall Clock Seconds on the IBM SP/WH2-375, Alpha Linux and SCALI Clusters as a Function of Processor distribution for the ANGUS CG-ILU Benchmark (1443);

Number of Nodes

Number of

1

2

4

8

16

CPUS

         

IBM SP/WH2-375

4

10520

5880

4060

   

8

 

4394

2560

1899

 

16

   

1864

1136

 

32

     

776

 

CS2 Alpha Linux Cluster

4

 

6030

3378

   

8

   

3238

1943

 

16

     

1635

936

32

       

751

CS5 Dual PIII / 930 SCALI Cluster

4

   

8122

   

8

   

7037

4556

 

16

     

3927

 

The strong correlation between elapsed time and node occupancy points to the driving influence of memory bandwidth on this benchmark. Thus performing an 8 CPU run on the IBM SP/WH2-375 realises elapsed times that vary by a factor of 2.3 depending on processor distribution (from 4394 seconds on 2 nodes to 1899 seconds on 8 nodes). Similarly the 16 CPU benchmark on the Alpha Linux Cluster requires 1635 seconds on 8 dual processor nodes, and 936 seconds when using a single CPU of each of the available 16 nodes. On the SCALI Pentium cluster, the 8 CPU benchmark requires 7037 seconds on 4 dual processor nodes, and 4556 seconds when using a single CPU of each of the available 8 nodes. These performance attributes are completely consistent with the STREAM memory bandwidth benchmark [3] on the nodes of each machine. The TRIAD bandwidth of 900 Mbytes/sec measured on a dedicated single SP/WH2 node is reduced to some 225 Mbytes/sec when running the same benchmark on all 4 CPUs of the node. Equally, the UP2000 6/667 figure of 1 GByte/sec measured on a dedicated dual processor node is reduced to 500 Mbytes/sec when both CPUs are involved in the same benchmark.

References

[1] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from and engineering perspective, Parallel Computing (1996).

[2] T.F. Chan and C-C.J. Kuo, Parallel Elliptic Preconditioners: Fourier Analysis and Performance on the Connection Machine, Computer, Physics Communications, Vol. 53, 1989, pp 237-252.

[3] The STREAM Memory Bandwidth benchmark, see http://www.cs.virginia.edu/stream.


Applications Performance: FLITE3D

M.F. Guest

FLITE3D is a finite-element code for solving the Euler equations governing airflow over whole aircraft. Parallelisation of FLITE3D for shared and distributed memory parallel systems has been undertaken as part of a collaboration between the Computational Engineering Group at Daresbury and the Sowerby Research Centre at British Aerospace. The code comprises a suite of modules for obtaining Euler solutions of the flow over complex configurations.

Table 1: Time in Wall Clock Seconds for the FLITE3D benchmarks on the Cray T3E/1200E, IBM SP/WH2-375 and CS1 Pentium III/450 and CS2 Linux Alpha Commodity Systems.

CPUs

Cray

T3E / 1200E

IBM

SP/WH2-375

CS1 PIII/450 + FE

CS2 Alpha Linux /EV67-667

     

(LAM/MPI)

(QsNet)

wing body benchmark (298,244 elements)

4

510

120.6

468

95

8

280

68.1

247

50

16

160

43.4

136

31

32

100

30.9

79

20

F18 benchmark (3,444,350 elements)

4

5040

1177

4630

950

8

2620

581

2360

478

16

1350

306

1260

263

32

720

172

690

150

64

415

     

128

250

     

Work has been carried out on the parallelisation of the steady Euler flow solver using standard techniques of mesh-partitioning for a single-program multiple-data (SPMD) programming model implemented in Fortran 77 and C with message passing using a choice (selectable at compile time) of MPI or PVM. The flow solver now reads in the partitioned mesh and performs the necessary communications at the boundaries between sub-domains. Fields are gathered onto the master processor for output so that no changes are necessary in the post-processing stages. This also enables the flow solver to be stopped and restarted using a different number of processors, if necessary. Table 1 shows timings on the Cray T3E/1200E, IBM-SP/WH2-375 and both the Pentium III/450 CS1 cluster and the QSNet Linux Alpha cluster, CS2 for two MPI-based FLITE3D benchmark studies, (i) a modest wing body benchmark using 298,244 elements, and (ii) the more demanding F18 benchmark using 3,444,350 elements.

These benchmarks provide further compelling evidence as to the value of the commodity clusters, and to the limited performance of the Cray EV56 node. Focusing on just the largest F18 benchmark, we see that although the Cray is scaling well (a speedup of 81 on 128 nodes), the Pentium III/450 CS1 cluster outperforms the Cray T3E/1200E at all node counts. The Pentium III cluster shows a percentage delivery figure of 145% of the Cray T3E on 32 nodes. This figure increases substantially on the more powerful CPUs of the IBM SP/WH2-375 and Alpha Linux cluster. The CS2 cluster outperforms the 32-node Cray T3E by a factor of 4.8, with the 32-node Alpha significantly faster than 128 nodes of the Cray. The relative performance of the IBM SP/WH2 is also impressive. While slower than the Alpha cluster, the 32-CPU SP timing is again significantly faster than that recorded on 128 nodes of the Cray. Although the code was originally developed for the Cray, these results strongly suggest that the individual node performance of the T3E is far from optimal.


Applications Performance: SUMMARY

M.F. Guest

We summarise In Table 1 the conclusions of the benchmarking exercise on applications reported in a number of previous articles, by showing

  • the percentage of a 32-node partition of the Cray T3E/1200E delivered by both the Pentium III/450 CS1 and QSNet Alpha Linux CS2 systems (i.e. T32-nodeCray T3E / T32-nodeCSx), and
  • the percentage of a 32-node partition of the SGI Origin R14k/500 delivered by the QSNet Alpha Linux CS2 system (i.e. T32-node SGI Origin 3800-R14k / T32-nodeCSx).

 

Table 1: Application Performance: Percentage of a 32-node partition of (i) the Cray T3E/1200E achieved by the 32-node CS1 Pentium /450 and 32-node CS2 QSNet Linux Alpha Cluster, and (ii) the SGI Origin 3800 R14k achieved by the CS2 QSNet Linux Alpha Cluster.

Application Code

T32-nodeT3E / T32-nodeCSx

T32-node Origin 3800 / R14k / T32-node Linux Alpha Cluster

CS1 Pentium-III / 450 + FE

Cluster

CS2 QSNet Linux Alpha EV67/667

Cluster

 

(%)

(%)

(%)

GAMESS-UK

     

SCF

53-69%

256%

99%

DFT

65-85%

301-361% (§)

99%

DFT (Jfit)

44-77%

219-379%

89-100%

DFT Gradient

90%

289% (§)

89%

MP2 Gradient

44%

228%

87%

SCF Forces

80%

154%

86%

       

CRYSTAL

145%

349% (§)

-

       

NWChem (DFT Jfit)

40-51%

299-464% (§)

-

       

REALC

67%

 

-

       

DL_POLY

     

Ewald-based

95-107%

363-470% (§)

95%

bond constraints

34-56%

143-260%

82%

       

CHARMM

96%

404% (§)

-

       

CASTEP

33%

166%

78%

CPMD

62%

 

-

       

ANGUS

60%

145%

-

FLITE3D

104%

480% (§)

 

(§) Outperforms 128 nodes of the Cray T3E/1200E

These figures suggest the following:

  1. In many of the applications the inexpensive Pentium-based systems with simple fast ethernet connection delivers a significant fraction of Cray/T3E performance. While applications with extensive communication demands clearly exhibit inferior performance and scalability on the PC-based system (e.g. CASTEP, DL_POLY with bond constraints, direct-MP2 gradient calculations using GAMESS-UK), the delivered performance of the Pentium III/450-based cluster is at worst 34% of the Cray T3E/1200E. Many of the other applications show a much higher delivered level of performance; in a few cases the Beowulf cluster actually equals or exceeds Cray performance (e.g. CRYSTAL, Ewald-based DL_POLY, CHARMM, FLITE3D). In these cases it makes little or no sense to be using the T3E for 32-node runs when equivalent performance is achieved by a solution that costs a tiny fraction of that associated with using the high-end machine.
  2. There are a number of performance issues associated with the CS2 QSNet Linux Alpha Cluster that act to constrain performance, most notably the limited memory bandwidth of the UP2000, and the effective utilisation of L2 cache - the so-called issue of "page colouring" under Linux. Allowing for these, results from the CS2 Alpha cluster are most encouraging. In all benchmarks, the 32-CPU cluster exceeds the performance of 64-nodes of the Cray T3E/1200E (and that associated with the 32-CPU IBM/SP WH2). In optimal cases (those marked with a § in the Table) the Alpha Cluster is outperforming 128-nodes of the Cray T3E/1200E.
  3. We have compared throughout this report the performance of a variety of clusters with a number of recent propietary high-end offerings. The latter includes the IBM SP/WH2-375, the Compaq AlphaServer SC (with both 667 and 833 MHz CPUs), the SGI Origin 3800 (both R12k-400 and R14k-500 CPUs) and a prototype of Cray's EV68-based Linux Alpha Cluster. While these more recent machines predictably outperform the Cray T3E, typically by factors of 4-6 at modest node count e.g. 32 CPUs, there is clear evidence of an increasing imbalance between CPU and interconnect performance. This manifests itself by a marked lack of scalability with increased processor count for these systems compared to the Cray T3E.
  4. The CS2 QSNet Linux Alpha cluster is competitive in performance with these newer machines, achieving for example between 78-100% of SGI Origin R14k/500 performance across a wide range of processor counts.

The collection of results presented in this appendix provides compelling evidence that suitably-configured Beowulf systems can provide not only highly cost-effective departmental, mid-range solutions, but can match the levels of performance associated with a significant fraction of a high-end MPP machine, again for a small fraction of the cost.


previous contents forward
design by CCP1, March 2001