the newsletter of
collaborative
computational project 1

The Daresbury Laboratory Beowulf Project

S.J. Andrews, M.F. Guest and B.G. Searle
CCLRC Daresbury Laboratory, Daresbury, Warrington WA4 4AD
s.j.andrews@dl.ac.uk, m.f.guest@dl.ac.uk, and b.g.searle@dl.ac.uk

Introduction

Networks of personal computers (so called Beowulf systems) composed of fast PCs configured with large quantities of RAM and hard disk, and running the Linux operating system are becoming more and more attractive as cheap and efficient platforms for distributed applications [1]. The main drawback of a standard Beowulf architecture is the poor performance of the conventional inter-process communication mechanisms based on RPC, sockets, TCP/IP, Ethernet. Such standard mechanisms are thought to perform poorly both in terms of throughput and message latency. Nevertheless, there is increasing interest in the use of commodity "off-the-shelf" components as building blocks for high-performance computing. This is evident in many areas as witnessed by the high proportion of such machines appearing in funding requests to EPSRC, and other sources, and the filling of this particular niche by specialist integrators. An example within the UK academic community is provided by the latest round of the Joint Research Equipment Initiative (JREI'2001). This included equipment requests for more commodity-based clusters than proprietary SMP-based solutions (e.g. Origin 3400), in stark contrast to previous rounds of such Initiatives.

To investigate the potential of this class of system, a number of "Beowulf-class" systems have been assembled and evaluated at Daresbury [2]. The programme of evaluation is focused on both system software and hardware, and on assessing the delivered performance across a broad spectrum of end-applications. The programme aims to inform the community over the wide variety of available options, from choice of CPU (Alphas, single- and dual-Pentiums, caches, Xeon or old memory subsystems) to choice of interconnect (Ethernet, Myrinet, Scali SCI, QsNet etc.). This report provides an update on progress.

Following an overview of the commodity-based systems currently under assessment, we consider in subsequent articles the results of a number of benchmarks designed to assess the building blocks of any cluster, namely (i) serial node CPU performance, and (ii) communications benchmarks across the variety of possible cluster interconnects. The major focus of these subsequent reports is, however, on applications and we present performance comparisons between commodity-based systems, the CSAR Cray T3E/1200E, and a variety of high-end proprietary systems. Applications considered include those from computational chemistry (GAMESS-UK [3], DL_POLY [4] and CHARMM [5]), computational materials (CPMD [6]), and computational engineering (ANGUS [7] and FLITE3D).

Evaluation has focused on commodity CPUs from Intel and AMD, and Alpha EV6-based platforms. The former includes Intel's IA-32 (Pentium III and Pentium 4) and AMD's Athlon CPUs, and the more recent IA64 processors from Intel (Itanium and Itanium 2). The Alpha-based platforms include DS10, DS20, ES40, ES45, XP-1000 and DPC264 machines, from 466 MHz EV6 to industry-leading 1000 MHz EV68 models. Various system and resource management packages (Lobosq, Sychron, Beowulf, Quadrics RMS and PBS) have been investigated, while several flavours of message passing software (MPICH, LAM6.2, LAM6.3, MPI-VIA, NCSA's VMI and Shmem), compilers (from Compaq, Intel, Absoft, PGI and GNU/g77) and numerical libraries (ATLAS, NASA, Intel's MKL) have been tested. Implementations of the Global Array (GA) tools and parallel eigensolvers from PNNL have also been completed. As well as the choice of processor, a variety of networking options including fast Ethernet, Myrinet and more recently the low-latency, high bandwidth solutions, SCI from Dolphin and QsNet from Quadrics, have also been evaluated.

The assessment of a variety of prototype commodity-based systems (CS) continues to produce a wealth of data. Ten such systems have been used in the present study (CS0-CS9), five in-house (CS0-CS3, and CS5), the five others through collaborative links to other Beowulf sites (CS4, CS6-CS9). Although no longer in use, the in-house Pentium-based machines (CS0 and CS1) have been used to benchmark a host of applications. Their role as development systems has now been taken over by CS2 (a Linux Alpha EV67-based UP2000/Quadrics system) and CS3 (an AMD Athlon-based Myrinet system).

  • CS0 and CS1 - An initial 10-processor 266 MHz Pentium II system (CS0) and its successor, a 32-processor 450MHz Pentium III Linux system (CS1), have acted as the primary test vehicles. CS1 has dual Ethernet networks and has been running both the LOBOSQ (from NIH, USA) and the PBS (Portable Batch system) job scheduling software. The beowulf1 system has been fully operational since mid-September '99, and until recently has successfully acted as a testbed installation, providing benchmark results on a number of parallel applications (see subsequent articles).

  • CS2 - Installation of a fully integrated and configured 32 processor EV67 Alpha Linux-based system with high-performance Quadrics interconnect (5 usec latency and 210 MBytes/s bandwidth) was completed in May '00. Numerous applications have been implemented on this system, with associated benchmark results presented in later articles. Note that the loki system was recently upgraded to 64 CPUs, with an additional 8 dual UP2000 nodes plus 8 dual CS20/EV67-833 nodes. More details on developments around the CS1 and CS2 systems are provided in the next article.

  • CS3 - Installation of an AMD Athlon-based / Myrinet system comprising 16 X 850MHz AMD K7 machines was completed in June '00.

  • CS5 - A prototype 16 processor machine (8 dual Pentium III/933 MHz processor nodes) with the high performance SCALI interconnect was kindly made available by Workstations UK between March – November, 2001.

In addition to the in-house systems above, the following machines have been evaluated and used to benchmark applications:

  • CS4 - The 128 CPU Athlon-based cluster at Sara (Amsterdam). Partitioned into a number of community-based partitions, we present results on up to 16 of the 700 MHz AMD processors over a Fast Ethernet network, and on up to 32 processors of the recently upgraded cluster with 1.2 GHz AMD CPUs (CS4).
  • CS6 - Access to the 528 processor CliC (Chemnitzer Linux Cluster) has enabled a number of benchmarks with higher processor count to be performed. The nodes are single processor Pentium III/800 with both fast ethernet and gigabit ethernet communication systems.
  • CS7 - Following the successful experiments with the prototype CS5 system above, the UKCP consortium has installed a 32-node system at Daresbury based on the SCALI interconnect, again from Workstations UK. Each node however now comprises the significantly more powerful dual-processor AMD Athlon 1 GHz CPUs.
  • CS8 - The Itanium-based Titan system at NCSA, consisting of 160 dual-processor IBM IntelliStation Z Pro servers machines. Each IntelliStation server features two 800MHz Intel Itanium processors, running Red Hat Linux and Myricom's Myrinet cluster interconnect network. Titan has a peak performance of 1 teraflop/s and joins Platinum, NCSA's Pentium® III Linux cluster, as the second teraflop cluster at the centre. Both Titan and Platinum will later be incorporated into the new TeraGrid computing system [8].
  • CS9 - The most recent addition to the list comprises the 48 node "dirac" system at Bristol University. Based on dual processor 2 GHz. Pentium 4 CPUs, the nodes are interconnected with both Myrinet 2000 and Fast Ethernet.

References

[1] D. Ridge, D. Becker, P. Merkey, T. Sterling and P. Merkey, Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs, Proceedings, IEEE Aerospace (1997).

[2] Additional information on the Daresbury Beowulf Systems is available via World Wide Web URL http://www.cse.clrc.ac.uk/disco/dl-beowulfs.shtml.

[3] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, H.J.J. van Dam, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, A.J. Stone and D.J. Tozer. (http://www.dl.ac.uk/CFS)

[4] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/

[5] CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus.

[6] CPMD Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999).

[7] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from an engineering perspective, Parallel Computing (1996).

[8] For details of the hardware at NCSA, see http://www.ncsa.uiuc.edu/UserInfo/Resources/

 

DisCo Hardware

The background to the major pieces of hardware associated with the DisCo programme has been provided in the preceding article. The major cluster hardware resources available at Daresbury over this period included:

Beowulf Systems:

32 processor Pentium system:
  • Master node: Intel PentiumII-266 CPU, 256MB SDRAM, 2 PCI Fast Ethernet interconnects, 4GB system and 9GB user ultrawide SCSI disks
  • 32 Processor nodes: each with Intel PentiumIII-450 CPU, 256 MB SDRAM, 2 PCI Fast Ethernet interconnects, 10GB U-DMA disk
  • 2 Fast Ethernet switches
  • 8 nodes with Myrinet interconnect using 4 32-bit and 4 64-bit PCI cards
  • 2 nodes with Gigabit Ethernet
64 processor Alpha system:
  • Master Node: dual Alpha 21264A EV67/667MHz, 1 GB ECC SDRAM; 16 GB UW-SCSI user disk
  • 32 compute nodes (dual Alpha 21264A CPUs 48 x 667MHz, 16 x 833MHz), 512 MB ECC SDRAM, 9 GB SCSI disk, connected to the master via an Extreme Summit48 Fast Ethernet switch
  • High Performance Interconnect via Quadrics Elan III NICs and 128-way switch chassis
  • QSW RMS Resource Management System running on Linux kernel

Miscellaneous desktop systems:

PCs and IBM, SGI, DEC-AXP, and SUN desktop workstations are used for code development, testing etc and support of the DisCo programme. Various loan machines are made available by vendors from time to time and these have included systems from Intel (dual Pentium Xeon (2GHz, 2GB RDRAM), quad Itanium (800MHz, 4MB L2 cache, 4MB RAM)), HP (dual 750MHz PA8700 J6700, X4000 Xeon workstation) and Workstations UK (10xdual PIII 933MHz + SCI interconnect).

Until recently three processors were likely to become dominant in this field, namely the Pentium, IA64 (currently Itanium) and Alpha although, with the pending Compaq/HP merger, the future of the latter is now in doubt beyond the EV7. Several interconnect options remain in play including Myrinet, SCI/SCALI, QsNet and Gigabit Ethernet, and we aim to evaluate for the community appropriate combinations of these elements. Recognising the cost ramifications involved in targeting these options our aim was to leverage capital investment from EPSRC with other funding routes to yield a significant asset base for realistic evaluations. It is unfortunate that this strategy has been affected by factors beyond the team's control resulting in the enforced continuation of the obsolete Pentium cluster service, prior to its decommissioning, and a limitation on the expansion of the Alpha machine to 64 processors. One factor here the impact of the complete refurbishment of the computer hall at Daresbury which necessitated a moratorium on new, space-filling, equipment until September 2002. Given this slower than anticipated expansion of the in-house systems, considerable effort has been invested in gaining access to other commodity-based systems (CS4-CS9, see preceding article) in the goal of benchmarking applications. The presence of a purpose-built machine room will relieve the pressure on floor space and services and allow us to concentrate on refreshing the Daresbury Cluster Computing capability.

The upgraded Alpha machine passed its acceptance tests in September 2001. These included:

  • Serial performance tests - Multiple copies of the molecular electronic structure code GAMESS-UK were to be run on the cluster in farming mode. No significant degradation in performance between running 1 copy of 1 node, 16 copies on 16 nodes (i.e. 1 per CPU) and 64 copies on 64 CPUs (32 nodes) was observed.
  • Communication Tests - the Pallas MPI Benchmarks (PMB) were run to address performance of the switch. The target was 10 successful runs of the benchmarks, with global communications performed on 8, 16, 32 and 64 CPUs. Consistent results were obtained with MPI latencies of ca. 5.5μs and inter-node ping-pong bandwidths of 200MB/s.
  • Robustness Tests - the target here was for the machine to demonstrate sustained execution for a period of 12 hours across a mixture of parallel jobs using variable node counts. The following job sequence was employed for DL_POLY using a case with Ewald-based electrostatics. Assuming the number of time steps in the 64 CPU job is adjusted so that the job completes in 15 minutes, then the following sequence of jobs was run as a script loop in ca. 4 hours:

Number of Runs

Nodes

CPUs

Estimated Total Elapsed Time (mins)

1

32

64

15

1

32

32

30

2

16

32

60

2

16

16

120

The pre-upgrade machine was used heavily by the Quasi consortium during the first half of 2001 and, notwithstanding the conclusion of the funding, their usage has continued at a reduced level since then. Total usage of the machine from September 2001 is around 50% of the available node-hours of which more than 25% has been by external users including those from the RI and Cambridge.

At the time the Alpha processor was a leading contender within the then recently announced HPC(x) procurement but this is no longer the case and support for this processor, particularly under Linux, is waning. It is likely, therefore, that our machine will be updated to use Intel processors under the Quadrics interconnect when the current system becomes uncompetitive.

After nearly three years the 32-node Pentium cluster was decommissioned in March 2002 due to space restrictions imposed by a major refurbishment of the Daresbury computer hall. During this time it proved to be a robust platform and served admirably in demonstrating the potential of commodity cluster computing to the academic community. Latterly, it was fully integrated into the emerging GRID-based environments that promise to dominate the next generation of distributed computing.

SPEC Benchmarks

The prototype commodity-based systems described in the preceding article are based on the Pentium, Itanium, Athlon and Alpha CPUs as the node building block. The choice of optimal CPU (performance vs. cost) has been informed by an analysis of the serial performance of a wide variety of processors across a number of computational chemistry benchmarks. We summarise below the results of such a comparison for the SPEC benchmark; comparisons in the area of computational chemistry are presented in the following article.

One of the most useful indicator of CPU performance is provided by the SPEC (``Standard Performance Evaluation Corporation'') benchmarks. This benchmark suite [1] contains non-tuned application-based code to measure processor speed for both integer (SPECint) and floating point (SPECfp) arithmetic. SPECfp95 and SPECint95, and their successors, SPECfp2000 and SPECint2000, have become industry standards in measuring primarily the performance of a system's processor, memory architecture, operating system and compiler. The next generation of SPEC benchmarks, SPEC CPU2000 [2], has recently replaced SPEC95. CFP2000 is derived from the results of 14 floating-point benchmarks compiled with aggressive optimization, and is the geometric mean of 14 normalised ratios (one for each benchmark). CINT2000 is derived from the results of 12 integer benchmarks compiled with aggressive optimization, and represents the geometric mean of 12 normalised ratios (one for each benchmark). Note that the level of optimisation is not mandated. While highly aggressive optimisation is permitted, results derived from benchmarks compiled with conservative optimisation (SPECfp_base2000) can be submitted.

A subset of the SPECfp2000 and SPECint2000 results for many of the leading CPUs are given in Table 1 (with the baseline system the Ultra 10 333MHz). In each case we have normalised the values relative to those of the Compaq AlphaServer ES45/6-1000. An examination of the SPECfp2000 values shows that the leading two leading machines, the IBM pSeries 690Turbo and Compaq AlphaServer ES45/68-1000 lie comfortably ahead of the third placed system, the Sun Blade Model 2050/1050 MHz. While the performance of the 690Turbo is indeed impressive, outperforming the ES45/68-1000 by a factor of 1.2, it is worth trying to put this performance in context by considering the potential shortcomings of using single processor benchmarks on such a system. The complex cache hierarchy of the 690Turbo means in practice that with 7 of the 8 CPUs disabled (according to the SPECfp2000 rules), the single processor benchmark job is actually running in an environment comprising the total cache associated with the 8-way MCM i.e. 128 MByte. Since many of the SPECfp2000 benchmarks have a total memory requirement of this order of magnitude, the recorded level of performance probably bears little resemblance to what might be seen if all 8 CPUS were running the same job. It is not easy to quantify the impact here as IBM have yet to release the SPECfp2000_rate figures for the 690Turbo.

Considering the two leading systems, the IBM pSeries 690Turbo outperforms the power3-based IBM 375 MHz CPU in the IBM RS/6000 SP (382) by a factor of 2.5, a factor that has been widely quoted in early benchmarks of the 690Turbo. The AlphaServer ES45 Model 68/1000 outperforms the 833 MHz Compaq ES40 (777), by a factor of 1.24, and the 667 MHz based ES40 (562) by a factor of 1.70. These factors are somewhat greater than the clock speed ratios (1.20 and 1.50). While some 50 systems in the current list exhibit greater than 50% of the performance of the 690Turbo (i.e. SPECfp2000 figures higher than 580), only 6 distinct CPUs are involved in these systems. These include Intel's IA32 Pentium 4 and IA64 Itanium processors, AMD's Athlon CPUs, the Alpha A21264C and A21264A, Sun's UltraSPARC III Cu processor and finally the PA-RISC PA8700 from Hewlett Packard.

Table 1. SPEC CPU2000 - SPECfp and SPECint Values and Values Relative to the Compaq AlphaServer ES45/68-1000.

Machine

SPECfp

SPECint

Relative Values (%)

 

SPECfp

SPECint

IBM pSeries 690Turbo pwr4/1.3 GHz

1169

814

122%

120%

Compaq AlphaServer ES45 Model 68/1000

960

679

100%

100%

Sun Blade Model 2050 /1050 MHz

827

610

86%

90%

Dell PW 530 / 2.2 GHz P4 Xeon

802

810

84%

119%

Compaq AlphaServer DS20E Model 68/833

784

571

82%

84%

Dell PW 530 / 2.0 GHz P4Xeon

765

757

80%

111%

Compaq AlphaServer GS320 M32 68/1001

756

621

79%

91%

Sun Blade 1000 Model 900 / 900 MHz Cu

731

533

76%

78%

HP SERVER RX4610 Itanium/800 4MB L3

701

379

73%

56%

Sun Fire 280R / 900 MHz Cu

700

529

73%

78%

Dell PW 340/1.8 GHz P4

696

620

73%

91%

HP i2000 Itanium/800 2MB L3

655

-

68%

-

API UP2000 833 MHz

644

533

67%

78%

AMD Epox 8KHA+ XP2000+ / 1667 MHz

642

724

67%

107%

Dell PW 530/1.5 GHz P4 Xeon

629

545

66%

80%

HP i2000 Itanium/733 2MB L3

623

-

65%

-

AMD Epox 8KHA+ XP1800 / 1533 MHz

615

671

64%

99%

AMD Asus A7M266-D MP2000+ / 1667 MHz

596

662

62%

97%

Compaq AlphaServer DS20E Model 6/667

582

455

61%

67%

HP 9000 Model J6700 / PA8700-750

581

603

61%

89%

Compaq AlphaStation XP1000 Model 6/667

532

403

55%

59%

SGI Origin 3200 1X 600MHz R14k

529

500

55%

74%

AMD Tyan Thunder K7 MP1500+/1333 MHz

516

554

54%

82%

Fujitsu PrimePwr650 (SPARC64 675MHz)

509

478

53%

70%

SGI Origin 3200 1X 500MHz R14k

463

427

48%

63%

Dell PowerEdge 1500SC/1.4 GHz PIII

456

664

48%

98%

HP 9000 Model B2600 / PA8600-500

440

403

46%

59%

IBM RS/6000 44P-270 (450MHz, 8MBL2)

433

334

45%

49%

Compaq AlphaServer DS20 Model 6/500

422

313

44%

46%

Sun Blade 1000 Model 1900 / 900MHz

410

466

43%

69%

SGI Origin 3200 400MHz R12k

407

353

42%

52%

Dell PowerEdge 2550/1.13GHz PIII

402

568

42%

84%

IBM RS/6000 SP-375MHz T/W (1 CPU)

382

260

40%

38%

SGI Origin 300 1X 500MHz R14k

378

379

39%

56%

Sun Fire V880 / 750 MHz

378

390

39%

57%

HP 9000 Model N4000 / PA8600-552 MHz

369

379

38%

56%

AMD GA-7ZM 1.2 GHz

342

-

36%

-

Dell PW 420/1.0 GHz PIII

340

462

35%

68%

AMD ASUS A7V 1.0 GHz Athlon

321

-

33%

-

Dell PowerEdge 6400/PIII Xeon 700 MHz

294

438

31%

65%

Sun Enterprise 450 / UltraSPARC-II/480

291

234

30%

34%

Dell PW 420/733MHz P3

290

374

30%

55%

Sun Fire 6800 / 750 MHz

278

360

29%

53%

Intel D815EEA2 (1.1 GHz Pentium III)

268

427

28%

63%

Sun Enterprise 420R (UltraSPARC-II/450)

265

214

28%

32%

Sun Enterprise 3500/4500 - 400 MHz

261

212

27%

31%

Intel SE440BX-2 (800 MHz Pentium III

237

344

25%

51%

Sun Blade 100 (UltraSPARC-IIe/500)

182

174

19%

26%

Intel SE440BX-2 (450 MHz Pentium III

178

213

19%

31%

Compaq DIGITAL PW 500au

158

161

16%

24%

Ultra 10 333MHz

126

133

13%

20%

(+) SPECfp_base2000 value

The 12 systems following the Sun Blade Model 2050/1050 MHz include 6 based on Intel's Pentium 4 Xeon, at 2.2 GHz (SPECfp2000, 777-782) and 2.0 GHz (SPECfp2000, 743-764), and 5 based on the Alpha A21264 CPU. The A21264B 833 MHz CPU in the DS20E and ES40 Model 6/833 exhibit SPECfp2000 figures of 784 and 777, whilst the A21264C 1001 MHz CPU (in the GS80, GS160 and GS320) exhibit SPECfp2000 ratings of 784, 777 and 756 respectively. The final system here is the 900MHz UltraSPARC III Cu CPU in the Sun Blade 1000 Model 1900 (SPECfp2000, 731).

28 of the next 33 entries feature either the Pentium 4 or Itanium/800 Mhz CPUs. SPECfp2000 figures for Intel's Pentium 4 in this range vary from 607 (1.7 GHz) to 714 (2.0 GHz), while the Itanium-based systems exhibit values from 703 (800 MHz with 4MB L3 cache) to 623 (733 MHz, with 2 MB L3 cache). Note that both Itanium values are in fact SPECfp2000_base and not SPECfp2000. The only non-Intel based CPU in this range include the 833MHz Alpha A21264A (a rating of 644 in the API UP2000 6/833), AMD's Athlon (Epox 8KHA+ XP2000+) at 1667, 1600 and 1533 MHz ( SPECfp2000 ratings of 642, 634 and 615 respectively) and the 900 MHz Cu UltraSPARC III (SPECfp2000 rating of 700 in the Sun Fire 280R). All other CPUs exhibit SPECfp2000 ratings of less than 600 i.e. are at least a factor two times slower than the IBM pSeries 690Turbo.

Based on the normalised ratings of Table 1, we would expect the IBM pSeries 690Turbo (122%) to comfortably outperform the Compaq Alpha ES45/1000 (100%), the Sun Blade Model 2050/1050 MHz (86%), the 2.2 GHz Pentium 4 Xeon (84%) and the Compaq AlphaServer ES40/833 (81%). Normalised ratings for the leading CPUs from other vendors outlined above are as follows:

  • 76% for the Sun Blade 1000 Model 1900/900MHz Cu;
  • 73% for the 800 MHz Itanium (4MB L3, 67% with 2MB L3);
  • 67% for the AMD 1.667 GHz XP2000+ and API UP2000/833 (A21264A);
  • 61% for the HP 9000 Model J6700/PA8700-750 and Compaq AlphaServer DS20E Model 6/667;
  • 55% for the SGI Origin 3200/R14k-600 and 48% for the Origin 3200/R14k-500;and,
  • 48% for the Dell PowerEdge 1500SC (1.4 GHz Pentium III).

In terms of leading CPUs from each vendor, the poorest performer would appear to be the current offerings from SGI. Thus the 600 MHz CPU in SGI O3800/R14k-600 is a factor of 2.2 times slower than the IBM pSeries 690Turbo, the 500 MHz CPU a factor of 2.5 times slower.

An examination of the SPECfp2000_base values of [2] shows a somewhat different picture. While the IBM pSeries 690Turbo remains the leading CPU (SPECfp2000_base, 1098), the Compaq AlphaServer ES45/1000 now exhibits effectively the same rating (776) as the 2 GHz Pentium 4 Xeon systems from Dell (SPECfp2000_base, 779). Also apparent is the decline in position of the Sun Blade Model 2050/1050 MHz, from 3rd in the SPECfp2000 ratings to 13th in the SPECfp2000_base table (with a value of 701). In similar fashion the Compaq AlphaServer DS20E 68/833, which lies 6th in the SPECfp2000 ratings, now occupies a much lower position, being outperformed by a number of Pentium 4 (1.7 - 2.0 GHz) and Itanium/800 MHz CPUs.

SPECfp2000_base figures for Intel's Pentium 4 range from 581 (1.4 GHz) to 779 (2.2 GHz), while the Itanium-based systems exhibit values from 703 (Dell Poweredge 7150, 800 MHz with 4MB L3 cache) to 623 (HP i2000, 733 MHz, with 2 MB L3 cache). The only systems that do not feature an Intel-based CPU appearing in top 30 SPECfp2000_base entries are the IBM pSeries 690Turbo, the Alpha systems from Compaq (AlphaServer ES45 68/1000, DS20E 68/833 and ES40 6/833) MHz), plus the Sun Blade Model 2050/1050 MHz and Model 900/900 MHz Cu.

References

[1] A SPEC FAQ describing the SPEC benchmark suite and the SPEC consortium is periodically posted to comp.benchmarks, and can be found on the WWW at: www.specbench.org/spec/faq. An excellent summary of the SPEC benchmarks that is periodically updated is available via anonymous ftp from: ftp.cs.toronto.edu in the file /pub/spectable.

[2] http://www.specbench.org/osg/cpu2000/results/cpu2000.html

Computational Chemistry Benchmarks

In the previous article we have discussed CPU performance on the general SPEC benchmarks. We summarise below performance in the area of computational chemistry, focusing on a benchmark suite that includes a set of twelve quantum chemistry calculations using the GAMESS-UK electronic structure program [1]. The comparison involves approximately one hundred and twenty computers, ranging from supercomputers to scientific workstations and Pentium, Athlon and Itanium-based PCs.

Vector supercomputers used in this report include the NEC SX-5. A large number of workstations and workstation servers have been benchmarked, including the recent offerings from:

  • IBM - the Power3 RS6000-based CPU in the IBM RS/6000 44P-270 dual and in the quad-processor Winterhawk 2 (WH2) CPU, both clocked at 375 MHz. The latest addition to the IBM power series is the power4; clocked at 1.3 GHz, this CPU features in the 8-way IBM pSeries 690Turbo and in the 32-way Regatta H node. Initial benchmark results have been obtained on both machines.
  • Hewlett Packard - the PA-RISC PA8500, PA8600 and PA8700 CPUs. The latter two CPUs feature in the HP PA-9000/J6000 (552MHz) and HP PA-9000/J6700 (750MHz) respectively.
  • Compaq - A variety of systems housing the A21264 EV6 CPU clocked at 500, 667 and 833 MHz. The 667 MHz EV6.7 A21264A CPU features in the Linux-based UP2000 dual-processor CPU from API, and in Compaq's PW XP1000, DS20E and ES40 AlphaServers (with 8 MByte L2 cache). The 833 MHz EV67 CPU appears in the Compaq ES40 and in API UP2000 (here with 4 MByte L2 DDR cache). The most recent offering is the 1 GHz EV68 CPU featuring in the ES45.
  • Silicon Graphics - the MIPS R12k- and R14k-based machines. The R12k-based machines include the Origin 3800 (400 MHz), Origin 2000 (300 and 400 MHz), the 270 MHz Octane R12k and O2 R12k, and the 400 MHz R12k Octane 2. The most recent R14k-based machines include the 500 MHz SGI Origin 3800 and dual processor Origin 300
  • SUN - the UltraSPARC-II and UltraSPARC III-based machines. The latter CPU, clocked at 600, 750 and 900 MHz, features in the Sun Blade 1000 series; included here is the Model 1750, with 750 MHz UltraSPARC-III and 8 MByte L2 cache. Also featured is the 900 MHz Cu UltraSPARC-III, as benchmarked in the SUN Fire 6800/900-Cu.

Note that the present results are taken from a more detailed report on computational chemistry benchmarks [2]. The GAMESS-UK Benchmark is designed to represent the typical range of calculations commonly performed by the ab initio quantum chemist. It includes 12 calculations that feature conventional- and direct-SCF, CASSCF and MCSCF, CI calculations (both direct-CI and conventional table-driven MRD-CI), MP2, and both SCF and MP2 analytic 2nd derivatives.

The data presented in Table 1 is collected under control of the UNIX time command, and includes CPU time (both user and system summed over all 12 calculations), total elapsed time and efficiency (measured as CPU versus elapsed). Also shown is the relative performance of each machine based on SPECfp2000 values and on the GAMESS-UK total CPU times, normalised to values for Compaq's ES45/1000. These suggest that the IBM pSeries 690Turbo is dominant, comfortably outperforming the Compaq Alpha ES45/1000. Apart from the 690Turbo, five of the leading 11 entries feature the Alpha EV68/EV67 processor, four the IA32 CPUs from AMD and Intel, with a single PA-RISC system from HP. The pSeries 690Turbo (3.1 mins.) is seen to outperform the AlphaServer ES45/1000 (3.7 mins.) by a factor of 1.2, and the Pentium 4/2000 (ifc), HP PA-9000/J6700-750 and the Alpha ES40/833 by a factor of 1.4. These 5 machines are followed by the AMD MP1800+/1533 (pgf77), the Pentium 4/2000 (pgf77), the AMD K7/1400 (pgf77), and the 667 MHz Compaq Alpha machines (DS20E and ES40) and API UP2000 6/833 (4.8-5.6 mins).

Table 1. The GAMESS-UK Serial Benchmark: total CPU time (user and system), elapsed time (minutes), efficiency (%) and relative performance from both SPECfp2000 and GAMESS-UK.

 

Machine

GAMESS-UK

Relative Performance
(%)

CPU Time (mins)

Wall Time (mins)

Efficiency
(%)

GAMESS-UK (%)

SPEC-fp 2000 (%)

user

system

total

IBM pSeries 690Turbo

3.1

0.5

3.6

3.8

97

126

122

Compaq Alpha ES45/1000

3.7

0.9

4.6

4.7

98

100

100

Pentium 4/2000 (*)

4.4

0.4

4.8

4.8

99

96

74

HP PA-9000/J6700-750

4.4

0.7

5.1

5.7

89

90

61

Pentium 4/2000 (+)

5.0

0.4

5.4

5.5

99

85

74

AMD MP1800+/1533 (+)

4.8

0.7

5.5

5.5

100

84

57

Compaq Alpha ES40/833

4.4

1.9

6.3

6.4

99

73

81

AMD Athlon K7/1400 (+)

5.4

1.1

6.5

6.6

99

71

48

Pentium 4/1500 (pgi) (+)

6.2

0.3

6.5

6.5

100

70

64

Pentium 4/2000

6.3

0.4

6.7

6.7

100

69

74

AMD MP1800+/1533

6.0

0.7

6.7

6.7

100

68

57

Compaq Alpha ES40/667

5.6

1.2

6.8

7.1

95

68

59

HP PA-9000/J6000-552

6.1

0.7

6.8

7.3

93

67

45

AMD Athlon K7/1200 (+)

6.3

0.9

7.2

10.9

66

64

43

Compaq Alpha DS20E/667

5.5

1.9

7.4

7.7

96

62

61

Compaq PW XP1000/667

6.3

1.4

7.7

9.0

85

60

46

AMD Athlon K7/1400

6.7

1.1

7.7

7.8

99

59

48

SUN Fire 6800/900-Cu

6.8

0.9

7.8

8.0

97

59

73

Pentium 4/1500

7.8

0.3

8.1

8.1

100

56

64

IBM RS/6000-SP/375

7.2

1.0

8.2

9.2

89

56

40

IBM RS/6000 44P-270

7.4

1.0

8.4

9.6

87

55

39

AMD Athlon K7/1200

7.7

0.8

8.5

12.4

68

54

43

AMD Athlon K7/1000 (+)

7.8

1.2

9.0

9.8

92

51

33

HP PA-9000/J5000-440

7.9

1.2

9.1

10.1

90

51

38

API UP2000 6/833

5.6

3.7

9.2

9.4

98

50

67

Pentium III /1000-CM (+)

8.2

1.2

9.4

9.4

100

49

33

SGI O300/R14k-500 (+)

8.1

1.3

9.5

9.5

100

49

39

HP PA-9000/C3000-400

8.4

1.1

9.5

11.1

85

48

37

Compaq Alpha ES40/500

8.0

1.9

9.9

11.2

89

46

44

SGI O3800/R14k-500

8.0

1.9

10.0

12.4

80

46

48

API UP2000 6/667

6.9

3.1

10.0

10.4

96

46

38

SUN Blade 1000/M1750

9.5

0.8

10.2

10.4

99

45

44

Compaq Alpha DS20/500

8.0

2.3

10.3

13.3

77

45

44

(+) using the portland group compiler, pgf77; (*) using the Intel compiler, ifc

Nine machines exhibit user CPU timings of between 6-7 minutes. These include the HP PA-9000/J6000-552, the Pentium 4/1500 and AMD K7/1200 (pgf77), Compaq PW XP1000/667, the SUN Fire 6800/900-Cu and the API UP2000 6/667. The HP Itanium/733-2M L3 and IBM's SP/WH2/375 and RS/6000 44P-270 follow (7.1-7.4 mins), somewhat ahead of the Intel Itanium/800-4M L3 (7.6 mins.) and AMD's Athlon K7/1000 (pgf77) and the HP PA-9000/J5000-440 (7.8 and 7.9 mins). The latter shows comparable performance to the EV6/500; of the next eleven machines with user CPU timings between 8-9 mins., six feature the EV6 processor. We see that the Compaq ES40/6-500 and DS20/6-500 are marginally superior to the GS140 and Compaq XP1000/500, which in turn lie ahead of the Compaq DS10/466 and AlphaPC 264DP-500. Non-alpha machines in this range include SGI's O3800/R14k-500 and O300/R14k-500, the Pentium III/1000 and Pentium 4/1400 (pgf77), and HP PA-9000/C3000-400 (8.1-8.4 mins). The leading 20 machines are from Compaq (5), API (2), IBM (1), HP (2), and Sun (1), plus the leading CPUs from Intel (4) and AMD (5). The fastest machine from SUN (the SunFire 6800/900-Cu) is a factor of 2.2 times slower than the IBM pSeries 690Turbo. We note that the performance of the Pentium 4 improves significantly with optimal maths libraries and disk configuration. The availability of Intel's MKL libraries on the Pentium 4/1500 (and not on the 1400 MHz) accounts for the significant decrease in user time, from 8.4 to 6.2 mins. Twelve machines are seen to lie within a factor of two of the fastest.

A somewhat modified picture emerges when considering the system CPU and elapsed times. With the exception of the Alpha-based systems, most machines exhibit a system CPU time of the order of 10-15% of the user time; this percentage increases significantly on the Alpha, to between 20-40% for the systems shown in the Table. The API UP/2000-833 shows an alarming increase, to a figure of 66%. Based on the summed CPU times, the IBM pSeries 690Turbo's position as the optimum CPU is strengthened (3.6 mins.), while the Compaq Alpha ES45/1000 is now only marginally faster than the Pentium 4/2000 (ifc) (4.6 vs. 4.8 mins.). These three systems are followed by the HP PA-9000/J6700-750, Pentium 4/2000 (pgf77) and AMD MP1800+/1533 (5.1-5.5 mins).

Considering the elapsed times and associated efficiencies for the most recent machines, efficiencies in excess of 96% are only seen on the IBM pSeries 690Turbo, the Compaq and API hardware (ES45/6-1000, ES40/6-833 and UP2000/6-833), the SGI R14k-based O300, the SUN Fire 6800/900-Cu and Sun Blade 1000/M1750, plus AMDs MP1800+/1533 and K7/1400, and Intels Pentium 4/2000 and P4/1500. What is noticeable is the significant improvements in these efficiency ratios on the more recent SUN and Compaq/Digital hardware compared to the figures recorded in the past.

We now consider the performance of the Pentium- and AMD-based hardware; using the GNU g77 compiler resulted in summed CPU times of 6.7, 8.1 and 12.6 mins. for the Pentium 4/2000, P4/1500, and P4/1400, respectively. Corresponding timings on the Pentium III systems range from 11.3 mins. (Pentium III/1000) to 19.4 mins. (PII/550). The AMD Athlon is arguably more impressive, with CPU times ranging from 6.7 mins. on the MP1800+/1533 to 18.2 mins. on the K7/500. Use of the Portland Group pgf77 compiler produces a consistent level of performance improvement compared to g77, typically by a factor of between 1.13-1.22 for the Pentium III and AMD Athlon CPUs, while somewhat higher for the Pentium 4/2000 and /1500 (a factor of 1.25). Optimum performance on the Pentium 4 systems arises from use of Intel's ifc Fortran compiler; an ifc/g77 performance ratio of 1.40 is found on the Pentium 4/2000, compared to the pgf77 figure of 1.25.

While the AMD Athlon MP1800+/1533 is 1.53 times slower than the IBM pSeries 690Turbo, we note that SPECfp-based predictions based on the ratings of Tables 1 would have led to much higher factors, 2.14 (SPECfp2000) and 2.18 (SPECfp_base2000). Finally, we note the disappointing performance of the Itanium-based IA64 systems. The Intel Itanium/800-4M L3 and HP Itanium/733-2M L3 exhibit comparable summed CPU times of 9.0 and 9.5 mins. i.e. only around 50% of the performance of the Pentium 4/2000, and little better than the IA32 1GHz Pentium III.

The relative performance figures of Table 1 suggest that the GAMESS-UK results are broadly in line with the SPECfp ratings, with the IBM pSeries 690Turbo representing the optimum CPU. However, this superiority is not as pronounced as might be expected from just a consideration of the SPECfp2000 rankings. These results, together with known costs, strongly suggest that the AMD Athlon and Pentium 4 CPUs provide the most likely building blocks for Beowulf systems. The previous inclusion of the EV67 and EV68 processors in this category must now be in doubt given the recent developments within Compaq and API.

References

[1] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick, and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, H.J.J. van Dam, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, A.J. Stone and D. J. Tozer. The package is derived from the original GAMESS code due to M. Dupuis, D. Spangler and J. Wendoloski, NRCC Software Catalog, Vol. 1, Program No. QG01 (GAMESS), 1980.

[2] M.F. Guest, Performance of Various Computers in Computational Chemistry, in Proceedings of the Daresbury Machine Evaluation Workshop, CLRC Daresbury Laboratory, November 2001. The associated MS PowerPoint presentation is also available.

Communications Performance Benchmarks

There are an increasing number of available options for the network connections in a Beowulf cluster. The proprietary options include Myrinet from Myricom Inc., SCI from Dolphin and QsNet from Quadrics Supercomputer World Ltd. The other choices are Fast Ethernet and Gigabit Ethernet which are available from most companies that produce network products. While we have measured point-to-point bandwidth and latency on these networks, these are of limited value and provide no more than an indication of network performance likely to be encountered in parallel applications. To provide a more quantitative assessment, we have adopted a number of benchmarks designed to provide a systematic evaluation of both point-to-point and collective operations.

The PMB Benchmark

We have continued to use the MPI Parallel Communications benchmarks from Pallas (PMB, Pallas MPI Benchmarks [1]) across a variety of parallel hardware. PMB considers a number of point-to-point communications e.g. PingPong, Sendrecv and Exchange) plus a selection of MPI Collective Operations (Allreduce, Reduce, Reduce_scatter, Allgather, Allgatherv, Alltoall, Bcast and Barrier). Results for the former are reported in MBytes/sec, results for the latter as time (μsec) to complete. Each operation is run for a variety of message lengths (0 to 4194304 Bytes), with the collective operations performed for various combinations of the number of CPUs available. Systems previously reported included the Cray T3E/1200E, SGI Origin 3800/R14k-500, IBM SP/WH2-375 and the commodity clusters, CS1, CS2, CS3 and CS5. High-end systems evaluated during the current reporting period include:

  • Cray's Linux Supercluster (dual 833 MHz EV68 nodes with myrinet interconnect);
  • The IBM SP/NH2-375 - using both 8- and 16-CPUs per 16-way SMP node, together with the recently available power4-based 8-way p-series 690 Turbo and 32-way Regatta-H.
  • The Compaq AlphaServer SC ES45/1000 with dual-rail Quadrics interconnect.

The PMB benchmarks have also been run on a number of commodity-based systems, including;

  • The CS7 cluster, with dual-AMD Athlon/1000 MHz nodes and SCALI interconnect (using SCAMPI);
  • The Pentium 4/2000 CS9 Cluster with Myrinet 2000 interconnect - with both single and dual CPUs per dual processor node; and,
  • The CS8 dual Itanium/800 cluster, "titan" at NCSA (featuring 160 Intellistation nodes from IBM) with Myrinet 2000 interconnect (using NCSA's VMI).

Full details of these benchmarks has been presented elsewhere [2]. To provide an example of the output, we show below a plot for the performance of MPI_allreduce on 16 CPUs of the machines identified above.

An Effective Bandwidth Benchmark, EFF_BW

As a spin off from PMB, a benchmark "EFF_BW" has been developed by Pallas to calculate the "effective bandwidth". For a given machine and processor number, one integral number is calculated which includes the performance for small and for large messages under participation of all available processors (see [1]). EFF_BW uses only the PMB PingPong Benchmark for measuring startup and throughput. We have adopted this benchmark [3], and present in Table 1 below the EFF_BW figures for 16 CPUs (together with reported Latency figures) measured on both the high-end and commodity-based hardware under consideration.

Table 1. Effective Bandwidth figures (MBytes/sec) and Ping-Pong Latency (μsec) for a number of commodity systems and high-end parallel machines.

Machine

Interconnect

Effective Bandwidth (MBytes / sec)

Ping-Pong Latency (usec)

   

16 CPUs

 

CS1 PIII/450 - LAM 6.5.2

Ethernet

65.0

82

CS1 PIII/450 - MPICH 1.2.0

Ethernet

39.8

149

CS6 PIII/800 - LAM 6.3.2

Ethernet

77.9

63

CS6 PIII/800 - MPICH 1.2.0

Ethernet

44.2

97

CS4 AMD K7/1200 - LAM 6.3.2

Ethernet

84.2

67

CS3 AMD K7/850 - MPICH 1.2.3

Myrinet

255

15.7

CS7 AMD K7/1000 MP - SCALI (2 CPU)

Scali /SCI

367

2.3

CS7 AMD K7/1000 MP - SCALI (1 CPU)

Scali /SCI

481

4.8

CS8 Titan Itanium/800 - VMI (2 CPU)

Myrinet 2000

516

6.8

CS8 Titan Itanium/800 - VMI (1 CPU)

Myrinet 2000

667

15.7

CS9 Pentium 4/2000 + MPICH (2 CPU)

Myrinet 2000

471

1.5

CS9 Pentium 4/2000 + MPICH (1 CPU)

Myrinet 2000

631

10.9

CS2 QSNet Alpha Cluster / 667 (2 CPU)

QSNet

456

6.0

CS2 QSNet Alpha Cluster / 667 (1 CPU)

QSNet

698

5.4

Cray T3E/1200E

CrayLink

1217

4.1

IBM SP/WH2-375

 

263.9

13.9

IBM SP/NH2-375 (16 - 8+8)

 

362

16.5

IBM SP/Regatta-H

Intra-node

2148

AlphaServer SC ES45/1000 (4CPU)

 

507

5.6

AlphaServer SC ES45/1000 (1CPU)

 

964

4.8

SGI Origin 3800/R14k-500

NumaFlex

530

4.7

‡ intra-node, inter-node

References

[1] http://www.pallas.de/pages/pmb.htm; PMB, a comprehensive set of MPI benchmarks written by Pallas, targeted at measuring important MPI functions: point-to-point message-passing, global data movement and computation routines, one-sided communications and file-I/O,

[2] see http://www.dl.ac.uk/CFS/benchmarks/pmb

[3] see http://www.hlrs.de/organization/par/services/models/mpi/b_eff

 

previous contents forward
design by CCP1, March 2003