The Daresbury Laboratory Beowulf Project
S.J. Andrews, M.F. Guest and B.G. Searle
IntroductionNetworks of personal computers (so called Beowulf systems) composed of fast PCs configured with large quantities of RAM and hard disk, and running the Linux operating system are becoming more and more attractive as cheap and efficient platforms for distributed applications [1]. The main drawback of a standard Beowulf architecture is the poor performance of the conventional inter-process communication mechanisms based on RPC, sockets, TCP/IP, Ethernet. Such standard mechanisms are thought to perform poorly both in terms of throughput and message latency. Nevertheless, there is increasing interest in the use of commodity "off-the-shelf" components as building blocks for high-performance computing. This is evident in many areas as witnessed by the high proportion of such machines appearing in funding requests to EPSRC, and other sources, and the filling of this particular niche by specialist integrators. An example within the UK academic community is provided by the latest round of the Joint Research Equipment Initiative (JREI'2001). This included equipment requests for more commodity-based clusters than proprietary SMP-based solutions (e.g. Origin 3400), in stark contrast to previous rounds of such Initiatives. To investigate the potential of this class of system, a number of "Beowulf-class" systems have been assembled and evaluated at Daresbury [2]. The programme of evaluation is focused on both system software and hardware, and on assessing the delivered performance across a broad spectrum of end-applications. The programme aims to inform the community over the wide variety of available options, from choice of CPU (Alphas, single- and dual-Pentiums, caches, Xeon or old memory subsystems) to choice of interconnect (Ethernet, Myrinet, Scali SCI, QsNet etc.). This report provides an update on progress. Following an overview of the commodity-based systems currently under assessment, we consider in subsequent articles the results of a number of benchmarks designed to assess the building blocks of any cluster, namely (i) serial node CPU performance, and (ii) communications benchmarks across the variety of possible cluster interconnects. The major focus of these subsequent reports is, however, on applications and we present performance comparisons between commodity-based systems, the CSAR Cray T3E/1200E, and a variety of high-end proprietary systems. Applications considered include those from computational chemistry (GAMESS-UK [3], DL_POLY [4] and CHARMM [5]), computational materials (CPMD [6]), and computational engineering (ANGUS [7] and FLITE3D). Evaluation has focused on commodity CPUs from Intel and AMD, and Alpha EV6-based platforms. The former includes Intel's IA-32 (Pentium III and Pentium 4) and AMD's Athlon CPUs, and the more recent IA64 processors from Intel (Itanium and Itanium 2). The Alpha-based platforms include DS10, DS20, ES40, ES45, XP-1000 and DPC264 machines, from 466 MHz EV6 to industry-leading 1000 MHz EV68 models. Various system and resource management packages (Lobosq, Sychron, Beowulf, Quadrics RMS and PBS) have been investigated, while several flavours of message passing software (MPICH, LAM6.2, LAM6.3, MPI-VIA, NCSA's VMI and Shmem), compilers (from Compaq, Intel, Absoft, PGI and GNU/g77) and numerical libraries (ATLAS, NASA, Intel's MKL) have been tested. Implementations of the Global Array (GA) tools and parallel eigensolvers from PNNL have also been completed. As well as the choice of processor, a variety of networking options including fast Ethernet, Myrinet and more recently the low-latency, high bandwidth solutions, SCI from Dolphin and QsNet from Quadrics, have also been evaluated. The assessment of a variety of prototype commodity-based systems (CS) continues to produce a wealth of data. Ten such systems have been used in the present study (CS0-CS9), five in-house (CS0-CS3, and CS5), the five others through collaborative links to other Beowulf sites (CS4, CS6-CS9). Although no longer in use, the in-house Pentium-based machines (CS0 and CS1) have been used to benchmark a host of applications. Their role as development systems has now been taken over by CS2 (a Linux Alpha EV67-based UP2000/Quadrics system) and CS3 (an AMD Athlon-based Myrinet system).
In addition to the in-house systems above, the following machines have been evaluated and used to benchmark applications:
References [1] D. Ridge, D. Becker, P. Merkey, T. Sterling and P. Merkey, Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs, Proceedings, IEEE Aerospace (1997). [2] Additional information on the Daresbury Beowulf Systems is available via World Wide Web URL http://www.cse.clrc.ac.uk/disco/dl-beowulfs.shtml. [3] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, H.J.J. van Dam, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, A.J. Stone and D.J. Tozer. (http://www.dl.ac.uk/CFS) [4] see, http://www.dl.ac.uk/TCS/Software/DL_POLY/dl_poly.t3e.htm/ [5] CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217 (1983), by B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. [6] CPMD Version 3.3: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-1999). [7] D.R. Emerson and R.S. Cant, Direct simulation of turbulent combustion on the Cray T3D - initial thoughts and impressions from an engineering perspective, Parallel Computing (1996). [8] For details of the hardware at NCSA, see http://www.ncsa.uiuc.edu/UserInfo/Resources/
DisCo HardwareThe background to the major pieces of hardware associated with the DisCo programme has been provided in the preceding article. The major cluster hardware resources available at Daresbury over this period included: Beowulf Systems:32 processor Pentium system:
64 processor Alpha system:
Miscellaneous desktop systems:Until recently three processors were likely to become dominant in this field, namely the Pentium, IA64 (currently Itanium) and Alpha although, with the pending Compaq/HP merger, the future of the latter is now in doubt beyond the EV7. Several interconnect options remain in play including Myrinet, SCI/SCALI, QsNet and Gigabit Ethernet, and we aim to evaluate for the community appropriate combinations of these elements. Recognising the cost ramifications involved in targeting these options our aim was to leverage capital investment from EPSRC with other funding routes to yield a significant asset base for realistic evaluations. It is unfortunate that this strategy has been affected by factors beyond the team's control resulting in the enforced continuation of the obsolete Pentium cluster service, prior to its decommissioning, and a limitation on the expansion of the Alpha machine to 64 processors. One factor here the impact of the complete refurbishment of the computer hall at Daresbury which necessitated a moratorium on new, space-filling, equipment until September 2002. Given this slower than anticipated expansion of the in-house systems, considerable effort has been invested in gaining access to other commodity-based systems (CS4-CS9, see preceding article) in the goal of benchmarking applications. The presence of a purpose-built machine room will relieve the pressure on floor space and services and allow us to concentrate on refreshing the Daresbury Cluster Computing capability. The upgraded Alpha machine passed its acceptance tests in September 2001. These included:
The pre-upgrade machine was used heavily by the Quasi consortium during the first half of 2001 and, notwithstanding the conclusion of the funding, their usage has continued at a reduced level since then. Total usage of the machine from September 2001 is around 50% of the available node-hours of which more than 25% has been by external users including those from the RI and Cambridge. At the time the Alpha processor was a leading contender within the then recently announced HPC(x) procurement but this is no longer the case and support for this processor, particularly under Linux, is waning. It is likely, therefore, that our machine will be updated to use Intel processors under the Quadrics interconnect when the current system becomes uncompetitive. After nearly three years the 32-node Pentium cluster was decommissioned in March 2002 due to space restrictions imposed by a major refurbishment of the Daresbury computer hall. During this time it proved to be a robust platform and served admirably in demonstrating the potential of commodity cluster computing to the academic community. Latterly, it was fully integrated into the emerging GRID-based environments that promise to dominate the next generation of distributed computing. SPEC BenchmarksThe prototype commodity-based systems described in the preceding article are based on the Pentium, Itanium, Athlon and Alpha CPUs as the node building block. The choice of optimal CPU (performance vs. cost) has been informed by an analysis of the serial performance of a wide variety of processors across a number of computational chemistry benchmarks. We summarise below the results of such a comparison for the SPEC benchmark; comparisons in the area of computational chemistry are presented in the following article. One of the most useful indicator of CPU performance is provided by the SPEC (``Standard Performance Evaluation Corporation'') benchmarks. This benchmark suite [1] contains non-tuned application-based code to measure processor speed for both integer (SPECint) and floating point (SPECfp) arithmetic. SPECfp95 and SPECint95, and their successors, SPECfp2000 and SPECint2000, have become industry standards in measuring primarily the performance of a system's processor, memory architecture, operating system and compiler. The next generation of SPEC benchmarks, SPEC CPU2000 [2], has recently replaced SPEC95. CFP2000 is derived from the results of 14 floating-point benchmarks compiled with aggressive optimization, and is the geometric mean of 14 normalised ratios (one for each benchmark). CINT2000 is derived from the results of 12 integer benchmarks compiled with aggressive optimization, and represents the geometric mean of 12 normalised ratios (one for each benchmark). Note that the level of optimisation is not mandated. While highly aggressive optimisation is permitted, results derived from benchmarks compiled with conservative optimisation (SPECfp_base2000) can be submitted. A subset of the SPECfp2000 and SPECint2000 results for many of the leading CPUs are given in Table 1 (with the baseline system the Ultra 10 333MHz). In each case we have normalised the values relative to those of the Compaq AlphaServer ES45/6-1000. An examination of the SPECfp2000 values shows that the leading two leading machines, the IBM pSeries 690Turbo and Compaq AlphaServer ES45/68-1000 lie comfortably ahead of the third placed system, the Sun Blade Model 2050/1050 MHz. While the performance of the 690Turbo is indeed impressive, outperforming the ES45/68-1000 by a factor of 1.2, it is worth trying to put this performance in context by considering the potential shortcomings of using single processor benchmarks on such a system. The complex cache hierarchy of the 690Turbo means in practice that with 7 of the 8 CPUs disabled (according to the SPECfp2000 rules), the single processor benchmark job is actually running in an environment comprising the total cache associated with the 8-way MCM i.e. 128 MByte. Since many of the SPECfp2000 benchmarks have a total memory requirement of this order of magnitude, the recorded level of performance probably bears little resemblance to what might be seen if all 8 CPUS were running the same job. It is not easy to quantify the impact here as IBM have yet to release the SPECfp2000_rate figures for the 690Turbo. Considering the two leading systems, the IBM pSeries 690Turbo outperforms the power3-based IBM 375 MHz CPU in the IBM RS/6000 SP (382) by a factor of 2.5, a factor that has been widely quoted in early benchmarks of the 690Turbo. The AlphaServer ES45 Model 68/1000 outperforms the 833 MHz Compaq ES40 (777), by a factor of 1.24, and the 667 MHz based ES40 (562) by a factor of 1.70. These factors are somewhat greater than the clock speed ratios (1.20 and 1.50). While some 50 systems in the current list exhibit greater than 50% of the performance of the 690Turbo (i.e. SPECfp2000 figures higher than 580), only 6 distinct CPUs are involved in these systems. These include Intel's IA32 Pentium 4 and IA64 Itanium processors, AMD's Athlon CPUs, the Alpha A21264C and A21264A, Sun's UltraSPARC III Cu processor and finally the PA-RISC PA8700 from Hewlett Packard. Table 1. SPEC CPU2000 - SPECfp and SPECint Values and Values Relative to the Compaq AlphaServer ES45/68-1000.
(+) SPECfp_base2000 value The 12 systems following the Sun Blade Model 2050/1050 MHz include 6 based on Intel's Pentium 4 Xeon, at 2.2 GHz (SPECfp2000, 777-782) and 2.0 GHz (SPECfp2000, 743-764), and 5 based on the Alpha A21264 CPU. The A21264B 833 MHz CPU in the DS20E and ES40 Model 6/833 exhibit SPECfp2000 figures of 784 and 777, whilst the A21264C 1001 MHz CPU (in the GS80, GS160 and GS320) exhibit SPECfp2000 ratings of 784, 777 and 756 respectively. The final system here is the 900MHz UltraSPARC III Cu CPU in the Sun Blade 1000 Model 1900 (SPECfp2000, 731). 28 of the next 33 entries feature either the Pentium 4 or Itanium/800 Mhz CPUs. SPECfp2000 figures for Intel's Pentium 4 in this range vary from 607 (1.7 GHz) to 714 (2.0 GHz), while the Itanium-based systems exhibit values from 703 (800 MHz with 4MB L3 cache) to 623 (733 MHz, with 2 MB L3 cache). Note that both Itanium values are in fact SPECfp2000_base and not SPECfp2000. The only non-Intel based CPU in this range include the 833MHz Alpha A21264A (a rating of 644 in the API UP2000 6/833), AMD's Athlon (Epox 8KHA+ XP2000+) at 1667, 1600 and 1533 MHz ( SPECfp2000 ratings of 642, 634 and 615 respectively) and the 900 MHz Cu UltraSPARC III (SPECfp2000 rating of 700 in the Sun Fire 280R). All other CPUs exhibit SPECfp2000 ratings of less than 600 i.e. are at least a factor two times slower than the IBM pSeries 690Turbo. Based on the normalised ratings of Table 1, we would expect the IBM pSeries 690Turbo (122%) to comfortably outperform the Compaq Alpha ES45/1000 (100%), the Sun Blade Model 2050/1050 MHz (86%), the 2.2 GHz Pentium 4 Xeon (84%) and the Compaq AlphaServer ES40/833 (81%). Normalised ratings for the leading CPUs from other vendors outlined above are as follows:
In terms of leading CPUs from each vendor, the poorest performer would appear to be the current offerings from SGI. Thus the 600 MHz CPU in SGI O3800/R14k-600 is a factor of 2.2 times slower than the IBM pSeries 690Turbo, the 500 MHz CPU a factor of 2.5 times slower. An examination of the SPECfp2000_base values of [2] shows a somewhat different picture. While the IBM pSeries 690Turbo remains the leading CPU (SPECfp2000_base, 1098), the Compaq AlphaServer ES45/1000 now exhibits effectively the same rating (776) as the 2 GHz Pentium 4 Xeon systems from Dell (SPECfp2000_base, 779). Also apparent is the decline in position of the Sun Blade Model 2050/1050 MHz, from 3rd in the SPECfp2000 ratings to 13th in the SPECfp2000_base table (with a value of 701). In similar fashion the Compaq AlphaServer DS20E 68/833, which lies 6th in the SPECfp2000 ratings, now occupies a much lower position, being outperformed by a number of Pentium 4 (1.7 - 2.0 GHz) and Itanium/800 MHz CPUs. SPECfp2000_base figures for Intel's Pentium 4 range from 581 (1.4 GHz) to 779 (2.2 GHz), while the Itanium-based systems exhibit values from 703 (Dell Poweredge 7150, 800 MHz with 4MB L3 cache) to 623 (HP i2000, 733 MHz, with 2 MB L3 cache). The only systems that do not feature an Intel-based CPU appearing in top 30 SPECfp2000_base entries are the IBM pSeries 690Turbo, the Alpha systems from Compaq (AlphaServer ES45 68/1000, DS20E 68/833 and ES40 6/833) MHz), plus the Sun Blade Model 2050/1050 MHz and Model 900/900 MHz Cu. References [1] A SPEC FAQ describing the SPEC benchmark suite and the SPEC consortium is periodically posted to comp.benchmarks, and can be found on the WWW at: www.specbench.org/spec/faq. An excellent summary of the SPEC benchmarks that is periodically updated is available via anonymous ftp from: ftp.cs.toronto.edu in the file /pub/spectable. [2] http://www.specbench.org/osg/cpu2000/results/cpu2000.html Computational Chemistry BenchmarksIn the previous article we have discussed CPU performance on the general SPEC benchmarks. We summarise below performance in the area of computational chemistry, focusing on a benchmark suite that includes a set of twelve quantum chemistry calculations using the GAMESS-UK electronic structure program [1]. The comparison involves approximately one hundred and twenty computers, ranging from supercomputers to scientific workstations and Pentium, Athlon and Itanium-based PCs. Vector supercomputers used in this report include the NEC SX-5. A large number of workstations and workstation servers have been benchmarked, including the recent offerings from:
Note that the present results are taken from a more detailed report on computational chemistry benchmarks [2]. The GAMESS-UK Benchmark is designed to represent the typical range of calculations commonly performed by the ab initio quantum chemist. It includes 12 calculations that feature conventional- and direct-SCF, CASSCF and MCSCF, CI calculations (both direct-CI and conventional table-driven MRD-CI), MP2, and both SCF and MP2 analytic 2nd derivatives. The data presented in Table 1 is collected under control of the UNIX time command, and includes CPU time (both user and system summed over all 12 calculations), total elapsed time and efficiency (measured as CPU versus elapsed). Also shown is the relative performance of each machine based on SPECfp2000 values and on the GAMESS-UK total CPU times, normalised to values for Compaq's ES45/1000. These suggest that the IBM pSeries 690Turbo is dominant, comfortably outperforming the Compaq Alpha ES45/1000. Apart from the 690Turbo, five of the leading 11 entries feature the Alpha EV68/EV67 processor, four the IA32 CPUs from AMD and Intel, with a single PA-RISC system from HP. The pSeries 690Turbo (3.1 mins.) is seen to outperform the AlphaServer ES45/1000 (3.7 mins.) by a factor of 1.2, and the Pentium 4/2000 (ifc), HP PA-9000/J6700-750 and the Alpha ES40/833 by a factor of 1.4. These 5 machines are followed by the AMD MP1800+/1533 (pgf77), the Pentium 4/2000 (pgf77), the AMD K7/1400 (pgf77), and the 667 MHz Compaq Alpha machines (DS20E and ES40) and API UP2000 6/833 (4.8-5.6 mins). Table 1. The GAMESS-UK Serial Benchmark: total CPU time (user and system), elapsed time (minutes), efficiency (%) and relative performance from both SPECfp2000 and GAMESS-UK.
(+) using the portland group compiler, pgf77; (*) using the Intel compiler, ifc Nine machines exhibit user CPU timings of between 6-7 minutes. These include the HP PA-9000/J6000-552, the Pentium 4/1500 and AMD K7/1200 (pgf77), Compaq PW XP1000/667, the SUN Fire 6800/900-Cu and the API UP2000 6/667. The HP Itanium/733-2M L3 and IBM's SP/WH2/375 and RS/6000 44P-270 follow (7.1-7.4 mins), somewhat ahead of the Intel Itanium/800-4M L3 (7.6 mins.) and AMD's Athlon K7/1000 (pgf77) and the HP PA-9000/J5000-440 (7.8 and 7.9 mins). The latter shows comparable performance to the EV6/500; of the next eleven machines with user CPU timings between 8-9 mins., six feature the EV6 processor. We see that the Compaq ES40/6-500 and DS20/6-500 are marginally superior to the GS140 and Compaq XP1000/500, which in turn lie ahead of the Compaq DS10/466 and AlphaPC 264DP-500. Non-alpha machines in this range include SGI's O3800/R14k-500 and O300/R14k-500, the Pentium III/1000 and Pentium 4/1400 (pgf77), and HP PA-9000/C3000-400 (8.1-8.4 mins). The leading 20 machines are from Compaq (5), API (2), IBM (1), HP (2), and Sun (1), plus the leading CPUs from Intel (4) and AMD (5). The fastest machine from SUN (the SunFire 6800/900-Cu) is a factor of 2.2 times slower than the IBM pSeries 690Turbo. We note that the performance of the Pentium 4 improves significantly with optimal maths libraries and disk configuration. The availability of Intel's MKL libraries on the Pentium 4/1500 (and not on the 1400 MHz) accounts for the significant decrease in user time, from 8.4 to 6.2 mins. Twelve machines are seen to lie within a factor of two of the fastest. A somewhat modified picture emerges when considering the system CPU and elapsed times. With the exception of the Alpha-based systems, most machines exhibit a system CPU time of the order of 10-15% of the user time; this percentage increases significantly on the Alpha, to between 20-40% for the systems shown in the Table. The API UP/2000-833 shows an alarming increase, to a figure of 66%. Based on the summed CPU times, the IBM pSeries 690Turbo's position as the optimum CPU is strengthened (3.6 mins.), while the Compaq Alpha ES45/1000 is now only marginally faster than the Pentium 4/2000 (ifc) (4.6 vs. 4.8 mins.). These three systems are followed by the HP PA-9000/J6700-750, Pentium 4/2000 (pgf77) and AMD MP1800+/1533 (5.1-5.5 mins). Considering the elapsed times and associated efficiencies for the most recent machines, efficiencies in excess of 96% are only seen on the IBM pSeries 690Turbo, the Compaq and API hardware (ES45/6-1000, ES40/6-833 and UP2000/6-833), the SGI R14k-based O300, the SUN Fire 6800/900-Cu and Sun Blade 1000/M1750, plus AMDs MP1800+/1533 and K7/1400, and Intels Pentium 4/2000 and P4/1500. What is noticeable is the significant improvements in these efficiency ratios on the more recent SUN and Compaq/Digital hardware compared to the figures recorded in the past. We now consider the performance of the Pentium- and AMD-based hardware; using the GNU g77 compiler resulted in summed CPU times of 6.7, 8.1 and 12.6 mins. for the Pentium 4/2000, P4/1500, and P4/1400, respectively. Corresponding timings on the Pentium III systems range from 11.3 mins. (Pentium III/1000) to 19.4 mins. (PII/550). The AMD Athlon is arguably more impressive, with CPU times ranging from 6.7 mins. on the MP1800+/1533 to 18.2 mins. on the K7/500. Use of the Portland Group pgf77 compiler produces a consistent level of performance improvement compared to g77, typically by a factor of between 1.13-1.22 for the Pentium III and AMD Athlon CPUs, while somewhat higher for the Pentium 4/2000 and /1500 (a factor of 1.25). Optimum performance on the Pentium 4 systems arises from use of Intel's ifc Fortran compiler; an ifc/g77 performance ratio of 1.40 is found on the Pentium 4/2000, compared to the pgf77 figure of 1.25. While the AMD Athlon MP1800+/1533 is 1.53 times slower than the IBM pSeries 690Turbo, we note that SPECfp-based predictions based on the ratings of Tables 1 would have led to much higher factors, 2.14 (SPECfp2000) and 2.18 (SPECfp_base2000). Finally, we note the disappointing performance of the Itanium-based IA64 systems. The Intel Itanium/800-4M L3 and HP Itanium/733-2M L3 exhibit comparable summed CPU times of 9.0 and 9.5 mins. i.e. only around 50% of the performance of the Pentium 4/2000, and little better than the IA32 1GHz Pentium III. The relative performance figures of Table 1 suggest that the GAMESS-UK results are broadly in line with the SPECfp ratings, with the IBM pSeries 690Turbo representing the optimum CPU. However, this superiority is not as pronounced as might be expected from just a consideration of the SPECfp2000 rankings. These results, together with known costs, strongly suggest that the AMD Athlon and Pentium 4 CPUs provide the most likely building blocks for Beowulf systems. The previous inclusion of the EV67 and EV68 processors in this category must now be in doubt given the recent developments within Compaq and API. References [1] GAMESS-UK is a package of ab initio programs written by M.F. Guest, J.H. van Lenthe, J. Kendrick, and P. Sherwood, with contributions from R.D. Amos, R.J. Buenker, H.J.J. van Dam, M. Dupuis, N.C. Handy, I.H. Hillier, P.J. Knowles, V. Bonacic-Koutecky, W. von Niessen, R.J. Harrison, A.P. Rendell, V.R. Saunders, A.J. Stone and D. J. Tozer. The package is derived from the original GAMESS code due to M. Dupuis, D. Spangler and J. Wendoloski, NRCC Software Catalog, Vol. 1, Program No. QG01 (GAMESS), 1980. [2] M.F. Guest, Performance of Various Computers in Computational Chemistry, in Proceedings of the Daresbury Machine Evaluation Workshop, CLRC Daresbury Laboratory, November 2001. The associated MS PowerPoint presentation is also available. Communications Performance BenchmarksThere are an increasing number of available options for the network connections in a Beowulf cluster. The proprietary options include Myrinet from Myricom Inc., SCI from Dolphin and QsNet from Quadrics Supercomputer World Ltd. The other choices are Fast Ethernet and Gigabit Ethernet which are available from most companies that produce network products. While we have measured point-to-point bandwidth and latency on these networks, these are of limited value and provide no more than an indication of network performance likely to be encountered in parallel applications. To provide a more quantitative assessment, we have adopted a number of benchmarks designed to provide a systematic evaluation of both point-to-point and collective operations. The PMB BenchmarkWe have continued to use the MPI Parallel Communications benchmarks from Pallas (PMB, Pallas MPI Benchmarks [1]) across a variety of parallel hardware. PMB considers a number of point-to-point communications e.g. PingPong, Sendrecv and Exchange) plus a selection of MPI Collective Operations (Allreduce, Reduce, Reduce_scatter, Allgather, Allgatherv, Alltoall, Bcast and Barrier). Results for the former are reported in MBytes/sec, results for the latter as time (μsec) to complete. Each operation is run for a variety of message lengths (0 to 4194304 Bytes), with the collective operations performed for various combinations of the number of CPUs available. Systems previously reported included the Cray T3E/1200E, SGI Origin 3800/R14k-500, IBM SP/WH2-375 and the commodity clusters, CS1, CS2, CS3 and CS5. High-end systems evaluated during the current reporting period include:
The PMB benchmarks have also been run on a number of commodity-based systems, including;
Full details of these benchmarks has been presented elsewhere [2]. To provide an example of the output, we show below a plot for the performance of MPI_allreduce on 16 CPUs of the machines identified above. An Effective Bandwidth Benchmark, EFF_BWAs a spin off from PMB, a benchmark "EFF_BW" has been developed by Pallas to calculate the "effective bandwidth". For a given machine and processor number, one integral number is calculated which includes the performance for small and for large messages under participation of all available processors (see [1]). EFF_BW uses only the PMB PingPong Benchmark for measuring startup and throughput. We have adopted this benchmark [3], and present in Table 1 below the EFF_BW figures for 16 CPUs (together with reported Latency figures) measured on both the high-end and commodity-based hardware under consideration. Table 1. Effective Bandwidth figures (MBytes/sec) and Ping-Pong Latency (μsec) for a number of commodity systems and high-end parallel machines.
‡ intra-node, † inter-nodeReferences[1] http://www.pallas.de/pages/pmb.htm; PMB, a comprehensive set of MPI benchmarks written by Pallas, targeted at measuring important MPI functions: point-to-point message-passing, global data movement and computation routines, one-sided communications and file-I/O, [2] see http://www.dl.ac.uk/CFS/benchmarks/pmb [3] see http://www.hlrs.de/organization/par/services/models/mpi/b_eff |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||