

Barcelona Supercomputing Center Centro Nacional de Supercomputación

# "Supercomputing for the Future, Supercomputing from the Past"

Onassis Foundation Lectures on Computer Science Keraklion, Crete, July 21-25, 2008

Heraklion, Crete, July 25th, 2008

Prof. Mateo Valero Director

## **Talk outline**



- Supercomputing from the past
  - Architecture evolution
  - Applications and algorithms
- Supercomputing for the future
  - Technology trends
  - Multidisciplinary top-down approach

# Conclusions





# **30th List: The TOP10**

| _  | Aanufacturer Computer Rmax [TF/s] Installation Site |                                 | Installation Site | Country                                           | Year    | #Cores |         |  |  |
|----|-----------------------------------------------------|---------------------------------|-------------------|---------------------------------------------------|---------|--------|---------|--|--|
| 1  | IBM                                                 | BlueGene/L<br>eServer Blue Gene |                   | DOE/NNSA/LLNL                                     | USA     | 2007   | 212,992 |  |  |
| 2  | IBM                                                 | JUGENE<br>BlueGene/P Solution   | 167.3             | Forschungszentrum<br>Juelich                      | Germany | 2007   | 65,536  |  |  |
| 3  | SGI                                                 | SGI SGI Altix ICE 8200          |                   | New Mexico Computing<br>Applications Center       | USA     | 2007   | 14,336  |  |  |
| 4  | HP                                                  | Cluster Platform 3000 BL460c    | 117.9             | Computational Research<br>Laboratories, TATA SONS | India   | 2007   | 14,240  |  |  |
| 5  | H<br>Plenty of room for research!                   |                                 |                   |                                                   |         |        |         |  |  |
| 6  |                                                     |                                 |                   |                                                   |         |        |         |  |  |
| 7  | Cray Jaguar<br>Cray XT3/XT4                         |                                 | 101.7             | DOE/ORNL                                          | USA     | 2007   | 23,016  |  |  |
| 8  | IBM BGW<br>eServer Blue Gene                        |                                 | 91.29             | IBM Thomas Watson                                 | USA     | 2005   | 40,960  |  |  |
| 9  | Cray Franklin<br>Cray XT4 85.37                     |                                 | 85.37             | NERSC/LBNL USA                                    |         | 2007   | 19,320  |  |  |
| 10 | IBM New York Blue<br>eServer Blue Gene              |                                 | 82.16             | Stony Brook/BNL                                   | USA     | 2007   | 36,864  |  |  |

30th List / November 2007 www.top500.org

page 16



|    |              | 31th Lis                                 | ST:            | Ine IOP1                                          | 0       |               |         |
|----|--------------|------------------------------------------|----------------|---------------------------------------------------|---------|---------------|---------|
|    | Manufacturer | Computer                                 | Rmax<br>[TF/s] | Installation Site                                 | Country | Power<br>[MW] | #Cores  |
| 1  | IBM          | Roadrunner<br>BladeCenter QS22/LS21      | 1,026          | DOE/NNSA/LANL                                     | USA     | 2.35          | 122,400 |
| 2  | IBM          | BlueGene/L<br>eServer Blue Gene Solution | 478.2          | DOE/NNSA/LLNL                                     | USA     | 2.33          | 212,992 |
| 3  | IBM          | Intrepid<br>Blue Gene/P Solution         | 450.3          | DOE/ANL                                           | USA     | 1.26          | 163,840 |
| 4  | Sun          | Ranger<br>SunBlade x6420                 | 326            | TACC                                              | USA     | 2.00          | 62,976  |
| 5  | Cray         | Jaguar<br>Cray XT4 QuadCore              | 205            | DOE/ORNL                                          | USA     | 1.58          | 30,976  |
| 6  | IBM          | JUGENE<br>Blue Gene/P Solution           | 180            | Forschungszentrum<br>Juelich (FZJ)                | Germany | 0.50          | 65,536  |
| 7  | SGI          | Encanto<br>SGI Altix ICE 8200            | 133.2          | New Mexico Computing<br>Applications Center       | USA     | 0.86          | 14,336  |
| 8  | НР           | EKA<br>Cluster Platform 3000 BL460c      | 132.8          | Computational Research<br>Laboratories, TATA SONS | India   | 1.60          | 14,384  |
| 9  | IBM          | Blue Gene/P Solution                     | 112.5          | IDRIS                                             | France  | 0.32          | 40,960  |
| 10 | SGI          | SGI Altix ICE 8200EX                     | 106.1          | Total Exploration<br>Production                   | France  | 0.44          | 10,240  |

### 4 1

31st List / June 2008

page 1

4



IBM continues to lead the TOP20 with 10 system. There was a great deal of activity in the Top20 with 14new, upgraded or improved benchmark entries.

| #  | Ven-<br>dor | Rmax<br>TFlops | Installation                                    |
|----|-------------|----------------|-------------------------------------------------|
| 1  | IBM         | 1026           | DOE/NSSA/LANL New<br>(QS22/LS21)                |
| 2  | IBM         | 478.2          | <b>DOE/NSSA/LLNL</b><br>(104 racks BlueGene/L)  |
| 3  | IBM         | 450.3          | Argonne Natl Lab<br>(40 racks Blue Gene/P)      |
| 4  | Sun         | 326            | Texas Adv Comp Center<br>(QC Opteron) New       |
| 5  | Cray        | 205            | Oak Ridge NL<br>(XT4 QC Opteron)                |
| 6  | IBM         | 180            | FZJ Juelich Better   (16 racks Blue Gene/P) Bmk |
| 7  | SGI         | 133.2          | New Mexico CACBetter(Altix Clovertown)Bmk       |
| 8  | HP          | 132.8          | TATA Research Lab (Clovertow Bette)             |
| 9  | IBM         | 112.5          | IDRIS New<br>(10 racks Blue Gene/P)             |
| 10 | SGI         | 102.8          | Total Exploration<br>(Altix Quad Core Xeon)New  |

| #  | Ven-dor | Rmax<br>TFlops | Installation                                |             |
|----|---------|----------------|---------------------------------------------|-------------|
| 11 | HP      | 102.8          | Swedish Govt<br>(Clovertown)                |             |
| 12 | Cray    | 102.2          | Sandia – Red Storm<br>(XT3 Opteron)         |             |
| 13 | IBM     | 92.96          | EDF R&D<br>(8 rack Blue Gene/P)             | ₽<br>₽<br>₽ |
| 14 | IBM     | 91.29          | BlueGene at Watson<br>(20 racks BlueGene/L) |             |
| 15 | Cray    | 85.368         | NERSC/LBNL<br>(XT4 Opteron)                 |             |
| 16 | Hitachi | 82.984         | T2K Open SC-Japan<br>(QC Opteron)           | }<br>∍w     |
| 17 | IBM     | 82.16          | Stony Brook / BNL<br>(18 racks BlueGene/L)  |             |
| 18 | IBM     | 80.32          | ECMWF<br>(Power 575, p6)                    | }w          |
| 19 | IBM     | 80.32          | RZG/Max Planck/IPP<br>(Power 575, p6)       | ₹<br>Sw     |
| 20 | Appro   | 76.46          | Univ of Tsukuba<br>(QC Opteron)             | e<br>S<br>W |

Source: www.top500.org



## Hybrid SMP-cluster parallel systems

 Most modern high-performance computing systems are clusters of SMP nodes (performance/cost trade-off)



- Programming models allow to specify:
  - How computation is distributed?
  - How data is distributed and how is it accessed?
  - How to avoid data races?

Per Stenström



## IBM breaks 1 Petaflop barrier with hybrid configuration at Los Alamos



#### System Highlights ...

- ✓ 1st to break the Petaflop barrier
- ✓Fastest machine in USA
- Largest contributor to Top500 aggregate performance with 1.026 of 11.7 Petaflops (8.7%)
- ✓ Third most power efficient system (QS22s at Fraunhofer and IBM Germany are #1 and #2)

## Site: DOE/NNSA/LANL

System Name: QS22/LS21

**System Configuration**: IBM BladeCenter cluster of 17 Connected Units (CUs) for a total 3060 nodes dual socket 1.8 GHz Opteron (dual core) LS21 blades plus 6120 nodes dual socket 3.2 GHz PowerXCell 8i (8 SPU + 1 PPU cores) QS22 blades. InfiniBand Interconnect. 280 racks total.

Cores: 122,400 Rmax: 1,026,000 GF Nmax: 2236927 Rpeak: 1,375,776 GF Power: 2345 kW Mflops/Watts: 437 Mflops/W

Source: www.top500.org





## **BlueGene/P**



# **Columbia configuration**





## Front End

- 128p Altix 3700 (RTF)

#### Networking

- 10GigE switch 32-port
- 10GigE cards (1 per 512p)
- InfiniBand switch (288 port)
- InfiniBand cards (6 per 512p)
- Altix 3700 2BX 2048 Numalink Kits

## **Compute Node (single sys image)**

- Altix 3700 (A) 12x512p
- Altix 3700 BX2 (T) 8x512p

## Storage Area Network

- Brocade switch 2x128 port

## Storage (440 TB)

- FC RAID 8x20 TB (8 racks)
- SATARAID 8x35TB (8 racks)



## **Processors, Blades, BladeCenters and Racks**













Heraklion, Crete, July 25th, 2008



Center

tro Nacional de Supercomputación

























## **Faith-based Computing**

The Ultimate Answer to Life, the Universe, and Everything

"Forty-two!" yelled Loonquawl. "Is that all you've got to show for seven and a half million years' work?" "I checked it very thoroughly," said the computer, "and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is."

mputing Center (BSC)

The Ultimate Answer from Deep Thought in "The Hitchhiker's Guide to the Galaxy"

## Red Española de Supercomputación







Altamira

#### MareNostrum



## MareNostrum

Processor:10240 PowerPC 970 2.3 GHzMemory:20 TBytesDisk:280 + 90 TBytesNetwork:Myrinet, Gigabit, 10/100System:Linux

### UPM

Processor:2408 PowerPC 970 2.2 GHzMemory:4.7 TBytesDisk:63 + 47 TBytesNetwork:Myrinet, Gigabit, 10/100System:Linux

## IAC, UMA, UC, UZ, UV

Process: 512 PowerPC 970 2.2 GHz Memory: 1 TByte Disk: 14 + 10 TBytes Network: Myrinet, Gigabit, 10/100 System: Linux



Picasso

Tirant

## **Performance development**







# A growth-factor of a billion in performance in a career



Heraklion, Crete, July 25th, 2008

Alliant, American Supercomputer, Ametek, AMT, Astronautics, BBN Supercomputer, Biin, CDC, Chen Systems, CHOPP, Cogent, Convex (now HP), Culler, Cray Computers, Cydrome, Dennelcor, Elexsi, ETA, E & S Supercomputers, Flexible, Floating Point Systems, Gould/SEL, IPM, Key, KSR, MasPar, Multiflow, Myrias, Ncube, Pixar, Prisma, SAXPY, SCS, SDSA, Supertek (now Cray), Suprenum, Stardent (Ardent+Stellar), Supercomputer Systems Inc., Synapse, Thinking Machines, Vitec, Vitesse, Wavetracer.

PACT'98 Gordon Bell



# **Talk outline**



## • Supercomputing from the past

- Architecture evolution
- Applications and algorithms
- Supercomputing for the future
  - Technology trends
  - Multidisciplinary top-down approach

## • Conclusions



# **Grand challenge problems**

- Systems biology -
  - Model & simulation leading to predictive models with clinical or environmental impact
- Sustainable Systems -
  - Taking into account multi-scale nature -Models are linked to experimental data - providing corroboration of experiments
- Turbulence & Chaos -
  - Characterize boundary layer effects and their impact on global solution and stability
- Environmental
  - Global Warming/Climate Change
  - Energy
  - Water
  - Biodiversity and land use
  - Chemicals, toxics and heavy metals
  - Air pollution
  - Waste management
  - Stratospheric ozone depletion
  - **Oceans & fisheries**
  - Deforestation

Multi-Scale Patient-Specific Data



**Genetic Variability** 

Gene Protein ExpressionExpression **Profiling Profiling** 



Parter storage

ice and enove

And Modeling





Enouge water starting

# **ITER design**

- Supercomputing is mandatory for ITER design
- The most computing demanding problems for ITER design
  - Plasma turbulent transport (Gyro-kinetic codes)
  - Plasma Wall Interaction (DFT+MD+MC+DDD+FE codes)<sup>3</sup>
- Problems generally amenable to parallelisation
  - Gryo-kinetic codes tested till 10<sup>4</sup> processors

- With a 100 TFlops state-of-the-art machine
  - Gyro-kinetic modelling of JET reactor (tokamak) in days
  - Stellarators are more challenging, but could be simulate
  - ITER needs at least a 10+PFlops machines







## Airbus 380 Design



Heraklion, Crete, July 25th, 2008

Center

#### each sample cell Growth of solid grains initiates independently, but 1.6

BlueGene/L supports solidification understanding

- soon leads to grain boundaries which span the simulation cell
- The ddcMD team is currently using 131,072 CPUs of BG/L for unprecedented five hundred million atom MGPT simulations

Nucleation is initiated at multiple independent sites in

# 2005 Gordon Bell Prize WINNER 0.6

Lawrence Livermore National Laboratory Blue Gene/L Simulation Results Using ddcMD Code





Tasks (logscale)

2M atoms(16384 processors)

16M atoms(32768 processors)

100000

Contact: Fred Streitz



# **Talk outline**



- Supercomputing from the past
  - Architecture evolution
  - Applications and algorithms

# • Supercomputing for the future

- Technology trends
- Multidisciplinary top-down approach

## • Conclusions



# **Technology Outlook**



| High Volume<br>Manufacturing | 2004                              | 2006    | 2008      | 2010                          | 2012             | 2014         | 2016      | 2018 |  |  |
|------------------------------|-----------------------------------|---------|-----------|-------------------------------|------------------|--------------|-----------|------|--|--|
| Technology Node (nm)         | 90                                | 65      | 45        | 32                            | 22               | 16           | 11        | 8    |  |  |
| Integration Capacity<br>(BT) | 2                                 | 4       | 8         | 16                            | 32               | 64           | 128       | 256  |  |  |
| Delay = CV/I scaling         | 0.7                               | ~0.7    | >0.7      | Delay scaling will slow down  |                  |              |           |      |  |  |
| Energy/Logic Op<br>scaling   | >0.35 >0.5 >0.5 Energy scaling wi |         |           |                               |                  | aling will : | slow dowi | ו    |  |  |
| Bulk Planar CMOS             |                                   | High Pr | obability | Low Probability               |                  |              |           |      |  |  |
| Alternate, 3G etc            | Itemate, 3G etc Low Probability   |         |           |                               | High Probability |              |           |      |  |  |
| Variability                  | Medium                            |         |           | Hig                           | h                | Very I       | ligh      |      |  |  |
| ILD (K)                      | ~3 <3                             |         |           | Reduce slowly towards 2-2.5   |                  |              |           |      |  |  |
| RC Delay                     | 1                                 | 1       | 1         | 1                             | 1                | 1            | 1         | 1    |  |  |
| Metal Layers                 | 6-7                               | 7-8     | 8-9       | 0.5 to 1 layer per generation |                  |              |           |      |  |  |

## Shekhar Borkar, Micro37, P

Heraklion, Crete, July 25th, 2008
#### Increasing CPU performance: a delicate balancing act

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor





We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.



#### **Talk outline**



- Supercomputing from the past
  - Architecture evolution
  - Applications and algorithms

#### • Supercomputing for the future

- Technology trends
- Multidisciplinary top-down approach

#### • Conclusions



#### Multidisciplinary top-down approach



#### **Multidisciplinary top-down approach**



## Intel's Petaflop chip



The key technologies of this first Tera-scale Research Prototype are a mesh interconnect (left) and support for 3D stacked memory (above).

- 80 processors in a die of 300 square mm.
- Terabytes per second of memory bandwidth
- Note: The barrier of the Teraflops was obtained by Intel in 1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters <sup>(2)</sup>
- This will be possible in 3 years from now



Example Mesh



Heraklion, Crete, July 25th, 2008

Center Centro Nacional de Supercomputació

### AMD's Next Generation Processor Technology



- Bit Manipulation extensions (LZCNT/POPCNT)
- SSE extensions (EXTRQ/INSERTQ, MOVNTSD/MOVNTSS)

#### AMD

The AMD Opteron™ CMP NorthBridge Architecture, Now and in the Future

Enhanced power

management and RAS

#### **Ranger System Summary**

- Compute power 504 Teraflops
  - 3,936 Sun four-socket blades
  - 15,744 AMD Opteron "Barcelona" processors
    - Quad-core, 2.0 GHz, four flops/cycle (dual pipelines)
- Memory 125 Terabytes
  - 2 GB/core, 32 GB/node
  - 132 GB/s aggregate bandwidth
- Disk subsystem 1.7 Petabytes
  - 72 Sun x4500 "Thumper" I/O servers, 24TB each
  - ~72 GB/sec total aggregate bandwidth
  - 1 PB in largest /work filesystem
- Interconnect 10 Gbps / 2.3 μsec latency
  - Sun InfiniBand-based switches (2) with 3456 ports each
  - Full non-blocking 7-stage Clos fabric
  - Mellanox ConnectX IB cards





## Ranger: All Racks & Power In Place





#### **Kilo-Instruction Processors: hitting the memory wall**



4-way, out-of-order processor - SpecFP 2000 benchmarks, from [Cri00]



#### **Kilo-Instruction Multiprocessors**





# You will see.... in 400 years from now people will get crazy



We have parallel systems today (Servers, HPC), but can we replace the "Big cores" with many small core that will run in parallel?

Dr. Avi Mendelson. Keynote at ISC-2007



Heraklion, Crete, July 25th, 2008

#### **GeForce 8800 GPU Computing**

Up to 12,288 active threads, 86.4 GB/s DRAM BW, 16 Streaming MP, 367 GFLOPS, 768 MB DRAM, 8GB/s PCIe Resources allocated at per-block granularity Host Input Assembler Thread Execution Manager **Parallel Data Parallel Data** Parallel Data **Parallel Data Parallel Data** Parallel Data **Parallel Data Parallel Data** Cache Cache Cache Cache Cache Cache Cache Cache Texture Texture **Texture** Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store \_oad/store **Global Memory** 

> Baccelona Supercomputing Center Centro Nacional de Supercomputación

## The Evolution of Programmable Logic



DAC, UPC, Dec 2007

7

#### **Heterogeneous Architectures Emerging**

- New integrated architectures for HPC (Cray XT5h, SGI Altix 350/4700, SRC MAP, etc.)
- Socket plug-in modules (HyperTransport, FSB)
- Which system architecture to choose?



#### The CELL/B.E. chip



235 Mtransistors 235 mm<sup>2</sup>



Los Alamos National Laboratory





All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indifferent designs. Heraklion, Crete, Cuty 25(b, 2008



#### Multidisciplinary top-down approach



- Importance of the different networks in a Supercomputer
- Communication patterns from the applications
- Latency and bandwidth
- Overlapping Communication and Computation
- Multipath routing
- Optical interconnects



#### **Network integration**

- Between nodes
  - Infiniband, Myrinet, ...
  - 3D Torus





- Inside a node
  - Buses to memory



- Network on Chip
  - Buses: CellBE
  - Direct topologies: Intel's 80 core Polaris





#### **Supercomputer networks**

- In the last November Top10 list
  - 4 BlueGenes with 3D Torus Networks
  - 3 Cray XT4 also with 3D Torus Networks
  - 3 Xeon platforms with Infiniband
- 5 independent networks in BlueGene
  - 3D torus: point-to-point
  - Collective network: global operations
  - Global barriers and interrupts
  - Gbit ethernet: file I/O and host interface
  - Control network: boot, monitoring and diagnostics





#### Scientific workloads and network parameters



- Low impact of latency (5-10%), compared to bandwidth (-50% to 20%)
- Amber execution, 64 tasks; simulations with different bw and latency



#### Scientific workloads and network parameters (II)



Latency by bw (CPMD 256 processors)

- No impact of latency, only bandwidth is relevant
- CPMD execution, 256 tasks; simulations with different bw and latency



#### **Speculative dataflow**





#### **Effects on bandwidth**











#### Better routes, better mapping





Nacional de Supercomputacia



## Evolution of Optical interconnects Time of Commercial Deployment (Copper Displacement):

|                                                        | 1980's                      | 1990's                    | 2000's                  | > 2010                 |                     |                       |
|--------------------------------------------------------|-----------------------------|---------------------------|-------------------------|------------------------|---------------------|-----------------------|
|                                                        | WAN, MAN<br>metro,long-haul | LAN<br>campus, enterprise | System                  | Board<br>module-module | Module<br>chip-chip | Chip<br>on-chip buses |
|                                                        |                             |                           |                         |                        | ► Terabus           | Program               |
| <u>Distance</u>                                        | 10's – 100's km             | 10m – 2km                 | <10 intra<br><100 inter | < 1 m                  | < 10 cm             | < 20 mm               |
| # of lines                                             | singles                     | tens                      | 100's-1000's            | 1000's                 | 10000's             | 100,000's             |
| <u>Cost</u><br>(\$/Gb/s)                               | 1000                        |                           |                         | 1                      |                     | 10.6                  |
| Power<br>(mW/Gb/s)                                     | 500                         |                           |                         | 5                      | <b></b> ,           | 0.5                   |
| <u>Density</u><br>(Gb/s/mm²)                           | 10 <sup>-3</sup>            |                           |                         | 10                     |                     | 1000                  |
| Sith Electronic Components<br>and Tachonicy Conference | May 29 – June               | 1, 2007                   | F.Doa                   | ny                     | IBM Research        | 4                     |

#### Multidisciplinary top-down approach





#### **Back to Babel?**



#### **Book of Genesis**

"Now the whole earth had one language and the same words" ...

..."Come, let us make bricks, and burn them thoroughly. "...

..."Come, let us build ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves"...

And the LORD said, "Look, they are one people, and they have all one language; and this is only the beginning of what they will do; nothing that they propose to do will now be impossible for them. Come, let us go down, and confuse their language there, so that they will not understand one another's speech."



#### The computer age

#### Fortran & MPI







### A simple case $\odot$ : the Cell/B.E.

- Libraries
  - libSPE, DaCS, ALF, ...
  - Complete modification of your code
- Follow the standards (i.e. OpenMP)
  - Software cache (runtime/compiler)
  - Tiling and prefetching (compiler)
  - What about performance?
- New programming models
  - CellSs
  - Proof-of-concept implementations that may influence standards





### A scaled view of architectures and programming models



<u>Cell/Grid/SMP Superscalar (StarSs)</u> standard sequential programming: "easy" "decent" performance Portable. One language, multiple run times

#### Propose new programming models: CellSs

- Simple programming model for the Cell/B.E. ...
  - allows easy porting of applications
  - oriented towards the exploitation of functional parallelism from a sequential application with annotated functions
- ... and a runtime system
  - dynamically exploits functional parallelism (true dependences)
  - removes false dependences (renaming)





#### Multidisciplinary top-down approach







#### **Algorithm kernels**

- Traditional Numerical Kernels
  continue
  - Sparse Linear algebra
  - Dense Linear Algebra
    - BLAS
    - Linear systems
    - Eigenvalues
  - Discretization methods (FD, FE, FV, BE)
  - FFT and other transforms
  - Random number generation
- Algorithm improvement in the last 20 years similar to Moore's Law
- Emphasis on
  - Memory bandwidth, QoS,...
  - Asynchronism, data flow

| Method       | Storage        | Flops                  |  |
|--------------|----------------|------------------------|--|
| GE (banded)  | n <sup>5</sup> | n <sup>7</sup>         |  |
| Gauss-Seidel | n <sup>3</sup> | n⁵ log n               |  |
| Optima I SOR | n <sup>3</sup> | n <sup>4</sup> log n   |  |
| CG           | n <sup>3</sup> | n <sup>3.5</sup> log n |  |
| Full MG      | n <sup>3</sup> | n³                     |  |





#### **Multidisciplinary top-down approach**




#### The need of performance analysis tools: Who

- Users, application developers
  - To confirm assumed behavior (very often reality is different from preconceived)
  - Provide expectations of impact to be used for decision support
    - New machines
    - Tuning efforts  $\rightarrow$  potential rewards
- Operations
  - To plan and ensure proper resource utilization



load imbalance - L2 misses



- System developers
  - To understand global impact of proposed features





#### **CEPBA-Tools environment**





#### **Performance analysis tools: issues**

### • Scalability

- Dynamic range: from long term behavior @ 10K cores to detailed impact of cache or core microarchitecture.
- Handling huge amounts of data.
- Intelligence
  - sumarizing / Datamining → useful information (leading to right decisions)
  - Models



#### Multidisciplinary top-down approach



#### Specfem3D: a "true" story



Center

o Nacional de Supercomputación

• Should I introduce asynchronous communication?



#### Specfem3D

 Should I introduce asynchronous communication?





Heraklion, Crete, July 25th, 2008



#### • Load Balance? Instructions and cache misses

Instructions\_c1 @ Specfem3D\_192.chop1.prv.gz - 0 × Begin time: 12516813 Semantic 🗖 Statistic Average value X-Axis End time: 12913026; Control Window: Instructions\_c1 🔗 Data Window: L3 Data cache misses per 1000 instr 9 ANNA. Processors Т 瓢 1 All window Repeat All trace Analyze OK Instructions

#### @ 192 processors





### Sources of unbalancing



- Intrinsic to the algorithm:
  - Non-perfect partitioning
  - Some applications need dynamic load balancing: (e.g. molecular dynamics)
  - Several computational phases
  - Data-dependent access pattern
- Caused by resources:
  - Cache misses
  - Processor heterogeneity in a chip/board
  - OS noise/user daemons: in some computing nodes the OS or user daemons could delay the running process
  - Network topology and contention



## **SMT** priorities and load balancing

- Increasing the priority of the threads executing longer<sup>1</sup>
- Assume a 4 process MPI application running on a POWER5
  - Further assume that P1 computes longer than P2, P3, P4
  - P1, P2 run on one core and P3, P4 in the other core
- Increasing throughput is not the solution to unbalance<sup>1</sup>
- By increasing P1's priority the application execution time decreases



#### **QoS through shared resource management**

• Balance thread progress by managing the shared L2 cache



#### **Multidisciplinary top-down approach**









### **Power and money**





| 200 |  |  |   |  |  |  |
|-----|--|--|---|--|--|--|
| 100 |  |  | - |  |  |  |
| 0   |  |  |   |  |  |  |

| HC29               | IK3L            | Bety                  | Bety       | Rety          | iepty      | HEAD          | HC20          |
|--------------------|-----------------|-----------------------|------------|---------------|------------|---------------|---------------|
| 1.11.12            |                 | 1                     | 1 1 1      |               |            |               |               |
| anti-              | - and loss      |                       |            |               |            | P             |               |
| U                  | U               | U                     | U-12 19    | 0- <u>0.8</u> | 0          | U             | 11.00         |
| #1                 | 1026<br>        | H28                   | 1620<br>41 | 1623<br>#1    | 1020       | 9031<br>91    | 9020<br>41    |
|                    |                 |                       | -          | - 3           |            |               |               |
| " and the second   |                 | -                     |            |               | · market   | - mater       |               |
| u++                | u               | 11 <del>11 10 +</del> | U+12:00    | u+u-+         | и<br>цж    | 10            | U - 12/00 - + |
| -1 <sup>NCD9</sup> | #CIS            | #1                    | -          | =1            | INCLA      | w1            | ect2          |
|                    |                 |                       |            |               |            |               |               |
| " yollow           | Palitica        | Panetter.             | T maintain | -             | Proper     | -             | T made        |
| 18 32-92           | 10 - C- C-      | 18-12-00 ····         | 14 14/00   | 10-00-        | 16         | 10-00         | 14 12 00      |
| -, <sup>801</sup>  | +PCEO           | ant av.               | Not Ai-    |               | - 1<br>- 1 | +1000         | - PC00        |
| *                  | -               | X                     |            |               | -          | Wand - F      | -             |
|                    | = yest          | -                     | - 1        |               |            | - part        | - 200         |
| n na               | 10              | 10. <del></del>       | 70 12.16   | 10-12-0       | 10.10      | 10-10.00      | 10 - 10 W     |
| - PEDF             |                 | - 500                 | PE33       | wit in .      | *1         | - 100         | - 1021<br>    |
|                    |                 |                       | - 22       |               | 1.00       |               | -             |
| *                  | a periodication |                       | * 🔚        | - 00          | -          | in particular | * Fall Jart   |
| 10 (1-50)          | 10 -00-         | 11                    | 10-12-00   | 10 10         | 10-00      | 11-50         | 10-12-00      |

### **Talk outline**



- Supercomputing from the past
  - Architecture evolution
  - Applications and algorithms
- Supercomputing for the future
  - Technology trends
  - Multidisciplinary top-down approach

### • Conclusions



#### **Key issues**





# **Education for Parallel Programming**



# Supercomputing and e-Science Consolider program



## **BSC-Microsoft Many-core Project**

- Programming models for future many-core architectures
- Architectural support to programming models
  - OpenMP+TM
  - HW acceleration for Haskell
- Many-core architecture
- Power-aware





#### \_\_\_\_\_

Heraklion, Crete, July 25th, 2008

#### An overall picture of the IBM MareIncognito project

- Our 10-100 Petaflop research project for BSC (2010)
- Port/develop applications to reduce time-to-production once installed
- Programming models (MPI, OpenMP, CellSuperScalar)
- Tools for application development and to support previous evaluations
- Evaluate node architecture (heavily multicored):
- Evaluate interconnect options



#### **Staff Evolution**

BSC-CNS has 195 members at October of 2007 and hailed from 23 different countries (Alemania, Argentina, Belgium, Brazil, Bulgaria, Canada, Colombia, China, Cuba, France, Germany, India, Ireland, Italy, Jordania, Lebanon, Mexico, Poland, Russia, Serbia, Turkey, the United Kingdom, the United States and Spain).









# Thank you !



Heraklion, Crete, July 25th, 2008