### Arm in HPC

### Jonathan Beard Staff Research Engineer

© 2019 Arm Limited

### Arm HPC Ecosystem

- Arm IPNeoverse
- IP roadmap
- SVE
   Scalable
   Vector ISA
   Extension
- Arm v8.x ISA

© 2019 Arm Limited

# With a growing software ecosystem

+ + + + + + + + + + + + + + +



\* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \*

### Arm HPC Ecosystem

| Arm IP                                                       | Si Partners | Integrators                           | Deployments                                                    |
|--------------------------------------------------------------|-------------|---------------------------------------|----------------------------------------------------------------|
| <ul> <li>Neoverse<br/>IP<br/>roadmap</li> <li>SVE</li> </ul> |             | Hewlett Packard<br>Enterprise<br>Atos | Sandia<br>National<br>Laboratories<br>University of<br>BRISTOL |
| Scalable<br>Vector ISA                                       |             |                                       |                                                                |
| • Arm v8.x<br>ISA                                            |             |                                       | * Stony Brook University                                       |



### **Recent Announcements**



"Per aspera ad astra



### Vanguard Astra by HPE WORLD'S MOST POWERFUL ARM SUPERCOMPUTER

### Arm Neoverse IP

+ + + + + + + + + + + + + +



### Edge to data center



© 2019 Arm Limited

Infrastructure Will Be Designed from the Edge In

#### **arm** Neoverse

# Potential long-term research directions

(The more fun stuff, for me at least)





+ + + + + + + + + + + + + + + +

\* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \*

+ + + + + + + + + + + + + + + +

\* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \*

CPU + + + + + + + + + + + + + + + + + +

\* \* \* \* \* \* \* \* \* \* \* \* \* \*

+ + + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + +

#### © 2019 Arm Limited

+ + + + + + + + + + + +



© 2019 Arm Limited

**arm** NEOVERSE

+ + + + + + + + + + + + + + +

+ + + + + + + + + + + + + + +



© 2019 Arm Limited

**arm** NEOVERSE

+ + + + + + + + + + + + + + + +





- Each of these likely has a core capable of running a container
- Sometimes similar security concerns (can we abstract this?)
- Can we make processing look the same in each of these devices? Or at least as easy to use?
- Goal is same in most cases, reduce data movement...

### The magic 8-ball says

Adapted/modified from original figure courtesy of Dilip Vasudevan (LBL)



### Making extreme heterogeneity a reality





1: Software adoption (or lack of) kills most novel accelerators

 Standardizing on the interface layer would be useful, can we do this?

2: Efficiency of data movement

logic is cheap(-ish), data movement is

3: Coherence scales only so far. Dataflow graph execution works well, can we virtualize it to be transport medium agnostic?

4: Virtualization and translation for accelerators is an afterthought at the moment, can we do more?

**n** neoverse

## Enabling an ecosystem

Unlock innovation on both sides of interface! – Minimize software disruption, maximize innovation pace



© 2019 Arm Limited

- Reduce latency to initiate a heterogeneous-parallel task
- Decrease communications overhead (reduce state transference)
- Increase data locality
- Reduce programmer effort for heterogeneous systems



### Communications Cost (latency)

#### **arm** NEOVERSE

#### © 2019 Arm Limited

+ + + + + + + + + + + + + + +

- Reduce latency to initiate a heterogeneous-parallel task
- Decrease communications overhead (reduce state transference)
- Increase data locality
- Reduce programmer effort for heterogeneous systems



### **Communications Cost (latency)**

#### arm NEOVERSE

- Reduce latency to initiate a heterogeneous-parallel task
- Decrease communications overhead (reduce state transference)
- Increase data locality
- Reduce programmer effort for heterogeneous systems



### **Communications Cost (latency)**

#### arm NEOVERSE

© 2019 Arm Limited

+ + + + + + + + + + + +

- Reduce latency to initiate a heterogeneous-parallel task
- Decrease communications overhead (reduce state transference)
- Increase data locality
- Reduce programmer effort for heterogeneous systems



### **Communications Cost (latency)**

#### arm NEOVERSE



© 2019 Arm Limited

**arm** NEOVERSE

\* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \*

### Enabling scalability and heterogeneity



- Synchronization takes extra cycles (~30ns between cores)
- Presented differently based on where you are: OoO Core, GPGPU, FPGA, etc.

© 2019 Arm Limited

### Enabling scalability and heterogeneity



- Data in cache lines are often underutilized (0-80%) before eviction, median ~42%.
- Synchronization takes extra cycles (~30ns between cores)
- Presented differently based on where you are: OoO Core, GPGPU, FPGA, etc.

| e, GPGPU,        | FPG/ |
|------------------|------|
| 2019 Arm Limited |      |
|                  |      |



- Dense transfers of packed cache lines
- Keep transfers inside coherence bus, instead of memory write-back
- Can we virtualize underlying mechanism?



### Enabling scalability and heterogeneity



- Data in cache lines are often underutilized (0-80%) before eviction, median ~42%.
- Synchronization takes extra cycles (~30ns between cores)
- Presented differently based on where you are: OoO Core, GPGPU, FPGA, etc.

© 2019 Arm Limited

Queueing Accelerator push stash/pop CPU CMOS Acc 3 CPU Parallelism CMOS L1 Cache CPU Acc 2 CMOS CPU NVM Acc 1

- Dense transfers of packed cache lines
- Keep transfers inside coherence bus, instead of memory write-back

 $\bullet$ 

Can we virtualize underlying mechanism?



#### **Communications Cost (latency)**

+ + + +

### Accelerator Rich Systems



- Applications of accelerators:
  - Data-layout transform (reduce data movement)
  - Matrix-ops
  - \* Neural \* \* \* \* \* \* \* \* \* \*
  - Near-data
  - Edge
  - <sup>+</sup> Etc.

### **arm** NEOVERSE

Programmable

Gather / Scatter

Acc 2

CMOS

Acc 1

### DAG Execution Programmer / Runtime / Compiler Provided Kernels



#### © 2019 Arm Limited

### DAG Execution



+ +

### DAG Execution



### **Productive Accelerator Rich Systems**



### Productivity from Edge to HPC

© 2019 Arm Limited



## Research summary / parting thoughts

- Scientists should be able to focus on the science, programmers on the algorithm...often juxtaposed to reducing time to solution.
- Is *edge* really just an extended form of near-data processing?
- The *edge* complicates an already complicated world for application programmers.
- Same old problems, still no "sticky" solutions...yet.

# **Closing Comments**



© 2019 Arm Limited

-----

# Arm HPC Community – Arm.com/hpc

|                                       | mmunication Portals                           |       |      |  |  |  |
|---------------------------------------|-----------------------------------------------|-------|------|--|--|--|
| · · · · · · · · · · · · · · · · · · · | Arm.com HPC resources                         |       |      |  |  |  |
| * • *                                 | developer.arm.com/HPC (HPC Ecosystem Landin   | ig pa | aġe) |  |  |  |
| · · · · · · · · · · · · · · · · · · · | community.arm.com/tools/HPC (HPC Blogs, For   | um)   |      |  |  |  |
|                                       |                                               |       |      |  |  |  |
| Arr                                   | n HPC User Group Community                    |       |      |  |  |  |
| *+ *<br>. •.                          | Gitlab HPC Packages Wiki (software ecosystem) |       |      |  |  |  |
| * • *<br>* • *                        | Arm-HPC @ Groups.IO (<=NEW)                   |       |      |  |  |  |
|                                       |                                               |       |      |  |  |  |

© 2019 Arm Limited

Supporting Arm HPC Community end-users and developers.



### Who you gonna call? Arm Professional Services!





| Ŏ | Applica | ation | perfo  | orma    | nce e | ngin | eerin | g! * |
|---|---------|-------|--------|---------|-------|------|-------|------|
|   | • Com   | piler | not ve | ectoriz | ing?  |      |       |      |

- Performance not what it should be?
- New system tuning!
  - Do you have the right SMT mode on your ThunderX2?
  - Does your InfiniBand need tweaking?
- Hackathons and tutorials!
  Does a team need a mentor at your hackathon?
  Looking for a jumpstart with Arm HPC?





# **CIT NEOVERSE** The Cloud to Edge Infrastructure Foundation

for a World of 1T Intelligent Devices

