# **Deep Learning and HPC**

Bill Dally, Chief Scientist and SVP of Research January 17, 2017\_\_\_\_\_



# A Decade of Scientific Computing with GPUs



#### **GPUs Enable Science**

# TITAN

# 18,688 NVIDIA Tesla K20X GPUs27 Petaflops Peak: 90% of Performance from GPUs17.59 Petaflops Sustained Performance on Linpack



# U.S. to Build Two Flagship Supercomputers

Pre-Exascale Systems Powered by the Tesla Platform



Summit & Sierra Supercomputers 100-300 PFLOPS Peak IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017

#### **DGX SATURNV** World's Most Efficient AI Supercomputer





Fastest AI Supercomputer in TOP500 4.9 Petaflops Peak FP64 Performance 19.6 Petaflops DL FP16 Performance 124 NVIDIA DGX-1 Server Nodes



Most Energy Efficient Supercomputer #1 on Green500 List 9.5 GFLOPS per Watt 2x More Efficient than Xeon Phi System

#### 13 DGX-1 Servers in Top500

FACTOIDS

38 DGX-1 Servers for Petascale supercomputer

55x less servers, 12x less power vs CPU-only supercomputer of similar performance

#### **EXASCALE APPLICATIONS ON SATURNV**



# of CPU Servers to Match Performance of SATURNV



S3D: Discovering New Fuel for Engines





SPECFEM3D: Simulating Earthquakes

QUDA version 0.9beta, using double-half mixed precision DDalphaAMG using double-single

### Exascale System Sketch











### **GPUs Enable Deep Learning**

#### **GPUs + Data + DNNs**



# THE STAGE IS SET FOR THE AI REVOLUTION



2012: Deep Learning researchers worldwide discover GPUs

2015: ImageNet – Deep Learning achieves superhuman image recognition

2016: Microsoft's Deep Learning system achieves new milestone in speech recognition

### A New era of computing









#### AI & INTELLIGENT DEVICES

PC INTERNET

### Deep Learning Explodes at Google

Android apps Drug discovery Gmail Image understanding Maps Natural language understanding Photos **Robotics** research Speech **Translation** YouTube



Jeff Dean's talk at TiECon, May 7, 2016

### **Deep Learning Everywhere**



#### INTERNET & CLOUD

Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation

#### MEDICINE & BIOLOGY

Cancer Cell Detection Diabetic Grading Drug Discovery

#### MEDIA & ENTERTAINMENT

Video Captioning Video Search Real Time Translation

#### SECURITY & DEFENSE

Face Detection Video Surveillance Satellite Imagery

#### AUTONOMOUS MACHINES

Pedestrian Detection Lane Tracking Recognize Traffic Sign

15 **OVIDIA** 

# Now "Superhuman" at Many Tasks

Speech recognition

Image classification and detection

Face recognition

Playing Atari games

Playing Go

### **Deep Learning Enables Science**

# Deep learning enables SCIENCE

#### NASA AMES



Classify Satellite Images for Carbon Monitoring





Determine Drug Treatments to Increase Child's Chance of Survival





Analyze Obituaries on the Web for Cancer-related Discoveries ML Filters "events" from the Atlas detector at the LHC

600M events/sec

Cranmer - NIPS 2016 Keynote

#### Using ML to Approximate Fluid Dynamics



"... Implementation led to a speed-up of one to three orders of magnitude compared to the state-of-the-art position-based fluid solver and runs in real-time for systems with up to 2 million particles"

"Data-driven Fluid Simulations using Regression Forests" http://people.inf.ethz.ch/ladickyl/fluid\_sigasia15.pdf 20 🖉 🗤 📭

#### Fluid Simulation with CNNs



Tompson et al. "Accelerating Eulerian Fluid Simulation With Convolutional Networks," arXiv preprint, 2016

#### Using ML to Approximate Schrodinger Equation





"For larger training sets, N >= 1000, the accuracy of the ML model becomes competitive with mean-field electronic structure theory—at a fraction of the computational cost."

"Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning", Rupp et al., Physical Letters 22 21 NIDIA.

### Deep Learning has an insatiable demand for computing performance

#### **GPUs enabled Deep Learning**



### **GPUs now Gate DL Progress**



### Pascal "5 Miracles" Boost Deep Learning 65X



# Pascal GP100



- 10 TeraFLOPS FP32
- 20 TeraFLOPS FP16
- ●16GB HBM 750GB/s
- 300W TDP
- •67GFLOPS/W (FP16)
- I6nm process
- I60GB/s NV Link



#### **TESLA P4 & P40**

#### INFERENCING ACCELERATORS

Pascal Architecture | INT8 P4 : 50W | 40X Energy Efficient versus CPU P40: 250W | 40X Performance versus CPU



#### TensorRT

#### PERFORMANCE OPTIMIZING INFERENCING ENGINE

FP32, FP16, INT8 | Vertical & Horizontal Fusion | Auto-Tuning VGG, GoogLeNet, ResNet, AlexNet & Custom Layers Available Today: developer.nvidia.com/tensorrt

### **NVLINK enables scalability**

#### NVLINK - Enables Fast Interconnect, PGAS Memory







#### NVIDIA DGX-1 WORLD'S FIRST DEEP LEARNING SUPERCOMPUTER



**170 TFLOPS** 8x Tesla P100 16GB NVLink Hybrid Cube Mesh **Optimized Deep Learning Software** Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU - 3200W



### "Billions of INTELLIGENT devices"



"Billions of intelligent devices will take advantage of DNNs to provide personalization and localization as GPUs become faster and faster over the next several years."

– Tracti<u>ca</u>



#### JETSON TX1 EMBEDDED AI SUPERCOMPUTER

10W | 1 TF FP16 | >20 images/sec/W



#### INTRODUCING XAVIER AI SUPERCOMPUTER SOC

7 Billion Transistors 16nm FF
8 Core Custom ARM64 CPU
512 Core Volta GPU
New Computer Vision Accelerator
Dual 8K HDR Video Processors
Designed for ASIL C Functional Safety

20 TOPS DL 160 SPECINT 20W

# AI TRANSPORTATION – \$10T INDUSTRY



### **NVIDIA DRIVE PX 2**

#### AutoCruise to Full Autonomy – One Architecture



#### ANNOUNCING Driveworks alpha 1 OS FOR SELF-DRIVING CARS



### **NVIDIA BB8 AI CAR**



### Nvidia AI self-driving cars in development



Baidu

nuTonomy

Volvo

TomTom

WEpods



#### Al Pioneers Pushing state-of-the-art



Reasoning, Attention, Memory - Long-term memory for NN

End-to-end training for autonomous flight and driving

Generic agents - Understand and predict behavior

RNN for long-term dependencies & multiple time scales

Unsupervised Learning – Generative Models

Deep reinforcement learning for autonomous AI agents

Reinforcement learning - Hierarchical and multi-agent

Semantic 3D reconstruction



Yasuo Kuniyoshi Professor, School of Info Sci & Tech Director, AI Center (Next Generation Intelligence Science Research Center) The University of Tokyo

### Challenge: Provide Continued Performance Improvement

### But Moore's Law is Over



Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

46 🕺 🕹

#### Its not about the FLOPs

DFMA 0.01mm<sup>2</sup> 10pJ/OP – 2GFLOPs

A chip with 10<sup>4</sup> FPUs: 100mm<sup>2</sup> 200W 20TFLOPS

Pack 50,000 of these in racks 1EFLOPS 10MW

16nm chip, 10mm on a side, 200W



### **Overhead**

Locality



**CPU** 126 pJ/flop (SP)

Optimized for Latency Deep Cache Hierarchy



Broadwell E5 v4 14 nm **GPU** 28 pJ/flop (SP)

Optimized for Throughput Explicit Management of On-chip Memory



Pascal 16 nm



#### Fixed-Function Logic is Even More Efficient

|                       | Energy/Op |
|-----------------------|-----------|
| CPU (scalar)          | 1.7nJ     |
| GPU                   | 30pJ      |
| <b>Fixed-Function</b> | 3pJ       |



### How is Power Spent in a CPU?



Dally [2008] (Embedded in-order CPU)

Natarajan [2003] (Alpha 21264)









📀 NVIDIA.



# Simpler Cores = Energy Efficiency





#### Payload Arithmetic 15pJ

#### Overhead 15pJ



#### **Communication Dominates Arithmetic**





# **Energy Shopping List**

| Processor Technology         | 40 nm  | 10nm   |
|------------------------------|--------|--------|
| Vdd (nominal)                | 0.9 V  | 0.7 V  |
| DFMA energy                  | 50 pJ  | 7.6 pJ |
| 64b 8 KB SRAM Rd             | 14 pJ  | 2.1 pJ |
| Wire energy (256 bits, 10mm) | 310 pJ | 174 pJ |

| Memory Technology            | 45 nm        | 16nm       |
|------------------------------|--------------|------------|
| DRAM interface pin bandwidth | 4 Gbps       | 50 Gbps    |
| DRAM interface energy        | 20-30 pJ/bit | 2 pJ/bit   |
| DRAM access energy           | 8-15 pJ/bit  | 2.5 pJ/bit |



Keckler [Micro 2011], Vogelsang [Micro 2010]







### **GRS Test Chips**







Eye Diagram from Probe



**Test Chip #2 fabricated on production GPU** 

Poulton et al. ISSCC 2013, JSSCC Dec 2013



#### **Efficient Machines**

Are Highly Parallel Have Deep Storage Hierarchies Have Heterogeneous Processors

# **Target Independent Programming**

#### Programmers, tools, and architecture Need to play their positions





# Legion Programming Model

Separating program logic from machine mapping



# The Legion Data Model: Logical Regions

Main idea: logical regions

- Describe data abstractly
- Relational data model
- No implied layout
- No implied placement

Sophisticated partitioning mechanism - Multiple views onto data

Capture important data properties

- Locality
- Independence/aliasing



# The Legion Programming Model\_

Computations expressed as tasks

- Declare logical region usage
- Declare field usage
- Describe privileges:

read-only, read-write, reduce Tasks specified in sequential order Legion infers implicit parallelism Programs are machine-independent

- Tasks decouple computation
- Logical regions decouple

data





# Legion Runtime System



# Evaluation with a Real App: S3D

Evaluation with a production-grade combustion simulation Ported more than 100K lines of MPI Fortran to Legion C++ Legion enabled new chemistry: Primary Reference Fuel (PRF) mechanism Ran on two of the world's top 10 supercomputers for 1 month

- Titan (#2) and Piz-Daint (#10)



# Performance Results: Original S3D

Weak scaling compared to vectorized MPI Fortran version of S3D



Achieved up to 6X speedup

Titan

Piz-Daint

# Performance Results: OpenACC S3D

Also compared against experimental MPI+OpenACC version Achieved 1.73 - 2.85X speedup on Titan

Why? Humans are really bad at scheduling complicated applications





# HPC <-> Deep Learning

- HPC has enabled Deep Learning
  - Concepts developed in the 1980s GPUs provided needed performance
  - Superhuman performance on many tasks classification, go, ...
  - Enabling intelligent devices including cars
- Deep Learning enables HPC
  - Extracting meaning from data
  - Replacing models with recognition
- HPC and Deep Learning both need more performance but Moore's Law is over
  - Reduced overhead
  - Efficient communication
- Resulting machines are parallel with deep memory hierarchies
  - Target-Independent Programming

