# 

### Bridging the Gap Between Deep Learning Algorithms and Systems

**ABHINAV VISHNU** 

AUGUST 28<sup>TH</sup>, 2019

### A QUICK INTRODUCTION TO MACHINE LEARNING/DEEP LEARNING



2 SMOKY MOUNTAINS COMPUTATIONAL SCIENCES AND ENGINEERING CONFERENCE | AUGUST 28TH, 2019

#### TRENDS IN LABELED DATA AND MODEL LEARNING (TRAINING) TIME



3 SMOKY MOUNTAINS COMPUTATIONAL SCIENCES AND ENGINEERING CONFERENCE | AUGUST 28TH, 2019

#### TRAINING DEEP LEARNING ALGORITHMS ON HPC SYSTEMS



No communication during Feedforward(error calculation) step



Layer-wise All-to-all reduction during Back-propagation (model learning) step



Popular Ring algorithm in DL algorithms

4 SMOKY MOUNTAINS COMPUTATIONAL SCIENCES AND ENGINEERING CONFERENCE | AUGUST 28TH, 2019

### THE MIS-MATCH BETWEEN SYSTEM DESIGNERS AND DATA SCIENTISTS



Hyper-parameter

optimizations

4. Work on algorithmic (hyperparameter) optimizations to improve accuracy

Epochs (number of passes over training set)

1. Wants large mini-batch size

2. Primary objective is higher compute efficiency given convergence constraints

3. Find the inter-play between maximum mini-batch size and accuracy for the usecases; generalizability is not the focus

4. Work on holistic system design to enable the maximum mini-batch size

Machine Learning models and system architecture/software needs to be co-designed to help bridge the gap between data scientists and system designer

#### POTENTIAL SOLUTION: ADAPTIVE MINI-BATCHING



Three datasets, four networks

The error magnitude is computed by adding the error from samples in the training dataset

- However, only a handful of samples contribute to the error
  - For example consider ResNet50 (after one epoch) on the adjacent figure
- Few samples have very high error, most samples have low error
  - The error curve becomes flatter with epochs
  - Low error samples contribute less to model learning
- A combination of large and small mini-batches may be created by epoch-wise analysis of the error/loss. An example is shown on the left
- Communication overhead is also reduced with adaptive mini-batching

#### 

#### ACCELERATION USING ADAPTIVE PRECISION

- Split the samples in multiple buckets of different precision
- The buckets may be defined by sorting the samples using non-increasing error
  - Flatter loss implies lower number of bits may be enough to encode the weight updates in that bucket
  - Loss becomes flatter with increasing epochs
- Reset the precision if validation loss increases
  - Reduce the precision adaptively after the reset
  - Self-corrects the problems due to aggressive reduction in precision



#### **REDUCING COMMUNICATION CARDINALITY**





#### CONCLUSIONS

- Deep Learning (DL) algorithms are becoming popular as they leverage complex representations (such as raw input with images) in addition to extracted features
- ▲ HPC systems play an important role in reducing the time-to-solution for DL algorithms
- ▲ There is a widening gap in primary metrics of concern between a data scientist and a system designer
- We proposed approaches to bridge the gap by using adaptive mini-batching
  - For high error samples, use small mini-batches
  - For low error samples, use large mini-batches under the memory and compute constraints of the system
  - Proposed adaptive precision (high precision for high error samples) that matches well with the compute capabilities of today's systems
  - Proposed solution for addressing the limitations of all-to-all reduction by using reduced communication cardinality
- We hope to work with the scientific community to enhance these solutions and present results through publications and open source software

#### THANKS FOR LISTENING!! QUESTIONS?



## REDEDN TECHNOLOGIES GROUP

Contact: **Abhinav Vishnu,** Abhinav.Vishnu@amd.com



#### **DISCLAIMER & ATTRIBUTION**

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

#### **ATTRIBUTION**

© 2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.