Mosharaf Chowdhury receives Google Research Scholar award for research on resilient deep learning

Chowdhury is working to develop new fault-tolerant techniques to enable deep neural networks to continue training even when failures occur
Mosharaf Chowdhury Portrait
Prof. Mosharaf Chowdhury

Mosharaf Chowdhury, associate professor of computer science and engineering, has received an award from the Google Research Scholar Program in support of this project, called “Fault-Tolerant Distributed Executive of Large DNNs,” through which he aims to develop novel fault-tolerant parallelism strategies to enable training to continue in deep neural network (DNN) models even when failures occur.

DNNs, which aim to replicate the human brain’s information processing structure, have become a fixture of the machine learning world. As DNN models see increasing use in various data-driven applications, they have become larger and more elaborate, and are trained on large datasets in distributed graphics processing unit (GPU) clusters. With a larger number of GPUs, however, comes an increased risk of failure. Moreover, the synchronous structure of distributed training means that when one GPU fails, all other GPUs in the cluster idle, wasting precious time and resources, until the failure is resolved.

The Google Research Scholar Program aims to support world-class, cutting-edge research by professors in computer science and related fields. With the support of this award, Chowdhury’s research will improve the speed, robustness, and efficiency of DNN training, allowing these models to continue learning with strong scaling regardless of isolated GPU failure. This will enable DNNs to be applied in even larger and more diverse use cases with large datasets, broadening the scope of their application in data analysis and beyond.