ECE Masters Advisory Council Presents: CEREBRAS SYSTEMS Tech Talk

Career Presentation

Learning Language is Hard: How hard? What does it mean for future applications and hardware?

Language applications are emerging as the next major frontier of deep learning, but they require large systems to train state-of-the-art models. Why? I’ll describe my (Baidu’s) prior large-scale empirical studies [1]: As training set size increases, DL model generalization error and model sizes scale as particular power-law relationships (not entirely consistent with theoretical results). As model size grows, training time remains roughly constant—larger models require fewer steps to converge to the same accuracy. Given these scaling relationships, we can accurately predict the expected accuracy and training time for models trained on larger data sets [2]. Language applications, in particular, are hard to scale and will require upward of 100x more compute than we currently use.

Why are language applications so hard? Language is often a minimum description of structures/concepts, and these descriptions are usually context-sensitive! Current state-of-the-art models find “low-dimensional” relationships between words or concepts and compose them together, requiring deep hierarchy. Emerging applications require significantly longer prediction lengths, encouraging models that don’t just predict but also simulate.

[1] Hestness et al., Deep Learning Scaling is Predictable, Empirically, ArXiv 2017,
[2] Hestness, Ardalani, Diamos, Beyond Human-Level Accuracy: Computational Challenges in Deep Learning, PPoPP 2019,

Joel Hestness is a Research Scientist at Cerebras Systems, an AI-focused hardware startup building the largest semi-conductor ever built at 1.2T transistors. Joel helps formulate strategy to support machine learning researchers/practitioners to use the hardware, and leads some Natural Language Understanding research. Previously, Joel was a Research Scientist at Baidu's Silicon Valley AI Lab (SVAIL), where he worked on techniques to understand and scale out deep learning speech and language model training. Joel holds a PhD in computer architecture from the University of Wisconsin - Madison, and has worked on applications in numerical methods, graph analytics, and machine/deep learning.

About the Company:  Cerebras Systems is revolutionizing compute for Deep Learning. Cerebras Wafer Scale Engine (WSE) is the world’s largest chip – 56x larger than the biggest GPU ever made, 78x more cores, 3000x more on chip-memory and 33,000x more bandwidth. Cerebras is building a team of pioneering software engineers, hardware engineers, architects, and deep learning researchers. 


For More Information, Please Contact: