Listen to the podcast of this fireside chat below:
The 'AI Broadcast' by Arya.ai is a fortnightly event series where we explore the latest advancements, trends, and implications of AI. Each session focuses on a specific AI topic - Machine Learning, Generative AI, NLP, responsible AI, and much more!
In the second session of the series, we engage in a discussion with Vinay Kumar, Founder and CEO of Arya.ai, on key factors influencing the growth trend in AI hardware, the growth of GPUs, companies and products in the market and the future of AI hardware.
In May 2023, the Washington Post coined NVIDIA as the 'kingmaker thanks to the AI boom', mentioning them as one of the most valuable companies in the world - their market value surpassed $1 trillion. 'AI hardware' is now being looked at as the beginning of a new market, with multiple startups joining the race.
Why is computing infrastructure important for AI?
If you look behind the hood of ML models, it's essentially math, either solving a complex problem or trying to figure out new patterns. To execute the mathematics, you need computing in front, whether CPU, GPU, or custom-specific hardware. As these calculations become more complex, you need more computing, which also involves the data. The more data you have and the more mathematics you have to do on that data, the more you need a larger amount of computing. For these models to work, you need a highly efficient infra; sometimes, these could be specially designed for AI, or general-purpose hardware repurposed for building these models. So, for the functioning of AI, if data and algorithms are fundamental, then hardware is the fundamental requirement for these models to work.
Graphics Processing Units (GPUs) market has grown tremendously in recent years and is expected to reach above USD 200.78 billion by 2029. How did GPUs become important to AI?
People have been building and deploying models on traditional CPUs for a long time. But GPUs gained traction primarily when researchers found use cases for deploying complex neural network models on GPUs. Nvidia introduced GPUs around 1999 as a parallel computing system primarily designed for gaming. A lot of parallel processing happens in gaming, which traditional CPUs cannot handle. Then came Nvidia, with GPU architectures that took up the use case well. But around 2009, the early era of deep learning started when Andrew Ng, Raina and Anand from Stanford published a paper where they used GPUs to train a 4-layer neural network with 100 million parameters. They were able to show that training the NN model on these GPUs is a lot faster as compared to training these models on CPUs. At the same time, CNNs were used for the MNIST dataset in 2010.
In 2011, Yenn Lecunn and other researchers at New York and Yale University benchmarked an Xilinx Virtex 6 FPGA against GPUs for computer vision tasks. They created a custom FPGA, which talks about parallel processing and using CNNs on this FPGA for computer Vision tasks. This was for object segmentation in a video, where they used these FPGAs and could deploy CNN models, showing that it was much more effective than a simple CPU. In 2012, Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton built a large CNN model for ImageNet competition for an image classification problem, famously known as 'AlexNet architecture'. When they used these large CNN models, they were able to scale the number of neurons in the CNN. And when they trained the model on this ImageNet data, they beat all the benchmarks in that year and the following years, sparking a lot of interest, particularly in using GPUs plus deep learning. This marked the beginning of the deep learning era, and GPUs also gained much traction for ‘AI’ use cases. It's fascinating because CNNs were first introduced by Yenn Lecunn in the late 1990s, and GPUs were introduced around the same time, but it took almost 10-12 years for both of them to work together, which was in 2010-2012 for the CNNs to be deployable on GPUs.
GPUs became the backbone for many state-of-the-art models today, including the recent GPT 3.5 or 4 - all of them are built on not Nvidia hardware. They became the backbone of the industry, creating a new market for the GPUs as well.
Which companies are working on AI hardware, and how are they building it?
The tech players who are into some or the other business around AI eventually have, or at least are planning to build their custom silicon. This trend was not new - Earlier, when there was a big boom around deep learning and 2012, many tech companies were continuously building larger models and using them for various use cases beyond image classification, like speech recognition, spam prediction, and recommendation algorithms. Particularly, the large tech players realized that training and hosting these models would be quite expensive. This sparked much interest for these tech players to build their custom silicons. For example, Facebook (Now Meta) open-sourced a hardware design called Big Sur around 2015 that uses Nvidia GPUs to speed up their model training. While that was primarily on the hardware architecture, they are now investing in developing custom silicon under a new name called Meta Training and Inference Accelerator (MTIA).
Microsoft, being a cloud and search provider, has more significant gains in building custom silicon. They have been invested in building custom hardware since almost 2010. Around 2018, Microsoft launched a hardware called Project Brainwave/catapult, where they deployed FPGA-accelerated Deep Neural Networks (DNNs) for search and image processing. They claimed that DPUs (Deep learning Processing Units), which were modified FPGAs, provided them with much better performance than GPUs. Because of huge computing requirements for their LLMs, they are looking at building custom hardware more seriously today. This project is called 'Athena', where they are focusing on building custom chips along with multiple industry partners like AMD, ARM etc.
AWS being the leader in cloud computing, has been trying to build custom silicon for years. In 2019, they launched a custom silicon for inferencing called AWS Inferentia, followed by AWS Trainium in 2020 for training. These are trying to compete with GPUs directly. But when it comes to model support for this hardware, it's not fully supportable. Here's a quick documentation on Model Support for AWS Trainium and AWS Inferentia2 (NeuronCore-v2). In many cases, even though they are supported, particularly for LLMs, it's still in the roadmap for other models, which means they can only support a limited number of modes today. This is the journey of AWS building their custom hardware.
Among other tech players, Google had a more commendable success around custom AI accelerators. Around 2015, they started using TPU (Tensor Processing Units) architecture. Since they own TensorFlow, they have optimized their hardware to support that framework. They started using TPUs internally in 2015 but released them to the public around 2018. Since then, there has been a pretty good adoption of TPUs, of course, Google themselves being the biggest adopters, surely they are one of the biggest consumers of AI hardware today. When it comes to the kind of services they offer, for example, search, video processing, or images, it makes a lot of sense for them to continue improving and optimizing the hardware. TPUv4 is coming to GCP data centres soon. The biggest drawback for TPUs is the lack of retail availability.
Besides these, other tech players like Tesla, Apple etc., are building their custom silicon. Apple developed M1 and M2 chips, moving away from typical Intel and NVIDIA GPUs. In June this year, M2 Ultra offered a massive unified memory of 196GB for both CPU & GPU. Theoretically, it means M2s can train/infer using larger models, but the training time is not comparable even with RTX 4090.
Dedicated hardware manufacturers like Nvidia, Intel, AMD, ARM etc., are all trying to either continuously catch up or improve their architectures. The recent announcement from Nvidia, H100s are a superhit; at least on paper, it's one of the most optimized hardware. If I'm not wrong, it is 4x Better than the A100s, which is already one of the best hardware for training large, deep-learning models. Last week, i.e., August second week of 2023, in Siggraph, they introduced GH200 with a massive memory of 282GB running on the HBM3e standard. These are specialised for generative ‘AI’ requirements.
With all the hype around Gen ‘AI’, there is so much demand in the market for H100s. Their MSRP was supposed to be around $20,000, but they sell for $40,000 to $50,000. This reminds of the mayhem we had during the crypto-mining boom. Post Ethereum moving from Proof of Work to Proof of stake, GPU prices dropped by more than 50%; this is necessarily what we are seeing.
AMD has a respectable market share in retail, data centre and enterprises. They have released MI250 and are coming up with MI300 and MI300X next year, comparable with H100s.
The prominent tech players, and hardware manufacturers, are getting into building AI-specific hardware. But the unbeatable leader at this point is Nvidia, and they have been at it for a long time. The biggest differentiator for Nvidia is the software support, which is not as mature for competitors like AMD, Cerebras, etc., as Nvidia CUDA. There is enough open-source algorithm so we can quickly use them and optimize and run by models at scale.
While we talked about these large organizations, there is also scope for startups to develop revolutionary architectures. For example, Cerabras, which started in 2015 and has raised close to $720 million, aims to build customer AI accelerators. They are building one of the biggest supercomputers for AI, 'CG-1,' which links 64 Cerebras CS-2 systems together into a single, easy-to-use AI supercomputer with an AI training capacity of 4 exaFLOPs. Graphcore is another startup that has been building AI-specific hardware or AI accelerators for quite some time. But around last year, they had some challenges when Microsoft ditched IPUs.
The prominent tech players, hardware manufacturers, and startups are also trying to figure out new ways because there is enormous demand for AI hardware today. This is clearly evident with the rise in stock price for Nvidia. Nvidia becoming a trillion-dollar business is apparent. It is quite a journey because there is a substantial demand for these AI accelerators today.
According to you, what's going to happen in the AI hardware space?
So far, we have seen a lot of innovation around the architecture itself, how the transistors are placed, and how they are built. To understand what's going to happen, let's try to understand the current challenges.
Moore's Law has been invalidated when it comes to GPU compute. These graphs show the comparison between the compute and the bandwidth:
While the compute has been increasing aggressively, the problem that the industry is facing is particularly around the bandwidth. Let's take LLMs as an example:
Typically, to serve a production customer, I should be able to produce the model with a capacity of at least 33 tokens per second. So, to host a 1 trillion parameter model, even 8 H100 are not enough to produce this much amount of tokens because the bandwidth is not as much as what we have seen increments in terms of compute power increments. This is the biggest challenge today - the larger the model becomes, it becomes quite expensive and almost impossible to host these models for inferencing. There is much scope to see how this could be improved.
Even Cerebras' ~$2,500,000 wafer scale chips only have 40GB of SRAM on the chip. So imagine what models could be put into memory with these chips. Comparing Nvidia's 2016 P100 GPU to their 2022 H100 GPU that is just starting to ship, there is a 5x increase in memory capacity (16GB -> 80GB) but a 46x increase in FP16 performance (21.2 TFLOPS -> 989.5 TFLOPS). This is the biggest problem currently in terms of bandwidth and storage.
Training cost and efficiency
Let's take a look at the training cost of LLMs:
Just the cost to train these models; if you look at MosaicML GPT-30Bmodel, where the size is 30 Billion parameters to tokens 610 billion parameters, taking A100 as a GPU reference, it costs around $325,855 to train this model. Google LaMDA costs $368,846 and Google PaLM costs $6,750,000.
Scaling these models:
Training a 1 Trillion parameter model costs you around $300 million. It will typically take nearly three months to train the model. Likewise, for a 10 Trillion parameter model, it takes almost two years to train it. This means there is an intrinsic limitation in terms of the model size that can be realistically trained today, which is probably somewhere between 1 trillion parameters and 10 trillion parameters. Again, this is only possible for large, highly profitable companies.
We're not necessarily saying that this can only be possible by increasing the capacity of the hardware. There are also inefficiencies in the current hardware.
For example, if you look at the efficiency of MosaicML's stack can achieve over 70% hardware FLOPS utilization (HFU) and 53.3% model FLOPS utilization (MFU) on Nvidia's A100 GPUs in LLMs without requiring writing custom CUDA kernels. Nvidia's Megatron-LM stack only achieved 52.8% HFU and 51.4% MFU on a 175B parameter model. Likewise, Google's stack for the PaLM model on TPUv4 only achieved 57.8% HFU and 46.2% MFU. This is the biggest differentiator; that's one of the reasons why MosaicML got acquired for almost a billion plus. While these large players are focusing on building new hardware, there is also a need for scalable & optimized software stack. And this is where a lot of differentiation is possible.
In a recent blog published by Mosaic talks about the optimizations done on MI250, they were able to train the models with zero code customization in Pytorch and completely out of the box from CUDA. This is interesting because it will create a lot of opportunity and scope to increase the efficiency of the hardware while increasing the diversity in terms of options.
This will be important as we go forward, which is primarily why the hardware players continue to introduce more custom AI accelerators, there is a lot of opportunity in optimizing the software stack. BesidesMosiac ML OpenAI has a project called Triton, aiming to do a similar optimization. Triton takes in Python directly or feeds through the PyTorch Inductor stack. Triton then converts the input to an LLVM intermediate representation and generates code. In the case of Nvidia GPUs, it directly generates PTX code, skipping Nvidia's closed-source CUDA. This means they can run the algorithm with simple, few lines of code. Currently, they only support GPUs, but it is being said they might support multiple other hardware as well. If that happens, players like Mosiac ML or OpenAI could provide opportunities for other players to come in and create a good benchmarking in the industry and democratize a lot AI hardware, which is currently localized to one player.
While this is on enterprise side of things, imagine mobile or edge use cases. Numerous opportunities exist to ensure these complex, large models are deployable in small-scale utility products, like mobiles or cameras. These areas present substantial growth opportunities, and surely the industry is working toward it, and there is a lot of demand which can be generated from this.
How carbon positive are these?
That's the next debate that we are seeing these days.
During the rough estimations, what was observed is that - to train a GPT3 model, the carbon emissions are estimated at 200 tonnes of CO2 when using the data centres in developing markets since they are not fully carbon positive from a dataset and energy maintenance perspective, compared to developed markets like Europe, where the data centres are highly energy efficient and carbon positive. This is equivalent to:
Today, multiple entities are building more complex LLMs with larger tokens and parameters, which means they have been consuming more than this benchmark to build one model. This becomes a challenge as this has to be balanced not only with the hardware getting optimized but also with how these data centres are maintained and offset the CO2 emissions. Surely there will be discussions on how they are going to create challenges.
What are your closing observations on AI accelerators?
The AI training requirement is only going to grow more aggressively. This paper on ’COMPUTE TRENDS ACROSS THREE ERAS OF MACHINE LEARNING’ divided the machine learning errors into three buckets - the pre-deep learning era, the deep learning era and large-scale models. Moore's Law was adhered to in the pre-deep learning era, but it's entirely out of the picture once the deep learning era comes into play because it's compute-heavy.
Training compute trendline
During 2010-2012, which is the deep learning era, the training, and compute requirements doubled every 5.7 months. For large-scale models, it's around 9.9 months from 2020 to 2030. This will probably again back to about 5 to 6 months. So the training compute requirement doubles yearly, meaning people are looking to build complex, larger models and larger data sets. Because we are producing more data than we have ever done in the history of humanity, we need more complex, larger models, and the users are also extrapolating. So, computing is only going to grow massively. And it will be quite interesting to see the trends in the next 10-20 years, particularly from an AI hardware standpoint. As mentioned, the enterprise and data centre hardware is just one dimension; IoT, mobile, is a new dimension with huge demand capabilities. It will be interesting to see how the industry copes with this much demand and scales AI hardware keeping everything in mind, from efficiency and responsibility to the environment and society and democratizing the hardware itself. It has to be democratized beyond one player, unlike what we see today; only then will we see equal participation and equal opportunity for everyone. These are the trends that I believe are going to happen. And, of course, environment-friendly hardware or data centres will also become another key topic.
- Large-scale Deep Unsupervised Learning using Graphics Processors: http://robotics.stanford.edu/~ang/papers/icml09-LargeScaleUnsupervisedDeepLearningGPU.pdf
- An FPGA-Based Stream Processor for Embedded Real-Time Vision with Convolutional Networks: http://yann.lecun.com/exdb/publis/pdf/farabet-ecv-09.pdf
- ImageNet Classification with Deep Convolutional Neural Networks: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
- Big Sur: A Closer Look at the Engine Powering Facebook’s AI: https://www.datacenterfrontier.com/cloud/article/11431168/big-sur-a-closer-look-at-the-engine-powering-facebook8217s-ai
- Project Catapult: https://www.microsoft.com/en-us/research/project/project-catapult/
- AWS launches its custom Inferentia inferencing chips: link
- AWS launches Trainium, its new custom ML training chip: https://techcrunch.com/2020/12/01/aws-launches-trainium-its-new-custom-ml-training-chip/
- Model Architecture Fit Guidelines: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html#model-architecture-fit
- Google AI Infrastructure Supremacy: Systems Matter More Than Microarchitecture: https://www.semianalysis.com/p/google-ai-infrastructure-supremacy
- Apple introduces M2 Ultra: https://www.apple.com/in/newsroom/2023/06/apple-introduces-m2-ultra/
- NVIDIA Unveils Next-Generation GH200 Grace Hopper Superchip Platform for Era of Accelerated Computing and Generative AI: https://nvidianews.nvidia.com/news/gh200-grace-hopper-superchip-with-hbm3e-memory
- AMD Hops On The Generative AI Bandwagon With Instinct MI300X: https://www.forbes.com/sites/tiriasresearch/2023/06/19/amd-hops-on-the-generative-ai-bandwagon-with-instinct-mi300x/?sh=77dc1d474d40
- Cerebras and G42 Unveil World's Largest Supercomputer for AI Training with 4 exaFLOPs to Fuel a New Era of Innovation: https://www.aninews.in/news/business/business/cerebras-and-g42-unveil-worlds-largest-supercomputer-for-ai-training-with-4-exaflops-to-fuel-a-new-era-of-innovation20230721103845/
- How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0: https://www.semianalysis.com/p/nvidiaopenaitritonpytorch
- The AI Brick Wall – A Practical Limit For Scaling Dense Transformer Models, and How GPT 4 Will Break Past It: https://www.semianalysis.com/p/the-ai-brick-wall-a-practical-limit
- A PyTorch Library for Efficient Neural Network Training: https://github.com/mosaicml/composer
- Training LLMs with AMD MI250 GPUs and MosaicML: https://www.mosaicml.com/blog/amd-mi250
- Carbon Footprint Of Training GPT-3 And Large Language Models: https://shrinkthatfootprint.com/carbon-footprint-of-training-gpt-3-and-large-language-models/
- Compute Trends Across Three Eras of Machine Learning: https://arxiv.org/abs/2202.05924