Artificial intelligence (AI) workloads are new and different to those we’ve seen previously in the enterprise. They range from intensely compute-intensive training to day-to-day inferencing and RAG referencing that barely tickles CPU and storage input/output (I/O).
So, across the various genres of AI workload, the I/O profile and impacts upon storage can vary dramatically.
In this second of a two-part series, we talk to Nvidia vice-president and general manager of DGX Systems Charlie Boyle about the demands of checkpointing in AI, the roles of storage performance markers such as throughput and access speed in AI work, and the storage attributes required for different types of AI workload.
We pick up the discussion following the chat in the first article about the key challenges in data for AI projects, practical tips for customers setting out on AI, and differences across AI workload types such as training, fine-tuning, inference, RAG and checkpointing.
Antony Adshead: Is there a kind of standard ratio of checkpoint writes to the volume of the training model?
Charlie Boyle: There is. As we engage with customers on their own models and training, we do have averages. Because we’ll know how long it should take for the size of a model and the number of compute elements that you have. And then we talk to customers about risk tolerance.
Some of our researchers checkpoint every hour. Some checkpoint once a day. It depends on what they expect and the amount of time that it takes to checkpoint.
And there is the amount of time it takes to recover from a checkpoint as well. Because you could say, ‘OK, I’ve been checkpointing once a day. And somewhere between day four and day five, I had a problem.’
You may not know you had a problem until day six because the job didn’t die, but you’re looking at the results and something’s weird. And so you actually have to go back a couple days to that point.
Then it’s about, ‘How quickly do I notice there is a problem versus how far do I want to go back in a checkpoint?’ But we’ve got data because we do these massive training runs – everything from a training run that lasts a few minutes to something that lasts almost a year.
We’ve got all that data and can help customers hit that right balance. There are emerging technologies we’re working on with our storage partners to figure out ways to execute the write, but also still keep compute running while I/O is getting distributed back to the storage systems. There’s a lot of emerging technology in that space.
Adshead: We’ve talked about training and you’ve talked about needing fast storage. What’s the role of throughput alongside speed?
Boyle: So throughput and speed on the training side are tightly related because you’ve got to be able to load quickly. Throughput and overall read performance are almost the same metric for us.
There is also latency, which can stack up depending on what you’re trying to do. If I need to retrieve one element from my data store, then my latency is just that.
But with modern AI, especially with RAG, if you’re asking a model a question and it understands your question but it doesn’t inherently have the data to answer the question, it has to get it. The question could be the weather or stock quote or something. So, it knows how to answer a stock quote and knows the source of truth for the stock quote is SEC data or NASDAQ. But in an enterprise sense, it could be the phone number for the Las Vegas technical support office.
That should be a very quick transaction. But is that piece of data in a document? Is it on a website? Is it stored as a data cell?
It should be able to go, boom, super fast, and with latency that is super low. But if it’s a more complex answer, then the latency stacks because it’s got to retrieve that document, parse the document, and then send it back. It’s a small piece of information, but it could have a high latency. It could have two or three layers of latency in there.
That’s why for GenAI the latency piece is really what you expect to get out of it. Am I asking a very complex question and I’m okay waiting a second for it? Am I asking something I think should be simple? If I wait too long, then I wonder, is the AI model working? Do I need to hit refresh? Those types of things.
And then related to latency is the mode of AI that you’re going for. If I ask it a question with my voice and I expect a voice response, it’s got to interpret my voice, turn that into text, turn that into a query, find the information, turn that information back into text and have text-to-speech reading to me. If it’s a short answer, like, ‘What’s the temperature in Vegas?’, I don’t want to wait half a second.
But if I asked a more complex question that I’m expecting a couple of sentences out of, I may be willing to wait half a second for it to start talking to me. And then it’s a question of whether my latency can keep up that it’s sending enough text to the text-to-speech that it sounds like a natural answer.
Adshead: What’s the difference in terms of storage I/O between training and inference?
Boyle: If you’re building a new storage system, they’re very similar. If you’re doing an AI training system, you need a modern fast storage appliance or some system. You need high throughput, low latency, highly energy efficient.
On the inference side, you need that same structure for the first part of the inference. But you also need to make sure you’re connecting quickly into your enterprise data stores to be able to retrieve that piece of information.
So, is that storage fast enough? And just as important, is that storage connected fast enough? Because that storage may be connected very quickly to its closest IT system, but that could be in a different datacentre, a different colo from my inference system.
A customer could say, ‘I’ve got the fastest storage here, and I bought the fastest storage for my AI system.’ Then they realise they’re in two different buildings and IT has a one gig pipe between them that’s also doing Exchange and everything else.
So, the network is almost as important as the storage to make sure that you’re engineered, that you can actually get the information. And that may mean data movement, data copying, investing in new technologies, but also investing in making sure your network is there.