We talk to Jeff Whitaker, vice-president for product marketing at scale-out NAS maker Panasas, about why storage is the key bottleneck in artificial intelligence (AI) processing.
In this podcast, we look at the key requirements of AI processing and how paying attention to the storage it requires can bring benefits.
Whitaker talks about the need to get lots of data into AI compute resources quickly and how some are tempted to throw compute at the problem. Instead, he argues that attention should be paid to storage that has the right throughput and latency performance profiles to deal with lots of small files, as found in AI
Antony Adshead: What are the challenges organisations face when it comes to storage for high-performance applications?
Jeff Whitaker: When it comes to high-performance applications…the application is trying to get to results fast. It’s trying to get to a decision, trying to get information back for the environment that is using the application.
There’s often a heavy reliance on the compute side of this, and sometimes an over-reliance. A lot of times that can be figured out, that can be resolved by [asking], what [does] a typical application environment look like? It’s compute, it’s network and it’s storage.
And I say storage third because often storage is the last thing that’s thought about on trying to get performance out of an application environment.
One of the things we like to look at is, when it comes to an application, what are the data needs? What kind of throughput is required, what kind of latencies are required, what is it going to take for that application to run as efficiently as possible?
And often, customers and partners have looked at solving the challenge by throwing more compute at making the applications faster, but really the bottleneck comes around storage.
It’s important for people to understand when it comes to their environment they should look at the data needs before they go and try to solve the problem with just compute.
So, it’s really a matter of trying to build an efficient environment to get the results they need. They need to look at what type of a storage environment can solve the challenges of their application.
Adshead: What are the key trends you are seeing, particularly around the convergence of high-performance computing (HPC) with high-end enterprise storage, artificial intelligence and machine learning?
Whitaker: HPC has traditionally been an application environment that needs a lot of data. And a lot of times, the storage environment needs to be something special that can scale and address the throughput so that the compute doesn’t just sit there idle. It needs a lot of data coming in there.
What we’ve started to see with the AI world and getting beyond just the development and coming up with ideas, they’re essentially applications. An AI environment is trying to process a lot of data and get to a result, especially during the training process there’s tonnes of data being pumped into compute. So, in this case it’s often GPUs [graphics processing units] that are used and those are expensive and no one wants to sit there and have those idle.
So, how fast you can pump the data into an AI environment is critical to how fast the application can run or the AI training can run. If you look at it, it’s almost on a par with what an HPC environment typically looks like where you’re ingesting a tonne of data trying to get a result, so you really need to look at what those data needs are for that training process or for the different types of HPC workloads and try to solve the challenge from there.
The one difference that we see here is often in an HPC world, we see very large files being pumped into the compute. Whereas in the AI side, we see tonnes of smaller files being pumped into the compute.
And really the bottleneck becomes how fast can you get that data into the compute so you can get to a result.
And really going out there and saying can a traditional enterprise storage environment solve that need for you?
It’s latency, it’s throughput. Traditional environments have the ability to have small latency, but trying to get very scalable throughput is very challenging and that’s when we start to look at different type of architecture like parallel solutions that can scale consistently, depending on how much performance you need, really solving that challenge of ingesting tonnes of data into those compute environments.