Storage is key to AI projects that succeed

0
14
Renault confirms Google as preferred cloud partner

Source is ComputerWeekly.com

The hyperscaler cloud providers plan to spend $1tn on hardware optimised for artificial intelligence (AI) by 2028, according to market researcher Dell’Oro.

Meanwhile, enterprises are spending big on AI, with plans for AI projects fuelling record spending on datacentre hardware in 2024. In Asia, IDC found the region’s top 100 companies plan to spend 50% of their IT budget on AI. 

Despite all that, it’s not just a case of throwing money at AI.

And many AI projects fail.

Gartner, for example, has reported that nearly a third of AI projects get dropped after failing to achieve any business value – and has even gloomier predictions for agentic AI.

So, how do organisations ensure the best possible chance of success for AI projects, and how do they evaluate the storage needed to support AI?

What does AI processing demand from storage?

Let’s first look at AI and the demands it places on compute and storage.

Broadly speaking, AI processing falls into two categories.

These are training, when recognition is generated from a model dataset, with varying degrees of human supervision; and inference, in which the trained model is put to work on real-world datasets. 

The components of a successful AI project start well before training, however.

Here, we’re talking about data collection and preparation, and with datasets that can vary hugely in nature. They can include backups, unstructured data, structured data and data curated into a data warehouse. Data might be held for long periods and prepared for AI training in a lengthy and considered process, or could be required rapidly for needs that were unexpected.

In other words, data for AI can take many forms and produce unpredictable requirements in terms of access.

In other words, AI is very hungry in terms of resources.

The voraciousness of graphics processing units (GPUs) is well-known, but it’s worth recapping. So, for example, when Meta trained its open source Llama 3.1 large language model (LLM), it is reported that it took around 40 million GPU hours on 16,000 GPUs. We’ll come back to what that means for storage below.

A large chunk of this is because AI uses vectorised data. Put simply, when training a model, the attributes of the dataset being trained on are translated to vectorised – high dimensional – data.

That means data – say the numerous characteristics of an image dataset – is converted to an ordered set of datapoints on multiple axes so they can be compared, their proximity to each other calculated, and their similarity or otherwise determined.

The result is that vector databases often see significant growth in dataset size compared to source data, with as much as 10 times possible. That all has to be stored somewhere.

Then there is frequent checkpointing to allow for recovery from failures, to be able to roll back to previous versions of a model should results need tuning, and to be able to demonstrate transparency in training for compliance purposes. Checkpoint size can vary according to model size and the number of checkpoints required, but it is likely to add significant data volume to storage capacity requirements.

Then there is retrieval augmented generation (RAG), which augments the model with internal data from the organisation, relevant to a specific industry vertical or academic specialisation, for example. Here again, RAG data depends on vectorising the dataset to allow it to be integrated into the overall architecture. 

To maximise chances of AI success, organisations need to ensure they have the capacity to store the data needed for AI training and the outputs that result from it, but also that storage is optimised so that energy can be conserved for data processing rather than retaining it in storage arrays

All this comes before AI models are used in production.

Next comes inference, which is the production end of AI when the model uses data it hasn’t seen before to draw conclusions or provide insights.

Inference is much less resource-hungry, especially in processing, but results still must be stored.

Meanwhile, while data must be retained for training and inference, we also have to consider the power usage profile of AI use cases.

And that profile is significant. Some sources have it that AI processing takes north of 30 times more energy to run than traditional task-oriented software, and that datacentre energy requirements are set to more than double by 2030.

Down at rack level, reports indicate that per-rack kilowatt (kW) usage has leapt from single figures or teens to up to 100kW. That’s a massive leap, and it is down to the power-hungry nature of GPUs during training.

The implication here is that every watt allocated to storage reduces the number of GPUs that can be powered in the AI cluster. 

What kind of storage does AI require?

The task of data storage in AI is to maintain the supply of data to GPUs to ensure they are used optimally. Storage must also have the capacity to retain large volumes of data that can be accessed rapidly. Rapid access is a requirement to feed GPUs, but also to ensure the organisation can rapidly interrogate new datasets.

That more than likely means flash storage for rapid access and low latency. Capacity will obviously vary according to the scale of workload, but hundreds of terabytes, even petabytes, is possible.

High density quad-level cell (QLC) flash has emerged as a strong contender for general-purpose storage, including, in some cases, for datasets that might be considered “secondary”, such as backup data. Use of QLC means customers can store data on flash storage at a lower cost. Not quite as low as that of a spinning disk, but then QLC comes with the ability to access data much more rapidly for AI workloads.

In some cases, storage suppliers offer AI infrastructure bundles certified to work with Nvidia compute, and these come with storage optimised for AI workloads as well as RAG pipelines that use Nvidia microservices.

The cloud is also often used for AI workloads, so a storage supplier’s integration with cloud storage should also be evaluated. Holding data in the cloud also brings an element of portability, with data able to be moved closer to its processing location.

AI projects often start in the cloud because of the ability to make use of processing resources on tap. Later, a project started on-site may need to burst to the cloud, so look for providers that can offer seamless connections and homogeneity of environment between datacentre and cloud storage.

AI success needs the right infrastructure

We can conclude that to succeed in AI at the enterprise level takes more than just having the right skills and datacentre resources.

AI is extremely hungry in data storage and energy usage. So, to maximise chances of success, organisations need to ensure they have the capacity to store the data needed for AI training and the outputs that result from it, but also that storage is optimised so that energy can be conserved for data processing rather than retaining it in storage arrays.

As we’ve seen, often it will be flash storage – and QLC flash in particular – that offers the rapid access, density and energy-efficiency needed to provide the best chances of success.

Source is ComputerWeekly.com

Vorig artikelEU Data Act prompts Google to scrap data transfer fees for UK multicloud users
Volgend artikelVMware virtualisation alternatives and the storage they need