Podcast: AI data needs scalable flash, but also needs to be FAIR

0
327
Renault confirms Google as preferred cloud partner

Source is ComputerWeekly.com

In this podcast, we talk to Quantum’s enterprise products and solutions manager, Tim Sherbak, about the impacts of artificial intelligence (AI) on data storage, and in particular about the difficulties of data storage over long periods and with very large volumes of data.

We talk about the technical requirements AI places on storage, which could include the need for all-flash in a highly scalable architecture and the need to aggregate throughput over multiple and single streams.

We also talk about the reality of “forever growth” and the need for “forever retention”, and how organisations might optimise storage to cope with such demands.

In particular, Sherbak mentions the use of FAIR principles – findability, accessibility, interoperability and reuseability – as a way of handling data in an open way that has been pioneered in the scientific community.

Finally, we talk about how storage suppliers can leverage AI to help manage those vast quantities of data across vast and diverse data stores.



What impacts does AI processing bring to data storage?

AI processing has huge demands on the underlying data storage you have. Neural networks are hugely computationally intensive. They take a large amount of data.

The basic challenge is feeding the beast. We’ve got massively powerful and expensive computer clusters that are based on these data-hungry GPUs [graphics processing units]. And so the basic challenge is, how do we feed that data at a rate so they’re running at full capacity all the time, just because of the enormous amount of computational analysis that’s required. It’s all about high throughput and low latency.

First off, that means that we need NVMe [non-volatile memory express] and all-flash solutions. Second, these solutions tend to have a scale-out architecture so they can comfortably grow and interact at scale with performance, as these clusters can be very large as well. You need seamless access to all the data in this flat namespace such that all of the compute clusters have visibility to all of the data.

In the current timeframe, there’s a lot of focus on the RDMA capability – remote direct memory access – such that all the servers and storage nodes in this cluster have direct access and visibility into the storage resources. This, too, can optimise storage access across the cluster. Then lastly, it’s not just aggregate throughput that’s desirable, but also single-stream performance that is very important.

And so there are new architectures that have parallel data path clients that allow you to not only aggregate multiple streams, but also optimise each of those individual streams by leveraging multiple data paths to get the data to the GPUs.

How can organisations manage storage more effectively, given the likely impacts of AI on data, data retention, etc?

With AI these days, there are two really clear problems.

One is that we’ve got forever data growth, and we’ve got forever retention of the data that we’re architecting into these solutions. And so there are enormous amounts of data above and beyond what’s being calculated in the context of any individual run in a GPU cluster.

That data needs to be preserved over the long term at a reasonable cost.

There are solutions on the market that are effectively a mix of flash, disk and tape, in order that you can optimise the cost of the solution as well as the performance of the solution by having different levels and quantities across those three mediums. By doing that, you can right-size the performance and the cost-effectiveness of the solution you’re using to store all this data over the long term.

The other thing I recommend to organisations looking at how to solve this problem of forever and forever growing data is to look into the concept of FAIR data management. This concept has been around for six or eight years. It comes from the research side of the house in organisations that are looking at how to curate all their research, but also has real impact and capability to help people as they look at their AI datasets as well.

FAIR is an acronym for findable, assessable, interoperable and reusable. This is really a set of principles [that allow] you [to] measure your data management environment to make sure that as you evolve the data management infrastructure, you’re measuring it against these principles [and] doing the best job you can at curating all this data. It’s kind of like taking a little bit from library science and applying it into the digital age.

How can AI help with data storage for AI?

That’s a really interesting question.

I think that there are some basic scenarios where as storage vendors collect data from their customers, they can optimise the operations and the supportability of the infrastructure on a worldwide basis by aggregating the experience and the patterns of usage, etc, that we can use advanced algorithms to more effectively support customers.

But I think probably the most powerful application of AI and data storage is this concept of self-aware storage or, likely more appropriately, self-aware data management. And the idea that we can catalogue rich metadata, data about the data we’re storing, and we can use AI to do that cataloguing and pattern mapping.

As we grow these larger and larger datasets, AI will be able to auto-classify and self-document the datasets in a variety of different ways. That will benefit organisations from being able to more quickly leverage the datasets that are at their disposal.

Just think in terms of an example like sports and how AI might be able to easily document a team or a player’s career just by reviewing all the player’s film, articles and other information that AI can have access to. And then when a great player retires or passes on, today without AI, it can be kind of a mad scramble for a league or a team to gather all that great footage and player history for the nightly news or for the documentary that they’re doing, but with AI, we have more opportunity to gain quicker access to that data.

Source is ComputerWeekly.com

Vorig artikelPublic Accounts Committee calls out legacy IT
Volgend artikelInternetproviders verliezen rechtszaak over blokkeren websites