In this podcast, we talk to Pure Storage about the challenges presented to data storage by the huge growth in unstructured data and the need to gain working insight from it.
We speak with Amy Fowler, VP of strategy and product marketing for FlashBlade at Pure Storage, and FlashBlade technical evangelist Justin Emerson about the nature of unstructured data, its immense growth in data volume terms and its diversity across data types, as well as the requirements in storage terms to meet that challenge.
Antony Adshead: What are the major challenges faced by enterprises with unstructured data in terms of management, use and analysis?
Amy Fowler: First of all, everyone has unstructured data these days, so I think that’s a good jumping-off point. I think the most recent metric I saw is that 80% of enterprise data will be unstructured by 2025, so it certainly represents something significant to grapple with.
And while we’ve been talking about data growth for as long as I can remember – and I’ve been in storage more than half my life – but it’s not any more just about how many terabytes or petabytes, but also the potential sources of data which is contributing to the growth, of course.
It used to be that the critical data was primarily in transactional databases that fed data into a data warehouse, and that was pretty straightforward. But these days, if you’re a retailer or a financial services organisation or healthcare organisation, you’re probably getting valuable data in the form of super-diverse sources; from images to tweets to IoT [internet of things] and log data.
And everyone’s telling you that your most valuable asset is your data. So, you know that ideally you don’t want to throw anything out, but at the same time you don’t want to store everything for ever both in terms of efficiency and for regulatory reasons.
So, the first thing is that unstructured data management becomes a huge challenge: What do I have? What do I want to keep? And, most importantly, what insights can I glean from it?
And one important element of this is the metadata – the data about the data – so that you can have help making those decisions.
The second big thing is that enterprises also know that they can do more with the data, whether that’s drawing connections, conclusions from disparate data sources to optimise profits or for threat detection or, in the case of healthcare image data, to accelerate diagnosis or patient care.
To do this effectively, to connect the dots between disparate sources to gain those insights, you really need to be able to triangulate the data. It can’t be in dozens of physical silos.
The third thing I would connect back to the declining human attention span, which is between eight and 15 seconds, depending on which study you have attention span to read on Google. But the users of your data now expect to be able to get insights from it super-fast.
So, just knowing what you have from a data management perspective and saving it all in one place isn’t enough. You need to have it living in infrastructure that delivers a level of performance so that you actually can rapidly analyse it. And that’s a lot and very different to what organisations were dealing with even just five or six years ago.
Adshead: What technical storage challenges does unstructured data present and what storage technologies are required to overcome these?
Justin Emerson: I think all the major problems come in some form with regard to scale.
So, whether that’s scale in terms of the number of files or objects, which can drive complexity around how you architect applications, the protocols you need, the performance you need, and increasingly in terms of how you address power consumption requirements of these things at scale.
Amy talked about how you used to analyse data that happened before, in a data warehouse. Now you’re trying to analyse data in real time. The next wave is how do you analyse things or predict things going into the future?
To do that, the amount of data you need increases tremendously. How much performance you need to analyse that data goes up increasingly, and then all of those things create pressure on the environmentals or the constraints of your datacentre.
So, in the largest cases, you end up sizing an entire infrastructure for the size of a datacentre or the size of a power footprint. And that’s driving decisions at a lot of customers at the largest end of the scale.
How you deal with these things is that you have to start thinking about scalability from the beginning and at every level of the stack.
If you’re not building scalable applications – which is why so many people, so many different kinds of applications are being refactored or rebuilt for scalable cloud-like infrastructure, or consumable infrastructure – you need to be able to build those applications to consume a scalable amount of data, data that spans potentially multiple namespaces, multiple datacentres, lots of different types of data, and lastly, building on platforms, foundational platforms that provide you that level of scale.
Because all the problems tend to stem from the fact that the amount of data is growing, the amount of computing power required to process that data is growing, and so that growth drives all of these scale problems.
And how you encounter those scale problems at various different levels of scale is actually quite interesting.