IDC estimates that upwards of 80% of business information is likely to be formed of unstructured data by 2025.
And while “unstructured” can be something of a misnomer, because all files have some sort of metadata by which they can be searched and ordered, for example, there are huge volumes of such data in the hands of businesses.
In this article, we look at what’s particular to working with unstructured data and the storage – usually file or object – that it needs.
In the past, images, voice recordings, videos, chat logs and documents of varying kinds were largely just a storage liability and seen as a headache for anyone who needed to manage, organise and keep it secure.
But now unstructured data is seen as a valuable source of business information. With analytics processing, value can be gained from it – for example, it’s possible to run AI/ML against sets of advertisement images and map what site visitors see to click behaviour. Analysis of unstructured image data can create structured fields that can drive editorial decision-making.
Elsewhere, backups – long consigned to dusty and hard-to-access tape archives – are now viewed as a potential data source for analytics processing. And with the threat of ransomware high on the agenda, the necessity of backups to recover to is more pertinent than ever.
Structured, unstructured, semi-structured
Unstructured data, broadly speaking, is data and information that does not conform to a predefined data model – in other words, information that is created and lives outside a relational database.
Business information generated by systems is most likely to be structured, with customer and product details, order numbers, stock levels and shipment information created by a sales system and stored in its underlying database being typical examples.
Those are more than likely SQL databases, configured with a table-based schema and data held in rows and columns that allow for very rapid writes and querying of the data, with very good transactional integrity. SQL databases are at the heart of the most performant and mission-critical applications in use.
Unstructured/semi-structured
Unstructured data is often created by people, and it includes email, social media posts, voice recordings, images, video, notes, and documents such as PDFs.
As mentioned, most unstructured data can actually be what you’d call semi-structured and though not held in a database – although that is possible – there is some structure there in its metadata. For example, an image of a delivered item would, superficially, be unstructured – although metadata from the camera files makes it semi-structured.
And then there are backup files, in which all an organisation’s data is copied, compressed, encrypted and packaged into the (usually proprietary) format of the backup vendor.
The fact that backups bundle together all types of data make it an unstructured data challenge, and one that has possibly more relevance than ever with the rise of the ransomware threat.
Unstructured and semi-structured storage needs
As we’ve seen, unstructured data is more or less defined by the fact it is not created by use of a database. It may be the case that more structure is applied to unstructured data later in its life, but then it becomes something else.
What we’ll look at here are the key requirements for storage infrastructure for unstructured data. These are:
- Volume: Usually there is lot of unstructured data, so capacity is a key requirement.
- File and/or object storage: Block storage is for databases, and as we’ve seen that’s just not a requirement for unstructured data use cases. File-based (NAS) and object storage fulfil the need for.
- Performance: Historically this wouldn’t have been on the agenda, but with the need for analytics closer to real time and for rapid recovery from cyber attack, it’s now more of a consideration.
Cloud and unstructured data
With these requirements in mind, cloud storage would appear to fit the bill well as a site to store unstructured data. There are potentially a few things that work against it, however.
Cloud storage provides object (overwhelmingly, in terms of volume) and file-access storage so it is potentially well-suited in that regard.
Cloud storage can also provide capacity, and it may well be the case that data can be stored at volume in the cloud in an extremely cost-effectively manner. But it is usually the case that costs can be kept very low only when data is not accessed, so that’s the first potential drawback of cloud storage.
So, the cloud is very good for cold data but any kind of I/O starts to push up costs. That may be acceptable depending on the size and access requirements of your workload, however. Small datasets, or those that require infrequent access, would be ideal.
On-site object and file storage
Clustered NAS and object storage are both well-suited to very large volumes of unstructured data. If anything, object storage is even better-suited to large amounts of data due to its superior ability to scale.
File-based storage is based on a file system and a tree-like hierarchical structure. This can lead to performance overheads as the file system is traversed. Object storage, by contrast, is based on a flat structure with objects/files possessing a unique ID that facilitates access.
On-site storage can allay concerns about security of data and its availability, and can potentially work out less costly than putting data in the cloud.
Either set of protocols – file and object – is well-suited to unstructured data storage.
Add flash for fast access
It’s quite possible to build adequately performing file and object storage on-site using spinning disk. At the capacities needed, HDD is often the most economic option.
But advances in flash manufacturing have led to high-capacity solid state storage becoming available, and storage array makers have started to use it in file and object storage-capable hardware.
This is QLC – quad-level cell – flash. This packs in four levels of binary switches to flash cells to provide higher storage density and so lower cost per GB than any other flash commercially usable currently.
The trade-offs that come with QLC, however, are that flash lifetime can be compromised, so it’s better suited to large-capacity, less frequently accessed data.
But the speed of flash is particularly well-suited to unstructured use cases, such as in analytics where rapid processing and therefore I/O is needed – and in cases where customers may want to restore large datasets from backups in case of a ransomware attack, for example.
Storage hardware providers that sell QLC-based arrays suited to file and in some cases object storage include:
Dell EMC, with PowerScale, which includes EMC’s Isilon scale-out NAS (partially) rebranded and with S3 object storage access. Its all-flash (it also has hybrid flash) NVMe QLC flash-equipped options come in a range of capacities that scale to tens of PB.
NetApp, which recently launched a new QLC flash storage array family – the C-series – aimed at higher-capacity use cases that also need the speed of SSD. The C-series starts with three options – the C250, C400 and C800 – which scale to 35PB, 71PB and 106PB respectively. Object storage access is possible but limited using the protocol via NetApp’s Ontap OS.
Pure Storage with its FlashArray//C provides all-QLC NVMe-connected flash in two models, the //C40 and //C60 with capacities into the PB range. Meanwhile, Pure’s FlashBlade//S family is explicitly marketed as “fast file and object” with NVMe QLC in its proprietary modules in two models. The S200 emphasises capacity, with data reduction, while the S500 goes for performance.