Since the 1990s, organisations have gathered, processed and analysed business information in data warehouses.
The term “data warehouse” was introduced to the IT mainstream by American computer scientist Bill Inmon in 1992, and the concept itself dates back further, with the founding of Teradata in 1979 and work carried out by IBM in the early 1980s.
Their goal was to allow enterprises to analyse business data to improve decision making, without the need to interrogate perhaps dozens of different business databases.
Since then, the technology has evolved, allowing organisations to process data at greater scale, speed and precision.
But some commentators now believe the data warehouse has reached the end of its useful life.
Ever greater volumes of data, along with the need to process and analyse information more quickly, including potentially in real time, are putting stress on conventional data warehouse architectures.
And data warehouse suppliers face competition from the cloud. An on-premise data warehouse can cost millions of dollars, take months to implement, and, critically, more months to reconfigure for new queries and new data types. CIOs are looking at the cloud as a more flexible home for analytics tools.
Exponential growth in business data
Conventional data warehouses are struggling with exponential growth in business data, says Richard Berkley, a data and analytics expert at business advisory firm PA Consulting.
“The cloud now provides much more scalability and agility than conventional data warehouses,” he says.
“Cloud technologies can scale dynamically, pulling in the processing power needed to complete queries quickly just for the processing time. You’re no longer paying for infrastructure that sits idle and you can get far better performance as the processing for individual queries is scaled far beyond what is feasible in on-premise services.”
Nor are data volumes the only challenge facing the data warehouse. Organisations want to avoid being locked into one database, or data warehouse technology.
Increasingly, businesses want to draw insights from data streams – from social media, e-commerce, or sensors and the internet of things (IoT). Data warehouses, with their carefully crafted data schemas and extract, transform and load (ETL) processes, are not nimble enough to handle this type of query.
“The market has evolved,” says Alex McMullan, chief technology officer for Europe, the Middle East and Africa at storage supplier Pure.
“It is no longer about an overnight batch report which you then give to the CEO as a colour printout. People are doing real-time analytics and making money in the space.” Applications, he says, run from “black box” financial trading to security monitoring.
Lakeside view
At one point, data lakes appeared set to take over from data warehouses. In a data lake, information is stored in its raw form, on object storage, mostly in the cloud.
Data lakes are quicker to set up and operate, as there is no prior processing or data cleansing, and the lake can hold structured and unstructured data. The processing, and ETL, takes place when an analyst runs a query.
Data lakes are increasingly used outside of traditional business intelligence, in areas such as artificial intelligence and machine learning, and, because they move away from the rigid structure of the data warehouse, they are sometimes cited as democratising business intelligence.
They do, however, have their own drawbacks. Data warehouses used their structure to build performance, and that discipline can be lost with a data lake.
“Organisations can accumulate more data than they know what to do with,” says Tony Baer, analyst at dbInsight. “They don’t have that discipline of an enterprise architecture approach. We gather more data than we need, and it is not being fully utilised.”
To deal with this, enterprises throw more resources at the problem – all too easy to do with the cloud – and end up with performance “almost as good as a data warehouse, through brute force”, he says.
Controlling queries and costs
This can be inefficient, and costly. Baer points out that cloud analytics suppliers such as Snowflake are building in more “guardrails” to control queries and costs. “They are moving in that direction, but it is still easy to keep adding VMs [virtual machines],” he says.
Data warehouses and data lakes also exist to support different enterprise requirements. The data warehouse is good for repeatable and repeated queries using high-quality, cleaned data, often run as a batch. The data lake supports a more ad-hoc – even speculative – approach to interrogating business information.
“If you are doing ‘what if’ queries, we are seeing data lakes or document management systems being used,” says Pure’s McMullan. He describes this as “hunter gatherer” analytics, while data warehouses are used for “farming” analytics. “Hunter gatherer analytics is looking for the questions to ask, rather than repeating the same question,” he says.
The goal for the industry, though, is to combine elasticity, speed and the ability to handle streamed data, and efficient query processing, all in one platform.
New architectures
This points to a number of new and emerging categories, including the data lakehouse – the approach taken by Databricks – Snowflake’s cloud-based, multi-cluster architecture, and Amazon’s Redshift Spectrum, which connects the supplier’s Redshift data warehouse to its S3 storage.
And, although the industry has largely moved away from trying to build data lakes around Hadoop, other open-source tools, such as Apache Spark, are gaining traction in the market.
Change is being prompted less by technology than by changes in business’s analytics needs.
“Data requirements differ from those of five or 10 years ago,” says Noel Yuhanna, an analyst covering data management and data warehousing at Forrester. “People are looking at customer intelligence, change analysis and IoT analytics.
“There is a new generation of data sources, including sensor and IoT data, and data warehouses have evolved to address this, [by handling] semi-structured and unstructured data.”
The cloud adds elasticity and scale, and cost savings of at least 20%, with 50% or even 70% cost reductions possible in some situations. However, he cautions that few companies genuinely operate their analytics systems at petabyte scale: Forrester calculates that fewer than 3% do.
Those that do are mostly in manufacturing and other highly instrumented businesses. They might, for their part, turn to edge processing and machine learning to cut down data flows and speed decision making.
The other change is the move towards real-time processing, with “click stream” data in e-commerce, entertainment and social media producing constant flows of information that needs immediate analysis, but has limited longer-term value. Organisations, for their part, will only invest in stream analytics if the business can react to the information, which in turn requires high levels of automation.
This is prompting suppliers to claim they can straddle both markets, combining the flexibility of the data lake with the structured processing of the data warehouse. Databricks, for example, says it can enable “business intelligence and machine learning on all data” in its data lakehouse, removing the need for its customers to run duplicated data warehouse and data lake architectures.
Whether that means the demise of the conventional data warehouse, though, is unclear.
“Without this lakehouse, the world is divided into two different parts,” says Ali Ghodsi, CEO of Databricks. “There are warehouses, which are mostly about the past, and you can ask questions about ‘what was my revenue last quarter?’ On the other side is AI and machine learning, which is all about the future. ‘Which of my customers is going to disappear? Is this engine going to break down?’ These are much more interesting questions.
“I think the lakehouse will be the way of the future, and 10 years from now, you won’t really see data warehouses being used like this anymore,” he says. “They will be around just like mainframes are around, but I think the lakehouse category is going to subsume the warehouse.”
Back to the future
By no means everyone believes the data warehouse has had its day, however. As Databricks’ Ghodsi concedes, some systems will carry on as long as they are useful. And there are risks inherent with moving to new platforms, however great their promise. “Data lakes, and new infrastructure models, can be too simplistic and do not fix the real complexity challenge of managing and integrating data,” says PA Consulting’s Berkley.
Much will depend on the insights organisations need from their data. “Data warehouses and DL are very complementary,” says Jonathan Ellis, chief technology officer of Datastax. “We don’t serve Twitter or Netflix out of a data warehouse, but we don’t serve a BI dashboard out of Cassandra. [We] run live applications out of Cassandra and do analytics in the data warehouse. What is exciting in the industry is the conjunction of streaming technology and the data warehouse.
“Databases are sticky and although everybody in the data warehousing space broadly supports Sequel, the devil is in the detail,” he says. “How you design schemas for optimum performance differs from supplier to supplier.”
He predicts a hybrid model, comprising on-premise and cloud, open source and proprietary software, to create a “deconstructed data warehouse” that is more flexible than conventional offerings, and more able to handle real-time data.
Others in the industry agree. We are likely to see a more diverse market, rather than one technology replacing all others, even if this poses a challenge for CIOs.
The data warehouse is likely to carry on, for some time at least, as the “gold copy” of enterprise data.
Pure Storage’s McMullan predicts that organisations will use warehouses, lakes and hubs to view different sets of data through different lenses. “It will be a lot harder than it used to be, with modern data sets and the requirements to go with it,” he says. “It is no longer about what you can do in your 42U, 19-inch rack.”