A funny thing happened on the way to the AI-promised land: People realized they needed data. In fact, they realized they needed large quantities of a wide variety of data and that it would be better if it was fresh, trusted, and accurate. In other words, people realized they had a big data problem.
It may seem as though the world has moved beyond the “three Vs” of big data–volume, variety, and velocity (although with variety, veracity, and variability, you’re already up to six). We have (thankfully) moved on from having to read about the three (or six) Vs of data in every other article about modern data management.
To be sure, we have made tremendous progress on the technical front. Breakthroughs in hardware and software–thanks to ultra-fast solid-state drives (SSDs), widespread 100GbE networks (and faster), and most importantly of all, infinitely scalable cloud compute and storage–have helped us blow through old barriers that kept us from getting where we wanted.
Amazon S3 and similar BLOB storage services have no theoretical limit to the amount of data they can store. And you can process all that data to your heart’s content with the huge assortment of cloud compute engines on Amazon EC2 and other services. The only limit there is your wallet.
Today’s infrastructure software is also much better. One of the most popular big data software setups today is Apache Spark. The open-source framework, which rose to fame as a replacement for MapReduce in Hadoop clusters, has been deployed innumerable times for a variety of big data tasks, whether it’s building and running batch ETL pipelines, executing SQL queries, or processing vast streams of real-time data.
Databricks, the company started by Apache Spark’s creators, has been at the forefront of the lakehouse movement, which blends the scalability and flexibility of Hadoop-style data lakes with the accuracy and trustworthiness of traditional data warehouses.
Databricks senior vice president of products Adam Conway turned some heads with a LinkedIn article this week titled “Big Data Is Back and Is More Important Than AI.” While big data has passed the baton of hype to AI, Conway said people should focus on big data.
“The reality is big data is everywhere, and it is BIGGER than ever,” Conway writes. “Big data is thriving within enterprises and enabling them to innovate with AI and analytics in ways that were impossible just a few years ago.”
The size of today’s data sets certainly are big. During the early days of big data, circa 2010, having 1 petabyte of data across the entire organization was considered big. Today, there are companies with 1PB of data in a single table, Conway writes. The typical enterprise today has a data estate in the 10PB to 100PB range, he says, and some companies are storing more than 1 exabyte of data.
Databricks processes 9EBs of data per day on behalf of its clients. That certainly is a large amount of data, but if you consider all of the companies storing and processing data in cloud data lakes and on-prem Spark and Hadoop clusters, it’s just a drop in the bucket. The sheer volume of data is growing every year, as is the rate of data generation.
But how did we get here, and where are we going? The rise of Web 2.0 and social media kicked off the initial big data revolution. Giant tech companies like Facebook, Twitter, Yahoo, LinkedIn, and others developed a wide range of distributed frameworks (Hadoop, Hive, Storm, Presto, etc.) designed to enable users to crunch massive amounts of new data types on industry-standard servers, while other frameworks, including Spark and Flink, came out of academia.
The digital exhaust flowing from online interactions (click streams, logs) provided new ways of monetizing what people see and do on screens. That spawned new approaches for dealing with other big data sets, such as IoT, telemetry, and genomic data, spurring more product usage and, hence, more data. These distributed frameworks were open-sourced to accelerate their development, and soon enough, the big data community was born.
Companies do a variety of things with all this big data. Data scientists analyze it for patterns using SQL analytics and classical machine learning algorithms, then train predictive models to turn fresh data into insight. Big data is used to create “gold” data sets in data lakehouses, Conway says. And finally, they use big data to build data products, and ultimately to train AI models.
As the world turns its attention to generative AI, it’s tempting to think that the age of big data is behind us and that we will bravely move on to tackling the next big barrier in computing. In fact, the opposite is true. The rise of GenAI has shown enterprises that data management in the era of big data is both difficult and necessary.
“Many of the most important revenue generating or cost saving AI workloads depend on massive data sets,” Conway writes. “In many cases, there is no AI without big data.”
The reality is that the companies that have done the hard work of getting their data houses in order–i.e. those who have implemented the systems and processes to be able to transform large amounts of raw data into useful and trusted data sets–have been the ones most readily able to take advantage of the new capabilities that GenAI have provided us.
That old mantra, “garbage in, garbage out,” has never been more apropos. Without good data, the odds of building a good AI model are somewhere between slim and none. To build trusted AI models, one must have a functional data governance program in place that can ensure the data’s lineage hasn’t been tampered with, that it’s secured from hackers and unauthorized access, that private data is kept that way, and that the data is accurate.
As data grows in volume, velocity, and all the other Vs, it becomes harder and harder to ensure good data management and governance practices are in place. There are paths available, as we cover daily on these pages. But there are no shortcuts or easy buttons, as many companies are learning.
So, while the future of AI is certainly bright, the AI of the future will only be as good as the data that the AI is trained on or as good as the data that’s gathered and sent to the AI model as a prompt. AI is useless without good data. Ultimately, that will be big data’s endearing legacy.