Shining the spotlight on dark data in the era of generative AI

By Dael Williamson, EMEA CTO, Databricks.

  • Monday, 4th September 2023 Posted 1 year ago in by Phil Alsop

All organisations have a wealth of invaluable data insights hiding just out of sight, waiting to be uncovered. Whilst many might fear the unknown, in this case, what is unseen could actually be a hidden source of knowledge which, if tapped into, could propel a business forward in unexplored ways. These are known as ‘no regrets’ use cases - when data that holds unspeakable value is locked away, hidden or forgotten.

Organisations collect huge volumes of data each day, and over time this is only growing. In fact, the World Economic Forum estimated that by 2025, more than 463 exabytes of data will be collected each day globally. However, a vast amount of that data is never used in any meaningful way. This is known as ‘dark data’, and it consists of millions of unstructured data points, such as unused customer information or geolocation data that is stored but never analysed. Dark data could also consist of legacy code with the potential to be reverse-engineered to business knowledge, or documents written years ago that have been stored on a file server and forgotten.

Before the explosion of generative AI, this unexplored data could be thought of as a simple missed opportunity. But now, organisations cannot afford to throw out the knowledge presented by dark data. For organisations that are seeking to leverage AI in their operations, they must have a strong handle on their big data, dark data, as well as both the strong and weak signals hidden away in these unexplored data sources, to understand the needs of their business moving forward.

The state of data for businesses

This year, business leaders have been faced with two extremes - the economic downturn that has resulted in higher costs, and the surge in popularity of large-language models (LLMs). This has led many organisations to consider whether or not they are taking advantage of the full range of benefits offered by LLMs – such as increased efficiency and reduced costs. According to a recent report, titled ‘2023 State of Data + AI’, organisations are putting substantially more models into production - natural language processing (NLP) and LLMs have grown 411% year on year since November 2022 when ChatGPT was released. Simultaneously, machine learning (ML) experimentation has increased 54%. This indicates that businesses are seeking to stay ahead of the curve, making use of novel solutions to alleviate some of the challenges they are facing and allowing them to focus more on innovation.

However, this all comes back to data. Data is the fuel that powers AI and, as such, businesses require a diverse, reliable set of data to train their models. In the case of ChatGPT, it was trained using text from a large body of data sources on the internet before September 2021 – this includes books, news articles, academic journals and more. Enterprises that are collecting huge volumes of proprietary data, that no one else can access, have an enormous opportunity to leverage this data to create their own powerful, customised LLMs similar to ChatGPT, but tailored to their specific business needs.

To ensure the data that is being used to train these models is reliable, organisations first need to look at their data and how it is being handled within the business. Organisations operating on legacy data architectures might not have the resources at their disposal to pull insights from the volumes of data they create. Instead, organisations should look to build strong, modern data foundations to power their AI – such as a data lakehouse, which removes much of the complexity typically associated with legacy data architectures and enables the timely flow of accurate data. Only when

data is managed effectively can organisations hope to discover the knowledge that could be hiding within their dark data.

Risk and reward

So, what are the risks posed if organisations fail to understand their dark data? Let’s look to customer services as an example. Traditionally, organisations store call logs and detailed records of customer conversations for future use – this data is often never analysed. However, when classified and anonymised, even these records can reveal a great deal about customer sentiment that can then be used to make business decisions. In this instance, the company would be losing out on opportunities to improve their services based on the anonymous insights pulled from the dark data, which over time could make a significant difference to their bottom line.

However, businesses should also be aware that not all dark data will be useful. There are significant risks in relying on data that you cannot see to train models. When AI is trained on untrustworthy data that is not scanned for quality, accuracy or bias – it could lead to disastrous results that may damage the reputation of a business. The process of mining through large datasets in order to uncover the really valuable insights is called ‘looking for weak signals’ - which is most useful when done in conjunction with classical ML and other techniques before generative AI production. This is why having a reliable data platform with the power to mine through and identify the golden pieces of unexplored business knowledge is essential for all businesses that are scaling AI use cases across their organisations, as well as looking to strengthen potential weaknesses in their operations.

A bright future ahead for enterprise AI

AI has the power to make the impossible possible, creating greater efficiencies within businesses and increasing their capacity for innovation. But before organisations can take full advantage of what the future could bring, they have to take a peek behind the curtain to discover what insights their dark data could be hiding.

As the volume of dark data that is created and stored increases with each day, data management should always be a priority for organisations that want to ensure their AI is trained on reliable and accurate datasets. If not, they could also risk missing out on the vast sea of valuable business insights that are just waiting to be illuminated.