The Art of Data Curation – It ain’t what you do, it’s the way you do it

Srini Srinivasan, CTO at Aerospike, says it is not about how much data you have, but what you do with it that matters, and curating your data is the key to scaling well.

  • Sunday, 19th November 2023 Posted 7 months ago in by Phil Alsop

The amount of data created, captured, copied and consumed is set to double over the next two years, reaching a mind-boggling 181 zettabytes (that’s 1021 if you were unsure) by 2025. It’s a volume of data that’s hard to comprehend, and is it all useful? I’ve been trying to make sense of this scale of data for some time, and of course the answer to the question is that it depends on what you need the data for. One person’s data ‘gold’ is another’s wasted storage!

When we try to ask the same question of the data that exists in every business or data domain, it’s harder to answer. Not only is that data in a constant state of flux and growth but, as business needs change, we want to be able to take advantage of new or different data sources to achieve a business objective.

Customer experience counts

Data is also a critical component of customer experience. Creating those experiences could involve recommending items to purchase on a website, providing access to a medical records database, or serving an advert to a gamer on their mobile device. In these cases and more, database architects must understand criteria such as response times and the need for access to historical data or third-party services, so the database scales and remains performant.

For many applications it’s possible to be very precise about the size of the data domain and how it needs to perform. For example, an analysis exercise on 20,000 medical records, or a set of retail sales is a fixed dataset. The challenge for today’s database architects is that, in many business applications, data is continually ingested from Internet of Things devices, for example, or third-party services 24 hours a day. How do you design to scale unlimited data?

Artificial Intelligence and Machine Learning are also placing pressure on businesses to keep the data they ingest and create. The whole point of these technologies is for them to learn from a variety of data sources so they can identify patterns and relationships between data points, then automate tasks based on the algorithms. Ultimately this leads teams to feel they can’t throw any data out as that may mean losing a competitive insight that they don’t currently know they need.

The change that’s needed

Today, we’re seeing the impact of reactive data curation where databases and solutions are expanded rather than reviewed holistically and decisions made about whether a new approach is needed. It’s only when a system is creaking under the weight of the data it holds, and throwing more processing power or storage at it doesn’t help, that organisations are forced to face that reality. Even if a system was designed to scale for users, it was not necessarily designed to scale for growing data.

Bolting on ‘muscle’ as an approach to managing growing databases is no longer viable, which means that organisations need to take a more methodical approach to data curation. Shelves of books have been written on this topic but, in today’s environment, most business systems should be designed with a few assumed principles:

· Design with more in mind - Assume you’ll need to consume more and more data to make better decisions going forward. This will force you to think about how that data will scale and whether you need to change your underlying technology, such as moving to a graph database, to boost performance and long-term scaling.

· Not all data is equal – Some data must be processed in milliseconds as part of complex queries, while some of it isn’t time critical at all. Understanding when your data points are used, by how many users, and the timeliness within which they must be delivered, is critical to ensuring the right customer experience.

· Focus on the customer experience – Whoever the customer may be, they must remain at the forefront of the design process, with data teams understanding at all times their needs and how big the audience is. It’s very easy to fall into the trap of architecting databases rather than curating data. By focusing on customer satisfaction and data curation the right solution will emerge.

The line in the sand

Deciding to draw a line in the sand and take a new approach to curating data and assessing the suitability of the databases that underpin your customer experience is not easy. It can, however, have a profound and positive impact on the long-term future of a business and the way it uses its data for competitive advantage through new customer experiences, automation and business insights. For big data users, it can also lower costs and reduce the CO2 impact of operations.

For systems that have not been designed to scale and ingest data at the rates expected in today’s organisations, the clock is ticking. It’s a problem that will continue to grow, casting a longer and longer shadow unless your data is under better control