The ABCs of Data Quality: Accessibility, Balance, Centralisation

By David, Talaga, Product Marketing Director, Dataiku.

Thursday, 7th September 2023 Posted 2 years ago in by Phil Alsop

In the age of data-driven decision-making, organisations must recognise the importance of good quality data. However, the unfortunate truth is that there is no such thing as “clean and centralised data.” There is only data good enough to support given applications.

So, where do you start when an organisation is improving the quality of their data? Here are the ABCs of data quality:

Accessibility

Most organisations have their data spread across a variety of systems. This can inevitably trigger issues when you want to access data from a different department or understand your datasets on a wider scale. You might be asking yourself over and over, “How can I quickly find common issues in my datasets? How can I go deeper in data analysis to understand deeper problems? How can I stop wasting so much time fixing datasets and instead start reusing trusted ones?”

Accessibility involves streamlining data collection, storage, and retrieval processes, empowering your stakeholders to quickly access relevant and accurate information. With greater accessibility comes greater data quality. An end-to-end platform can collate all of your data, allowing any user to assess data quality across their whole dataset and share this across the team or organisation. This gives you a visual and permanent understanding of data quality issues, and thereby makes it far easier to spot trends affecting the standard of data.

These ‘all-in-one’ data platforms can also provide exploratory data analysis to proactively identify and rectify data flaws, which leads to improved ML models, reliable algorithms, and informed business strategies. After all, in order to make better decisions, you need to have access to the data.

Balance

In business or in life, striving for balance is the most sustainable and effective way of doing things. From a data perspective, whatever data you clean will most likely be obsolete by the time the effort is completed. So, you need to balance your expectations around achieving ‘perfect’ data quality.

When it comes to keeping your data clean at scale, the balance to strike is finding a solution that can both simplify combining data from multiple sources and also maintain high performance as the volume and variety of data grows.

This task of keeping your data clean involves comparing it with other data, yet this data may be held in a separate system. So, you could transfer all of the data you need to access into a single location. However, the process of duplicating your data and building and maintaining new pipelines can add up some unnecessary costs to your data quality initiatives.

Instead, the better approach could be to leave the data in its systems and instead utilise federated queries for your metrics and checks. This avoids duplication and saves on time and resources. But despite query federation becoming a more common feature in query engines, few engines have the features needed to scale effectively.

As such, transitioning your data lake (data stored in its original format) into a ‘data lakehouse’ ( a management architecture that brings warehouse-like features to your data lake) can promote consistency and accuracy in your data and help you scale your data quality efforts.

All being said, a data lakehouse, just like its lake counterpart, is only as good as its query engine. It is about finding a balance between these systems and methods that provides you with a data quality process best suited to your use case and organisation.

Centralisation

Centralised data creates a single source of trust for an organisation’s data. A Data Catalog, for example, offers a central location for data scientists, engineers, analysts and other collaborators (such as domain experts) to share and search for datasets across the organisation. This includes searching for data collections, indexed tables and server connections.

By bringing together technical and business teams, equipping data and domain experts with the same capabilities and working off the same platform, you are democratising AI and analytics in your company. This goes a long way to safeguarding data quality and producing the most efficient model. Centralised data makes data quality a team sport.

With this centralised data, you can see a visual representation of a project’s data pipeline, allowing all collaborators to view and analyse the data. This makes it far easier to add to, amend and transform datasets alongside building predictive models.

The key to a successful platform is visual simplicity and easy-to-use interfaces that allow users to join, aggregate and clean data — amongst a host of other abilities — in a few clicks. However, centralising data is not an end game — it is just one of the many ways to open broader use of data.

The ABCs of Data Quality

Without good quality data, organisations are partially blindfolded in their decision-making. And if data is not fit to scale and work across systems, then you can face far more issues and unnecessary costs than you need to. But with so much data being produced, taking on data quality problems can feel like a never-ending burden.

Therefore, if you can give company-wide accessibility to data, find a balanced approach to how you view it and clean it, and provide a centralised source of data, you will begin to see great improvements to your data quality. And by applying data on a case-by-case basis, rather than striving for ‘perfect’ data, data quality suddenly becomes a much more rewarding and enjoyable process.