Improving data quality with a more intelligent approach to data management

By Bob Eve, Senior Data Management Strategist, TIBCO.

  • Friday, 7th August 2020 Posted 4 years ago in by Phil Alsop

Data is critically important to business success, and yet its quality remains elusive as evolving definitions, syntax, structures, sources, and uses conspire to limit its efficacy. The sheer amount of data available to organisations and its complexity can often feel overwhelming. However, next-generation data management advances, used in conjunction with a more pragmatic approach, can help alleviate these concerns while improving data quality.

 

Integrate data management silos with next-generation data management

Ensuring data quality cannot be done using a standalone application. Instead, it requires a combination of several applications that encompass metadata management, master data management, reference data management, data cataloguing, data governance, and data integration.

 

Traditionally, these systems were independent and required metadata and data coordination across a myriad of distinct tools. Furthermore, first-generation data management offerings were designed with technical users in mind who do not have the required business domain expertise to ensure the success of the data quality efforts.

 

Companies therefore require a next-generation data management solution capable of integrating what were once disparate components in an environment where business and technology experts can collaborate across the entire customer data quality lifecycle.

 

Use data virtualisation to stop making so many copies of data

According to the International Data Corporation (IDC), enterprise data doubles every three years*. While this volume adds to the complexity of ensuring data quality, 85% of this data can be considered copies of the original data.

 

This can be attributed to how traditional data warehouse-based integration works. It sees original data from transactional systems getting copied once to staging, and then again into the warehouse. Data marts based on the warehouse create even more copies. Add data lakes to the mix, and the proliferation of copies gets even worse.

 

With so many copies spread across these different locations, maintaining data accuracy and consistency becomes a considerable challenge with data quality suffering as a result.

 

The straightforward answer is for companies to stop making so many copies of their data. For example, using data virtualisation to access upstream data directly from its original source means an organisation can retire many of its data marts to improve quality while reducing costs. This will also result in all upstream users sharing common data definitions and gaining a consistent view in terms of data quality and provenance.

 

Take advantage of artificial intelligence to automatically identify and resolve quality issues

Beyond adopting integrated data management solutions and making fewer copies of data, companies should also consider using artificial intelligence (AI) to identify and resolve quality issues automatically.

 

IDC has identified five models of how the currently available AI and machine-learning capabilities from data management vendors can improve data quality. These include being human-led; being human-led and machine-supported; being machine-led and human supported; being machine-led and human governed; and being machine-led and machine governed.

 

With the market research company finding that more than 65% of organisations surveyed now use AI to automatically highlight data quality issues and 55% of those applying the AI-recommended correction, the technology has delivered the practical advances required to improve data quality. Perhaps more telling is that companies’ trust in the AI recommendations is rated at over 90% with approximately 35% of those recommendations nearly always accepted.

 

Be pragmatic and solve the data quality problem at hand

The final piece of the puzzle is approaching data quality from the perspective of the problem the company is trying to solve. So, even though attaining perfect data quality is a noble pursuit, often having data quality that is good enough to meet the business need is sufficient.

 

For example, if the organisation is trying to improve the customer experience, then it would want to know everything about how it engages with them. Typically, this will include customer data from multiple systems that might allow for different identifiers, such as:

·  Steve Smith in the salesforce automation system

·  S Smith in the service management system

·  S.E. Smith in the marketing system

·  Steven Smith in the order entry and billing system

 

These non-matching primary keys make it difficult to match records and build a complete view of the customer. Data quality from an integration point of view is not good enough for the problem at hand. This extends beyond customer identifiers. The integration challenge hits every key master data entity across suppliers, partners, products, location, and more.

 

A more intelligent master data management system is required that can automatically detect and resolve these mismatches to make it easier for the company to have a ‘golden’ record that resolves these primary key anomalies. And with this taken care off, data virtualisation can query all the details and provide the 360-degree customer view required to improve their experience.

 

Additionally, businesses might be looking to increase their cross-sell revenue opportunities. In this instance, a data scientist might be attempting to build a next-best-offer recommendation engine based on historical sales data in the hopes of uncovering popular product combinations.

 

When the data scientist starts examining this sales data and its distribution, some of it might correlate well, but there will likely be outliers as well.

 

If the business is just getting started on the model, it will likely focus on the data that represents most of the customers and the highest revenue products. By not bothering with the outlier data, it can build and implement the model faster and realise its benefits sooner. Later, the data scientist can go back and investigate and try to understand the outliers.

 

Alternatively, the company might decide that better use of a data scientist would be to focus on the majority of customers and the highest revenue products. This will result in continuous refining and improving the original model while leaving the outliers unresolved.

 

Ultimately, it comes down to a business, keeping focused on improving the quality of its data to put it in a position to make more informed decisions that will help drive the success of the organisation.

 

* Source: IDC Global DataSphere 2020