Automate Data Validation in Astera Centerprise

By | 2019-11-29T10:16:13+00:00 November 29th, 2019|

Enterprise information constantly changes due to data updates, modifications, deletions, or querying, making valid data a necessity rather than an exception. For an organization to establish trust in data, validating changes is essential to ensure consistency and accuracy in reporting and analyses.

Insights based on invalid data not only affect the business bottom line, but may also result in lost opportunities, customer attrition or reduced revenues, causing a business to lose valuable opportunities. Similar to the internet ‘filter bubble’ that can lead you down a path of false information, inaccurate or invalid data can lead you to take wrong decisions that incur expenses in the long run.

A survey conducted by Convertr, a customer acquisition platform, deduced that 1 in 4 leads that undergo processing are categorized as invalid because 27 percent have fake names, 28 percent have an invalid email address, and 30 percent have incorrect phone numbers.

Catching and fixing invalid data points early in the data journey can save significant processing time and improve the overall performance. This is where data validation steps into the picture. A form of data cleansing, this technique checks for data quality and accuracy prior to processing and loading data. The primary purpose of data validation is to ensure that the data is:

  • Comprehensive, i.e., contains no null values
  • Unique and free of duplication
  • Compliant with the business requirements

Data validation is an essential requirement for various data processes, such as ETL, ELT, and data warehousing, where the end goal is to help ensure the accuracy of the results. Working with reliable data gives businesses the confidence to take timely decisions without hesitation.

Issues Affecting Data Validity

Here are some of the issues that affect the validity of data:

  • Invalid values: In case the datasets have known values, such as ‘M’ for male and ‘F’ for female, then change in these values can render data invalid.
  • Missing values: Presence of null or blank values in the dataset.
  • Duplication: Repetition of data is a common occurrence in organizations where data is collected from multiple channels in several stages.
  • Attribute dependency: The inaccuracy caused due to the value of a field depending on another field. For instance, the accuracy of product data is dependent on the information related to suppliers. Therefore, errors in supplier data will reflect in product data as well.
  • Inadequate data recovery: Poorly recovered data can make it difficult for people to search for their required records.
  • Format discrepancies: It’s possible that data is entered in a format that is different from the rest of the business data.
  • Misspellings: Incorrect spellings
Data validation for incorrect records

Figure 1: Factors leading to invalid data (source: QuantDare)

Simplify Data Validation through Astera Centerprise

Astera Centerprise is a data management software that supports data validation through built-in data profiling, quality, and cleanse transformations. Through out-of-the-box connectors in a graphical UI, you can integrate, transform, and validate data from over 40 sources.

The software helps with automation of data validation tasks, liberating employees from the repetitive and manual effort of identifying and fixing incorrect records, and standardizing data to make it useful.

Let’s consider a simple scenario where a company ABC consolidates their customer data in an Excel file to streamline their marketing efforts and revenue channels. However, the data they gathered had several errors. Therefore, they decide to validate their data using Astera Centerprise. This is done by using three transformations: Data Profile, Data Cleanse, and Data Quality Rules.

Fig. 2 shows the dataflow that takes an Excel source as input, profiles it for analyzing source data, cleanses it to remove invalid records, and applies data quality rules to identify errors in the cleansed data before writing it into the destination delimited file.

Data validation dataflow

Fig.2: A simple dataflow for validating data in Astera Centerprise

The result of the Data Profile transformation shows the field-level details of data. This enables the organization to understand the data and ensure:

  • The credibility of data: Once the data has been analyzed, anomalies and duplications can be eliminated to ensure the reliability of data. This further helps the organization identify quality issues and determine actionable information to streamline business processes.
  • Faster decision-making: It creates an accurate picture of the source data, enabling the organization to reach decisions faster.
  • Hands-on crisis management: Profiled data can prevent small mistakes from turning into critical issues.
Profiling in data validation

Fig. 3: Profiling source data

The Data Cleanse transformation is used to fix two issues in the source data:

  1. It removes trailing and leading spaces from the records.
  2. It identifies records containing ‘.co’ and replaces it with ‘.com’. This fixes erroneous records in the Email Address.
cleansing in data validation

Fig.4: Applying conditions to cleanse data

The cleansed data, after removing extra spaces and incorrect email address format, can be seen in the right half of Fig. 5.

Using this clean data, the organization can:

  • Improve email marketing efforts: By creating a clean and error-free version of its customer data, the organization ensures the data can be utilized to get maximum returns on email marketing.
  • Increase revenue: Using correct email addresses guarantees higher response rates, which in turn results in increased conversions and chances of sales.
Fig. 5: Comparison of erroneous source data with cleansed data

Fig. 5: Comparison of erroneous source data with cleansed data

Next, Data Quality Rules are applied to the cleansed data to identify records in the Email Address field that have an invalid format.

flagging records- data validation

Fig.7: Flagging incorrect records in the Email Address field

The result can be seen in the next screenshot. Applying Data Quality Rules enables the organization to:

  • Get consistent data: By correcting email addresses, the organization ensures that all the departments have access to consistent and correct information.
  • Facilitate scalability: With a sound quality infrastructure in place, the organization can easily scale up without worrying about the trustworthiness and reliability of its data.

The errors identified by the Data Quality Rules are written into a log file, whereas the cleansed data is written into a Delimited file.

In the modern enterprise world, where important decisions are derived from data, automating data validation can significantly save time and streamline business processes. The code-free environment of Astera Centerprise enables you to run data validation as a part of a dataflow or workflow. Further, data updates can be made conditional, depending upon the success of validation tests to ensure the trustworthiness of your enterprise data.

To find out how you can validate data in Astera Centerprise, download the trial version.