Applying Data Virtualization in the Enterprise: Staging Tables

By | 2019-08-23T10:57:38+00:00 August 7th, 2019|

In the age of the data-driven enterprise, organizations at every scale deal with a constant stream of structured, semi-structured, and unstructured inputs drawn from both legacy systems and cloud-connected APIs. With such a vast variety and volume of data to store and process, IT teams have found themselves scrambling to develop infrastructures that are capable of meeting their rapidly evolving requirements.

Unfortunately, these set-ups rarely follow any sort of planned data strategy. Instead, many organizations choose to rely on a makeshift assortment of repositories, platforms, and reporting systems to solve their data challenges. Such implementations generally require significant amounts of data to be consolidated and replicated through ETL processes which must be designed and executed consistently to ensure the timeliness of enterprise data. However, if a business’s input sources or analytics systems are updated at any point, their dataflow pipelines will need to be adjusted accordingly, which will only add to the time and cost of infrastructure development.

At a time when C-suite executives and frontline employees alike depend on immediate insights to drive their daily decision-making, this inflexible architecture often fails to provide end-users access to the data they need.

Solving Enterprise-Wide Data Challenges with Data Virtualization

Data virtualization offers a streamlined and cost-effective solution to many of these issues. This technology allows data to be mapped from disparate sources to a seamlessly integrated virtualization layer without any physical movement or duplication. Inputs can then be fed into reporting and analytics systems in much the same way as a physical database.

For organizations attempting to gain deeper value from their digital touchpoints, transactional systems, and external data streams, the benefits of virtualization seem clear. Fittingly, reports from Forrester show a marked growth in the adoption of enterprise data virtualization (EDV) across the globe with several key decision-makers indicating a switch towards EDV in the near future. As part of this trend, the report shows that the technology is being embraced by a broad segment of industries including  financial services, telecommunications, oil and gas, insurance, and eCommerce. Meanwhile, Gartner predicts that up to 50% of organizations will employ some form of EDV in their data architecture by 2020.

These statistics clearly demonstrate the applicability of virtualization in a range of settings, from optimizing the delivery of customer analytics to enabling the integration of social media and IoT insights into the enterprise data warehouse (EDW). Astera’s Data Virtualization module is designed to facilitate these deployments. Our EDV tool combines intuitive drag-and-drop data mapping and traditional ETL functionalities with a comprehensive virtual data model builder that connects to over 40 different sources including web applications, flat files, documents, as well as a variety of physical and cloud databases.

In this series of articles, we’ll take you through a number of effective use cases for Astera Data Virtualization.

Staging Tables

Why Introduce Staging Tables to Your Data Architecture?

Staging tables are a common feature in any traditional data architecture. Here, they provide storage for the records and tables extracted from various source systems across the enterprise. Once moved into the staging area, these datasets are combined and further transformed into a standardized format that aligns with the structure of the destination database, thereby optimizing ETL loading.

Data Decoupling

Data loading can be a time-consuming process that occupies the production database for hours at a time. Not only does this place a substantial burden on source systems, it also means that the loading process may be interrupted at any time due to server unavailability.  Unless these disruptions are actively monitored, end-users will end up with incomplete data at the destination.

Staging tables remove this dependence by acting as a repository for currently retrieved data while other source systems are brought online. Once these are made available, the destination tables can be loaded with all relevant operational data.

Security

Production databases and other operational systems can contain a variety of sensitive data that should only be accessible to authorized end-users. With staging tables, data architects can ensure that only relevant records are made available to these stakeholders for querying and analysis.

Streamlined Architecture

Without a staging table in place, the data warehouse would include intermediary tables to handle the transformations required to prepare data for publishing. These tables will create additional complexity in your reporting and analytics platforms and could affect query performance.

Improved Data Quality

Scrubbing data beforehand in the staging table helps ensure that inconsistencies and redundancies are minimized when data is moved to its destination. This is especially important because multiple versions of one record may exist across different sources.

Independent Scheduling

While some source systems house relatively static data that changes infrequently, others are far more dynamic and may be updated on a weekly or even daily basis. Staging tables allow for individualized loading schedules that are tailored to each source system. This ensures that relevant tables in the data warehouse are kept as current as possible.

Applying Data Virtualization to Staging Tables

In a virtualized staging table, metadata from internal and external systems is taken and mirrored in a separate semantic layer, creating an abstracted view of enterprise data sources. Through Astera Data Virtualization, these views can be further joined and enriched with business logic. As a result, end-users are able to connect their analytics and reporting systems to disparate data sources from a single repository, despite the fact that no actual data movement has taken place.

When compared to ETL-based staging tables, the virtualized staging area shows some clear advantages.

  • Ideally, staging tables should be separated from persistent source systems, in order to emphasize the fact that these repositories are temporary in nature. However, placing staging tables on a separate database server will make data loading less performance-efficient. Virtualization introduces a logical barrier between the staging area and operational databases and does away with the need for data movement altogether.

 

  • In a best practices approach, physical staging tables should be truncated before and after each load in order to reduce the time taken for each new update as well as the storage space required by the staging table. These maintenance and storage costs are minimized with data virtualization.

 

  • Data virtualization provides a near real-time view of source systems thus ensuring that staging table data is always as fresh and accurate as possible.

 

  • Additional data sources can be added through a simple connection in the abstraction layer. No additional ETL pipelines are required.

 

  • Data replication is kept to a minimum, so chances of theft or unauthorized access are reduced. This also helps to keep organizations on the right side of data management regulations.

 

  • Virtualization hides source table schemas from data consumers, instead presenting all data in a single simplified view that allows for efficient loading into analytics/reporting systems.

Staging Tables in Astera Data Virtualization

In Astera Data Virtualization, we  support over 40 connectors across a disparate array of platforms, systems and file types from Excel workbooks to relational databases, data warehousing solutions such as Teradata, as well as legacy architectures based on COBOL and other programming languages.

The driving concept behind our technology is that it enables enterprises to bring together heterogeneous inputs in a unified layer where the original source systems are completely invisible from data consumers.  This is made possible through our virtual model designer. Using the drag and drop functionality, users can import relevant source objects into the designer window and set up connections to appropriate source systems.

Fig 1: A Virtual Model Consisting of 158 Entities

In the diagram above, we have built a massive virtual model showcasing real-time snapshots from a wide variety of source systems. Once our model has been verified and deployed, all of these tables will be accessible as a single virtual database for consumers.

However, in certain cases, data publishers may need to apply additional transformations to their source datasets to allow for more effective analysis and reporting at their destination. In this case, they can set up a staging area for their source data in a reusable entity known as a subflow.

Subflows contain mapping and complex decision logic (validation rules, transformations, expressions). They can be connected to any preferred source or destination in much the same way as a dataflow object.

In the dataflow above, we have gathered invoice tables from various regional sales departments across the organization. However, some of these departments maintain customers in countries outside of their selling areas. This creates a distorted picture of profits derived from certain regions. In order to combat this issue, we can stage invoices data in a subflow and apply an expression to filter out irrelevant values. We have demonstrated such a process in the subflow below.

Fig. 2: Expression Transformation Applied to Invoices Table in Subflow

We can then integrate the subflow into the virtual model by dragging and dropping a subflow entity from the toolbox and pointing it towards the filtered dataset.

Fig. 3: Subflow Transformation in a Virtual Model

Now the model is ready to be published and consumed in much the same way as any physical database. In this way, we can obtain all the functionalities of a traditional staging table within Astera Data Virtualization.

Building the Modern Operational Data Store with ADV

As data architectures continue to evolve, so do the possible applications of data virtualization within the enterprise. In upcoming articles in this series, we will discuss how our technology can enable business users to extract real-time reports and transactional intelligence from their day-to-day operational sources.

If you have a more specific use-case that you would like to explore, you can always contact our sales and support team for a private consultation. Or, try out the product yourself by downloading the trial version of Astera Data Virtualization.