Effective data warehousing requires a network of well-maintained data pipelines to deliver high-quality , real time data for reporting and advanced analytics.
But manual ETL processes do not provide the scalability, agility, and speed required to keep up with modern enterprise requirements. This blog will explain some approaches to building a data pipeline architecture which is metadata-driven and that can unleash the true power of your BI system.
While there are many benefits of using data pipeline tools and implementing a data warehouse, the real value of an EDW becomes apparent when you’re finally able to visualize data in beautiful, consolidated dashboards using platforms like Tableau or PowerBI.
In a well-tuned data architecture, these frontend tools serve as the single source of truth for all of your decision-making, enabling everything from KPI monitoring to historical trend analyses and even advanced predictive analytics. But, of course, before you unlock that transformative potential, you need to ensure that the BI layer is stocked with relevant, accurate, and standardized data.
Enter ETL, a term that’s been synonymous in the field of data science with data warehousing since the technology’s inception.
Back then, the primary use case involved transforming and loading large batches of data from a transactional database and enhancing their quality or align them with a data model using joins, aggregates, and other similar functions. These batches were then loaded to an analytics-ready repository through a data flow where users could easily query them.
But the world of business intelligence is advancing at lightning speed, and enterprises have had to evolve their ETL processes and ETL tools alongside. Let’s take a look at some of these developments.
From Coding to a Metadata-Driven Data Pipeline Architecture
Modern organizations deal with a lot more than just a single transactional database. Although they’ve got various sources generating significant volumes of data, their data pipeline tools need to support all of these systems and consumption methods.
Traditionally, developers manually build ETL processes for each of these systems. These data pipeline tools are tested and optimized after deployment in a production environment and then recoded to account for any changes to the source system schema.
https://imgs.xkcd.com/comics/data_pipeline.png
With a metadata-driven ETL process, many steps of the process can be automated. So in this approach, metadata for all source and target tables is maintained in a central repository, including database table schemas, foreign/primary key relationships, and data types for individual fields.
Similarly, any business logic and data transformations used to maintain the quality and relevancy of source data are abstracted from the transactional database. Data mappings used to move data between tables in the source systems and the data warehouse are also stored here.
Most data engineers agree that there are some major advantages to doing things this way.
First, you can design flows at the logical level and, in doing so, minimize any variability present in hard-coded pipelines built over the years by multiple developers.
This means that consolidating data sources and building data pipelines can be simplified to dropping pre-packaged source connectors into a data mapping template, applying necessary transformations, and then mapping to a destination table within your data warehouse.
The templates can be easily maintained and executed immediately to deliver fresh data to your data warehouse as required, ensuring the timeliness of subsequent reports and analytics.
Now you’ve got scalable ETL jobs that can be quickly designed from scratch, updated, or even reused as new sources are integrated into your architecture. These advantages dovetail perfectly with a modern agile methodology, where development takes place in multiple iterative cycles.
Data Latency – How Low Can You Go?
Is this the ideal state for your data architecture?
We touched on it a little earlier, but timeliness is one of the primary signifiers of high-quality business intelligence.
Say you’re looking to create a remarketing campaign to engage dropped leads with relevant products from your portfolio based on past customer data. The more current your intelligence is, the more attractive these offers will be.
On the other hand, if you’re relying on months or even year-old purchasing data to guide your efforts, then the campaign’s efficacy will be diminished. From fraud detection to supply chain optimization, there is a slew of business activities that are similarly reliant on up-to-date information.
Amidst these operating conditions, it’s no surprise that many enterprises are attempting to make data sets available in their data warehouse as quickly as possible.
A couple of years ago, data streaming was pitched as the answer. In this approach, data would become available for querying in the data warehouse as soon as it arrived in the source database. Now, that sounds great in theory, but there are a few details to consider.
First, your database needs to allow data to be recorded and retrieved concurrently, so significant resource expenditures are involved. The second factor is data quality. There is an increasing possibility of duplicate or erroneous data showing up on final reports in a real-time BI system. In which case, you may end up having to perform additional transformations on this (essentially) raw data before it’s analysis-ready.
By comparison, micro-batching provides a more cost-effective alternative with minimal data latency and, ultimately, better-quality data. With this kind of near real-time ETL process, data is loaded to the data warehouse at specific intervals based on the requirements of each data pipeline tool.
These triggers could be set on a minute-by-minute basis in the case of, say, a fraud detection system or daily for ongoing sales reports.
Data scientists believe that regular batch processing can still provide all the functionality you need for broader-level historical analyses. However, it’s important to note that you don’t want to adopt one approach at the expense of another because different latencies will be required based on the needs of BI users.
Introducing the Data Refinery
It Takes Work to Get From Raw Data to Business Intelligence
It’s no secret that the majority of time invested in ETL pipelines is spent on safeguarding the quality, accuracy, and relevancy of data that ends up in the data warehouse. Add in a tightening regulatory environment with policies like GDPR, HIPPA, and SOC2 in place across major industries and proper data management becomes an even clearer objective.
Enterprises need to examine this issue from a few different angles throughout the data pipeline.
- Data Availability: At the extraction stage, the right data needs to be collected and made available to end-users.
- Data Consistency: Data should be reliable and formatted for sharing across different interest groups.
- Data Redundancy: The data pipeline tools should have processes in place for removing obsolete, trivial, or duplicated data.
- Data Reporting: Only relevant data should retrieved for analytics and reporting in the BI layer.
When you’re talking about making the right data available, the key is really to have a strong system of records in place. So, let’s say you’re constructing a customer dimension table from various source tables in your database.
You might have data on their income levels coming in from two separate tables, one with sales-specific focus and one that provides a more in-depth marketing-level overview. You’d want to build your pipeline in such a way that it automatically prioritizes data from the latter table since it’s likely to be more accurate and up to date.
Data consistency is all about the correct application of transformations. There are two functions that are particularly important in this case. First, the ability to create custom expressions to validate input datasets based on internal business logic.
Second, a broader data cleanse capability that allows you to deduplicate and correct records. Ideally, you’d also want a data profiling option so you can drill into your data to find out how many records are failing standard checks due to missing, duplicate, or incomplete information.
Now, a large part of the reporting process is defined by how you actually structure your data warehouse. If you’re following a Kimball-style methodology, then specific data marts can be exposed to user groups based on their level of authorization.
However, when it comes to building the actual data pipelines role-based access controls ensure that access to the all-important data preparation layer is restricted to those who are leading the data warehouse design process.
ELT – Leveraging the Power of Data Pipeline Tools
With Cloud Data Warehousing and ELT, The Sky’s the Limit
You might be thinking: I thought we were arguing in favor of ETL here? But this idea that the two approaches are somehow diametrically opposed is just plain wrong. Whether you’re opting for ETL or ELT, the right choice depends entirely on how your data architecture is built and the particulars of each data pipeline, which brings us to the cloud data warehouse and data lakes.
So, back when on-premise databases were the only game in town, developers had to be careful about how they doled out their modes of storing data because these infrastructures can be expensive to set up and expand.
In this case, pushing down transformations to the database server required careful consideration. The offshoot was that enterprises preferred to process these operations in the ETL layer, making it easier to scale up.
That equation has changed with the introduction of cloud data warehousing. Look at platforms like Amazon Redshift or Microsoft Azure. They offer everything you need for modern data warehousing, like a powerful MPP engine for fast query execution, columnar data storage for optimized analytics, and in-memory caching at a more cost-effective price point.
Suddenly, you have an analytics-ready database that you can deploy in minutes and scale as needed to match the volume of your incoming data.
Now, with this kind of architecture in place, it makes sense to move to an ELT approach where you can transfer data directly from source systems to the data warehouse.
Then leverage the inherent computing capacity of the database engine to perform data transformations. Quicker time-to-value is a big benefit here because data is made available at the destination for reporting and analytics at a faster rate with processes executed in parallel with data loading.
On the other hand, ETL still has a significant part to play in enterprises where data governance is paramount, and you want to ensure that only properly cleansed records and integrated data falls into the data warehouse.
Another thing you want to be careful about with ELT is that the code you’re developing to perform transformations in the target data warehouse doesn’t lock you into a particular platform. Ideally, you want platform-agnostic data pipelines that can function across several source and destination systems.
Astera DW Builder: Best-In-Class ETL Capabilities for Data Warehouse Development
Astera DW Builder is built on an industrial-strength ETL/ELT engine. It comes packed with a slew of functionalities that make it ideal for modern data warehousing.
These include the backend architecture, which features a parallel processing engine that can handle high-volume, complex data processing with ease, and DW Builder’s pushdown optimization mode, which empowers users to execute data pipelines in ELT mode at the push of a button.
The product also offers support for over 40 sources and destinations with support for flat-file systems, industry-leading on-prem databases such as SQL Server and Oracle Database to cloud based data warehouses such as Amazon Redshift, Snowflake, and Microsoft Azure, as well as multiple BI tools.
Add in over 500+ out-of-the-box transformations, including data quality and data validation options, and you have all the tools you need to construct fast, accurate, and error-free data pipelines for the data warehouse without writing a single line of code.
These features are supported by comprehensive automation capabilities via a built-in workflow orchestration and job scheduling component.
For a first-look trial of Astera DW Builder, click here.