The vast majority of enterprises rely on transactional systems of record to drive their business processes. Online transaction processing systems (OLTP) deliver current, predefined data to users, and as such only support a limited number of data processing functions involving updates, deletions, and insertion requests, which are generally applied at the record level.
When it comes to gathering strategic insight and intelligence for decision-making, data mining, and planning purposes, a more comprehensive analytical processing system is required. This is the role that data warehouses serve. These systems collect current, historical, and archival data from disparate transactional sources and support a range of complex, ever-changing query operations, most of which are are applied to large volumes of transactions.
Engineering Your EDW in Centerprise 8.0
However, normalized warehouses are rarely employed in practice, as they require dozens if not hundreds of tables to be joined together to provide a comprehensible view of a business process. Such an architecture significantly impacts query performance and is thus unsuitable for any data-driven enterprise.
As a result, most modern data warehouses store data in the form of denormalized tables that show the facts (measurable values) and dimensions (context) of each business process in a dimensional model. The dimensional model groups together business information into different categories, facilitating data retrieval and thus query performance.
In Centerprise 8.0, we have introduced a range of features that enable users to extract and load data from their transactional systems into an optimized data warehouse architecture. In this blog, we’ll take you through these capabilities step by step and show you how you can integrate them into your operations.
Reverse Engineering a Source System
These first step in building a data warehouse architecture is creating an OLTP model which contains all relevant transactional systems. In Astera Centerprise, this is accomplished through a reverse engineering feature which is used to automatically create a normalized view of all selected tables in a source database.
Denormalizing the OLTP
After transactional sources are set up, they can then be denormalized to improve data retrieval and ultimately query performance in the OLAP (online analytical processing system). To do so, we will drag and drop tables with matching keys onto one another.
In the screenshot below, the CustomerCategories table is merged with the Customers tables based on the common CustomerCategoryID key. The red outline indicates that a merge is possible.
Drag & Drop Merging in the OLTP
After denormalization is completed, we can start building the OLAP model which will serve as the destination for this transactional data.
Completed Normalized Relational Data Model (OLTP)
Building the OLAP
To start building an OLAP, navigate to the Project Explorer window and add a new data model to the data warehouse.
Again, use the reverse engineer function to set up the structure for the OLAP. In this case, we have selected the Stock_Item, Employee, and Customer table for importing into the data model. We have also created a new Sales table using the entity builder in the toolbox. Each entity must also be defined as either a fact table, aggregate (additive) fact table, or dimension table, as shown below.
Dimensional Model Showing Sales Process (OLAP)
Next, surrogate keys (uniquely identifies each version of record) and business keys (an identifying value that is assigned in the transactional systems usually based on internal business logic) will be assigned to appropriate fields in the layout builder for each entity. Surrogate keys are generally small integer values that are far simpler to scan than business keys which can include characters or even a mix of different data types. As a result, these numeric values contribute to faster query performance.
Stock Item Dimension Layout Builder
Implementing Slowly Changing Dimensions
At this point, users will need to identify how updates, deletions, and additions are treated in their dimension tables.
Presently, all data within the dimension tables is current in nature, so there will be no record of changes made to key attributes within entities. If historical information is required for these attributes, an appropriate slowly changing dimension (SCD) type must be assigned to relevant fields in the layout builder.
Centerprise supports multiple SCD-handling techniques, as shown in the table below.
|SCD Type 1||Shows only the most current value for each attribute. Historical data is overwritten.|
|SCD Type 2||Any change to the source attribute is recorded as a new row in the destination table. New fields for effective start date and expiry date are also added to the table. This ensures that historical data for each time period is available for reference.|
|SCD Type 3||Preserves current value for each attribute as well as well as the previous value in an additional field. Type three tables may also include an additional field to show the date at which the current value was assigned.|
|SCD Type 6||A new row is added for each change to the source attribute. Additional rows will include fields for effective date, expiry date, and a flag to show whether a value is current or expired.|
Forward Engineering the OLAP Structure
Now, all changes made to entities in the OLAP must be updated in the OLTP to ensure that data can flow seamlessly between the two databases. The forward engineer function is used to accomplish this task.
With this option, Centerprise will convert the OLTP schema and present entity relationships into a script when applied generates and drops these tables into the OLAP.
Forward Engineering Tables to OLAP
Populating Fact and Dimension Tables in the OLAP
With the OLAP structure in place, the next step is to populate the dimension and fact tables in the data warehouse. To do so, each entity in the OLAP will need to be mapped to a corresponding transactional source in the OLTP. This is done through the load settings window.
Load Settings Window in Astera Centerprise
In this case, the Customer table in the OLTP is used as the source entity for the Customer table in the OLAP. Each field is then mapped to an attribute in the OLTP table. If a matching field is available, as with the Customer_ID or Customer fields, the direct mapping option can be used. However, in some cases, multiple fields in the source table may need to be joined and aggregated to produce relevant outputs for the OLAP. These transformations can be applied using the expression mapping option. In the above screenshot, the Bill_To_Customer field in the OLAP is mapped to a consolidated table showing customer names for every billing ID.
After the configuration of a dimension is completed, the user can verify that all the mapping done is correct by clicking the green verify button at the top of the load settings screen.
Once the required fields have been mapped, load files are generated for each entity. A load file is essentially a dataflow that contains mapping logic for a dimension. When these dataflows are executed, data is taken from the OLTP and processed through the necessary transformations before being loaded into the relevant dimension table in the OLAP. Here, appropriate surrogate keys, business keys, and SCD types will be assigned to input data fields as defined during the modelling stage.
Automatically Generated Dataflow for Customer Dimension
Pushdown Optimization in the Data Warehouse
By their nature, data warehousing projects involve transfer of large volumes of data. As such, simply moving this data from a source database to an intermediary server and then a destination can be an extremely resource-intensive process. In this case, we have added multiple ETL transformations as well, which will add significant load to the intermediary server.
When existing data within the OLTP is modified and records need to be updated in the OLAP, these dataflows will need to be executed again. Depending on the scale of changes made this could require thousands of records to be reconciled or overwritten, while new rows are created for SCD2 and SCD6 marked fields. Again, all of these operations will place a substantial load on the ETL server.
In pushdown optimization mode, Centerprise can leverage the inherent processing power of an RDBMS to perform all the extraction, loading, and transformations required in a data warehouse architecture. When pushdown is deployed, these tasks will be converted into database-native SQL code that is executed at the source and target. Additionally, any SCD reconciliation, updates, or insertions will be converted into consolidated SQL statements that can be performed far more efficiently in-database.
To enable pushdown mode across your load setting dataflows, check the appropriate option in the right-click menu of the dataflows folder.
As a result, dataflow runtimes will improve considerably while the associated memory and processing requirements for the data warehouse will be significantly reduced.
Once all dimension tables have been populated and updated, and surrogate keys have been assigned to all records within the dimension, the fact table can finally be loaded.
Fact tables consist of measurements and business keys drawn from different parts of a business process (in this case Sales). Whenever the context for a fact needs to be retrieved, these business keys are matched against their related surrogate keys in the dimension tables. This is carried out through a dimension lookup transformation.
Dimension Lookup for Sales Fact Table
As you can see from the screenshot above, fact table loading involves multiple joins between dimension tables. When you consider that fact tables generally contain millions of records, the high cost of performing this operation is clear. Of course, the entire process will also need to be repeated on a consistent basis as transactional data is constantly being updated.
Each time the fact table is populated, the lookup transformation will also need to cross-reference every business key against the relevant SCD table and convert it into a surrogate key. If the dimension table is particularly large, or several changes have been made to source records, the lookup can become particularly time-consuming and resource-intensive. While caching can reduce load time for the transformation, it will place additional load on the server’s resources.
With this in mind, running the fact table load in pushdown optimization mode can be extremely advantageous as all of these processes will be converted into a single SQL statement that extracts and transforms data in the source database before loading it into the OLAP.
Applying Aggregate Fact Tables to Complex EDW Architectures
The granularity of a fact is based on the level of detail it presents about a business process. For example, the fact table in the example above shows delivery dates on a daily basis rather than total deliveries aggregated over a month or quarter.
While this level of detail is sufficient for a transactional system, where users generally only need to retrieve data on a single customer or examine business information across a single day, in a data warehouse environment, queries are far more complex and require calculations to be performed on hundreds or thousands of record sets. If these queries are applied to low-grain fact tables, they will take significant amounts of time to execute, especially if the ETL server is already under high utilization.
An aggregate fact table optimizes query runtimes by providing pre-calculated summaries of lowest-grain facts derived from the fact table. In the sales table above, we can aggregate individual sales invoices for each date into a total sales excluding tax figure that will return far fewer records.
However, this is a relatively basic example. As the number of dimensions in the data warehouse increases, so too will the number of possible aggregate tables that can be deployed. In addition, as the level of granularity increases, more rows will be occupied for each query (sparsity) which will further increase the storage requirements on the ETL server.
The following data model shows a more complex data warehouse that features multiple dimensions linked to a Sale fact table.
Complex OLAP Model
In this case, a SalesAggregate fact table has been created to provide a summary view of the sales function as derived from Invoice processing system. The calculated view shows a running tally of total sales (including and excluding sales tax), as well as a moving average of dried and frozen items currently in stock.
Aggregated Total Sales and Stock Items
Layout Builder for Aggregate Fact Table
This aggregated data is relevant to a variety of calculations carried out during reporting and analytics. By running aggregations at the source in pushdown optimization mode, users can reduce the load on the ETL server and ensure that query runtimes are fully optimized.
Learn More About Data Warehousing in Centerprise
The upcoming release of Centerprise’s Data Warehouse Builder module will provide enterprise users with a range of tools for planning, engineering, and populating their own organization-wide data architectures. By integrating pushdown optimization with these features, we aim to ensure that our clients get the highest possible performance during these high-volume data operations.
For a more in-depth look at the performance benefits of ELT, watch out for our upcoming performance bench marking blogs, or schedule an in-person consultation with our sales team.