At Astera, we’re constantly looking to expand our data integration capabilities to better serve the needs of business users and developers in a variety of environments.
As part of this constant evolution, we’ve introduced Extract-Load-Transform (ELT) functionalities to our flagship product, Astera Centerprise. This “pushdown mode” introduces improved scalability, performance, and ease of management to even the most complex data infrastructures.
Unlike the traditional three stage extract-transform-load (ELT) process, ELT does not require data to be processed through a centralized staging server before it is dispatched to its destination. Instead, ELT leverages the inherent computing capacity of a high-performance RDBMS to perform all or some of the data processing functions within the target database system.
Fig.1: Extracting Payroll Data from Sales and Employee Tables
Fig.2 SQL Generated in Normal ETL Dataflow
Figures 1 and 2 represent ETL in a normal dataflow. In this case, tables are extracted from an organization’s Sales and Employees databases. These tables are then joined to an Invoicing table and processed through various transformations to ascertain which salespeople are eligible for bonuses based on their performance over a two-month period. The resulting data is then loaded into a payroll-specific database.
As you can see, SQL query is only generated to pull relevant columns from these two tables, while the actual transformations are applied automatically within the ETL engine.
ELT in Centerprise
In Centerprise’s pushdown mode, all relevant data is loaded directly into these systems, and the necessary transformations are applied at either the source or target database in the form of SQL queries.
Centerprise currently supports pushdown optimization for MSSQL, PostgreSQL, Oracle, Db2, and MySQL database systems. All queries are generated automatically based on the existing transformation logic that is transformed into native SQL code according to the target database platform in use. These queries are executed in parallel with the loading process, which means that throughput rates can be maximized, while wait times between data transformation and delivery to the end-user are improved considerably.
With ELT, data size becomes a less pressing concern. While larger datasets will produce longer running queries, query processing times are still improved considerably. As a result, everything from social media insights and real-time performance metrics to streams of information derived from cloud-based applications and APIs can be loaded and transformed with utmost efficiency.
This integration process is ideally suited to modern reporting systems that supply increasing volumes of semi-structured and unstructured data to the enterprise.
Fig.3 SQL Generated in ELT Dataflow
For example, if we take the same transformations from Fig.1, and run them in pushdown optimization mode, we will see far more complex code generated. Fig.3 illustrates how the transformations that were previously processed in Centerprise have now been converted into SQL code which can be executed directly in the target database.
ETL vs. ELT
Pushdown optimization speeds up a number of common ETL tasks such as calculations, lookups, and aggregations.
|Calculations||Row-level calculations take place in the ETL engine. Once the transformation is applied, the existing column will be overwritten, or a new column will be added to your source.||Raw data is read from the database before transformation. Calculations are performed on specific columns which are then simply added to the existing tables using database-native SQL.|
|Lookups||ETL engine will map values from a reference table to a data source. A successful lookup requires all reference data to be available. If large datasets are used, lookup efficiency is reduced. You may need to implement partial or full caching to resolve these performance issues.||Databases lookups are commonly achieved through left outer joins in the lookup table. Users can also optimize these processes by applying indexes on lookup columns. As complete datasets are loaded into the database, the availability of sources is not a concern. All of these features provide significant performance benefits over ETL processing.|
|Aggregations||Aggregate transformations can be complex, especially when large quantities of granular and aggregated data are moved into multiple tables.|
Modern database engines are equipped to process and move large volumes of disparate data and can perform aggregation transformations extremely quickly.
Full Pushdown vs. Partial Pushdown in Centerprise
In the pushdown optimization mode, Centerprise will analyze the mapping between sources and targets to determine how much of the transformation can be pushed down to the database level at either end of the integration process.
If the dataflow uses a non-database source or destination, or the target database is located on a different server altogether, then only a partial pushdown will be completed. For example, if we run the same transformations in Fig. 1 but load the resulting data to an Excel destination then the SQL code will be pushed down to the source database alone.
Fig.4 Error Message in Job Progress Window
Fig. 5: SQL Generated When Non-Database Destination is Employed
The job progress window pictured in Fig. 4 illustrates that this particular transformation can only be pushed to the source-side database. As you can see from Fig. 5, the Insert Into statement has also been omitted from the SQL code that was generated for this dataflow.
If a specific transformation cannot be converted into database-native SQL code, Centerprise will automatically process this step of the transformation through the ETL server. The latest versions of Centerprise support a wide variety of transformations including aggregate, sort, and join alongside various complex expressions involving name and address parsing as well as concatenation. In these instances, partial pushdown mode or traditional ETL processing will be deployed.
As this transformation cannot be written into SQL logic, only partial pushdown optimization will be applied. The success of the ELT deployment can be checked by clicking the Verify Pushdown button on the toolbar.
Verify Pushdown Button
This will open a new window that shows any potential errors and warnings that occurred during the pushdown process.
Fig. 6: Unsupported ELT Transformation in Dataflow
Fig. 7: Error Message for Unsupported ELT Transformation
Here, we’ve added a mathematical function that allows us to round sales value figures to the nearest whole number. As this functionality isn’t supported by the target database, the action is processed in the ETL engine instead. The Verify Pushdown immediately identifies this limitation.
That being said, Centerprise does provide users with the ability to implement modifications that enable them to work around some of these limitations.
For example, certain unsupported transformations can be substituted for compatible functions that create the same dataflow output. This can be applied to the Distinct transformation which can only be run in pushdown optimization mode when single fields or entire field layouts are used. In this case, the Distinct transformation can be replaced with a Filter transformation that is set to deliver values within a specified range to the target file or database.
Similarly, technical users with a more in-depth understanding of database-native SQL can override Centerprise-generated statements by using the PushdownSQL() function to pass pre-defined SQL code to the database server during pushdown.
Fig. 8: Manual SQL Pushdown Setting in Function Properties Window
While partial pushdown still offers substantial performance improvements over traditional ETL, the full benefits of pushdown optimization are realized when all transformation logic can be processed within the source and target database.
However, larger ELT transactions may lock the database for longer periods of times, which could prevent users from accessing them for business purposes. These transformations may also be more vulnerable to unexpected errors. You can circumvent these issues by mapping your process for partial pushdown optimization instead.
In Centerprise 7.5, we introduced a new feature that allows users to omit certain objects from the pushdown optimization mode. This capability can help delegate bottleneck processes to the ETL server rather than the database.
Fig. 9: ‘Exclude From Pushdown” Menu Option
Although the ETL engine can handle these transformations with ease, it will also introduce additional load to your network. In the example provided above, pushdown optimization reduced the dataflow processing time considerably.
Dataflow Processing Time in ETL Server
Dataflow Processing Time in Pushdown Optimization Mode
These performance benefits only become more apparent as the volume and complexity of your dataflows increase. As the following use case demonstrates, in pushdown optimization mode, Centerprise can load and transform thousands of records in seconds using native database functionalities with minimal impact on your servers.
Pushdown Optimization in Action
We were recently able to showcase the full performance benefits of pushdown optimization at one of the largest P&C insurers in the United States. The client holds a diverse underwriting portfolio spanning more than 70 countries, multiple industries, and a range of specialized products for small and medium-sized businesses.
However, as their requirements have grown the company has struggled to bring structure to a vast data infrastructure that encompasses detailed policy information drawn from various subsidiaries, actuarial tables, along with real-time feedback on fluctuating premium rates and agent commissions.
Fig. 10: High-Volume, Complex Dataflow
Before pushdown optimization was deployed, the organization was spending up to four hours at a time cleaning, enriching, and transforming this data through the staging server before migrating it to delivery systems. This lengthy process also took up a significant amount of network resources.
When the company approached us to resolve these issues, we recognized that their concerns could be effectively resolved by switching to ELT. Our service team used the verify pushdown feature to assess any changes that were needed to bring their dataflows in alignment for pushdown optimization.
Full pushdown could not be deployed in this instance, because the company was looking to load their data into delimited targets rather than an RDBMS . However, we were still able to perform several tweaks to ensure that most of the transformation logic ran in-database rather than through the Centerprise engine.
After implementing these modifications, we were able to decrease runtime for this dataflow down to just 50 seconds in pushdown optimization mode.
New Pushdown Optimization Features are Coming to Centerprise
As part of the Centerprise 8.0 roll-out, we have added a number of new functionalities to pushdown optimization mode that allow users to take even more control of their ELT integration processes. Here’s a brief rundown of what you can expect.
In Centerprise 8.0, we have added a new preview pane that provides users with a full view of their dataflows running in pushdown optimization mode, as well as the related SQL statement(s) that have been generated and pushed to the source and/or target database. This comprehensive view also enables technical users to modify SQL code for their pushdown dataflows.
Choose from Source, Destination, or Full Pushdown Optimization
Earlier versions of Centerprise support both source and destination side optimization. There are use cases where full pushdown optimization deployment may be more beneficial.
For instance, dimension lookups are a key part of Astera’s Data Warehouse Builder. They provide users the ability to select appropriate dimension rows for a given fact.
However, before fact table data can be loaded into a data warehouse, several joins may need to be performed on the source database in pushdown optimization mode. As this data is moved to the target, many dimension lookups will also need to be performed in the ETL server to allow the data to be written to its destination. In Centerprise 7, only source-side pushdown optimization would be deployed for this transformation.
In version 8.0, users can now choose to run these transformations in full pushdown instead. When this option is selected, Centerprise will create a staging table in the target database which can be used to perform dimension lookups concurrently with source-side joins, so that data can be written to the target in a far more efficient manner. These composite queries can create major performance improvements for your “dataflow.”
Most modern RDBMS come equipped with in-built optimizers that automatically find the most resource-efficient execution plan for each query. However, these plans can sometimes fail to produce the expected performance during dataflow execution. In this case, users can employ query hints to override the optimizer and take direct control over how data is processed in-database.
In upcoming versions of Centerprise, users will be able to apply hints to various transformations including aggregations, unions, and joins in pushdown optimization mode. They can also be used to control how tables are scanned, the order of operations for a certain function, or the allocation of system resources for a specific job.
Generate Pushdown Dataflows
With Centerprise 8, you can generate pushdown dataflows to review and understand SQL Statements generated in pushdown optimization mode. This visual representation will help users to optimize their dataflow for performance and scalability.
In Centerprise 8.0, you can select any action in a dataflow and choose not to push it down. This option is useful in situations where our ETL engine executes a transformation differently to how it would be performed in-database.
Support for Databases
Centerprise 8.0 will add support for several major MPP data warehouse solutions such as:
Custom Function Maps
Centerprise provides a number of in-built functions that enable users to process incoming data according to a variety of logical rules. When Centerprise pushes these functions down to the database, it uses a table to map them to the corresponding database functions. Frequently used ETL database function tables are included in all versions of our solution.
However, if a function is not mapped, then the related transformation cannot be pushed down. This will, in turn, affect all succeeding transformations in the dataflow as well. In Centerprise 8.0, users can manually add mapping for these functions to enable pushdown of related transformations in-database.
Find Out More About Pushdown Mode in Astera Centerprise 8.0
Centerprise’s advanced pushdown capabilities are ideally suited to any organization that wants to bring the performance benefits of ELT integration to its data architecture. To find out more about the range of data processing capabilities supported in our latest release, stay tuned for our upcoming whitepaper which will cover these features in-depth. If you are looking for a more use-case specific introduction to our product then you can also schedule a private consultation with our sales and support team.