home
 

Data Integration Jobs

Overview

Centerprise Data Integrator is designed to provide a hyper-parallel multithreaded integration platform for handling large-scale data integration tasks. The engine takes advantage of the multiple processor architecture of today’s computers by running several steps of data integration tasks in parallel. Additionally, the product provides a number of features to enable major performance gains in data integration tasks.

Integration performance depends on a number of variables, and achieving optimal performance requires understanding and controlling these variables. Centerprise provides a number of options for tuning and optimizing data integration jobs.

This document discusses features and techniques that can be used to improve the performance of your data integration tasks and specifically applies to situations where your destination is a database. Centerprise supports multiple databases including Microsoft SQL Server, Oracle, Sybase, MySql and DB2. Some of the features and options discussed in this document do not apply to all supported databases. Please refer to the product user documentation to ascertain if the feature is supported for your database.

Understanding Your Data

Developing a detailed understanding of your data is a key part in building an optimized data integration job. Centerprise provides a number of options to help you optimize your integration jobs and knowing your data will assist you in developing an optimal integration process. Here are some aspects of the data that you need to consider as part of designing your integration job:

Record Size

If your data record is very large, it may be desirable to use dynamic and cached update statements instead of a single prepared update statement. On the other hand, if the data record is relatively short, a prepared update may be more efficient than a dynamic one.

Number of Records

The larger the job, the more parameters you need to control and tune.

Data Modifications

If data contains a large subset of records that require no updating or only updating of a few fields, data synchronization and cached update statements can provide a substantial boost. On the other hand, if updates do not follow any discernable pattern, a prepared update statement may be faster.

Database Constraints

If a large subset of incoming data will pass database constraints—such as unique constraints, foreign key constraints, etc., you should use batched updates and bulk inserts.

Eliminate Intermediate Steps

It is a common practice to create a text file that is then transported to a different group within the organization and loaded to the database. For large data integration situations, this process can create significant latency. This intermediate step can be eliminated by using Centerprise to directly transfer data between the two databases. In cases where one source feeds multiple destinations, you can use Centerprise APIs to create multi-destination data synchronizers.

Data Synchronization

Data Synchronization is a key feature in Centerprise and can be used to achieve substantial performance increases in situations where a database is periodically updated from a source and a small subset of data has changed since the previous transfer. More discussion on Data Synchronization is provided here.

Parallel Writes

Parallel writes enable you to specify multiple database writers. When you specify multiple database writers, Centerprise allocates a separate thread for each of these writers. These writers are then used by the multithreaded Centerprise engine to perform parallel writes to database. Based on the configuration of your Centerprise server and database hardware, you can choose an appropriate number of parallel writers.

Bulk Inserts

Bulk and batch inserts, supported for selected databases, provides enormous performance gains compared to individual inserts. When available, use this feature to boost the performance of your jobs.

Database Updates

Generally, updates are the slowest of database write operations. Centerprise provides multiple options to boost update performance. Depending on the nature of your data, your LAN/WAN configuration, size of database record, and the frequency of updates, you can choose the appropriate options to speed up performance.

Batch Update

Batch update enables you to perform a set of update operations in a single call to the database. This option provides a substantial performance boost over individual updates by reducing the number of trips to the database. If the batch update fails—which happens when any database constraints fail, Centerprise attempts to write each of the rows individually to isolate records with errors. Therefore, if a large subset of your data fails the database constraints, this option can actually slow down updates. In most cases, however, you will see considerable performance improvements when using this option.

Cached update commands

In some instances, database records are very wide. However, relatively few fields need to be updated for each record. For these situations, you can use dynamic update commands. Dynamic update commands incur the cost of compiling and optimizing update statements. However, they reduce network traffic by updating only the fields that are different between the source and the destination. One way to minimize the creation of dynamic update statement is to cache the update statements. This option enables you to specify the size of the update statement cache. For example…..Centerprise uses a least-recently-used caching mechanism to manage update statement cache. On every update, Centerprise looks into the cache to see if an update statement exists that can be used for the fields being updated for the current record. If it is, Centerprise uses that statement. Otherwise, a new update statement is prepared and executed.

Server Location

Locating a server closer to the database server can also boost performance by reducing network latency and therefore the cost of roundtrips. Integration jobs will run significantly slower on a wide area network. The location of the source and destination files also affects the performance.

Database Options

Dropping and recreating indexes

For large tasks involving bulk inserts, dropping indexes can significantly boost task performance. For these types of jobs, it is faster to drop indexes, transfer data, and recreate indexes.

Indexes on Keys

When using the ‘Check for Existing’ option while writing to a database, having an index on the key fields can provide major performance improvements. This is also the case for Data Synchronization jobs.

Transaction Processing

For large jobs, transaction processing can slow down database writes and will adversely affect performance. Centerprise currently does not support transaction processing for parallel writes.

Log Writer Options

Centerprise provides the capability to generate detailed logs of transfers. You can log every single record transferred with source and destination values as well as any errors that occur transfers. This, however, comes at a significant performance cost. For large transfers, you should avoid logging every single record and limit the logging to errors only.

Custom Integrators

In some instances, the complexity of an integration task may require you to develop your own custom integrator. Centerprise provides extensive APIs to enable you to create custom integrators. Astera has made available a series of articles titled ‘Creating Custom Integrators Using Centerprise APIs’ to assist you with creating custom integrators. Source code for the samples described in articles is also available for download at Astera’s website.