Overview
Centerprise Data Integrator is designed to provide a hyper-parallel
multithreaded integration platform for handling large-scale data integration
tasks. The engine takes advantage of the multiple processor architecture of
today’s computers by running several steps of data integration tasks in
parallel. Additionally, the product provides a number of features to enable
major performance gains in data integration tasks.
Integration performance depends on a number of variables, and achieving optimal
performance requires understanding and controlling these variables. Centerprise
provides a number of options for tuning and optimizing data integration jobs.
This document discusses features and techniques that can be used to improve the
performance of your data integration tasks and specifically applies to
situations where your destination is a database. Centerprise supports multiple
databases including Microsoft SQL Server, Oracle, Sybase, MySql and DB2. Some of
the features and options discussed in this document do not apply to all
supported databases. Please refer to the
product user documentation to ascertain if the feature is supported for your
database.
Understanding Your Data
Developing a detailed understanding of your data is a key part in building an
optimized data integration job. Centerprise provides a number of options to help
you optimize your integration jobs and knowing your data will assist you in
developing an optimal integration process. Here are some aspects of the data
that you need to consider as part of designing your integration job:
Record Size
If your data record is very large, it may be desirable to use dynamic and cached
update statements instead of a single prepared update statement. On the other
hand, if the data record is relatively short, a prepared update may be more
efficient than a dynamic one.
Number of Records
The larger the job, the more parameters you need to control and tune.
Data Modifications
If data contains a large subset of records that require no updating or only
updating of a few fields, data synchronization and cached update statements can
provide a substantial boost. On the other hand, if updates do not follow any
discernable pattern, a prepared update statement may be faster.
Database Constraints
If a large subset of incoming data will pass database constraints—such as unique
constraints, foreign key constraints, etc., you should use batched updates and
bulk inserts.
Eliminate Intermediate Steps
It is a common practice to create a text file that is then transported to a
different group within the organization and loaded to the database. For large
data integration situations, this process can create significant latency. This
intermediate step can be eliminated by using Centerprise to directly transfer
data between the two databases. In cases where one source feeds multiple
destinations, you can use Centerprise APIs to create multi-destination data
synchronizers.
Data Synchronization
Data Synchronization is a key feature in Centerprise and can be used to achieve
substantial performance increases in situations where a database is periodically
updated from a source and a small subset of data has changed since the previous
transfer. More discussion on Data Synchronization is provided here.
Parallel Writes
Parallel writes enable you to specify multiple database writers. When you
specify multiple database writers, Centerprise allocates a separate thread for
each of these writers. These writers are then used by the multithreaded
Centerprise engine to perform parallel writes to database. Based on the
configuration of your Centerprise server and database hardware, you can choose
an appropriate number of parallel writers.
Bulk Inserts
Bulk and batch inserts, supported for selected databases, provides enormous
performance gains compared to individual inserts. When available, use this
feature to boost the performance of your jobs.
Database Updates
Generally, updates are the slowest of database write operations. Centerprise
provides multiple options to boost update performance. Depending on the nature
of your data, your LAN/WAN configuration, size of database record, and the
frequency of updates, you can choose the appropriate options to speed up
performance.
Batch Update
Batch update enables you to perform a set of update operations in a single call
to the database. This option provides a substantial performance boost over
individual updates by reducing the number of trips to the database. If the batch
update fails—which happens when any database constraints fail, Centerprise
attempts to write each of the rows individually to isolate records with errors.
Therefore, if a large subset of your data fails the database constraints, this
option can actually slow down updates. In most cases, however, you will see
considerable performance improvements when using this option.
Cached update commands
In some instances, database records are very wide. However, relatively few
fields need to be updated for each record. For these situations, you can use
dynamic update commands. Dynamic update commands incur the cost of compiling and
optimizing update statements. However, they reduce network traffic by updating
only the fields that are different between the source and the destination. One
way to minimize the creation of dynamic update statement is to cache the update
statements. This option enables you to specify the size of the update statement
cache. For example…..Centerprise uses a least-recently-used caching mechanism to
manage update statement cache. On every update, Centerprise looks into the cache
to see if an update statement exists that can be used for the fields being
updated for the current record. If it is, Centerprise uses that statement.
Otherwise, a new update statement is prepared and executed.
Server Location
Locating a server closer to the database server can also boost performance by
reducing network latency and therefore the cost of roundtrips. Integration jobs
will run significantly slower on a wide area network. The location of the source
and destination files also affects the performance.
Database Options
Dropping and recreating indexes
For large tasks involving bulk inserts, dropping indexes can significantly boost
task performance. For these types of jobs, it is faster to drop indexes,
transfer data, and recreate indexes.
Indexes on Keys
When using the ‘Check for Existing’ option while writing to a database, having
an index on the key fields can provide major performance improvements. This is
also the case for Data Synchronization jobs.
Transaction Processing
For large jobs, transaction processing can slow down database writes and will
adversely affect performance. Centerprise currently does not support transaction
processing for parallel writes.
Log Writer Options
Centerprise provides the capability to generate detailed logs of transfers. You
can log every single record transferred with source and destination values as
well as any errors that occur transfers. This, however, comes at a significant
performance cost. For large transfers, you should avoid logging every single
record and limit the logging to errors only.
Custom Integrators
In some instances, the complexity of an integration task may require you to
develop your own custom integrator. Centerprise provides extensive APIs to
enable you to create custom integrators. Astera has made available a series of
articles titled ‘Creating Custom Integrators Using Centerprise APIs’ to assist
you with creating custom integrators. Source code for the samples described in
articles is also available for download at Astera’s website.