Wednesday, 29 May 2013

Considerations – Loading Oracle Data Using SSIS

With the changing dynamics of technologies extraction data from the heterogeneous databases is enterprise need. Microsoft SSIS serves as one of the good ETL tools to extract data from multiple and heterogeneous sources. Below are some of the considerations and best practices to be taken care during the pull.


Pulling Data from Oracle

There are several methods to pull data from Oracle. This link provides the detailed documentation for creating ETL and planning SSIS solution using Oracle OLE DB Provider. Below are some more points of consideration while pulling data from Oracle.


Choosing Provider

Choosing a provider can drastically change the speed of the data pull by multiple folds hence it becomes one of the most critical and the important part while planning ETL. Currently below are the options which can be used.

  1. The Microsoft OLE DB provider for Oracle. It is available for 32 bit version and is depreciated.
  2. The Oracle OLE DB provider and can be downloaded from here. This is widely used and is very stable in terms of performance and maturity and provides least amount of issues.
  3. The fastest providers are shipped as Microsoft connectors by Attunity for Oracle and it supports both 32 and 64 bit versions. SQL 2012 can be downloaded from here. These providers tend to be very fast and are almost 20 percent faster than Oracle Providers.

In-Memory Transformations & Minimized/Required Column Approach

SSIS is in-memory pipeline. It works very fast for in-memory operations and if the memory is full then the transformations will spill into the disk and performance will start degrading. Below are the methods which can be considered for the same.

  1. Ensure that only required and minimized columns are being pulled from the oracle source as views or use the sql command to get only required columns. Minimizing the columns will ensure the minimum data size per row.
  2. Prefer using the sql queries than table dropdown options. Using dropdown option uses different set of commands to prepare sql and then execute the command. Due to this sometimes we may have the SSIS hanged up whereas in management studio it is showing results in very quick time. Same behavior is very visible when we select the table from the drop down and it takes long time to show up tables.
  3. The in-memory transformation can be checked by checking the “Buffers spooled” perfmon counter for SSIS. Here are the details about the performance counters for SSIS for more details. The initial value is of this counter is zero and more than zero indicates that swapping is being carried out by the SSIS pipeline.

Data Type Transformations

Using the efficient data type always ensures the best performance and the minimum resources. Considering an example of a big integer value like ‘19876234987’ can be stored as integer as well as characters. Numbers uses different algorithms than the characters to index, compare etc. So in this case characters will give slow performance as well as more space to store data, more memory space while doing ETL (RAM). Best practice is to use the optimized length for each data type and make it as narrow as possible. For the money part Money data type is best and it is fast. Numeric and float have the larger precession which leads to the performance issues.
Below is the default mapping of the data types from Oracle to SSIS.


Oracle Data Type
Integration Services Data Type
VARCHAR2
DT_STR
NVARCHAR2
DT_WSTR
NUMBER
DT_R81
NUMBER(P, S)
When the scale is 0, according to the precision (P) DT_I1, DT_I2, DT_I4, DT_NUMERIC(P, 0)

DT_NUMERIC(P, S)
LONG
Not supported
DATE
DT_DBTIMESTAMP
RAW
DT_BYTES
LONG RAW
Not supported
ROWID
Not supported
CHAR
DT_STR
User-defined type (object type, VARRAY, Nested Table)
Not supported
REF
Not supported
CLOB, NCLOB, BLOB, BFILE
Not supported
TIMESTAMP TIMESTAMP WITH TIME ZONE INTERVAL YEAR TO MONTH INTERVAL DAY TO SECOND TIMESTAMP WITH LOCAL TIME ZONE
DT_STR
UROWID
Not supported

We can use the data conversion task to convert the data types of different types and the package might look like below image. Data type cast is important but too much of casting degrades the performance of the pull as well. Staging tables here plays vital role for these type of issues.

Using Bulk Operations and Use Minimize Logging

While inserting data in the SQL Server, minimized logged operations provide the maximum performance. Below are some of the options which can be used.
  1. In the data flow task use the fast load options to put data in the target tables.
  2. For fast ETL the foreign keys can be removed to perform insert/update in parallel. If it is needed badly best practice is to drop all the foreign keys and then create it again once data pull is complete. Even if large data is getting changed in the target table dropping primary keys and other indexes is also a part of best practices.
  3. Ensure using the TABLOCK if target table is not in frequent use and disabling check constraint.
  4. Ensure describing values for “Maximum insert commit size” to proper to ensure batch wise inserts.
  5. Prefer using NOLOCK for lookup and other tasks to minimize locking impacts.
  6. Uses of SQL Server partitioning and the SWITCH statement increase the performance by several folds while working with large data.
  7. Minimize the SSIS blocking and slow performing operations e.g. sort, aggregations like SUM, MIN, MAX etc.

Revisit ETL Design

Every ETL tool has its own pros and cons. To get the optimized performance out of SSIS below are some of the design considerations which can be taken into account. Most of them are for large set of data.
  1. Redesign the organization of data using staging tables in such a way that TRUNCATE is being used than DELETE.
  2. To UPDATE large data from data flow tasks prefer using staging tables and run the UPDATE statement from SQL Server than using row by row OLEDB calls but for small dimensions it is fine using row by row calls.
  3. Set based operations and transformations provide best performance in the SQL Server as it is designed for the set based operations. All row by row transformations are best for the SSIS.
  4. Date and time conversions tend to be fast in SQL Server than SSIS so it is better to change date and time in the SQL Server.
  5. Delta data load should be evaluated well before implementation. If a large amount of data is getting changed then full load gives better performance than the delta detection logic.
  6. The ideal pull from the heterogeneous pull looks like below. Different parts of the ETL can be removed and added as per the need.

No comments:

Post a Comment