Tags: ETL, SQL Server Integration Services
Our customer has been supporting Oracle based online transaction processing (OLTP) system. They decided to take advantage of the modern data analysis capabilities of SQL Server’s data warehouse features including columnstore for greater value over an Oracle based data warehouse solution. We came up with a comprehensive solution on extracting and transforming their source data and loading it into the target database. As a result, our customer benefited from obtaining a highly standardized and easily manageable solution within a short period of time.
Our customer has been using an Oracle database running their OLTP workload with a large amount of data being added on a regular basis. They decided to add a data warehouse to their system in order to analyze their data and facilitate future planning based on the information received. They needed a solution to extract the operational data from the source Oracle database, transform it to historical format, and finally load the data to the target SQL Server database. For that task they considered SQL Server Integration Services (SSIS) to be the best solution. SQL Server Integration Services is Microsoft’s platform for data integration that provides data extraction, transformation, and loading (ETL) processes for SQL Server data warehouses. To implement these processes, we created an SSIS package that holds a variety of elements defining a workflow.
So let’s overview the implemented ETL strategy. We chose a project deployment model to deploy SSIS packages on the dedicated Integration Services server. That allowed us to centralize the management of packages, as well as monitor and log package execution performance. Another advantage is improving the overall performance by separating the roles of SQL Server database and SQL Server Integration Services.
Data extraction and mapping
Before the start of the actual ETL procedures, we had to convert Oracle PL/SQL scripts used for data extraction. For this purpose, we manually rewrote the PL/SQL code according to SQL Server T-SQL standard. Once we completed the conversion, we specified those scripts as a data source for SSIS packages.
Since the source and target databases have different data models, we performed data mapping to record data transformation during the integration process. That allowed us to eliminate potential issues with data inconsistency during the next ETL steps. So, we prepared a data mapping document that describes all data transformations.
Data profiling and cleansing
Once the data was extracted from the source database, we had do transform it according to the data warehouse model. Among other tasks were data cleansing and validation.
We implemented profiling for each extracted data set to analyze the data to meet the following criteria:
- For each source data set, we checked that the source file or query returned correct record count.
- We checked any given source records for missing and duplicated values.
- We performed testing to discover referential integrity errors.
- We identified any source schema changes, such as data type changes.
- Also, we flagged suspicious data values that may have been caused by the errors during data entry.
Knowing all this, we could perform data cleansing to eliminate all issues to avoid troubleshooting problems in future before data integration started.
For the purpose of auditing and processing of possible errors, we implemented logging approach on Integration Services server. Instead of standard SSIS logging, we used customized logging realized by utilizing parent packages that control the execution of the child packages. We embedded logging inside each parent package to record the number of execution parameters such as start and stop times.
That allowed auditing the state of the entire packages of the project by having the logs recorded for the parent packages only. In case of execution errors, we could easily check the logs of the problematic package and discover the error.
The final stage of ETL process was data loading procedure. At this step, we paid particular attention to initial data load because it is a long-running and memory-intensive process. We specified proper load intervals to control the amount of loaded data in order to avoid running into out-of-memory exceptions.
We developed this project in accordance with strict Microsoft SSIS standards, utilizing TFS for all created objects. Additionally, we prepared documentation where we described how each delivered SSIS package works and also noted some specific exceptions.
The benefits for our customer included:
- Advanced package deployment model
- Enhanced performance tracking and troubleshooting by utilizing customized logging
- Accurate data transformation with data profiling and mapping approaches
- Improved performance by running data transformation on the dedicated server