"Extract, Transform and Load (ETL)"
Extract, transform and load (ETL) is a process used in data warehousing that involves
- extracting data from outside data sources,
- transforming it to fit business needs and
- loading it into a data warehouse
ETL is a key part of building a data warehouse because it is the way data is pulled from multiple, disparate data sources and combined and loaded into the warehouse.
Extract
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Extraction converts the data into a format for transformation processing.
Transform
The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. However, in other cases any combination of the following transformations types may be required:
- Specifying which columns to load
- Normalizing free-form values (e.g. mapping “Female” and “W” and “Ms.” into one value)
- Summarizing multiple rows of data (e.g. total cost per region)
- Translating coded values (e.g. converting source “M” and “F” to warehouse’s 1 and 0)
- Deriving calculated values
Load
The load phase actually loads the data into the destination data warehouse. The actual requirements of this step vary widely based on each business’ needs. Some data warehouses overwrite old data with new while others maintain a history and audit trail.
Tools
A good ETL tool must be able to communicate with many different relational databases to access the data it will process. The best ETL tools will be both powerful and easy-to-use with a logical interface. In addition, data quality tools that validate and standardize addresses, remove duplicates, assign proper casing etc. will greatly improve the quality of the data in the data warehouse.