ETL

An acronym standing for Extract, Transform and Load. Sometimes extended with a C for Cleanse (ECTL).

Usually used in the context of Data Warehousing to refer to a software tool that is employed to move large amounts of data from one database to another, such as from a DB2 database on a mainframe to an Oracle database on a midrange server. The tool's use is not limited to data warehousing only.

ETL popped up about 1988 with Prism III, a COBOL code generator with interfaces to several popular database engines of the time. Back then, they were known as Data Integration Tools. Later, Informatica, Ab Initio, Datastage, and other tools became available.

The basic functions required of an ETL tool are: a database interface, a mapping facility, and a transformation facility.

The earliest ETL tools (now called First-Generation) were code generators, that is, the ETL developer would enter information about the source of the data, the transformations required, the mapping of the source columns to the target columns, and then the tool would take that info and create programs that would then be executed to do the work.

The newest tools divide between doing some generation of code, and simply entering configurations for a standard engine. Informatica takes the standard engine approach and saves all of its configurations in a database, such as Oracle. AbInitio takes more of a generation approach, and generates Korn shell script that strings modules of code together, saving the conigurations in a proprietary format.

The transformation part of the equation is usually accomplished through the input of some kind of code by the ETL programmer. For Informatica, the code consists of SQL extended by procedural commands and calls to functions. For Ab Initio, the code is much more like the programming language "C".

Mapping consists of making logical connections between physical data, usually between columns in files, or columns in database tables. Most tools allow you to click and drag to draw lines between data entities. ETI Extract worked along these lines, like a HIPO or Hierarchical Input Process Output diagram, with a list of input columns along the left side of the screen, and output columns along the right. The lines across the middle were clicked on and transformation logic could be entered at that point. Informatica takes a more ERD or Entity Relationship Diagram approach, where the connections are between boxes on the screen, and the columns are represented as ports on the boxes. The transformations occur inside some of the boxes.

Finally, Extraction is most often accomplished through SQL or Structured Query Language, the interface usually provided to most modern Databases.