Администрирование
HyperSQL, IBM DB2, IBM DB2, Interbase, Firebird, Yaffil , MSSQL, MySQL, ORACLE, PostgreSQL, SYBASE, Другие, Курсы, ... |
Безопасность
HyperSQL, MSSQL, Oracle, Phpmyadmin, Инъекция SQL, Книги, Патчи, Программы, Скрипты, СУБД, ... |
Проектирование БД
Business intelligence, Data Warehouse, Designer 2000 (Oracle), ErWin, ETL, OLAP, Sybase PowerDesigner , Другие CASE средства, Статьи, Теория БД, ... |
Производительность
DB2, HyperSQL, IBM DB2, Interbase, Firebird, Yaffil, MSSQL, MySQL, ORACLE, PostgreSQL, SQL, SYBASE, Другие, ... |
Установка
HyperSQL, IBM DB2, Interbase, Firebird, Yaffil , MSSQL, MySQL, ORACLE, PostgreSQL, SYBASE, Другие, ... |
|
|
Главная ETL Generate surrogate key (ETL)
Generate surrogate key (ETL)
|
|
|
|

Goal Fill in a data warehouse dimension table with data which comes from different source systems and assign a unique record identifier (surrogate key) to each record. Scenario overview and details To illustrate this example, we will use two made up sources of information to provide data about customers dimension. Each extract contains customer records with a business key (natural key) assigned to it. In order to isolate the data warehouse from source systems, we will introduce a technical surrogate key instead of re-using the source system's natural (business) key.
A unique and common surrogate key is a one-field numeric key which is shorter, easier to maintain and understand, and independent from changes in source system than using a business key. Also, if a surrogate key generation process is implemented correctly, adding a new source system to the data warehouse processing will not require major efforts.
Surrogate key generation mechanism may vary depending on the requirements, however the inputs and outputs usually fit into the design shown below: Inputs: - an input respresented by an extract from the source system - datawarehouse table reference for identifying the existing records - maximum key lookup
Outputs: - output table or file with newly assigned surrogate keys - new maximum key - updated reference table with new record Proposed solution Assumptions: - The surrogate key field for our made up example is WH_CUST_NO. - To make the example clearer, we will use SCD 1 to handle changing dimensions. This means that new records overwrite the existing data. The ETL process implementation requires several inputs and outputs. Input data: - customers_extract.csv - first source system extract - customers2.txt - second source system extract - CUST_REF - a lookup table which contains mapping between natural keys and surrogate keys - MAX_KEY - a sequence number which represents last key assignment
Output data: - D_CUSTOMER - table with new records and correctly associated surrogate keys - CUST_REF - new mappings added - MAX_KEY sequence increased
The design of an ETL process for generating surrogate keys will be as follows: The loading process will be executed twice - once for each of the input files Check if the lookup reference data is correct and available: - PROD_REF table - max_key sequence Read the extract and first check if a record already exists. If it does, assign an existing surrogate key to it and update the desciptive data in the main dimension table. If it is a new record, then: - populate a new surrogate key and assign it to the record. The new key will be populated by incrementing the old maximum key by 1. - insert a new record into the products table - insert a new record into the mapping table (which stores business and surrogate keys mapping) - update the new maximum key |
ETL process to load extract and assign surrogate keys The following spoon Transformation loads the customers extract into a data warehouse and assigns a surrogate (technical) key to all records. As a final step the transform updates the maximum key dictionary table. ETL process design The process has been designed in Pentaho Data Integration - Kettle Spoon 3.02. There are probably many other ways to achieve the same goal, however our intention is also to show how various Spoon components work. For instance, most of the objects might be replaced by a Java Script component to do the calculations and descrease the number of objects.
Inputs: - CSV input textfile, an extract from the source system - Datawarehouse table (WH_PARAMS) reference for lookup of the maximum surrogate key
Outputs: - Output table with newly loaded extract and assigned surrogate keys (D_CUSTOMER) - New maximum key which updates the reference table (WH_PARAMS)
Please refer to the screenshot below for the transform design:
Surrogate key generation transform design in Kettle Spoon: 
Step by step ETL process flow Below the description of all ETL process steps used in the transformation with the specification of used Spoon objects: CSV file input (CSV Input component) - reads the csv extract with customers data WH surrogate key field (Add constant component) - assigns a constant value to the wh_key_field. The assigned value is wh_cust_no and will be used in the next lookup. Lookup Max Key (Stream Value lookup and Table Input) - those two components get the maximum surrogate key value already stored in the database (wh_key_maxval field). NewCustSeq (Add Sequence) - this step generates a sequence number which starts from 1 and is incremented by every record that passes through it (new_cust_cnt). CalcSurrogateKey (Calculator) - the calculator adds new_cust_cnt to wh_key_maxval and creates a new surrogate key maximum (wh_cust_no) Select Values and D_CUSTOMER (Select Rename Fields and Table Output) - those steps actually feed the customers dimension table in the data warehouse Select Rename Fields and GetNewMaxKey (Select Rename Fields and Group By component) - the newly loaded records are grouped and we select the maximum surrogate key (wh_key_maxval). UpdateKeyLookup (Update component) - updates the maximum surrogate key in the database Design of the most significant components in the transform LkpMaxKey - Maximum surrogate (technical) key lookup: 
UpdateKeyLookup - update of the maximum surrogate key value in the data warehouse: 
| |
|
Последние добавленные статьи |
|
|
|
|