|
Страница 1 из 3 
The Date dimension is a well known construct in general data warehousing. In many cases, the data for a date dimension is generated using a database stored procedure or shell-script. Another approach to obtain the data for a date dimension is to generate it using an ETL tool like Pentaho Data Integration, a.k.a. Kettle. I think this approach makes sense for a number of reasons: - When you tend to use a particular ETL tool, you will be able to reuse the date dimension generator over an over, and on different database platforms.
- You won’t need special database privileges beyond the ones you need already. Privileges for creating tables and to perform DML will usually be available, whereas you might need to convince a DBA that you require extra privileges to create and execute stored procedures.
In addition to these general considerations, you can pull a neat little trick with Kettle to localize the data and format of the date attributes. I wouldn’t go as far as to say that this feature is Kettle specific: rather, it relies on the localization support built into the java platform and the way you can put that to use in Kettle transformations. Prerequisites In this tip, the steps to create a date dimension are described using Kettle 2.5.1 (Generally available Release) and MySQL 5.1.20 (Beta). You will be able to follow through the example using earlier (and later) versions of both products though - I am not using any functionality that is specific to these particular version of the products. The recipe does not really require that you understand anything about data warehouses or date dimensions, but you will probably appreciate it better if you do ;) Overview The transformation to generate the data for the date dimension follows a pretty straightforward design. The graphical representation of the transformation is shown below: 
First, the dimension table is created (Prepare). After that, rows are generated to fill it (Input). However, the generated rows are almost empty and barren - we still need to derive and add data to fill the attributes of the date dimension (Transformation). Finally, the data is stored in the date dimension table (Output). Step-by-Step The remainder of this article describes in detail how to build this transformation. The majority of steps is probably not very interesting to moderately experienced Kettle users, but may be of use to beginning users. Note for users that are completely new to Kettle - it is advisable to review the first few chapters of the Spoon user guide (Spoon is the name of Kettle tool you use to design the ETL process). It explains how to start up the tool, create a new transformation, add and connect steps etc. You can find it in the docs/English directory beneath the Kettle home directory. MySQL JDBC driver: setting the characterEncoding property to UTF8 You need to create a (JDBC) connection to MySQL in the usual, straightforward way: 
In addition, you need to set the characterEncoding property of the JDBC driver: 
This ensures MySQL will be able to understand the utf8 encoded data that we may produce to generate a date dimension in the, say, Chinese language. Note that you cannot just use a statement like SET NAMES utf8 to do this. This is not specific to Kettle, but has to do with the way the MySQL JDBC driver (Connector/J) handles character sets. Please refer to the “Using character sets and unicode” section of the Connector/J documentation for more information on this topic. Creating the date dimension table In this particular case, it seemed convenient to create the dimension table as part of the transformation. This is done using the “Execute SQL Script” step shown below: 
The “Execute SQL Script” step executes the following script to create the date dimension table: DROP TABLE IF EXISTS dim_date ; CREATE TABLE IF NOT EXISTS dim_date ( date_key smallint unsigned NOT NULL, date date NOT NULL, date_short char(12) NOT NULL, date_medium char(16) NOT NULL, date_long char(24) NOT NULL, date_full char(32) NOT NULL, day_in_year smallint unsigned NOT NULL, day_in_month tinyint unsigned NOT NULL, is_first_day_in_month char(10) NOT NULL, is_last_day_in_month char(10) NOT NULL, day_abbreviation char(3) NOT NULL, day_name char(12) NOT NULL, week_in_year tinyint unsigned NOT NULL, week_in_month tinyint unsigned NOT NULL, is_first_day_in_week char(10) NOT NULL, is_last_day_in_week char(10) NOT NULL, month_number tinyint unsigned NOT NULL, month_abbreviation char(3) NOT NULL, month_name char(12) NOT NULL, year2 char(2) NOT NULL, year4 year NOT NULL, quarter_name char(2) NOT NULL, quarter_number tinyint NOT NULL, year_quarter char(7) NOT NULL, year_month_number char(7) NOT NULL, year_month_abbreviation char(8) NOT NULL, PRIMARY KEY(date_key), UNIQUE(date) ) ENGINE=MyISAM DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci This is by no means a complete date dimension. The most important limitation is that it only contains attributes that are immediately derivable from the calendar. So, attributes to denote business specific periods like the fiscal year and holidays are not included. Generating 10 years worth of days The grain of the date dimension is days - a row in the date dimension represents a single day. In this case, the “Generate Rows” is configured to generate 3660 rows, which roughly corresponds with enough days to last 10 years: 
In the example, this step is also used to provide parameters to generate the date dimension data. As we’ll see in a moment, the inital_date field effectively specifies the first date that goes into the date dimension. The language_code and country_code fields are used to localize the textual attributes of the date dimension, and the local_yes and local_no fields are used for boolean attributes. There are other ways to get these parameters into our transformation. For example, we could have used an “Add Constants” step with a similar result. Another possibility would be to get this data from the environment using a “Get Variables” step, and this would allow the parameters to be specified at transformation run-time.
|