Добавить в избранное   Сделать стартовой   Главная   E-mail   Форум   Мой блог 
   
Cертификации

Errors

ETL

FAQ (по темам)

GIS

Web

wiki

Администрирование

Безопасность

Книги
Oracle, ...

Новости

ОС

Программирование

Проектирование БД

Производительность

Скачать

Советы

Тестирование

Установка

FAQ - по базам данных
FAQ - по базам данных
Установка СУБД
Oracle
Sybase
MySQL
PostgreSQL
MS SQL Server
Interbase, Firebird
Другие DB
Администрирование
Oracle
MySQL
Sybase
PostgreSQL
MS SQL Server
Interbase, Firebird
IBM DB2
Другие DB
Проектирование БД
Статьи
ETL
Теория БД
ErWin
Designer 2000
PowerDesigner
Хранилища данных
CASE средства
OLAP
Бизнес - анализ (BI)
Производительность
Oracle
MSSQL
Interbase, Firebird
IBM DB2
MySQL
PostgreSQL
SYBASE
Безопасность БД
Oracle
MS SQL Server
Инъекция SQL
Программирование
Transact-SQL
PL/SQL
C++
XML
SQL
PostgreSQL
MDX
Java
VBA Excel
Книги по базам
Oracle
Заказ книг
ОС
Установка и настройка
UBUNTU
ОС
Установка и настройка
UBUNTU
FAQ
FAQ - по базам данных
Главная arrow ETL arrow ETL: Practical Example of Data Transformation Using Kettle

ETL: Practical Example of Data Transformation Using Kettle

Печать E-mail
Оглавление
ETL: Practical Example of Data Transformation Using Kettle
Страница 2

ETL: Practical Example of Data Transformation Using Kettle

 I’ve written about Kettle before. I personally think it is a great tool, and its easy to tell that this was written by someone who works with annoying data formats on a consistent basis. Well, Kettle is now known as the Pentaho Data Integration Project, and its now up to version 3. For brevity’s sake, I will just refer to it as Kettle going forward. Of course, none of this changes the functionality of the tool. In this article I am going to show a practical application of how I have been using Kettle to assist in the generation and transformation of annoying data formats.

 

I recently had to work with a feed from an entertainment vendor who distributes DVD, music CDs, and such. To the stores that use these files, they provide a rather confusing set of flat, tab delimited files, in a very un-userfriendly format. Since the store I was working for has a predefined format they want vendor files in to work with their search and navigation backend, we needed to transform these files to that format. This is where Kettle comes in.

The first file I need to deal with from them is their Products file. The file contains roughly 28 fields with various numbers that make no sense to me what so ever. The only fields I need to concern myself with are the Product ID, Product Name, Internal ID (used for mapping with the other files), a Category ID, the Artist, Price, and Availability Date. The other fields I can ignore. Since I am dealing with Categories I also need their Category file. Categories are going to be handled in a special way. I don’t need this in my actual file, but in a separate file, or files rather, that will be appended to a separate set of files only once. And the final file I need out of their set is their Attributes file, which will contain information about products such as if the product is Widescreen, Dubbed, Subtitled, etc. These are handled in a special way in the search backend, so I just need to provide them.

 

Figure 1. Conceptual Model of Data

 

What I need to do is transform this data into the format in Figure 2.

 

Figure 2. The Data Feed Format

 

The Artist field will go into Search Field 1, and the first two attributes I come across will go into search fields 2 and 3. Everything else will be a simple 1 to 1 mapping.

 

To get started, the first thing I need to do is start Kettle.exe, and create a new transformation. If you are using a repository, great, otherwise, choose No Repository at the startup screen. When you are in Kettle, go to File/New/Transformation.

 

Figure 3. New Transform

 

With the new file created, drag over 3 new Text File Inputs, located under the Core Objects/Input section, to the transformation area.

 

Figure 4. New Text Inputs

 

With the three text file inputs in the transformation, I need to set them up to read my data files. The first thing is to add the file to the input file list. Do this by clicking on Browse, then when you select your file, click on add. With the file selected, I now need to set up the delimiting information. These files are all tab delimited, with no text qualifiers (meaning no quotes around Strings), and no header row. So I click on the content and set the appropriate options. Since this is a tab delimited file, I need to click on the Insert Tab button to add in the tab.

 

Figure 5. Delimeter Options

 

Now I need to select the fields. To do that, since I have my file, all I need to do is click on the Get Fields button under the Fields tab. For developments sake, I will just name the fields that I need, and leave the field names alone for the remaining fields. Also, since I am just using these fields as description fields, I change the Date fields back to Strings.

 

Figure 6. Field selection

 

That’s it for Products, I do the same for the Categories and Attributes Tables. Now that the text inputs are set up, I need to do the transformations. The hardest part will be to denormalize and join the attributes into my input stream to feed into my output text file. The first step I need to take care of is sorting my data for the field row delimeter. So, I drag over a Sort Rows object from under the Transformation section. I need to connect the Attribute text data source to the Sort Rows object in order to edit it correctly. In order to do the connection, I need to hold down the Shift key on my keyboard, and drag my mouse from the Attibutes object to the Sort Rows object. This will indicate to the transformation that a “hop” in steps needs to occur between these two objects. Now, I edit my sort to sort based on the Product ID.

 

Figure 7. Sort Row Options

 

Next, I drag over a Select Values object and connect the Sort Rows object to it. In the Select/Alter tab, I click on Get Fields and leave everything default. Since I wont be using the non-named fields, I go over to the Remove tab, and select those fields.

 

Figure 8. Remove Fields

 

That was the easy part. Now I need to denormalize the data. What I want to do is have the first 3 attributes for each product to show up in consecutive columns. I tried using the Denormlizer here, with no success. So I ended up using the row flattener. The way the row flattener works is you define a single field that will contain the consecutive data. You then define additional columns. The flattener will then copy to each column in the order it receives data. So for example, lets say you have the following data defined in a field in your incoming data stream:

 

-Wide Screen

-Sub Titled

-Spanish

 

And in your row flattener, you defined the following target fields:

 

-FieldOne

-FieldTwo

-FieldThree

 

The flattener would assign the values like so:

 

-FieldOne = Wide screen

-FieldTwo = Sub Titles

-FieldThree = Spanish

 

It also seems that once unique values have been exhausted, it will just finish filling out the columns with the last unique value it encountered. For my purposes this is just fine. I define my field flattener with my values for Attribute Name in the following figure.


 
 
« Пред.   След. »
Взаимосвязанные статьи
     

Последние добавленные статьи
Поиск
Ссылки
Главная
Скачать
Курсы
Роль АБД (SYSDBA)
Карта сайта
Автостекла
Контакты
Войти на сайт
Популярные статьи
Online - тесты
1Z0-042
Rambler's Top100 МЕТА - Украина. Рейтинг сайтов хостинг от freehost.com.ua

Все права защищены.SYSDBA 2010 | Если у Вас есть хороший материал пришлите его нам.