Pentaho Data Integration (PDI/Kettle)
Pentaho Data Integration (PDI/Kettle) is a tool to integrate data, responsible for extract, transform and load processes (ETL).
It can be used as an independent application or as a part of Pentaho Suite. Since it is an ETL tool, its functionalities are generally focused on data warehouse development. Nevertheless, it can be focused on the following aspects:
-Migration of data between applications and databases.
-Exportation of data from database to files
-Data cleaning
-Integration of applications
-Process automation
At this chapter we focus on the creation of a project on PDI. Therefore, a step-by-step approach is used, with print screens and an easy language.
Installation and Configuration
Currently, the latest stable version is 5.4.0.1-130.
File: pdi-ce-5.4.0.1-130.zip
Size: 695 MB
Download: http://sourceforge.net/projects/pentaho/files/Data%20Integration/5.4/pdi-ce-5.4.0.1-130.zip/download
Installation process:
- Download PDI: http://sourceforge.net/projects/pentaho/files/Data%20Integration/5.4/pdi-ce-5.4.0.1-130.zip/download
- Unzip pdi-ce-5.4.0.1-130.zip file
- Run spoon.bat (Windows) or spoon.sh (Linux)
- The following screen should appear:
Repository configuration
When PDI opens, the screen to configure repository is presented.
The connection with repository allows the saving of transformations and jobs in a preferable place.
In cases where the specification of the path is not the focus, close the window. This action will store jobs and transformations on a local machine .
To add a repository, click on (+)
There are two ways to configure your repository
1st Store transformations and jobs inside database “Kettle database repository”
2nd Define a folder and use it as a repository “Kettle file repository”
Whether the initial screen will not open, follow these steps to access it:
Tools-> Repository -> Connect to repository
Select one of the two options and click on Ok.
Click on new to configure a new connection with the server
The software allows connections with the main databases: Oracle, SQL Server, MySQL, PosrgreSQL, Firebird, IBM DB2, and others.
Select the database and complete the following fields:
- Connection name
- Host name
- Port number
- Username
- Password *
*It is recommended to write down the password in order to edit the connection in the future
Click on OK to proceed with the configuration
Complete ID and Name fields
Click on Create or upgrade to create the necessary structure at the database
Execute -> OK -> Cancel -> Ok
Organizing folders
Right click on New Folder -> Folder name -> Ok
Select the option according to the following in order to configure repository to a text mode (.ktr and.ktb). Click on OK
Exporting/Importing Repository
Click on Tools -> Repository -> Export Repository... Save in a convenient folder
Click on Tools -> Repository -> Import Repository
After the selection of a file to be imported, click on Ok to inform the repository storage diretory
Transformations
Transformation is a routine with a collection of connected steps
The first step represent the data source and the last one the data output
Jobs
Job is a routine that allows, among many actions, run transformations or jobs.
||
Step
Step is the minimum unit inside a transformation.
There are three basic types of steps
- Input
- Transformation
- Output
The steps are easily created using drag and drop process.
At the tool area a list groups many categories of steps.
Hop
Hop is a graphic representation of data flow between two steps.
The connection between two steps can be created clicking on the step of origin, pressing shift button and dragging it until the step of destiny
Transformations and main steps
Formula
This step allows the creation of formulas as calculated fields, fields with constant values, logical conditions, string formatting, comparison operators and others.
Table input
Create your connection -> insert step name -> insert your SQL query -> Click on preview to verify the query result -> Ok
||
Table Output
Truncate: Before data insertion, the data of the table is deleted. Check off this option if an incremental load needs to be done.
If the table does not exist, click on SQL to create it
Insert / Update
Configuring connection -> inform that table will be updated.
Don’t perform any update:
If enabled, the data on database will never be updated, except if an insert is performed.
Inform the primary key of source table and target table;
Inform the name of the field of origin and the target field.
Select Y or N to configure the fields that will be updated.
Database Lookup
It configures the connection, target table, key and the condition to run a search on database.
Get fields: Returns a list of available fields of input flow
Get lookup fields: Return a list of available fields that can be added on outuput flow.
Switch / Case
Replace in String
In stream field: base field to search
Use RegEx: use regular expressions and group references
Search: Value that will be searched or replaced
Replace with: Value that will overwrite Seach value
Whole world: Put Y when the whole word will be replaced or N to replace a specific part
Case sensitive: inform if it is a case sensitive field with Y or N
Java Script
Calculator
New field: Field name;
Field A: first value;
Field B: second value;
Field C: third value;
Value type: Type value of the calculation
Precision: quantity of decimal places
Call Procedure
Value Mapper
Fieldname to use: field that will be used as mapping source
Target field name: Field that will be mapped
Default upon non-matching: Define a standard value to situations where the value of origin does not fit the mapping
Mapping (sub-transformation)
Mapping transformation: Refer the transformation used to map;
Parameters: Allows the definition of variables in the mapping;
Add Input: Each input is one mapping step
Add Output : Each output is one mapping step.
Mapping (Input)
Input source step name: Name of input step at the “Father transformation” (not the name of the mapping);
Mapping target step name: Name of mapping input step (data line to be mapped);
Is this main data path: Verifies if the above fields are empty;
Ask these values to be renamed back on: Rename files before they will be sent to the mapping;
Step mapping description: Mapping description.
Mapping Output
Mapping source step name: Name of input step at the transformation where data will be read.
Output target step name: Name of input step at the current transformation;
Is this the main data path: Verifies if the fields above are empty;
Ask these values to be renamed back on: Rename files before they will be sent to the mapping;
Step mapping description: Mapping description.
Excel Input
At the Content tab, select the format of the file to be imported (xls, xlxs ou ods)
File or directory: Inform the origin of the file that will be used as input
-> Go to !Sheets tab
Get sheetname(s) : Click to inform the tab where the data are.
-> Go to !Fields tab
TXT Input
TXT Output
Separator: Choose the file delimiter, such as TAB, comma (,), semicolon (;) and pipe (|).
Get fields: Click to indicate fields that will be exported, types of fields and formatting
-> Click Ok
CSV Input
Filename: Informs the origin of input file;
Select Values
Get fields to select: Select data automatically
Select & Alter: alters name and specifies the exact order that the fields should appear at the output;
Remove: Specifies fields that will not be presented as output
Meta-data: alters name, type, length and fields precision.
Jobs
Start
Repeat: Select if you want to repeat schedulling automatically;
Type: Choose the schedulling type (Daily, Days of week, Day of month);
Interval in seconds: informs the interval in seconds to repeat the process;
Interval in minutes: informs the interval in minutes to repeat the process;
Time of day: Informs time to run the processes;
Day of week: informs the day of the week to run the processes;
Day of month: informs the day of the month to run the processes.
Transformation
Transformation filename: informs the origin of transformation that will be executed.
Job
Job filename: Informs the origin of the job that will be executed
File Exists
Useful step to check if a file exists
- Inform the path of the file;
- Click on OK.
Table exists
Useful step to check the existence of tables in a data source.
- Configure the database connection;
- Inform the schema where the table belongs;
- Inform the table;
- Click on OK.
SQL
Connection: inform the connection;
SQL from file: Select this option to run a script
SQL script: Inform the query that will be executed
Extracting data from multiple sources
1st Data Source: Postgres
Extract information from PostgreSQL
2 nd Data Source:CSV
3 rd Data Source: Excel
Organizing data
Click on Get fields to select the desired fields
Data Warehouse
Concept
Data Warehouse is where the company’s important data are store. It is catalogued in a organised and structured way.
Very often, a multi-dimensional modelling is used to build a data warehouse.
According to Inmon (1997), creator of the concept, Data Warehouse “is a collection of integrated data, subject-oriented, variable over the time and non-volatile, used to support the management decision-making process.”
According to Kimball (1997), the process of bulding a data warehouse has nine stages.
They are:
- Select the business process to be modeled;
- Define granularity (details) of business process;
- Choose dimensions to fact table;
- Identify numeric facts (metrics);
- Store pre-calculated data into fact table;
- Load dimension tables;
- Prepare dimensions to support improvements (changes);
- Define the time amplitude to database history;
- Define the period of time that data are extracted and loaded in DW.
Creating a Data Warehouse
- Creating Database tables
Creating Dimensions
It is recommended that each transformation deals only with one dimension, fact table or variable manager.
This pattern will help on job creation.
Dimension Lookup / Update
Update the dimension: Updates dimensions based on input flow;
Target table: dimension name;
Commit size: Commit each X insertions or updating.
Technical key field: dimension business key or primary key;
Use table maximum +1: a new technical key will be created from a table key
Date range start field: Input data from insertion or updating
Table daterange end: Date of last insertion or updating
Click on SQL -> Run to create dimension
Use this pattern to create other dimensions
Create a process to run all the dimensions transformation sequentially
Creating a Data Stage
Click on SQL -> Run to create aux table
Creating Fact Table
- Connect on Data Stage and select only primary keys;
- Create a lookup to each dimension;
- Select only ID fields and remove others use select value;
- Load data on fact table.
How to use Slowly Changing Dimension on PDI/Kettle
Slowly Changing Dimension on PDI
SCD Type 1:
- It does not keep data information, this implementation does not trace historic data
- Every time a change is made on a record, the information is updated.
Loading Client Dimension
Loading Client dimension – Keys tab
Component:
Select:
Comparative fields:
Technical key:
The remaining fields at Keys tab are set as Default
The Type of dimension update column must be defined as Update
Loading Client Dimension at fact table
Component:
Properties of Database Lookup component:
SCD Type 2
- Keeps data history;
- Inserts a row to each change at the records, saving the version of the altered record;
- There is no limit of changings, this implementation keeps all the history of changes.
Loading Client Dimension - Keys tab
Component:
Select:
Comparative Fields:
Technical Fields:
The remaining fields at Keys tab are set as Default
The type dimension update column must be defined with Insert
Loading Client Dimension at Fact Table
Component:
Only the following fields must be set:
Table Output Component :
Truncate table option must not be selected:
HTTP Client - Chamada de serviços HTTP
- Call a web service through HTTP
- URL allows the dymanic adjust of parameters
Documentation – HTTP Client Component: http://wiki.pentaho.com/display/EA/HTTP+Client
ETL and Kettle documentation - CookBook
Installation – Kettle CookBook
Installation requirements
- PDI 5.4.0.1-130
- Link Kettle CookBook project: https://code.google.com/p/kettle-cookbook/
Utilization (Linux):
sh kitchen.sh –file:/parking/kettle-cookbook/pdi/document-all.kjb -param:”INPUT_DIR”=/project/dwh/kettle/ -param:”OUTPUT_DIR”=/tmp/output
Kettle CookBook – IT4biz – Example:
http://www.it4biz.com.br/cursos/kettle-cookbook/Samples/
Transformations and Jobs Examples
With PDI Kettle it is possible to use WebServices
Example created to CONSEGI 2010 event, Web Service demonstration
http://www.it4biz.com.br/cursos/consegi2010/exemplos_kettle_webservices.tar.gz
Logs
Displaying Logs
Configuring Logs Level
Log Screen
In Case of error during the process, PDI will show where the error is, Highlightening the step and errors row in red.
Error Handling
Error handling resource avoids the stopping of transformation in case of existence of data consistence error
The advantage in use this resource is to record inconsistent data in another place in order to complete the ETL process
Practice:
Create a PDI transformation that reads client’s data from an excel file and records it at postgres database.
Insert the following components:
- Excel Input
- Table Output
- String Operations
- Text file Output
- Connect hops as the image above
- Configure Excel Input component to read client’s data
- After select a sheet, define the fields to be read
- Configure Table output component to record client’s data
It must include: (from PDI5.3.0.0.213-2015-02-16.pdf)
- DDL to create client table:
OBS: The sheet that will be loaded is on examples directory.
- Right click on table output component and then click on error handling option.
- At error handling configuration screen, insert the following properties:
- Target step: String operations
- Enable the error handling?: Enable
- Error descriptions fieldname: DS_ERRO
- Open String Operations component and then insert DS_ERRO field
- At the same row of DS_ERRO field, at remove special caracter column, insert the carriage return & line feed property.
This configuration allows that possible errors are recorded in different rows
- Open Text output component, insert the following properties on File tab:
- Open Text output component, insert the following properties on Fields tab:
OBS: Change CODE field type to String. If errors occur, the content of the field will be presented as String.
- Run transformation to test data recording
- This example will record all data with success, because excel data are 100% consistent
Produce an inconsistent on the sheet. To do so insert 12345678956345200000 code to REDE FELDER client
Save the sheet and perform another load
- As soon as the transformation runs, an error file is created at the rejected directory:
Schedule, jobs execution and monitoring from command line
Kitchen (Text)
Kitchen is a simple way to schedule your jobs It is simple to configure and consumes little memory.
Windows:
- Open notepad;
Enter the command below:
kitchen.bat/file:C:\Treinamento\ETL\update\Warehouse.kjb /level: Basic
Save file with .bat extension
Linux:
Enter the command below:
kitchen.sh-file=/Treinamento/ETL/meuJob.kjb –level=Basic
Kitchen (Text and Log)
Windows:
- Open notepad;
Enter the command below:
kitchen.bat/file:C:\Treinamento\ETL\update\Warehouse.kjb /level: Basic > C:\LOG\trans.log
- Save file with .bat extension
Linux:
Enter the command below:
kitchen.sh-file=/Treinamento/ETL/meuJob.kjb –level=Basic >> /LOG/trans.log
Kitchen (BD)
Windows:
- Open notepad;
Enter the command below:
kitchen.bat/rep etl_desenvolvimento /dir Contabil /user admin /pass admin/ job job01Contabil
- Save file with .bat extension
Linux:
Enter the command below:
sh kitchen.sh /rep repository_etl /dir 7_FALTAS / user admin /pass admin / job CARGA_FALTAS
Pan
Pan, just as kitchen, has simple configuration and little memory consuming
Windows:
- Open notepad;
Enter the command below:
pan.bat / file: C:\Treinamento\ETL\updateWarehouse.ktr /nivel: Basic
- Save file with .bat extension
Linux:
Enter the command below:
pan.sh –file = /Treinamento/ETL/updateWarehouse.ktr – level = Minimal
Schedule
To schedule Jobs or Transformations it is necessary to use Windows Scheduler or CRON (from Linux)
Cluster
Carte
Carte is a simple web server that allows the execution of jobs and transformations remotely.
It accepts XML (using a small servlet) with the transformation to run the configuration.
It also allows remote monitoring, the starting or stopping of transformations and jobs at Carte server. The server running Carte is called Slave according to PDI terminology.
Starting Carte
In order to start carte server, at PDI folder look for carte.sh (Linux) or carte.bat (Windows).
The carte services can be started from terminal:
sh carte.sh server_ip 8081
Example:
sh carte.sh 192.168.10.105 8081
Carte Log
Below we have an example of Carte Log
Sudo sh carte.sh 10.105.186.128 8081 2009/07/24 15:09:25:942 BRT [INFO] DefaultFileReplicator - Using “/tmp/ vfs_cache” as temporary files store 2009-07-24 15:09:26.063 :: INFO: Logging to STDERR via org.mortbay.log.StdErrLog INFO 24-07 15:09:26,092 – org.pentaho.di.www.WebServer@3efc3efc – Created listener for webserver @ address : 10.105.186.128:8081 2009-07-24 15:09:26.101::INFO: jetty-6.1.9 2 0 0 9 – 0 7 – 2 4 1 5 : 0 9 : 2 6 . 4 8 4 : : I N F O : Started [email protected]:8081
Slave Server
In order to create slave server it is necessary to configure PDI as presented below:
User and standard password:
User: cluster Password: cluster
There is also the possibility of encrypting the password using encr.sh
When remote server is created select master= true option to the first server
Metadata Injection (XML, Tables and CSV files)
The objective of working with Metadata Injection is to define an ETL routine that will be used as a model to dynamically load metadata.
Instead of creating a job to load each XML, text file, table or CSV, it is possible to define a job that will be responsible for loading data dynamically.
Documentation
http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection
Common Problems
Forget to add Hop;
Forget to put a technical key fields;
In case of using PostgreSQL as DW use Numeric (17,17) leading to field overflow;
Forget of running SQL when creating dimensions;
Forget of defining target table;
Do not define database connection properly;
Do not insert fields on key fields tab;
Do not insert fields on fields tab;
Use some unacceptable format to the database when creating a technical key. Example: MySQL uses auto increment, PostgreSQL only accepts sequence and table maximum+1;
Do not select Update the Dimension checkbox, altering component behavior;
Kitchen and Pan are on data integration folder, .bat or .sh file, so do not forget of inserting the path of data integration
C:/Treinamento/PDI/tools/data integration kitchen.bat / file: C:\Treinamento\ETL\updateWarehouse.kjb /nivel: Basic
Use Windows Scheduler to run an automatic process, or Crontab on Linux;
Forget of mapping the connection properly;
Do not rum SQL script to create dim_time table.