ETL process to load the time dimension. Example use of the ETL Talend.

Printer-friendly version

Having identified the origins of data, we can proceed with the construction of the physical tables of our model and the development of the filling. Begin the process with the time dimension. As stated, this dimension does not depend on our ERP or other external systems, but the build from the calendar. Generate all necessary records for this table for a period of 20 years, which runs from January 1, 2000 (for historical data prior charge also in our DW) until December 31, 2020.

The physical definition of the table has not changed after the analysis of data sources, and is as follows:

Physical Design Time Dimension Table in MySql

The processes will be implemented using Talend will follow (as we saw in a previous blog entry):


Transformations to create Time Dimension

We will use Talend to generation option in Java (we could also use the PERL language.)Using Java means that all processes and transformations that define it will "translate" to the Java language internally. Although not strictly necessary to know the Java language to work with Talend, knowing them will greatly facilitate the work with the tool and will allow us to define our own code for processing or processes that are not included as standard.

Talend is based on the Eclipse development environment. It is a graphical environment with rich functionality where the definition of the transformations are done in a very intuitive menu, clicking and dragging components and establishing relationships between them.Includes Debug environment to analyze the processes and their errors as well as the possibility of traces to monitor the processes when we developed and validated.

The following image is an example of a job created with Talend:


Graphical environment of Talend tool

The most important components within the graphical environment are:


Talend Objects Repository

Repository: includes all objects that can be defined using Talend, falling from a tree as follows:

  • Business Models: Talend has a simple graphical tool where we define our business models.In this folder locate the various business models that have "drawn" using Talend. The tool contains the most common graphical elements.
  • Job Designs, a project of transformation or integration of data consists of multiple processes or jobs that can be defined and classified in a folder structure to organize and classify them.In this section we look at the different jobs we have defined and the way in which we have classified.
  • Contexts: execution contexts are processes. They can define constants or parameters that we are asked to run a process and we can use the various components of a job. Contexts can also be loaded at runtime from files.
  • Routines: the place where we can see the routines developed by Talend code (which then can be used in the transformations) and where we can add our own routines to perform operations or calculations or transformations for which we do not have a standard method.The routines are scheduled in Java (or Perl if you have selected that language).
  • Template SQL: SQL statements are pre-defined templates that we can use or customize.
  • Metadata: the place where we will define the metadata of the project. These definitions of components that we will then be reused in all design processes of change. For example, the metadata can define connections to databases, retrieve the schema of a database and have them documented (with tables, views, etc), SQL statements define, identify patterns of different types of files, etc. This allows us the elements defined in one place and reuse them throughout the process.The Metadata Repository centralizes all project information and ensures consistency in all processes of integration. The metadata related to the source and destination systems integration processes is easily loaded into the metadata repository through advanced tools of analysis of the databases or files provided by various attendees. The characteristics defined in the Metadata are inherited by the processes that make use of them.
  • Documentation: we can load the project files documentation of our analysis and development, classifying them into folders. Linking can be done with a link or loading the file directly in the repository. This utility allows us to be centralized in one place all the elements of a data integration project.
  • Recycle Bin: delete objects that go to the Recycle Bin and bring it back there if necessary.

Job Designer: is a tool from which to manipulate the various components that make up a job, establishing relationships between the different elements.

pg "/>

Designer Job

When working with a job, at the bottom have a set of tabs from which we can perform various actions, such as setting a context for the job, setting the properties, run the job and set the performance mode, modify the properties components of jobs, scheduling jobs, establish hierarchies in jobs, etc..

Finally, on the right side of the application we have the Component Palette, which are different Talend provides controls for use in our jobs.They are classified by function.


Component Palette

Some of the available components in Talend are:

- Business Intelligence: A group of connectors that meet the needs of reading or writing in multidimensional databases or OLAP, departures to Jasper reports, change management database for slowly changing dimensions, etc (all of them related to Business Intelligence .)

- Business: connectors for reading and writing of CRM-type systems (Centric, Microsoft CRM, Salesforce, Sugar, Vtiger) or to read and write from SAP systems.They also allow working with Alfresco document management system.

- Custom Code: components to define our own custom code and use it integrated with other components in Talend. We can write components in Java and Perl, as well as load libraries or customize commands Groovy.

- Data Quality: components for data quality management, such as filtering, CRC calculations, searches for fuzzy logic, replacement values, schema validation against metadata, clean duplicates, etc.

-

- ELT: components for working with databases in ELT mode (with the changes and processes typical of such systems.)

- File: controls for file management (verification existence, copy, delete, list, property), to read files of different formats (text, excel, delimited, XML, email, etc) and writing on them.

- Internet: components for access contenid stored online as Web services, RSS feeds, SCP, Mom, Email, FTP servers and the like.

- Logs & Errors: management controls and logs errors in the definition of processes.

- Miscellaneous: various components such as windows messages, verification of operation of servers, generating records, variable scope management, etc.

- Orchestration: components to generate the strings and orchestration tasks and processing jobs and subjobs defined in our transformations (generating loops, execution of pre-or post jobs, processes waiting for files or data, etc).

- Processing: components for processing data streams, such as aggregation, mapping, transformations, filters, denormalization, etc.

- System: components for interaction with the operating system (execution of commands, environment variables, etc).

- XML: components for working with XML data structures, operations of parsing, validation or creation of structures.

To get an idea of how to work with Talend, it is interesting to see a demo of 5 minutes Talend's web (accessible from this link ).Also you can see the following video demonstration of how to generate test data in a MySQL table.

Also, to further deepen You can download the User Manual of the tool and Components Reference Guide (both in English) at this link .

Talend job in filling the Time Dimension

Now that we know that is a bit Talend, we will further see a practical example.We need to generate a stream of dates beginning on 01.01.2000, 12.31.2020 comes to the day (with dates that carry 20 years in the dimension).

To make the process let's define the following steps:

1) Loop that runs 10,000 times, with a counter going from 0 to 9999 (using the component group tLoop Orchestration).

2) The loop calls the control records generator, which generates a log with the date 01/01/2000 (using the component tRowGenerator MISCELLANEOUS group).

3) the date RowGenerator passes a processing (MAP), which adds to the time the counter of step 1 (with this we are increasing the starting date every day and generating all necessary dates.)We use the component TMAP Processing Group.

4) the dates we generated another processing (MAP), where for each date, we generate all the attributes of the dimension of time as the transformation table we have indicated previously (month, year, day, day of week, quarter semester, etc.). We use the component TMAP Processing Group.

We had to define a routine in java to generate correct number of weeks in the year of each date and for generating data of quarters, semesters, holiday and weekend.For example, for the generation of weeks we have written the following Java code:

/ / Template routine Java 
package routines; 
import java.util.Calendar; 
import java.util.Date; 
{public class dates 
public static String semana_del_anyo (Date date1) { 
Calendar c1 = Calendar.getInstance (); 
c1.set (Calendar.DAY_OF_WEEK, Calendar.MONDAY) 
c1.setMinimalDaysInFirstWeek (1); 
c1.setTime (date1); 
int week = c1.get (Calendar.WEEK_OF_YEAR) 
if (week <10) { 
return (0 + Integer.toString (week)); 
Else {} 
return (Integer.toString (week)); 
}} 

5) We filter to discard records that are older than 31.12.2020, as such we want to load on the database.We use the component tFilterRow Processing Group.

6) Insert the records in the database Enobia in DWD_TIEMPO table, using the component tMySqlOutput, Group Databases, MySql. If records already exist in the database are updated.

The full outline of Job would be:


Job Filling Time Dimension

This was our first contact with an ETL tool.Without programming (or almost, because we had to prepare a routine in java for the treatment of weeks and other attributes of the dates), we have filled with datas real the first table / dimension of our model.