Product Dimension ETL load. More examples of Talend. Using logs, metrics and statistics.

The following ETL process that is going to address the burden of the Social Product. In this dimension will be located, as we saw, the different attributes associated with the product for use in your BI system. After review of the physical structure, taking into account the origins of data, the physical structure of the table is as follows:


Physical Design Product Dimension Table in MySql

The processes will be implemented using Talend will follow (as we saw in a previous blog entry):


Transformations for the creation of the Product Dimension

To develop the process with Talend, we will go one step further, defining information logs, recording statistics and metrics.This will enable us to meet other functionality of Talend, and orchestrate the execution of the jobs (to enable, for example, the recovery of the last dateexecution of the job to look for changes in the material master since the last execution of the transfer process), monitor implementation and report errors by email if a problem occurs.

The full outline of the job with all the steps would be as follows (see below in detail each of the steps included):


Talend ETL process in full for the Product dimension

ETL process steps to complete the Product dimension are:

1) Implementation of a prejob to retrieve the last date of execution of the process and recorded in the log the start of the implementation of the job.In addition, it will launch a logCatcher (to catch Java exceptions or errors in the process), which will generate sending an email warning that if a problem occurs at any step of the job.

2) From the date of last execution, we recover the material master in the ERP, all changes made products or high in the table since that date.

3) We make a substitution of erroneous values ​​or blank in the selected records according to established criteria.

4) We completed the mapping of product size, filling the other fields from other tables in the database (lookup fields), with corresponding SQL queries.

5) The mapping may generate erroneous materials should be checked and corrected in the source system (for having the wrong values).The list of materials will be introduced in an excel file. Also, count the number of records read from the correct source system for storage in metric tables.

6) verify that there really is modifications to existing data in the DW database, and the records that if they change (or are new records), we perform the upgrade.In principle, we will not perform management Slowly Changing Dimensions, but we will always have a picture of the data as they are the master files at the moment (later if you manage a dimension to this series to learn the components that Talend available for this purpose).

7) the process has concluded in the corresponding message log successful completion of the process.

Before going deeper and see what each one of the steps, we will see how to activate Talend logs management, statistics and metrics. The meaning of each of these elements is as follows:

  • Logs: would the record error messages generated by the processes running Java exceptions or messages generated by us using TDIe or tWarn components (to generate different types of messages.)
  • Stats (statistics): the registration of the statistical information of the different steps and components that make up a job.For each step, you can turn on logging of statistics tStatCatcher selecting Statistics. It stores information for each step, when it starts, when finished, its execution status, duration, etc.
  • Meter, Volumetrics (metric) is the metric log (number of records in a flow) we can insert in the processes using the tFlowMeter control.

Talend we can work in two ways with all these records (can be combined): one would be managing ourselves logs, stats and meters that are generated (collected at runtime with components tLogCatcher, tStatcatcher or tFlowMeterCatcher in each job and giving early treatment) or activate the automatic management at the individual project or job, selecting that place we want to save the logs generated (with three options: viewing console, log in plain text files or log database).With this last option, all the information generated will be registered in tables and will be easier to exploitfor review of processes performed, process orchestration, verification bottlenecks, bug fixes, etc.

To activate the project level management will select the menu option File -> Edit Project Properties, Job Settings option, Stats & Logs (as we see in the picture):


Options statistics and logs at the Project level

These properties are entitled to all the jobs the project, though you can adjust in each job as we are interested (there may be jobs for those who do not want to store no information).This will modify the properties of each job, the Stats & Logs tab. In principle, send the properties set on the project, but exceptions can be generated (carrying the logs to a different site or not generate) for a particular job:


Options statistics and logs level Job

After this clarification, see in detail each step of the ETL:

1) Implementation of a prejob to retrieve the last date of execution of the process and recorded in the log the start of the implementation of the job.In addition, it will launch a logCatcher (to catch Java exceptions or errors in the process), which will generate sending an email warning that if a problem occurs at any step of the job. The components used in this initial step are:

  • Pitcher Prejob (tPrejob component): used for the launch of a pretrabajo, before the main process.
  • Last Execution Date (component tMySqlInput) recovered from the log table logs, the last date of successful execution of the job.This date will serve as a reference date to find the high / modifications in the ERP products since the last execution. But there is a previous execution of the job, build a reference date for what would be the initial loading of product dimension.
  • Message Log Home (tWarn component) generates a log message indicating that the job begins execution.
  • Date Set Variable (tSetGlobalVar component): the last execution date recovered in the previous step is recorded in a global variable that can be used later in any step of the job.
  • Error Control (tLogCatcher component) activate the component that "listen" for the duration of the job, waiting to be any kind of error.At that time it activated to retrieve the error and pass it to the next component for sending email notification.
  • tFlowtoIterate : convert the flow of log records to an iteration in order to make sending email.
  • Email sending Notif (tSendMail component): powering the sending of an e-mail notification of errors, including the step where you stop the process, and The message of error.It is a way to tell something went wrong in the process.

To see in detail how we define each component, you can click on the link for each component or to view the full HTML documentation generated by Talend clicking here .

2) From the date of last execution, we recover the material master in the ERP, all changes made ​​products or high in the table since that date.


Job DimProduct - Reading Master Sap Products

To do this, use the component type LECT_MAESTRO_PRODUC TO tOracleInput (see details here ), which allows us to access the SAP database, which in this case Oracle.There could also be performed using Sap reading tSAPInput tSAPConnectiony components). Observe the SQL statement executed in the component and see how combine written text with the use of a global variable and functions using prayer shawl (which could also be Java).

3) We make a substitution of erroneous values ​​or blank in the selected records according to established criteria.Using the component AJUSTE_CAMPOS, tReplace type (see details here ), with which we can find values ​​that exist in the records retrieved (using regular string) and indicate a substitution string. It can be useful for any transformation to be carried out for certain values, to normalize, to correct erroneous records, etc. In our case, we will use to replace records with no value (it could have also chosen to reject such entries).


Job DimProduct - Set Fields

4) We completed the mapping of product size, filling the other fields from other tables in the database (lookup fields), with corresponding SQL queries.We use the component type MAPEO_PRODUCTO TMAP (see details here ).


Job DimProduct - Mapping Product

With this component, together in one step the master records read materials, and additional values ​​that are in other tables in the ERP that allow us to complete each record (such as the settings descriptions, units of measurement, calculation of equivalent units measurement, etc).

In this mapping, we use Talend functionality that we had not seen, that is to perform a selective filtering of the records.This allows us to generate in step two data streams, one with the correct records, and another with records that do not meet certain conditions.In our case, the records with a NULL in the field umren will be rejected and passed to a subsequent step which will be recorded in an excel file for review.

5) The mapping may generate erroneous materials should be checked by a user and fix the source system (for having the wrong values). The list of materials will be introduced in an excel file via the type control tFileOutputExcel REGISTROS_ERRONEOS (see here ). Similarly, using the component type CUENTA_LEIDOS_SAP tFlowMeter (see here ), count the correct number of records read from the source system for storage in metric tables (to later oten statistical information processes performed).

6) verify that there really is modifications to existing data in the DW database, and the records that if they change (or are new records), we perform the upgrade.To do this, use the component type VERIF_MODIFICACIONES TMAP (see here ) that receive the main flow records as we read from Sap fully completed.On the other hand, receive as flow lookup (see here ) records already have filled in the table Myql DWD_PRODUCTO (the product dimension table). With this information, we make the comparison in the mapping, I will make a selective filtering of the records, going to the next component only those that have modifications (there is a difference in any of the fields.)

""

Job DimProduct - Changes Verification

The code used in the filter expression, which is the filter condition is as follows:

! Row11.material_desc.equals (row15.material_desc) | |
row11.familia_id! = row15.familia_id | |
! Row11.familia_desc.equals (row15.familia_desc) | |
! Row11.denom_id.equals (row15.denom_id) | |
! Row11.variet_id.equals (row15.variet_id) | |
! Row11.formato_id.equals (row15.formato_id) | |
! Row11.um_id.equals (row15.um_id) | |
row11.litros_id! = row15.litros_id | |
row11.linprod_id! = row15.linprod_id | |
! Row11.linprod_desc.equals (row15.linprod_desc) | |
! Row11.target_id.equals (row15.target_id) 

See how we made the field by field comparison of the two streams to determine if there are any changes between the loaded data and from the ERP.

In principle, we will not perform management Slowly Changing Dimensions, but we will always have a picture of the data as they are the master files at the moment (later if you manage a dimension to this series to learn the components that Talend available for this purpose).The records that exceed the mapping, are registered in the MySQL database using the control ACTUALIZA_DWD_PRODUCTO, tMySqlOutput type (see detailhere), recorded as already mentioned in the table DWD_PRODUCTO.

TFlowMeter the component, in step CUENTA_MODIFICADOS (see details here ), recorded in the tables of metric the number of records to generate update.

7) the process has concluded in the corresponding message log successful completion of the process, with the component type MENSAJE_LOG_FIN tWarn (see details here ).

To see in detail how we define each component, you can access the full HTML documentation generated by Talend here .You can download the zip file containing the documentation here .

Conclusions

We made our first job using Talend ETL complex. It was a little tricky to understand how the tool works, and are linked and chained together the various components, how to pass the information and records between them, such errors are handled, etc.. But once covered the basics, we see the great potential of TAlenda (which also can be completed using the Java language everywhere). I recommend if you want to increase awareness of the tool, read the User Manual of the tool and Components Reference Guide (both in English) on this link and visit the project website Open TalendForge , where you can find Tutorials, Forums management, bugs, etc.

I have also seen how important is the process of analysis of the origins of the detailed information and all the transformations to be performed, the exceptions, that criterion is to them that we made substitutions, etc.Maybe I should have made this step more in depth since then to make the ETL process situations have appeared that we had not anticipated. What we have in mind for our next project.

We have also seen that it is important to establish a mechanism for the management of all logs of the running processes, statistics and metrics, then it will be useful to set up automated processes, monitoring and debugging necessary to put into productive the project and its maintenance over time.

And of course, the documentation. The act of going documenting and explaining each component used in Talend (naming the steps, explanatory text on each component, feedback metadata, etc), has enabled using standard functionality of Talend, the generation of HTML, have a repository where you can see the processes as they are constructed, in a very complete and detailed view to understanding how processes are assembled by third parties, for further editing or maintenance in the future.

In the next blog entry, detail the ETL process for loading the other dimensions of our project.Later, board the ETL process to load the fact table, which will be more complex and that a mechanism should be established orchestration to automate.In that entry also include the study of partitioning of tables using Mysql, we will see in depth.