Friday, 11 January 2013

Understanding tNormalize and tDenormalize components (Talend Open Studio)


I have seen couple of forum posts where people are asking about usage of tNormalize and tDenormalize components of Talend Open Studio. These columns are very important components and can be very useful in situations where a single column values needs to be split to multiple records or combining/grouping multiple records into one. This is the reason, I have thought of putting up a post to demonstrate usage of tNormalize and tDenormalize components.

Today, I will create two following jobs:
1. To demonstrate tNormalize : A Job to split a record having comma separated values for a employee skill column into multiple records where each output record contain only one skill.
2. To demonstrate tDenormalize : A Job to combine multiple records into one with employee output skill column containing comma separated values from the grouped records.

Using tNormalize to split records:

Lets look at the Input file. In the file we have employee details with their Skills. Now skill are separated by comma. We need to read this file and output one row for each skill for an employee.


For Example:

For following Input row:

101;Avinash Gupta;java,C++,sql

Our output should look like this:

101;Avinash Gupta;java
101;Avinash Gupta;C++
101;Avinash Gupta;sql


To achieve our objective, lets create a new job and perform following steps.

1. Create metadata for our Input delimited employee file. You can take reference of the steps to create metadata for delimited from
here. Your metadata should look like following:




2. Drag the metadata created in step 1 to Job designer pane and select tFileInputDelimited component from the Popup. Open the component properties enter the following details:

I. Path of the input file
II. Field Separator
III. Header (Our input file also contain top most row as header)


3. Drag tNormalize component onto Job Designer from Palette pane. This component will be used to spit the record to multiple records based on skills column.

4. Right click tFileInputDelimited, select Row > Main and connect it to tNormalize component.

5. Open the component properties of the tNormalize component. Click on Sync columns to propagate the metadata of tFileInputDelimited. Select the column emp_skills in the Column to normalize dropdown. Here we have selected emp_skills as we want to normalize the values from this column. Also provide the Item separator as “,”. (In our input file, skills are separated by comma e.g. “101;Avinash Gupta;java,C++,sql


6. Drag tLogRow component onto Job designer from Palette pane. It will be used to display the output from tNormalize component in the Run Console.

7. Right click tNormalize, select Row > Main and connect it to tLogRow component.

With this step our Job to split the records input file to multiple records has been completed. It should look like following.



8. It time to run and check the output in Run console.



From the screenshot, you can see that every record has been split to multiple records based on employee skills. If the employee has 3 skills then the employee record will be split to 3 records each containing single record, similarly If the employee has 4 skills then the employee record will be split to 4 records.

Using tDenormalize to group multiple records:

Now we have seen, How to split the record to multiple records using tNormalize component. Its time to learn its reverse action i.e. To group multiple records into single record based on a key.



For Example:

For following records:

101;Avinash Gupta;java
101;Avinash Gupta;C++
101;Avinash Gupta;sql

Our output should contain only single record and look like this:

101;Avinash Gupta;java,C++,sql

We can use denormalize component to group multiple records into single record. Lets first save the output from the above Job to a output file tNormalize_output.txt. File should look like following.





















To achieve our objective, lets create a new job and perform following steps.

1.  Drag the
metadata created in step 1 (tNormalize example above) to Job designer pane and select tFileInputDelimited component from the Popup. Open the component properties enter the following details:

I. Path of the input file (Enter the path of output file from above example)
II. Field Separator


2. Drag tDenormalize component onto Job Designer from Palette pane. This component will be used to group the multiple records to single records accumulating the values into skills column.

4. Right click tFileInputDelimited, select Row > Main and connect it to tDenormalize component.

5. Open the component properties of the tDenormalize component. Click on Sync columns to propagate the metadata of tFileInputDelimited. Click on  and Select the column emp_skills in the To denormalize section. Here we have selected emp_skills as we want to denormalize and group the values from multiple rows to this column. Also provide the delimiter as “,”. 




6. Drag tLogRow component onto Job designer from Palette pane. It will be used to display the output from tDenormalize component in the Run Console.

7. Right click tDenormalize, select Row > Main and connect it to tLogRow component.

With this step our Job to group multiple records from input file to single record has been completed. It should look like following.



8. It time to run and check the output in Run console.


From the screenshot, you can see that multiple records has been grouped to single record based on employee number. e.g. If the employee has 3 records with different skills then the output will have single employee record containing multiple skills as comma separated.

This article is written by +Vikram Takkar  and published on www.vikramtakkar.com, please let me know, if you see this article on any other website/blog. 

No comments:

Post a Comment