Tuesday, 18 December 2012

Performing Lookup Operation in Talend Open Studio



Today I am going to create a Talend job to perform Lookup operation. Most of time all the information is not provided in single file, it may be stored in multiple files. E.g. All the master data or definition to the dimensions are stored in different files. In these situation we need to use the lookup operation to fetch corresponding information from Lookup file.

Below is the snapshot of the Talend job to perform Lookup operation.

Here is the Sample Input File

empid;empname;salary;dept_id
101;Anuj Mittal;10000;10
102;Vinay Gupta;15000;11
103;Akshay Arora;25000;12
104;Sharukh Khan;125000;13
105;Katrina Kaif;55000;11
106;Amir Khan;120000;12

Lookup File

department_id,department_name
10,finance
11,human resources
12,admin
13,information technology
14,outsourcing

Let's work through the steps of creating job:

1. First create the metadata of the main input delimited employee file by following the steps  mentioned in post “Generating Metadata for CVS (Delimited) using Talend Open Studio”. Metadata will be available under the Metadata section in the Repository pane.

2. Drag the metadata created in the Step 1 to the Job designer and select component tFileInputdelimited component.

3. Drag component tFileInputdelimited to the job designer from Palette pane. Double click the component the edit the properties.

4. Enter the path of the Lookup delimited file in the File name/Stream text box. Also provide other details like Row Separator ("\n") and Field Separators (",") as per specifications of the Lookup files.



5. Click on the Edit Schema button (highlighted in blue in above screenshot) and enter schema details of the lookup file as below:

6. Now, its time to add tJoin and tLogRow components from the Palette pane to the job designer. tJoin component will help to join (perform lookup) two data stream defined above.
tLogRow will be used to display the output stream.

7.  Right-click on the main Input delimited file component, select Row > Main, and drop the connector onto the tJoin component.

8.  Right-click on the Lookup file component, select Row > Main, and again drop this onto the tJoin component. You'll notice that, even though we selected Row > Main in both cases, the resulting connector is slightly different. The first connector we made in step 7 is noted as Main, while the second is called Lookup. The order in which we connect the components to the tJoin is important. Join the primary data stream first, and one or more lookup components subsequently.

If you join connections in the wrong order or wish to change which is the Main data flow and which is the Lookup data flow, you can do this by right-clicking on one of the connectors and selecting Set this connection as Lookup or Set this connection as Main.


9. Right-click on the tJoin component, select Row > Main, and drop the connector onto the tLogRow component.

10. Now click on the tJoin component. In its Basic settings, click on the Edit Schema button.

11. We need to copy some of the input fields into the tJoin component. You can see in the left-hand side window that both of the input schemas are shown in separate tabs. Holding down the Ctrl key, click on empid and empname our main delimited input file. Click on the top-arrow
button in the middle bar of the window to copy these over to the tJoin_1 schema. (Clicking on the two-arrow button will copy all the fields from the left-hand side window to the right-hand side window.)

12. Now click on the second tab in the left-hand side window showing the schema for the second delimited file input. Select the dept_name column and copy this over to the right-hand side window, using the top-arrow button.


13. Click on OK to save the changes, and when prompted to propagate the schema changes, click on Yes. This will copy the tJoin schema over to the delimited output component.

14. In the Basic settings tab of the tJoin component, click on the checkbox Include lookup columns in output.

15. In the Column mapping box, add a row and select dept_name from the Output column drop-down menu. In the Lookup column section, select row3.dept_name.

14. In the Key definition table, add a row and select empdept as the Input key attribute field, and row3.dept_id as the Lookup key attribute field.


Basically, What we have done is that In the Key definition table, we've noted that the empdept field in the main input file maps to the dept_name field in the lookup file. We've also configured in the Column mapping table, that in the Output column section, the dept_name field should be
populated with values from the lookup file dept_name field.

15. Now Run the Job and check the output.




In the output stream, We can see that Department name has been populated corresponding to dept id from the Lookup file.

We can also perform the Lookup operation using the tMap component instead of tJoin. You guys can go ahead and try to create a job performing lookup using tMap component. let me know, if you face any issues.

You may also like to read..

2 comments:

  1. Thnx for the post!

    I would like to know your opinion: Which method do you think is better for this op, tJoin or tMap?

    ReplyDelete
  2. tJoin and tMap are quite different components, though tMap can be used to perform the same functionality as tJoin.

    However, tMap is more powerful in terms of functionality. I am listing some of them as below:

    1. tMap can have many outputs links as compared to tJoin which can have a main and reject links.
    2. With tMap we can use expression on the columns while providing the joining condition. I think it is not possible for tJoin, Only exact match between the keys is possible.
    3. In tMap we have option to store the intermediate data in the disc.
    4. In tMap, we can enable the option to reload the lookup for every record.

    Hence tMap is quite powerful component as compared to tJoin which is basic. Being a powerful component, it may take more space and time to load in the memory while running than tJoin. Hence, in our situation tJoin is recommended.

    Please let me know, your thoughts.

    ReplyDelete