Today I am going to create a Talend job to perform Lookup operation. Most of time all the information is not provided in single file, it may be stored in multiple files. E.g. All the master data or definition to the dimensions are stored in different files. In these situation we need to use the lookup operation to fetch corresponding information from Lookup file.
Below is the snapshot of the Talend job to perform Lookup operation.
Here is the Sample Input File
Let's work through the steps of creating job:
1. First create the metadata of the main input delimited employee file by following the steps mentioned in post “Generating Metadata for CVS (Delimited) using Talend Open Studio”. Metadata will be available under the Metadata section in the Repository pane.
2. Drag the metadata created in the Step 1 to the Job designer and select component tFileInputdelimited component.
3. Drag component tFileInputdelimited to the job designer from Palette pane. Double click the component the edit the properties.
4. Enter the path of the Lookup delimited file in the File name/Stream text box. Also provide other details like Row Separator ("\n") and Field Separators (",") as per specifications of the Lookup files.
5. Click on the Edit Schema button (highlighted in blue in above screenshot) and enter schema details of the lookup file as below:
6. Now, its time to add tJoin and tLogRow components from the Palette pane to the job designer. tJoin component will help to join (perform lookup) two data stream defined above.
tLogRow will be used to display the output stream.
7. Right-click on the main Input delimited file component, select Row > Main, and drop the connector onto the tJoin component.
8. Right-click on the Lookup file component, select Row > Main, and again drop this onto the tJoin component. You'll notice that, even though we selected Row > Main in both cases, the resulting connector is slightly different. The first connector we made in step 7 is noted as Main, while the second is called Lookup. The order in which we connect the components to the tJoin is important. Join the primary data stream first, and one or more lookup components subsequently.
If you join connections in the wrong order or wish to change which is the Main data flow and which is the Lookup data flow, you can do this by right-clicking on one of the connectors and selecting Set this connection as Lookup or Set this connection as Main.
9. Right-click on the tJoin component, select Row > Main, and drop the connector onto the tLogRow component.
10. Now click on the tJoin component. In its Basic settings, click on the Edit Schema button.
11. We need to copy some of the input fields into the tJoin component. You can see in the left-hand side window that both of the input schemas are shown in separate tabs. Holding down the Ctrl key, click on empid and empname our main delimited input file. Click on the top-arrow
button in the middle bar of the window to copy these over to the tJoin_1 schema. (Clicking on the two-arrow button will copy all the fields from the left-hand side window to the right-hand side window.)
12. Now click on the second tab in the left-hand side window showing the schema for the second delimited file input. Select the dept_name column and copy this over to the right-hand side window, using the top-arrow button.
13. Click on OK to save the changes, and when prompted to propagate the schema changes, click on Yes. This will copy the tJoin schema over to the delimited output component.
14. In the Basic settings tab of the tJoin component, click on the checkbox Include lookup columns in output.
15. In the Column mapping box, add a row and select dept_name from the Output column drop-down menu. In the Lookup column section, select row3.dept_name.
14. In the Key definition table, add a row and select empdept as the Input key attribute field, and row3.dept_id as the Lookup key attribute field.
Basically, What we have done is that In the Key definition table, we've noted that the empdept field in the main input file maps to the dept_name field in the lookup file. We've also configured in the Column mapping table, that in the Output column section, the dept_name field should be
populated with values from the lookup file dept_name field.
15. Now Run the Job and check the output.
In the output stream, We can see that Department name has been populated corresponding to dept id from the Lookup file.
We can also perform the Lookup operation using the tMap component instead of tJoin. You guys can go ahead and try to create a job performing lookup using tMap component. let me know, if you face any issues.
You may also like to read..