Monday, 5 May 2014

Talend Job Design - Performance Optimization Tips


I am going to share few of the Performance tuning tips that I follows while designing Talend Job. Let me know your comments on the same and also let me know, if there are any other performance optimization methods you follows and are helpful.

Here we go..


1.Remove Unnecessary fields/columns ASAP using tFilterColumns component.



It is very important to remove the data from the Job flow which is not required as soon as possible. e.g. we have a huge lookup file having more than 20 fields but we only need two fields (Key, Value) while performing the lookup operation. Now if we do not filter the columns before join then the whole file will be read into memory for performing lookup hence occupying unnecessary space. However, if we filter fields and only keep two required columns then the memory occupied by lookup data is much less i.e. in this example 10 times less.


2. Remove Unnecessary data/records ASAP using tFilterRows component.



Similarly, It is necessary to remove the data from the job flow which is not required in the Job. Having less data in your job flow will always allow your Talend Job to perform better.


3. Use Select Query to retrieve data from database - When retrieving data from database, it is recommended to use the select query in the t<DB>Input component e.g. tMySQlInput, tOracleInput etc to select only required data. In the select query itself you can provide the required fields to fetch and also provide the where condition and filter only required data. This will allow only required data to be fetched in the job flow rather than complete table unload.




4. Use Database Bulk components - While loading huge datasets to database from Talend Job, it is recommended to use Bulk components provided by Talend for almost all databases. For more details and demonstration of performance optimization using Bulk components click here.






5. Store on Disk Option - There can be several possible reasons for a low performance of Job. Most common reason may include:


  • Running a Job which contains a number of buffer components such as tSortRow, tFilterRow, tMap, tAggregateRow, tHashOutput for example
  • Running a Job which processes a very large amount of data.


In Jobs that contain buffer components such as tSortRow as well as tMap, you can change the basic configuration to store temporary data on disk rather than in memory.  For example, tMap,  select the option Store on disk for lookup data to be stored on a defined path. This will allow not to take the whole data into memory which will keep the memory available for operations and temp data will be fetched from disk.




6. Allocating more memory to the Jobs - If you cannot optimize the Job design, you can at least allocate more memory to the Job. Allocating more memory to job will allow the job to perform better.


-Xms signifies the initial heap size of the Job.
-Xmx signified the maximum size to which heap can grow. (maximum memory allocated to Job)




7. Parallelism - Most of the time we need to run few jobs/sub jobs in parallel to maximize the performance and reduce overall job execution time. However, Talend doesn’t automatically execute the subjobs in Parallel. E.g. If we have a Job which loads two different tables from two different files and there is no dependency between both loads then Talend will not automatically execute the Jobs in parallel. Talend will execute one of the sub job(randomly) and when one is finished then it start execution of the second subjob. You can achieve the parallelization in following two ways:


  • Using the tParallelize component of Talend. (only available in Talend Integration Suite)
  • Running SubJobs in Parallel by using the Multithreaded Executions. This option is also available in Talend Open Studio. However, this option is disabled by default. You can enable this option from Job view. Visit the article “Parallel Execution Sub Jobs in Talend Open Studio” for more details and demonstration of Parallel execution of Sub Jobs in Talend Open Studio.


8. Use Talend ELT Components when required - ETL components are very handy and helps to optimize performance of the job when we need to perform transformation on data within a single database. There are couple of scenarios where we can use ELT components e.g. performing a join between the data in different table in same database.  Benefit of using ELT component is that It will not unload the data from database tables into Job flow for performing the transformations. However, it will Talend will automatically create Insert/Select statements which will directly run on DB server. So if the database tables are indexed properly and data is huge then ELT method can provide to be much better option in terms of performance of the Job. For more details on ELT components click here.





9. Use SAX parser over Dom4J whenever required - When parsing Huge XML files try using the SAX parser in the Generation mode in the Advanced Settings of tFileInputXML component. However SAX parser comes with few downsides e.g. we can  only basic XPATH expression and can not use expressions like Last , array selection of data [ ] etc. But if your requirement is getting accomplished using SAX parser, you must prefer it over Dom4J.


Visit the article “Handling Huge XML files in Talend”, for demonstration of performance optimization of SAX parser.
Visit the article “Difference between Dom4J and SAX parser in Talend”, for detailed difference between Dom4J and SAX parser.


10. Index Database Table columns - When updating the data in a Table through Talend Job, it is recommended to index the database table columns on the same fields which is defined as Key in the Talend Database output component. Having the index defined on the key will allow the job to run much faster as compared to non indexed keys.






11. Split Talend Job to smaller Subjobs - Whenever possible, one should split the complex Talend job to smaller Subjobs. Talend operates pipe line parallelism i.e. after processing few records it passes to downstream components even if the previous component has finished processing all records. Hence if we will design a JOb having complex number of operations in single subjob then the performance of the job will reduce. It is advisable to bread the complex Talend job to smaller Subjobs and then control the flow of Job using Triggers in Talend.

Thanks Guys for reading this post. I am looking forward to your expert comments.

This article is written by +Vikram Takkar   and published on www.vikramtakkar.com, please let me know, if you see this article on any other website/blog.

2 comments: