This Talend Open Studio component provides a complete environment for executing code written for the popular statistical platform R and retrieve results back. It’s built around the JRI interface of the rJava package. Although this package offers a 100% compatibility in execution of code, it’s quite rudimental on I/O features and it’s limited to retrieve only String, Int and Double arrays. This limitation is imposed by the very inner architecture of R and it doesn’t depends on the component itself. Some features of the component:
- Low level connection to an existing R installation
- Support for external .R file load and inline R code
- Two logging possibilities (Verbose/Silent)
- Results mapping to convert R symbols to standard ‘row’ Talend connection
- Log redirection to tLogCatcher elements, if available;
- Autocast of output, if possible.
- Written in true OOP using the robust Talend Bridge framework
How to use the component
To install the component, some prerequisites must be met:
- A valid installation of R must be present on the system. The component was tested with R > 2.15 and R > 3.0.1, but it may work with other versions as well
- The R package rJava must be installed. Follow this guide if you don’t know how to do it. You could consider looking at this thread for further assistance, then.
- These environment variables must be set with the proper values: R_HOME, JAVA_HOME
- The path to R.dll library (libR.so under *nix), the path to R executable and the path to jri.dll (libjri.so) must be present in your system Path. Please note that these libraries usually come in two flavours (32/64 bit) so set the proper paths aaccordingly to your system properties.
Please note that this is a starting/generating component, not a processing component. This means that incoming data connections are not allowed here.
To use it, just write down in the box the code you want to be executed by R. Please note this must be correctly quoted and escaped. Alternatively, you can source a code from an external .R file. This is by far the fastest route, expecially if you have lot and lot of code, since it doesn’t need to be quoted/escaped. Of course, inline code could be manipulated (parametrized) at runtime directly on the Talend side, while external .R scripts could not. So the final choice depends on your need. Anyway, it’s always possible to pass some command-line parameters to R using the proper String (which must be quoted/escaped too). These parameters applies on both executing scenarios just exposed.
At the very end of the tab, you’ll have a parameter that let you choose if log messages from the component must be notified to tLogCatcher instances. They will be printed out to stdout/stderr only, if disabled.
On the advanced parameters, you can choose between two output redirection strategies. On Standard clients, output coming from R is redirected to Talend logging facilities, while on Silent clients, R console is actually muted. Irrespective of the choice, R autoprinting is disabled, so you must explicitly put a “print” statement in your R code to output something to the console.
Advanced use scenario: get results from R evaluations
Although sometimes you just need to execute R code, more often you probably need to get the computation results back on the Talend side. Under the hood of the limitations imposed by JRI, this is done from the advanced parameters tab.
For each column of an output schema, you can map a R expression returning an array of a imposed kind (only String, Double and Integers are supported). If the expression doesn’t return an array, or the array cannot be cast to String, Double, Integers on the R-side, you’ll probably going to get a NullPointerException somewhere. As this is from R, not Talend, you’ll probably need to look at your R code and expressions to fix it (for example, Factors must be converted to Characters/String, Boolean to Integers and so on).
Anyway, Talend schema columns conversion is automatic, if allowed. For example, a Double array coming from R can be stored in a Float column of the output schema. This is perfectly allowed and doesn’t cause any error.
In a very typical scenario, you’ll almost always need to explode your complex R output (ie. a data frame, a matrix, a list…) to a primitive array of one of the supported type. This is not elegant, but unavoidable for the intrinsic limits of JRI. The good news is that you actually don’t need to mess your R code because you’ll probably going to put your extraction code just inside the mapping expression field. As every good R programmer should now, it’s always possible to do a one-line data manipulation using R accessors.
This is made possible because the main R code (the inline code/the external .R script) is always evaluated first while the output expressions are evaluated later.
The number of output rows is set from the biggest of the array coming from the expressions set (other values are set to null by default for the smaller). An optional “Loop expression” parameter can be set to force the loop on different basis. This R expression, as usual, must return an array of one of the allowed type and must be correctly quoted/escaped. For example, the expression “1:10” will force 10 rows in output, whether or not the size of the other expressions.
Please remember one final word:
This is because of the inner limitations of R itself. So be really, really, really careful when you plan to use the component in a multi-thread Talend job execution!
- Main output – if some output expressions are present, the component allows for 1-to-N standard data flow connections