tExecuteRScript

This Talend Open Studio component provides a complete environment for executing code written for the popular statistical platform R and retrieve results back. It’s built around the JRI interface of the rJava package. Although this package offers a 100% compatibility in execution of code, it’s quite rudimental on I/O features and it’s limited to retrieve only String, Int and Double arrays. This limitation is imposed by the very inner architecture of R and it doesn’t depends on the component  itself.  Some features of the component:

  • Low level connection to an existing R installation
  • Support for external .R file load and inline R code
  • Two logging possibilities (Verbose/Silent)
  • Results mapping to convert R symbols to standard ‘row’ Talend connection
  • Log redirection to tLogCatcher elements, if available;
  • Autocast of output, if possible.
  • Written in true OOP using the robust Talend Bridge framework

Download tExecuteRScript for Talend Open Studio

How to use the component

To install the component, some prerequisites must be met:

  • A valid installation of R must be present on the system. The component was tested with R > 2.15 and R > 3.0.1, but it may work with other versions as well
  • The R package rJava must be installed. Follow this guide if you don’t know how to do it. You could consider looking at this thread for further assistance, then.
  • These environment variables must be set with the proper values: R_HOME, JAVA_HOME
  • The path to R.dll library (libR.so under *nix), the path to R executable and the path to jri.dll  (libjri.so) must be present in your system Path. Please note that these libraries usually come in two flavours (32/64 bit) so set the proper paths aaccordingly to your system properties.

Please note that this is a starting/generating component, not a processing component. This means that incoming data connections are not allowed here.

To use it, just write down in the box the code you want to be executed by R. Please note this must be correctly quoted and escaped. Alternatively, you can source a code from an external .R file. This is by far the fastest route, expecially if you have lot and lot of code, since it doesn’t need to be quoted/escaped. Of course, inline code could be manipulated (parametrized) at runtime directly on the Talend side, while external .R scripts could not. So the final choice depends on your need. Anyway, it’s always possible to pass some command-line parameters to R using the proper String (which must be quoted/escaped too). These parameters applies on both executing scenarios just exposed.

At the very end of the tab, you’ll have a parameter that let you choose if log messages from the component must be notified to tLogCatcher instances. They will be printed out to stdout/stderr only, if disabled.

On the advanced parameters, you can choose between two  output redirection strategies. On Standard clients, output coming from R is redirected to Talend logging facilities, while on Silent clients, R console is actually muted. Irrespective of the choice, R autoprinting is disabled, so you must explicitly put a “print” statement in your R code to output something to the console.

Advanced use scenario: get results from R evaluations

Although sometimes you just need to execute R code, more often you probably need to get the computation results back on the Talend side. Under the hood of the limitations imposed by JRI, this is done from the advanced parameters tab.

For each column of an output schema, you can map a R expression returning an array of a imposed kind (only String, Double and Integers are supported). If the expression doesn’t return an array, or the array cannot be cast to String, Double, Integers on the R-side, you’ll probably going to get a NullPointerException somewhere. As this is from R, not Talend, you’ll probably need to look at your R code and expressions to fix it (for example, Factors must be converted to Characters/String, Boolean to Integers and so on).

Anyway, Talend schema columns conversion is automatic, if allowed. For example, a Double array coming from R can be stored in a Float column of the output schema. This is perfectly allowed and doesn’t cause any error.

In a very typical scenario, you’ll almost always need to explode your complex R output (ie. a data frame, a matrix, a list…) to a primitive array of one of the supported type. This is not elegant, but unavoidable for the intrinsic limits of JRI. The good news is that you actually don’t need to mess your R code because you’ll probably going to put your extraction code just inside the mapping expression field. As every good R programmer should now, it’s always possible to do a one-line data manipulation using R accessors.

This is made possible because the main R code (the inline code/the external .R script) is always evaluated first while the output expressions are evaluated later.

The number of output rows is set from the biggest of the array coming from the expressions set (other values are set to null by default for the smaller). An optional “Loop expression” parameter can be set to force the loop on different basis. This R expression, as usual, must return an array of one of the allowed type and must be correctly quoted/escaped. For example, the expression “1:10” will force 10 rows in output, whether or not the size of the other expressions.

Please remember one final word:

The component is not thread-safe

This is because of the inner limitations of R itself. So be really, really, really careful when you plan to use the component in a multi-thread Talend job execution!

Available Output

  • Main output – if some output expressions are present, the component allows for 1-to-N standard data flow connections

12 Comments

  1. Sebastiao

    Hello,

    I’m trying to follow your tutorial, but I’m running into the following error:
    Exception in thread “main” java.lang.NoSuchMethodError: org.gabrielebaldassarre.customcode.jri.tJRIClient.eval(Ljava/lang/String;)Lorg/gabrielebaldassarre/customcode/jri/tJRIFlowBehaviour;
    [statistics] disconnected
    at sanofi.rscript_0_1.rscript.tExecuteRScript_1Process(rscript.java:533)
    at sanofi.rscript_0_1.rscript.runJobInTOS(rscript.java:945)
    at sanofi.rscript_0_1.rscript.main(rscript.java:810)

    Any idea?

    Reply
    • Gabriele

      Hello,

      This is much probably a literal enclosing problem. Don’t forget that R code and calls are Java literals, so you have to enclose the strings with double quotes!

  2. Guest

    hello,

    Thanks for this but :

    – i have installed rjava (I don’t know why I had to remove JAVA_HOME first..)
    I try a texecuteRjob and get :

    Cannot find JRI native library!
    Please make sure that the JRI native library is in a directory listed in java.library.path.

    Do I need to add the jri.jar somewhere in talend ?

    Reply
    • Gabriele

      Hello…ehm…Guest 🙂

      rJava could be really a pain in the neck for installing and getting it working. I’m not very competent on it you should ask for help on a more focused community, like Stackoverflow. Anyway, it definitively looks like a system path problem, not a classpath problem. You must add the path to LD_LIBRARY_PATH environment variable. This could be of help: http://stackoverflow.com/questions/4894002/question-about-jri-error

      regards,
      Gabriele

    • natus

      Hello Guest speeking,

      I solve it by adding “-Djava.library.path=/usr/lib/R/site-library/rJava/jri/” in the jvm seeting arguments (in the execute job>advanced parameters)

      It looks like a little dirty, but works for now.
      Thans

  3. natus

    Hello,

    talend/R/jri seem to be installed.

    Then I moove your component in a new job, and try it with the default value : “print(head(iris))”
    Then, I get a console output looks like the R output, (but with strange symbols). 
    “[0;38;05;40m  Sepal.Length Sepal.Width”…

    Anyway, I would like know to export a dataframe to a csv. I can’t make it working.
    Could you bring me a little help the way I could do ?

    Thanks a lot.

    Reply
  4. Anton

    Hi Gabriele,

    This tool seems particularly useful, however, I have been trying to follow your tutorial:

    https://gabrielebaldassarre.com/2014/01/25/build-predictive-model-using-talend-open-studio-r/

    and have the following two errors

    1. “Type mismatch: cannot convert from TalendFlowBehaviour to tJRIFlowBehaviour”
    2. “The method loopVariable() is undefined for the type tJRIFlowBehaviour”

    I am using Talend 6.1.0 – do you know whether this is due to a my set up of rJava etc. or possibly with the fact that I am using a different version of Talend?

    Thanks in advance!

    Reply
    • Gabriele Baldassarre

      Hi Anton,

      can you pls provide me a more detailed error? Ie. the exact line of code that goes in exeception?

      I’m not able to reproduce the bug, atm…

      tnx,
      gabriele

  5. Metin Kahraman

    Hello Gabriele,

    first of all, thanks for such a component. It facilitates the interaction with R a lot.

    However in my scenario it seems as if it is only possible to execute one tExecuteRScript component per job.

    Am I right or am I just doing wrong ?

    My (simple) scenario is:

    tExecuteRScript1 –row1–> tLogRow1
    |
    on SubjobOK
    |
    ˅
    tExecuteRScript2 –row2–> tLogRow2

    The first script executes as expected, but the second one gets internally stuck at:

    “org.gabrielebaldassarre.customcode.jri.tJRIClient client_tExecuteRScript_2 =
    factory_tExecuteRScript_2.getClient();”

    Is there anything else to consider to get this working ?
    It would be great if you could help me with this.

    Regards

    Metin

    Reply
    • Gabriele Baldassarre

      Hi Meetin,

      tnx for using my component! I love to hear it’s useful, although it’s still so damned unstable due to very bugged R2Java interface I’m starting to think it’s better to retire from service 🙂 anyway, I cannot directly reproduce the bug: which kind of execeptio did you get exactly?

  6. Chen

    Hi Gabrielle,

    I’ve followed the setup steps:
    installed R, rJava,
    set R_HOME to be install directory (in my case C:\Temp\devel\R-3.2.3),

    Then I created a Talend job having only the tExecuteRScript ( with default R code “print(head(iris))” ) and tLogRow.

    And I got following error messages after running the job:
    Type mismatch: cannot convert from TalendFlowBehaviour to tJRIFlowBehaviour
    symbols_tExecuteRScrpt_1 cannot be resolved to a type
    Syntax error, insert “VariableDeclarators” to complete LocalVraibleDeclaration
    The method loopVaraible() is undefined for the type tJRIFlowBehaviour

    What did I do wrong?
    Can you give me a hand?

    Thanks.

    Chen

    Reply
  7. chen

    Hi Gabriele,

    I’ve followed the installation steps:
    installed R (in my case under C:\Temp\devel\R-3.2.3)
    installed rJava (in my case C:\Temp\devel\R-3.2.3\library\rJava\)
    set JAVA_HOME and R_HOME (in my case C:\Temp\devel\R-3.2.3\)

    Then I created a Talend job with tExecuteRScript and tLogRow:
    in tExecuteRScript I kept the default R code “print(head(iris))” and create a dummy column called “id” of string type.
    In the JVM arguments, I added -Djava.library.path=C:/Temp/devel/R-3.2.3/library/rJava/jri/x64

    But if I run the job I got following error messages:
    Type mismatch: cannot convert from TalendFlowBehaviour to tJRIFlowBehaviour
    The constructor tJRISymbol(tJRIOutputType) is undefined
    The method loopVariable() is undefined for the type tJRIFlowBehaviour
    Syntax error on token “,”, delete this token.

    What did I do wrong?
    Can you give me a hand?

    Thanks.
    BR,
    Zhengdao

    Reply

Submit a Comment

Your email address will not be published. Required fields are marked *

Share This