How to use prepared statements for efficiency and security in Talend queries

How to use prepared statements for efficiency and security in Talend queries

In the majority of Talend tutorials related to database operations I found there’s no or little use of RDBMS prepared statements. To build or parametrize queries, the most part of Talend users and developers seems to prefer a pure string concatenation approach. But this is absolutely a bad habit, since it offers the side to some important security flaws and doesn’t make use of the caching mechanisms of modern RDBMS. Although the guys at Talend really don’t make your life easier because of some choices in I/O DB components, It’s still possible to design a job which make use of PreparedStatements at full extent. In this tutorial I’m going to introduce a technique for some common use cases, while hardening the security and improving the debugging speed at the same time.

What is a prepared statement and why you should use it

When triggering a parametrized query to a RDBMS, no matters if it’s a SELECT or an INSERT or UPDATE or whatever, the most common way a Talend developer does is something like this inside a t***Input component:

"SELECT foo, bar FROM database.table WHERE stuff = '" + globalMap.get("var") + "';"

Although very easy to do, this is by far the worst way to trig a parametrized query on a RDBMS. This is for many reasons.

First of all, the query is not safe, as string concatenation makes it vulnerable to SQL Injection. I don’t mean that your tightened ETL is opened to malware here and there. But even the most innocent scenario may carry a side effect. What about if “var” has a single quote inside? If you’re unlucky, you’ll mess your data. At least, your query will break up and so will do the entire ETL job.

Second, the query is going to be very inefficient as it won’t use the caching mechanism every modern RDBMS actually has. In case you’re going to issue just one query, this won’t probably hurt you too much. But what about your query is triggered by a tFlowToIterate component? What about several query with different value for “var”? Since the queries will be all different from each others, the RDBMS won’t be able to resolve the query plan in advance, optimize the resultset, cache the outputs. At the end of the day, you’re going to suffer a serious performance bottleneck.

Third, string concatenating is slow and hard to debug. If the query is so simple as the one I used for the example above, you would probably not believe me. But queries tend to be much more complicated than this, and concat java strings here and there is going to become a pain in the neck. Slow to write and absurdly hard to debug.

A prepared statement is no more a normal query with one or more placeholders. It’s easier to show than to explain:

SELECT foo, bar FROM database.table WHERE stuff = ?

This way you’re going to tell the RDBMS how the query will look like and which conditions have to be met. This is done before assigning a value to parameters and execute the query itself. The major advantage is the query plan can be solved at this earlier stage and it can be cached for subsequent queries of the same kind but with different values for the parameters set. This will give you a huge performance improvement.

Aside that, don’t forget you’ll have a more robust query that cannot be injected and a more suitable way to build conditions, easier to code and to debug.

Use cases can be really wide. For the purpose of this Talend tutorial, I’m going to use a PreparedStatement to filter a table.

How to use PreparedStatements in Talend

For this example, I’ll use a MySQL table which stores comments from my wife’s blog. Let’s say we’re interested in content made by commenters whose names are stored somewhere (another table, a CSV file, a java List…it doesn’t matter). This means I’ve some filtering conditions I know at runtime only.

Using tFlowToIterate to trig multiple queries (one query for each name, concatenating the resulting context vars inside the WHERE condition) will bring us to all those flaws I described before. The idea remains the same, but we will use PreparedStatements to do that in a smarter way:

  1. A starting component will feed a schema with all relevant parameter conditions. We put them into a regular TOS data flow.
  2. A statement is prepared to tell MySQL the query we’re going to execute and which parameters we have. At this stage we’re going to just define them, not sending actual values.
  3. For each parameter set (commenter’s name) we’ll execute a query and get the results. This is the stage when we actually send to MySQL the values for parameters and trig the query.

First of all, let’s import the table schema in the metadata repository, as It’s always a good habit to do. You can refer to Talend official manual if you don’t know how to do that: it’s really a 101 operation on TOS, by the way.

Retrieve DB schema in Talend

Then, we’re going to design the job. As you may see in the screenshot on top, it’s really trivial, but I want to point you out three particularities:

  1. I used a component from the family of t***Row components instead of t***Input ones. This because Input components don’t support prepared statements at the moment, while the more generic Row ones actually do. This gives you a bit less flexibility and speed in job design (ie you haven’t a Guess Schema button), but nothing you cannot manage to do with some good habits.
  2. t***Row elements don’t have a structured output schema, but you can get the raw ResultSet to parse. This is easily done using a very powerful yet not so common-used component tParseRecordSet.
  3. I’ve used an external (persistent) tMySQLConnection to use caching mechanism at full extent.

For the sake of simplicity, I hardcoded the filtering conditions (the four commenters I’m interested in) inside a tFixedFlowInput instance. Each line stores a name which i put into a column called mittente (italian for sender).

Delimited fixed flow in TalendNow, let’s create a tMySqlRow instance with the schema we defined in the repository and sync to propagate changes. As I said, only t***Row components support PreparedStatement at the time. Adjust the query adding your filtering conditions using PreparedStatement placemarks (‘?’) . Here’s my example:

"SELECT
 `SK_RICETTA`,
 `CD_CATEGORIA`,
 `DES_URL`,
 `DES_MITTENTE`,
 `DES_TESTO`,
 `CD_SESSO`,
 `FLG_DOMANDA`
FROM `T_COMMENTI`
WHERE DES_MITTENTE = ?"
 Tip: a fast way to create a component which makes use of a generic schema is to drag your schema definition icon from repository to an empty space in job design canvas and select which component type to create. The relevant parameters will automatically be set.

Unfortunately, t***Row components are very rude and don’t support structured output schemas. You can however propagate the entire ResultSet into a single column:

  1. Switch the schema to built-in
  2. Add an Object column to the schema. I called it RESULT_SET
  3. Check Propagate Query’s record set from Advanced Settings and select your RESULT_SET column. This column will store the entire dataset coming from the DB in a raw form.
  4. Remove other columns if you like, as they won’t be fed anyway. However, you can leave some columns eventually coming from the incoming connection if you have, as they will be propagated (this is proven to be great for fast-lookup scenarios)
Tip: if your propagate the schema by mistake, just drag and drop the table definition schema from the repository  to the tParseRecordSet instance to restore it.

Now it’s time to tell the RDBMS where to find parameters’ values. This is simple done with positional indexing, relating to question mark placeholders positions. Check Use PreparedStatement in Advanced Settings from tMySqlRow and fill the table with position, type and value for each parameter you defined inside the query. The Value column can host a define, for simple query, or a field coming from an incoming connection. In this case, we’ve just one parameter column stored in incoming row3.mittente.

PreparedStatement parameters mapping

Now, let’s go to tParseRecordSet. You just need to tell the component which column holds the record set and how to store the outgoing values. Be careful to Value column in the mapping table: elements are double quoted, here, as they represents the fields as coming from the database query.

Parse RecordSet in TalendFinalize your subjob. For this tutorial, I simply ended it with a tLogRow instance and, of course, a tMySqlClose.

Execution

When you execute the subjob, the query is prepared at first. As I said, it’s done only one time and then cached.

Then, for each row in tFixedFlow, a query is triggered by tMySqlRow with the current parameters value configuration. Then, the result is parsed and splitted in Talend columns for further use. As you may see in my simulation above, I trigger four queries (the commenters’ names I’m interested in) which held to a total of 842 rows when parsed.

Conclusions

This technique has proven to be extremely useful when you need to filter a DB query from a set of parameters you’ve at runtime only, as the only other way is to use a time-consuming tFlowToIterate design or a very memory-consuming tMap with inner joins. This is especially true when filtering parameters are correctly indexed, as the query plan solvers of RDBMS are usually very brave in such cases.

Since t***Row is a processing component, you can use it to fast lookup against a table. You’re not limited to send it just the parameters’ fields (ie the sender, in our case) but you can feed it with all your data from an incoming dataset, trig the queries you need and propagate the whole set of incoming data plus the ResultSet to the outgoing connection where tParseRecordSet will split and eventually normalize the rows in the right place. This is particularly useful, for example, when you’ve a relatively small set of elements to look-up but a very large lookup table, since you won’t need to load the entire lookup table into memory, as Talend usually does with a huge amount of memory wasted.

Finally, don’t underestimate the use on even simple one-shot parametrized queries. You’ll find real value building query without string concat, which is really error-prone on complex SQL statements and unsafe in most situations. Regards!

  • natus

    Thanks, very usefull. Do you think it could be possible to easily create a component, that replace the three you use (***flow/t***row/tparse) ?
    Because as you said, it’s exhausting, and error-prone to manually write names in ***flow, or in double quote in tParse …

    • http://gabrielebaldassarre.com/ Gabriele

      It’s possible, but probably it’s not a great deal because of lacking flexibility. Let’s say, for example, you switch your DB from MySql to Oracle. And it would be a very complex component, anyway. You probably would try to bake a child job for the purpose, instead.

      Anyway, existing best practices are the best deal to avoid errors, in such a case. The wise use of generic schemas metadata you can drag and drop onto a component to synch on it is possibly the best example on how to make a good complexity isolation. Unbelievable to say, the most useful usability tips in TOS are the most undocumented ones

      • natus

        > a child job instead
        I anderstand the prepared query could be a context variable for the child job, but how to configure the tParse/tFlow from the parend job (say all double quote in the tparse) then ?
        > you switch from MySql to Oracle
        I guess I would change from tPreparedMysql to tPreparedOracle. The same problem today with tInputMysql to tInputOracle, no ?
        > generic schemas metadata you can drag and drop

        that’s right, thanks again, I discover this great feature

        • http://gabrielebaldassarre.com/ Gabriele

          Yes ofc you’ll suffer a lot of limitations, while the switch from MySql to Oracle comes to be critical if you would to boxing everything into a custom component (you’ll end up having a family of components, at the end of the day).

          Talend is a nice tool, but because of its very inner design it’s quite difficult to shake metadata up while in runtime. That’s one very typical example of that.

          • natus

            My previous approach to get a subset from a large table “A” was to load first the keys in a dedicated table “B” (t***output), then inner join “A” and “B” on the key to get the result. This was faster than the prepared SELECT you show here.
            Maybe the inner join strategy is better when you can create a dedicated table ?

          • http://gabrielebaldassarre.com/ Gabriele

            You’re definitively right: a prepared statement is most probably not the fastest solution for handling such kind of problems. But it’s more robust.
            Let’s say, for example, that your “B” table suffers a huge, sudden, increasing in size. Then, your job, thus your ETL, will sooner or later tend to fail because of exhausted heap memory, while doing a disk-based join will dramatically slow down the execution. Although slower, an inline lookup using a prepared statement will keep you safe, as it’s fully scalable and let you focus on RDBMS performance optimizations rather than fixing JVM memory issues instead.

          • natus

            I guess it’s a misunderstanding.
            In the strategy I explained, the join is done by the database (with the “B” table I truncate/insert before each input (equivalent to your tFixedFlow)). The jvm get the same little subset, but this is faster