In the majority of Talend tutorials related to database operations I found there’s no or little use of RDBMS prepared statements. To build or parametrize queries, the most part of Talend users and developers seems to prefer a pure string concatenation approach. But this is absolutely a bad habit, since it offers the side to some important security flaws and doesn’t make use of the caching mechanisms of modern RDBMS. Although the guys at Talend really don’t make your life easier because of some choices in I/O DB components, It’s still possible to design a job which make use of PreparedStatements at full extent. In this tutorial I’m going to introduce a technique for some common use cases, while hardening the security and improving the debugging speed at the same time.
What is a prepared statement and why you should use it
When triggering a parametrized query to a RDBMS, no matters if it’s a SELECT or an INSERT or UPDATE or whatever, the most common way a Talend developer does is something like this inside a t***Input component:
"SELECT foo, bar FROM database.table WHERE stuff = '" + globalMap.get("var") + "';"
Although very easy to do, this is by far the worst way to trig a parametrized query on a RDBMS. This is for many reasons.
First of all, the query is not safe, as string concatenation makes it vulnerable to SQL Injection. I don’t mean that your tightened ETL is opened to malware here and there. But even the most innocent scenario may carry a side effect. What about if “var” has a single quote inside? If you’re unlucky, you’ll mess your data. At least, your query will break up and so will do the entire ETL job.
Second, the query is going to be very inefficient as it won’t use the caching mechanism every modern RDBMS actually has. In case you’re going to issue just one query, this won’t probably hurt you too much. But what about your query is triggered by a tFlowToIterate component? What about several query with different value for “var”? Since the queries will be all different from each others, the RDBMS won’t be able to resolve the query plan in advance, optimize the resultset, cache the outputs. At the end of the day, you’re going to suffer a serious performance bottleneck.
Third, string concatenating is slow and hard to debug. If the query is so simple as the one I used for the example above, you would probably not believe me. But queries tend to be much more complicated than this, and concat java strings here and there is going to become a pain in the neck. Slow to write and absurdly hard to debug.
A prepared statement is no more a normal query with one or more placeholders. It’s easier to show than to explain:
SELECT foo, bar FROM database.table WHERE stuff = ?
This way you’re going to tell the RDBMS how the query will look like and which conditions have to be met. This is done before assigning a value to parameters and execute the query itself. The major advantage is the query plan can be solved at this earlier stage and it can be cached for subsequent queries of the same kind but with different values for the parameters set. This will give you a huge performance improvement.
Aside that, don’t forget you’ll have a more robust query that cannot be injected and a more suitable way to build conditions, easier to code and to debug.
Use cases can be really wide. For the purpose of this Talend tutorial, I’m going to use a PreparedStatement to filter a table.
How to use PreparedStatements in Talend
For this example, I’ll use a MySQL table which stores comments from my wife’s blog. Let’s say we’re interested in content made by commenters whose names are stored somewhere (another table, a CSV file, a java List…it doesn’t matter). This means I’ve some filtering conditions I know at runtime only.
Using tFlowToIterate to trig multiple queries (one query for each name, concatenating the resulting context vars inside the WHERE condition) will bring us to all those flaws I described before. The idea remains the same, but we will use PreparedStatements to do that in a smarter way:
- A starting component will feed a schema with all relevant parameter conditions. We put them into a regular TOS data flow.
- A statement is prepared to tell MySQL the query we’re going to execute and which parameters we have. At this stage we’re going to just define them, not sending actual values.
- For each parameter set (commenter’s name) we’ll execute a query and get the results. This is the stage when we actually send to MySQL the values for parameters and trig the query.
First of all, let’s import the table schema in the metadata repository, as It’s always a good habit to do. You can refer to Talend official manual if you don’t know how to do that: it’s really a 101 operation on TOS, by the way.
Then, we’re going to design the job. As you may see in the screenshot on top, it’s really trivial, but I want to point you out three particularities:
- I used a component from the family of t***Row components instead of t***Input ones. This because Input components don’t support prepared statements at the moment, while the more generic Row ones actually do. This gives you a bit less flexibility and speed in job design (ie you haven’t a Guess Schema button), but nothing you cannot manage to do with some good habits.
- t***Row elements don’t have a structured output schema, but you can get the raw ResultSet to parse. This is easily done using a very powerful yet not so common-used component tParseRecordSet.
- I’ve used an external (persistent) tMySQLConnection to use caching mechanism at full extent.
For the sake of simplicity, I hardcoded the filtering conditions (the four commenters I’m interested in) inside a tFixedFlowInput instance. Each line stores a name which i put into a column called mittente (italian for sender).
Now, let’s create a tMySqlRow instance with the schema we defined in the repository and sync to propagate changes. As I said, only t***Row components support PreparedStatement at the time. Adjust the query adding your filtering conditions using PreparedStatement placemarks (‘?’) . Here’s my example:
"SELECT `SK_RICETTA`, `CD_CATEGORIA`, `DES_URL`, `DES_MITTENTE`, `DES_TESTO`, `CD_SESSO`, `FLG_DOMANDA` FROM `T_COMMENTI` WHERE DES_MITTENTE = ?"
Unfortunately, t***Row components are very rude and don’t support structured output schemas. You can however propagate the entire ResultSet into a single column:
- Switch the schema to built-in
- Add an Object column to the schema. I called it RESULT_SET
- Check Propagate Query’s record set from Advanced Settings and select your RESULT_SET column. This column will store the entire dataset coming from the DB in a raw form.
- Remove other columns if you like, as they won’t be fed anyway. However, you can leave some columns eventually coming from the incoming connection if you have, as they will be propagated (this is proven to be great for fast-lookup scenarios)
Now it’s time to tell the RDBMS where to find parameters’ values. This is simple done with positional indexing, relating to question mark placeholders positions. Check Use PreparedStatement in Advanced Settings from tMySqlRow and fill the table with position, type and value for each parameter you defined inside the query. The Value column can host a define, for simple query, or a field coming from an incoming connection. In this case, we’ve just one parameter column stored in incoming row3.mittente.
Now, let’s go to tParseRecordSet. You just need to tell the component which column holds the record set and how to store the outgoing values. Be careful to Value column in the mapping table: elements are double quoted, here, as they represents the fields as coming from the database query.
When you execute the subjob, the query is prepared at first. As I said, it’s done only one time and then cached.
Then, for each row in tFixedFlow, a query is triggered by tMySqlRow with the current parameters value configuration. Then, the result is parsed and splitted in Talend columns for further use. As you may see in my simulation above, I trigger four queries (the commenters’ names I’m interested in) which held to a total of 842 rows when parsed.
This technique has proven to be extremely useful when you need to filter a DB query from a set of parameters you’ve at runtime only, as the only other way is to use a time-consuming tFlowToIterate design or a very memory-consuming tMap with inner joins. This is especially true when filtering parameters are correctly indexed, as the query plan solvers of RDBMS are usually very brave in such cases.
Since t***Row is a processing component, you can use it to fast lookup against a table. You’re not limited to send it just the parameters’ fields (ie the sender, in our case) but you can feed it with all your data from an incoming dataset, trig the queries you need and propagate the whole set of incoming data plus the ResultSet to the outgoing connection where tParseRecordSet will split and eventually normalize the rows in the right place. This is particularly useful, for example, when you’ve a relatively small set of elements to look-up but a very large lookup table, since you won’t need to load the entire lookup table into memory, as Talend usually does with a huge amount of memory wasted.
Finally, don’t underestimate the use on even simple one-shot parametrized queries. You’ll find real value building query without string concat, which is really error-prone on complex SQL statements and unsafe in most situations. Regards!