In this step-by-step tutorial I’ll show you how to use Talend Open Studio and the Twitter Components Pack to connect to Twitter, do a simple REST query and build a trivial relevance report on top on it. There’re tons of similar Talend tutorials out there, but no one is focused on my Twitter components pack, which let you do queries and result parsing without writing a single line of custom code. So let’s go into this 101 crash course on how to download tweets and build a real-world analysis on it.
Register a Twitter Application
The first step to do is to subscribe to Twitter Developer Network using your Twitter account. This is a mandatory step as you won’t be able to use the API anonymously, otherwise. Just point your browser to TwitterDev and follow the easy instructions. After you logged in, you’ll be presented a dashboard like the one below. Click on “My Applications” on your profile menu on top-right corner of the page:
Then, click on Create New Application and fill the fields with some descriptive content. Leave Callback URL empty, as we’re not going to use direct-to-token authentication (more below). Confirm clicking on Create your Twitter Application submit button.
Now, we’re going to add some technical flavour to make our Twitter App effectively works. First of all, click on the Permission tab and put on Read and Write. This is not really mandatory for the scope of this tutorial, but we’re going to setup a more general App so let’s be more permissive on this. Click on Update settings before leaving the tab.
Now, click on API Keys tab and take note of the fields named API key and API secret under Application settings. Then, click on Create Access Token and take notes of Access token and Access token secret as well. These four weird alphanumeric strings are the ones used for the OAuth connection inside Talend, so keep them secret and do not share with anyone (remember than we’ve granted R/W permissions!). You may need a page refresh or two before being able to see all these fields, but at the end the page would look something like that:
If you like to, you can click on the Settings tab and add some bells and whistles like changing the icon or the Organization website. Nothing really mandatory, but it’s nice to set up. That’s it on the Twitter part, now let’s switch on Talend and do the real job.
Create the Talend job for getting the tweets
To perform a query and get the results back to Talend we’re going to use the Twitter Components Pack. Said that, for keeping this tutorial simple, we’re going to use the Search API to perform a single query, although the pack supports bot the REST and the Streaming API. The pack is not installed by default on Talend so you need to download it and install manually. Don’t be afraid, it’s no more than install a custom Talend component you’re probably well aquainted already. Now, restart Talend (or refresh your components palette) and create a new job. Drag one instance of tTwitterOAuth, one of tTwitterInput and one of tTwitterOAuthClose on the canvas and connect them using a OnSubJobOk connection like this:
Now, select the tTwitterOAuth component and fills the fields with the strings you got from your Twitter App API keys page. In this context, API and Consumer are synonims. Under Connection Type leave the default selection Twitter API. Don’t forget to double quote each string!
Select tTwitterInput component. You may see a lot of possible options on it, but it’s beyond the scope of this tutorial. Have a look to component documentation for a detailed post on how this work. For now, just click on Edit Schema. Since we’re interested in hashtag relevance analysis, just feed it with one String column named hashtags. Make it 220 characters long, just to be sure.
Now, you need to tell the component which operation to perform to fill the output column. Do it using the Output mapping table from the component Basic settings pane. In this case, it’s going to be trivial, since we’ve only one output column. When you’ve more, however, you’ll be able to assign parsing operation one-by-one. In this case, just assign to hashtags the Hashtags list operation. Then, set up the relevant keywords. Let’s say we want tweets with both the words Obama and Michelle (that’s why the AND condition). The parameters pane will look more or less like this:
Again, you’ve plenty of possibilities in this filtering wizard. Just keep experimenting on it until you find the query condition you really need. Now, switch to Advanced setting pane and uncheck on Prefix entities with reserver character. This will strip off character like # or @. Since we’re going to save only the hashtags, their role is obvious, so we can safely strip off the sharp character, get a clear list and save time. Keep playing with these parameters as well but don’t rise too much the Limit to value. It’s useless, since the Twitter Rate Limit actually applies. As a general rule, don’t pretend to fully understand how the rate limiter actually works. It’s really a mess I suspect even Twitter engingeers actually understand complitely. Leave the comma as separating character between hashtags.
Now, complete the job with the following components:
- The output connection from tTwitterInput is of Structured type. The component supports two kind of output connectors: one is used to get the raw tweet in JSON native format, while the Structured type is used to output a vector of fields as you previously mapped.
- The filter component is to reject tweets without hashtags. This because the component returns every tweet regarding it has hashtag or not. The filter is to strip off those without hashtags. Just check for the length of hashtags column and get it if it’s more than zero.
- The normalize component is to break hashtags lists against a breaking character. The output of the component is one hashtag per line. If you didn’t change the separator character for inside the tTwitterInput you don’t need to fix the default parameters on this.
- Finally, save the list into a txt file.
After executing it, each line represents an occurrence of a hashtag.
Now, let’s build a nice 10-seconds relevance report on it. You have endless visualization possibilities. I decided to build a word cloud using a free online app called Wordle. Open the file Talend generated with Notepad and copy the content into the clipboard, then paste inside Wordle, tweak the settings a bit and you’ll come with a nice wordcloud
This is just the tip of the iceberg. While in Talend, you have endless possibilities of data analysis and visualization. So…keep on experimenting!