Build GML graphs for Social Network Analysis in Talend

Build GML graphs for Social Network Analysis in Talend

I’m attending a coursera-powered class in Social Network Analysis from University of Michigan. While the course is really stunning, I had to realize that it’s kind difficult to obtain enough example datasets to study, especially if you haven’t a solid programming background. Since the course is not targeted to programmers but data analysts, It sounds really weird to me that my classmates have to wait for someone to extract, clean-up and eventually share these datasets. However, network data is usually well-structured and with low dimensionality, so I think that a data integration and manipulation software could be an easier way to prepare these datasets. I choose to do this using Talend Open Studio, as it’s probably the best free data integration platform available nowadays. It obviously doesn’t have a component to build datasets in a format readable by for example Gephi or Pajek, but Talend is eclipse-based, so it was easy to me to build a custom component to write .GML files.

In this tutorial I will guide you using this component to write graphs file, then I’ll explain you a 120-seconds method to build an endless set of example valid network datasets, useful for learning, testing, simulations and so on.

Here’s the screenshot of the Talend job that does the magic. Please remember that you need to download my component and copy the unpacked folder into Talend Open Studio plugins folder before starting!

How it works

A typical Graph Modeling Language file is generally made of two parts:

  • A nodes list, describing every node of the network. Each node must have at least an ID and a label, while can have zero or more node attributes (usually literals or numbers). If an attribute is missing, it must be included anyway in the graph with a NULL value. This structure is perfect to be fed by a tabular data frame and we propagate it in the first branch (the one labeled by row3/out1)
  • An edge list, describing relations between nodes. While this is not technically mandatory, a list of unconnected nodes is quite useless for SNA studies, so it makes sense to always provide it as a reference (or lookup) connection. In the screenshot, it’s the branch named row4/row5. The edge definition vector must have at least two column describing source and target node, but can have weight coefficients and labels, too.

(please remember than Talend subjobs can have as many lookup connections as you desire, but just one main connection that keeps the pace. I decided to design the component using nodes data flow as the main one).

It’s quite handy to see how the tFileOutputGML instance is the heart, just around the middle of the canvas.

First of all, let’s have a look to tFileOutputGML parameters, since this is the main purpose of this tutorial (click to open the full-sized version):

The first parameter let you to set the node attributes that will be included in the GML file. Even complex java types can be used as node attribute columns, however they will be flattened to a primitive type (mostly literals or numbers), with perhaps unexpected results. So, it’s better to convert them to a primitive type and following your needs before going into tFileOutputGML block. You can define graph direction, node properties (where to search for IDs and labels) and edge properties. Here you can activate/deactivate them and define the location for 2 to 4 describing columns.

You can eventually set some file and graph properties into Advanced parameters, but these properties are fully optional and default values are ok in most cases.

That’s all you need to save a GML file for your social network analysis. But if you need to fill an example network to testing and educational needs, continue reading the post.

Build example networks for Social Network Analysis

In Talend we have a very nice 20-seconds set of functions to generate example datasets. They are very basic, but enough for our needs. We will use them to generate a fictional network of friends. Click below to see the complete job.

Generating nodes

First of all we’ll generate a set of nodes representing persons; it’s the upper branch of our job which starts with the tRowGenerator_1 block. Let’s generate a dataset of let’s say 100 nodes and the following 4 columns:

  • label, using the getFirstName() function (It returns a random english american name, no parameters needed). It will become the node label
  • city, using the getUsCity() function (It returns a random town of the United States, no parameters needed). It will become the literal node attribute “city”
  • age, using the random number generation function with two parameters (upper value, lower value). It will become the numeric node attribute “age”

The following screenshot should give you a clever idea of what we’re doing.

As you may see, I didn’t defined the node ID, here. For the sake of clarity, I usually prefer to generate numeric sequences in a separate  tMap component. And that’s what i did, when i defined the ID column using the following snip of code.


Numeric.sequence("node_id", 1, 1)

You may add other attributes if you need, for example US State ID, sex (using a random integer number which can be only zero or one), phone numbers and so on. This will however be beyond the scope of this tutorial, so let’s stay focused and link this to a tFileOutputGML block, setting basic and advanced parameters as I explained earlier. Don’t forget to fill the output definition schema, as It represents the node structure you’ll see in your GML file. This means, if you don’t add a column in the output schema, the corrispettive attribute will not be saved in the output file. The following screenshot should give you a better idea on what I’m talking about:

That’s all about nodes definition. Let’s fill the edges definition part, now.

Generating edges

For this part, we’ll use another instance of tRowGenerator family component, with just two integer columns – source and target – both filled with random integer numbers ranged from 1 to 100. They will simulate our connections between nodes. To obtain a significant network for SNA, I suggest you to generate two or three times the number of nodes at least. So, since we set as 100 the number of nodes, 300 is a nice numbers to set here, instead. A lower number will let you see a smaller giant component and more dangling nodes around the graph, while a higher number will let you see a much more deepness in nodes connections. Just play a bit with these thresholds to get the result that best fits your needs.

Since we’re describing friendship relations, we don’t want self-referring relations (one cannot be a friend of himself!). So we need to filter out these unwanted rows using a filter and the following easy condition:


input_row.source != input_row.target

Finally, we need to remove duplicates source/target pairs using an aggregation block. In this step, if you need to, you can define a weight column fed by a count operation between rows. It will be useful if you want to make some tests on weighted networks, obviously. The edges data flow is ready and you can link to our file writer block using a lookup connection.

That’s almost done; I just added a print-to-console block (tLogRow) after the output stage just to demonstrate that a file writer doesn’t have to be the final block of the job. Just remember than only the main connection (nodes definitions) is propagated.

You can finally run up the job and explore the generated network. You can obviously extend the dimensionality of your network, adding more node attributes. You can easily extends Talend data generation functions to do so, as they are simple java static functions; you can find them in the project repository, under system routines.

While this method is probably not indicated for more sophisticated analysis (ie, social networks rarely have a full random distribution of connections), I think it could still be useful. I hope this tutorial will help you getting the most from Talend, from this point of view. If you need any kind of support, help, or wanna share opinions, critiques or considerations, feel free to use the comment form or to contact me directly. See you next time and have a nice day!

  • http://www.linkedin.com/pub/robert-quinn/0/38/372 Robert Quinn

    Very cool, A lot of people are using neo4j and I was looking for a talend component to read legacy data and write to a directly to neo4j, this looks close.

    Any interest in converting this to a write to neo4j?

    I did search for neo4j GML importer but haven’t come across it yet.

    Thanks

    • http://gabrielebaldassarre.com Gabriele Baldassarre

      Good morning Robert,

      I did never used neo4j, but It sounds nice to add a Talend component for that purpose. I’m not sure I’ll be able to do so in the near future, but I’ll start documenting myself a bit, at least.

      tnx for your comment!

    • http://twitter.com/vstoyak vstoyak (@vstoyak)

      Sorry, see my comment below RE: Importing GML also you can use Talend tREST component to transform data into neo4j using Talend DI.

  • http://twitter.com/vstoyak vstoyak (@vstoyak)

    You can use Gremlin GraphMLReader

    g.loadGraphML(‘file:/jenkins/jenkins/workspace/neo4j-master/gremlin-plugin/target/test-classes/graphml