Original article posted on SandTable.com

Here we take a look at how to analyze large amounts of unstructured data as in Twitter by using Graph Databases with Neo4j

As part of a larger project we’ve been working on to build a simulation of social influence (more on that in a later post) we’ve been exploring ways of ingesting and analysing Twitter data.

There are a lot of moving parts to this task, of course, but one of the more interesting ones is how best to store and query the data. Given that Twitter is a part of the social graph, it made sense to us to take a look at what graph databases, such as Neo4J, have to offer.

Graph databases are particularly well suited to capturing social data because they store data as nodes and edges. In our Neo4J Twitter database we have four kinds of node: Twitter users, tweets, retweets and mentions.

The coding rubric goes like this:

  • If a user tweets a new tweet, this is stored as a user node, a tweet node, and a ‘tweeting’ relationship between the nodes
  • If someone retweets the original tweet, the retweet and the person tweeting it are added as new nodes, and the retweeting user is connected via the tweet relationship to the retweet, and the retweet is connected via the retweet relationship to the original tweet
  • If a user mentions another user, the mention tweet is connected to both the user that tweets it and the user mentioned

This gives us a very powerful and economical way of harvesting the social graph. This becomes clear when you visualise what’s in the database (which we’ve done below using Gephi):

This graph represents Twitter activity around the phrase ‘Future is Here’ (inspired by the exhibition at the Design Museum supported by the Technology Strategy Board), collected via the Twitter API between 22 October and 13 November 2013.

  • Green nodes (with names) are Twitter Users
  • Red nodes are tweets
  • Purple are retweets
  • Blue are mentions (or replies)

The size of a node is simply a function of how many relationships it has (it’s not an intrinsic property of the node). There are 13,631 nodes here (users and tweets of various kinds) and 25,966 edges.

Looking at the graph as a whole, it’s striking how much Twitter activity around ‘Future is Here’ takes place in isolation. (Of course, followers may be reading ‘Future is Here’ tweets, but this doesn’t leave a trace for us to pick up.) Then, as we zoom in, we start to see some distinctive patterns of connected activity. We’ll take a closer look at just 6, and then talk briefly about how Neo4j helps can help us with finding and quantifying such patterns.

Read the Full Article Here.