As Social Data Grows, Researchers Uncover Secrets with Graph Databases
Currently under development at Georgia Tech University is a new project - STINGER Spatio-Temporal Interaction Networks and Graphs Extensible Representation.
It is a graph-processing engine that project lead David Bader says is bigger, faster and more flexible than anything currently in use for analyzing social media connections. You provide a shared-memory computing system, and it provides an open-source tool that can help detect relationships between billions of people, places and things as those relationships change over time — even in real time.
Someone using Facebook data, for example, might write an algorithm using where people or pages would be the vertices and actions (likes, shares, wall posts, etc.) would be the graph’s edges. One relatively easy application, Bader explained, would be to analyze how activity around particular people is increasing, decreasing or changing, therefore indicating changes in their importance or the growth of new communities.
Writing an algorithm to perform that kind of analysis isn’t really the problem, though — it’s writing one that can scale into the billions of vertices and edges and still perform quickly enough to be useful. An algorithm that generates one false positive in a million isn’t so bad when you’re dealing with tens of thousands of items, Bader explained, but it gets to be a big problem when you’re talking about billions of items against which it’s running.
There are dozens of open source graph databases available, including popular offerings such as Neo4j and InfiniteGraph, but he said, “Our lab focuses on algorithms that run fast on massive data sets and that are more accurate than what is traditionally done in social media.”
Bader’s team recently presented a paper detailing a social media algorithm running atop STINGER that ran 100 times faster than some previous approaches because the system stores the graph’s previous state and only performs the minimal amount of processing necessary as new edges are inserted. This is in contrast to traditional approaches that re-process the entire graph every time there’s a change.
That being said, Georgia Tech isn’t entirely alone analyzing massive amounts of social data with graph databases. Google’s Pregel hadalready scaled to billions of vertices and edges as of 2009, and Facebook is currently analyzing more than a billion edges usingApache Giraph (an open source, Hadoop-based Pregel implementation). But those cases — both companies are loaded with smart engineers, data scientists and powerful infrastructure — just underscore the importance of what researchers like Bader are building and releasing as open source.