Matching and De-duplication in a Graph Database
Philip Howard, Research Director – Data Management, Bloor Research, explores matching and de-duplication in a graph database, topics discussed at The Data Warehousing Institute TDWI London Symposium.
If Bill Clinton and William Clinton (this was the example posed during the session) have the same relationships they must surely be the same person, though given the nature of some of the ex-president’s relationships it would perhaps be better to refer to following the edges of the graph rather than the relationships they represent. In fact, if you are using a graph database to look at terrorist or criminal networks this is precisely one of the things you would be doing as you want to understand which aliases equate to which real individuals.
First of all I should say that I am not aware of any graph vendor packaging up any special facilities to support matching and de-duplication but I imagine that there are things they could do to make this process easier. However, the concept is quite cool. It would mean that you don’t need to license such capabilities from the likes of Trillium or Informatica. Of course there are other data cleansing requirements beyond matching but this does tend to be the bedrock for all such environments so could a graph database be a real competitor?
Of course the big advantage is that there is no additional license fee.
What I don’t really know is how performance would compare. Vendors in the data quality field are apt to extol that their matching engine can outperform anybody else’s: something that is inherently impossible to prove one way or the other, thanks to the fact that you can’t compare match accuracy across platforms.
Nevertheless, my guess is that a graph database could seriously outperform a conventional matching engine. That’s because graph databases have been explicitly designed to explore relationships and that’s precisely what you do when matching: you have two similar but non-identical names and they each have relationships with an address, a mother’s name, a phone number, an email address and so on. Instead of searching through table after table: just follow the edges of the graph.