Big Data Tradeoffs: What Agencies Need To Know
Peter Mell, senior computer scientist at National Institute of Standards and Technology, NIST speaks with AOL Government on the limits of relational databases and the challenges of governmental agencies to manage big data. Most particularly, there are tradeoffs to consider with respect to the CAP theorem:
In the [RDBMS] database world, they can give you perfect consistency, but that limits your availability or scalability. It’s interesting, you are actually allowed to relax the consistency just a little bit, not a lot, to achieve greater scalability.
Well, the big data vendors took this to a whole new extreme. They just went to the other side of the Venn diagram and they said we are going to offer amazing availability or scalability, knowing that the data is going to be consistent eventually, usually. That was great for many things.
And then people realized, let’s not be extremist; let’s not be too far one way or the other; maybe we need a balance. So we started developing systems where you have nodes and dials that you can turn and you can decide how much data consistency do I want, how much availability do I want — that you can dynamically tune, and make that tradeoff per the application, per your need. So some of the big data technologies allow you to explicitly tune it how you want it.
Some big data technologies will claim to give you perfect data consistency and scalability. And for such solutions you should start asking yourself, are they really resistant to component failure, because that’s the third circle in that CAP theorem.
For most of our high value data, though, we are going to still keep it in relational databases.
But when those don’t effectively work with the structure of the data that we are processing, or we have too much data, or we have a variety of data, then in order to process it we are going to have to make some tradeoffs. We are going to have to be willing to relax some of the data consistency to get that scalability.
and understanding the various big data solutions to maximize unstructured data.
I think people are also understanding that there is a tradeoff between relational databases and big data technology. They are understanding why they would make that tradeoff – that big data technology isn’t just one thing; there are different choices they have to make.
For example there are different approaches to tackling big data: There are graph representation databases, key-value store databases, document store databases, and column-oriented databases among other approaches. Executives or program managers need to understand that they are going to really have to do their homework before they move forward in implementing one of these.
…there are graph databases, where you represent your data using dots with lines between them. You label the dots and you label the lines to show relationships. This is very powerful when you want to model relationships among the population. The intelligence community makes great use of graph databases.
There are key-value stores, where you have some data you are going to store and you put a tag on it like a name. So I might store data with the name John Doe and the value might be he wears glasses. There is no schema, no table, no query language. Because it’s so simple it’s really, really fast and it can scale dramatically.
A regular database or relational database stores its data sequentially by record. So if we have a record that says, a person, their age, their hair color, their eye color, it stores Peter Mell, 40 years old, brown hair, brown eyes in that order. If you want to then look up Peter Mell, you get my hair color quite readily. But if you want to do statistics on everybody’s hair color, it’s not very efficient because you have all this other data interspersed between where you store the hair color on disc.
The column-oriented database stores all of the particular attribute altogether, so all the hair color together, and all of the eye color together, so it makes it very efficient for certain kinds of analysis, certain kinds of queries.
I should mention that relational databases in some cases can solve big data problems, it’s called embarrassingly parallel. Then you can use relational database and scale it almost linearly.