Understanding Big Data Persistence
Xinyu Liu discusses Relational Databases, NOSQL and NewSQL, including Neo4j!
Relational Databases have been the dominant data persistence technology for many years. Feature-rich, mature ecosystem has been developed around them. Effectively relational databases are the default option for any enterprise applications as well as data warehouse projects. This statement holds true in present days even with proliferated NOSQL techniques. That being said, relational databases may become the bottleneck of an enterprise application with constantly growing data size and extreme performance requirement. Scalability is sacrificed for strong consistency, transaction support, normalized data model, ad hoc query, and table-join capabilities. NOSQL (with exceptions) as well as NewSQL with automated (or transparent) sharding, re-sharding, and failover, was invented for easy scale out, but falls short on some features as a price. CAP theorem dictates that it is practically impossible to have all the merits available in a distributed data storage system.
First glance at CAP theorem likely makes you feel it is simple. According to wiki, it is impossible to simultaneously provide all three guarantees of C-Consistency, A-Availability, and P-Partition Tolerance to a clustered system. While, it turned out that the definition of the three guarantees are intricate and inconsistent across different resources. Cloudera made a good effort in describing the CAP theorem. But the author believes that the best interpretation comes out of the book “NoSQL Distilled”.
Consistency means all nodes in a cluster see the same data at the same time. Availability has a particular meaning in the context of CAP, every request received by a non-failing node in the system must result in a response. If the response is undetermined to a running node, it is not available. A failed, unresponsive node doesn’t infer a lack of CAP availability. Partition tolerance means that the cluster can survive communication breakages that separate the cluster into multiple partitions, which are unable to communicate with each other. At least one partition should remain functioning and accessible to the clients. Bear in mind this network partition is accidental, destructive, and undesired.
CAP theorem states that only two out of the three guarantees can be achieved at any moment. To get a better insight, the author quotes a small section from the book blended with interpretations from his own perspectives.
First consider a CA (consistency and availability) system. A single-server system is a CA system with consistency and availability but not partition tolerance. Apparently, a single machine can’t partition. If it’s up, it’s available. When it is down, a failed, unresponsive node doesn’t infer a lack of CAP availability. Relational Databases are in general a CA system.
It is theoretically possible to have a CA cluster. However, this would mean that in event of network partitions, all the nodes in the cluster would go down so that no client can talk to a node. Why? Servers under different partitions can’t communicate with each other. Copies of the same data under different partitions would be out of synch when the servers keep serving read/write requests. That breaks consistency. If you choose to make all nodes ready-only, which preserves consistency, it impairs availability because the nodes reject all write requests. As a result, we have to bring all the nodes offline. Note a failed, unresponsive node doesn’t infer a lack of CAP availability. Since the entire cluster is out of service due to a network partition, it is to say that the cluster has no partition tolerance. This indeed defeats the purpose of having a cluster.
Evidently partition tolerance is a must for a clustered system. The choice is narrowed down to CP and AP, meaning that you trade between consistency and availability. An example should make it clear. Martin and Pramod are both trying to book the last hotel room on a system that uses peer-to-peer distribution with two nodes, London for Martin and Mumbai for Pramod, assuming a multiple data center topology. (The two nodes could reside in the same data center say in New York.) If we want to ensure consistency (CP), when Martin tries to book his room on the London node, that node must communicate with the Mumbai node before confirming the booking. Essentially, both nodes must agree on the serialization of their requests. But should the network link break between the two data centers (or just the two nodes in the New York data center), then neither system can book any hotel room, sacrificing availability. The status of the room may still be available as read only so long as the record remains in synch in both partitions – impaired availability. In a master-slave topology, the master node can still book the room, while the slave either is taken offline (weakens partition tolerance) or rejects the request (impairs availability) since its copy of data is no longer up-to-date until the network connection is resumed.
If we choose availability (AP), both partitions would allow their user to book the last hotel room in the peer-to-peer topology. That leads to data inconsistency, copies of the same record under different network partitions are out of synch. It is described as a write-write conflict to be resolved once the network connection is resumed. A common solution is optimistic locking, in which the first write commits and the second fails by comparing a timestamp value (or version count) on different copies of the same record. In reality, some hotels always keep a few rooms clear even when they are fully booked, in order to be able to swap a guest out of a room with problems or to accommodate a double-booking. Another example is Amazon shopping cart. In this case you are always allowed to write to your shopping cart, even if network failures mean you end up with multiple shopping carts under different partitions. The checkout process can merge the multiple shopping carts by putting the union of the items from the carts into a single cart and returning that. In the meantime, a master-slave topology favoring AP remains with a writable master node and a read-only slave node(s). The master node is always up-to-date, while the slave may return obsolete data, sacrificing consistency. Eventually, of course, the updates will propagate across the network, and data on different nodes will be consistent, generally referred as Eventually Consistency.
There is no definitive interpretation of CAP theorem so far. After all, it is still a little debatable. According to the description of this article, Apache Cassandra possesses a peer-to-peer topology, and is highly tunable from CP to AP through the configurable READ and WRITE consistency levels per each database operation. MongoDB has master-slave architecture, and both reads and writes go to the master by default, which suggests a CP system. When secondaries (slaves) are allowed to serve reads, it becomes AP as the secondaries may return out-of-date results. Apache HBase is practically a peer-to-peer architecture with peers the region servers, considering HBase Master does very little concrete work in the cluster. HBase is believed a CP system in that each data region is served by a single region server for both reads and writes, which guarantees consistency. On the other hand, it used to bear a weak form of Partition Tolerance due to Name Node the single point of failure (SPOF). A partition losing contact with the Name Node is out-of-service. HA (high availability) NameNodes introduced in Hadoop 2.0.0 mitigates the situation by running multiple Name Nodes in the same cluster. Availability falls short in that should a region server fail, clients will have data inaccessible for the time till the region comes up on some other server. This failover process is not instantaneous.
A general principle in database scaling is that replication helps on read performance as all the replicas can serve read requests, little help on write performance (avoiding write-write conflicts). On the other hand, sharding/partitioning improves both read and write scaling, since reads and writes are spread out to different nodes hosting different data shards.
As we’ve discussed, relational databases alone are not intended to run in a cluster. Client-side sharding approach for scaling is very involving and complex in terms of programming, configuration, and maintenance. Besides, we lose table joins, referential integrity (foreign keys), and transactions that cross shards. People were motivated to design brand new, innovative database systems. NOSQL databases are introduced to tackle the spots not being well-addressed by relational databases, for instance, extreme capacity and performance, unstructured or semi-structured data, continuously evolving schema, and complex relationship graphs.
There are 4 types of NOSQL databases, namely, column family-oriented databases, key-value databases, document databases, and graph databases, among which column-family databases are well-tuned for big data because of the automated and efficient sharding/re-sharding mechanisms.
Column Family-Oriented NOSQL
Apache HBase is a column family-oriented, schema-free database modeled after Google’s BigTable, the database behind the Google web search engine. HBase is indeed a Hadoop database running on top of HDFS (Hadoop Distributed File System). Schema-free means no database constraints, no foreign key relationships (referential integrity), rows are no longer lined up in a table. Thinking of a column family as a table in SQL databases, different rows under the same table can have a different number of columns, as well as different column names. Note schema-free doesn’t infer disorganized data; rather the data model is elastic to accommodate ongoing changes. Disorganized data only leads to chaos. Multi-row transaction is not supported, while, an operation on any single row is atomic. There are two fashions in storing data in HBase, tall table versus wide table. For instance, when storing customer orders, with a tall table, you save each order as a new row. However, in a wide table you save each customer as a row, and orders are columns in that row. Obviously wide tables are highly de-normalized as opposite to tall tables. Data belonging to the same column family is partitioned into regions and saved to region servers in the HBase cluster. HBase supports indexed search by row key, row key range query with filters, full table scan with filters. Note filters do not improve query performance as they are applied after the fact. There is no table-join support in HBase, which implies join has to be done on the client side. In wide tables, HBase does not index column names in contradiction to Cassandra, which has column names sorted. To enable ad-hoc type queries, secondary indexes are to be created on columns at the client side either through MapReduce jobs (batch) or through HBase Observer Coprocessors (real-time). The most important thing in designing HBase tables is the row key. HBase works well if you prepare the data for the queries you need – a query driven table design approach, but not for general search. HBase has the best integration with Hadoop since both run on HDFS to take full advantage of data locality. For example, HBase enables large map-side joins for Hadoop.
Google web search engine is a use case of column-family databases. Google runs web crawlers to collect all the web pages on the internet and store their data in BigTable. A MapReduce process makes use of BigTable as the data source to produce the search indexes. When a user submits a search request, a web search application queries the search indexes and retrieves the matching document from the BigTable to display to the user. The open source counterpart, Apache Nutch project has a storage abstraction via Apache Gora for big data stores such as HBase, Cassandra, Apache Accumulo (a distributed key-value store), Apache Avro, as well as various SQL stores.
Apache Cassandra on the other hand is modeled after Amazon DynamoDB. With a decentralized, peer-to-peer topology, Cassandra is highly scalable and available. Internally, Cassandra is implemented as staged event driven architecture (SEDA, like in Spring Integration), each stage in a work is resilient to scale up or down independently. Tunable consistency level on individual queries gives developers choices between performance and consistency. A high consistency level makes Cassandra behave more like a relational database, while a low consistency level boosts read (or write) performance and throughput, but may leave you some data conflicts to solve. The latest release of Cassandra provides native support on secondary indexes. The primary index in Cassandra allows fast lookup of rows by their row key. Each node in a Cassandra cluster knows the range of the keys it manages, requested rows can be efficiently located by scanning the row indexes only on the relevant nodes. In HBase, this job is taken care by ZooKeeper. Cassandra automatically writes a row to the proper node depending on the key regardless to which node a client connects. In addition to the query types supported by HBase, Cassandra offers pageable slice queries, where given a wide row, users can retrieve a subset of the columns falling within a given column name range (with column names sorted in a column family). Apart from many discrepancies, Cassandra and HBase share the same design principle – carefully planning your queries before designing the database.
A few new projects are stemmed from Cassandra, for instance, Brisk (running Hadoop on Cassandra File System), Lucandra (using Cassandra as the underlying data storage for Lucene). With Cages – a distributed synchronization library in Java, multi-row transactions can be simulated in Cassandra.
Amazon SimpleDB, Apache Accumulo, Riak, Redis, and Kyoto Cabinet are typical key value stores with extremely high performance. The primary data access pattern is by key, while search by indexed metadata is often supported. They do not in general support join or ad-hoc queries.
Document databases are essentially key-value stores, of which the value is XML, JSON types of documents. In regular key value stores, the value is just a piece of binary data, which isn’t queryable. While the rich, hierarchical documents in a document database are visible and queryable from the perspective of the database. MongoDB, CouchDB, and RavenDB are popular document databases.
MongoDB is fully capable in querying and manipulating JSON/BSON (Binary JSON) documents. Ad-hoc query and indexing features make you feel you are working with a regular SQL database, even without join or transaction support. Replica sets in a MongoDB cluster provide data redundancy, automated failover, and can distribute read loads across machines. MongoDB has been designed to scale horizontally with automated sharding and re-sharding, although re-sharding is not quite optimized. Unblocked, asynchronous write mode improves write performance if you could afford occasional data loss on less-valued data sets like server logs. MongoDB runs blazing fast when it works like an in-memory database. At minimum, you need to make sure that all indexes fit in RAM. Ideally, both indexes and a working data set fit in RAM. A working set is the subset of total data commonly queried and updated. When you can’t fit the indexes in the physical RAM of one machine, that is the moment you need to shard the database. With optimized configurations, MongoDB could outperform relational databases in orders of magnitude under a similar hardware spec.
A graph database like Neo4j and InfiniteGraph offers a normalized data model with nodes (vertices), edges (relationships), and properties that makes it better at handling data with complex relationships, for instance, user relationships in a social networking application. Query performance in a relational database degrades significantly with the increasing number of joins. Normalized relational schema trades query efficiency for data efficiency. Graph databases are built to serve joins (hops in a graph). Each hop is executed with a constant speed irrelevant to the total size of the graph. Query performance is only related to the number of hops despite the growing size of the database. The relationship-oriented data structure makes Neo4j difficult to shard and scale out. On the other hand, it enables full JTA transaction support. Lucene is the default indexing engine for Neo4j; in addition, Lucene empowers Neo4j on full-text search. Boosted by Spring Data, Neo4j resembles Hibernate in various aspects. A POJO data model with a few annotations abstracts the native Neo4j API. Full-text search resembling Hibernate Search takes Lucene query strings. The Open Session In View pattern is also available for on-demand, lazy data access. Apache Giraph (modeled on Google Pregel graph processing framework) along with Gremlin (a Groovy based graph traversal language) enables complex graph analysis on Neo4j-like graph databases.
Read the Full Article Here.