Pages

Monday, March 17, 2014

Back to the Future for Data Storage

Building a massive, distributed datastore which can service requests at an extremely high throughput is something that weve focused on at Google. We created something called Bigtable that underlies the datastore in App Engine. The design for Bigtable focused on scalability across a distributed system so it may operate a bit differently than databases youve worked with before, such as not supporting joins. This isnt an accident -- when you build a system that can scale to the size that Bigtable can theres no way to do a general purpose join on data sets that size and still have them be performant.

Google isnt alone in offering an non-Relational datastore to enable scaling. For example, Amazon has SimpleDB:

A traditional, clustered relational database requires a sizable upfront capital outlay, is complex to design, and often requires a DBA to maintain and administer. Amazon SimpleDB is dramatically simpler, requiring no schema, automatically indexing your data and providing a simple API for storage and access.

There are also a range of non-relational open source datastores now available such as CouchDB and Hypertable. Those are just two examples, there are many more.

While you might think this is all new, its actually a bit of a return to the past. You see, there was a time when "RDBMS" wasnt always the answer regardless of what the question was. At the time Codd published his paper, "A Relational Model of Data for Large Shared Data Banks," there were many different approaches to datastores. It was only in the 80s that relational databases won the majority of the mindshare. Having settled on a single metaphor the industry has developed many tools and techniques to make developing on a relational database easier.

Unfortunately that majority mindshare is also a problem because while RDBMS are useful in many situations, they are not useful in all situations. Their dominance in the mindshare means that useful alternatives arent used, and huge amounts of time and money can be wasted trying to force non-relational problems into a relational model.

We are in the middle of a renaissance in data storage with the application of many new ideas and techniques; theres huge potential for breaking out of thinking about data storage in just one way. Michael Stonebraker pointed out in his paper, "One Size Fits All": An Idea Whose Time Has Come and Gone, that there are common datastore use cases, such as Data Warehousing and Stream Processing that are not well served by a general purpose RDBMS and that abandoning the general purpose RDBMS can give you a performance increase of one or two orders of magnitude.

Its an exciting time, and the takeaway here isnt to abandon the relational database, which is a very mature technology that works great in its domain, but instead to be willing to look outside the RDBMS box when looking for storage solutions.



Related Posts by Categories

0 comments:

Post a Comment