mindstorms: Quick Reference to Alternative data storages

Make sure you check myNoSQL a NoSQL blog featuring the best daily NoSQL news, articles and links covering all major NoSQL projects and following closely all things related to NoSQL ecosystem. Everything you need and want to know about NoSQL.

Collaborative effort: Please help me fill in the gaps in the tables below by providing missing data, references to interesting articles, metrics, etc.. Please feel free to suggest new criteria to be included.

This is work in progress..

While it may probably not be exhaustive, my intention is to provide a quick reference to BASE systems (Basically Available, Soft State, Eventually consistent, as opposed to ACID: Atomicity, Consistency, Isolation, Durability) that would offer newcomers an overview of the existing projects in the field.

So far, I've been looking for filling in information about the following characteristics:

Data model
Partitioning
Persistence
Rebalancing (elasticity)
Replication (clustering)

I have also included notes about the implementation language and the protocols that can be used with each solution.

If you think I should include other criteria please do let me know.

The projects included so far in the list: Cassandra, CloudBase, CouchDB, Dynomite, HBase, Hypertable, Kai, LightCloud, LucidDB, Memcached, MemcacheDB, MonetDB, MongoDB, Neptune, Redis, Ringo, Scalaris, ThruDB, Tokyo Cabinet + Tyrant, Voldermort.

Alternative Data Storages

Project	Data model	Partitioning	Persistence	Rebalancing	Replication
Cassandra	Column-family (BigTable^[5], Dynamo⁶)	Y^[n4]	disk	Y	Y
CloudBase	HDFS/Hadoop^[n3]	Y	disk	Y	Y
CouchDB	Doc-oriented	?^[n2]	disk	?^[n2]	?^[n2]
Dynomite	Blob (Dynamo⁶)	Y	pluggable	Y	Y
HBase	Column-family (BigTable^[5])	Y	disk	Y	Y
Hypertable	Column-family (BigTable^[5])	Y	DFS (HDFS)	?	Y
Kai	Blob	?	disk	?	?
LightCloud	check Tokyo Tyrant^[n5]
LucidDB	Column-based	?	disk	?	N
Memcached ^[n1]	Blob	Y	RAM	Y	N
MemcacheDB	Blob	?	BerkleyDB	?	Y
MonetDB
MongoDB	Doc-oriented	Y			Y
Neptune
Redis
Ringo	Blob	Y	disk	Y	Y
Scalaris	Blob	Y	RAM		Y
ThruDB	Doc-oriented
Tokyo Cabinet + Tyrant
Voldemort	Structured / Blob / Text	Y	pluggable	N	Y

Notes

[n1] Memcached: a distributed memory object caching system
[n2] CouchDB partitioning and replication: according to a 2009 Summer of code proposal: While distributed deployments have been achieved with the help of proxies and smart external scripting, the core of CouchDB itself does not currently support distributing the database across multiple machines. More references about CouchDB cluster:
- CouchDB Cluster
- couchdb-lounge: Clustering framework for CouchDB
[n3] All other criteria for CloudBase have been deduced based on the HDFS/Hadoop capabilities
[n4] Cassandra: Consistent hashing vs order-preserving partitioning in distributed databases
[n5] LightCloud seems to be a set of management scripts (Python) for Tokyo Tyrant

Implementation details

Project	Impl.	Client protocol	Refs
Cassandra	Java	Thrift^[4]	[1], [2], [3]
CloudBase	Java	JDBC (Java)
CouchDB	Erlang	HTTP + JSON	[1], [2], [3]
Dynomite	Erlang	Thrift^[4]	[1], [3]
HBase	Java
Hypertable	C++	C++ API, Thrift^[4]
Kai	Erlang
LightCloud	Python + Tokyo Tyrant	Python
LucidDB	Java/C++	JDBC (Java)
Memcached	C	all^*
MemcacheDB	C	all^* (memcached protocol)
MonetDB	C
MongoDB	C++	API (Python, Java, Ruby, PHP, C++, Perl, Erlang)
Neptune	Java
Redis	C
Ringo	Erlang	HTTP
Scalaris	Erlang
ThruDB	C
Tokyo Cabinet + Tyrant	C	C, Perl, Ruby, Java, Lua
Voldemort	Java	Java

Performance

I usually do not trust micro-benchmarks. I know that performance measuring is an art. But I also know that some are looking for this sort of data and sometimes even the smallest piece of information is more helpful than nothing.

Project	reads/s	writes/s	refs
Cassandra
CloudBase
CouchDB
Dynomite
HBase
Hypertable
Kai
LightCloud	See: Tokyo Tyrant results + this
LucidDB
Memcached	here, 2007, here
MemcacheDB	benchmark data
MonetDB
MongoDB	Performance testing
Neptune
Redis
Ringo
Scalaris
ThruDB
Tokyo Cabinet + Tyrant
Voldemort

Other projects

I have found a couple of other projects, but I couldn't decide if they fit in or not. In case you consider that I should include them please do let me know (a helpful argument is also highly appreciated)

I'd like to also mention the FriendFeed usage of MySQL, which while not being a new system in itself it was conceived to behave like a BASE .

Resources:

17 comments:

Anonymous said... 12:48 PM: Nice post. A pity that so many columns are empty (all in the perf. table, ~ 1/2 in the 1st table).
Alex Popescu said... 12:52 PM: jakubholy,

I am also concerned about the performance table as it looks like there isn't much information out there. I have found some and I'll start filling it in immediately.
Anyways, I think the only option to get enough details would be to spread the link and hope that others will start sharing their information.
Mihai said... 10:16 PM: Note that for Tokyo Cabinet/Tyrant you can also use the memcache protocol but you don't get access to everything.

For performance see this post: http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/ but I don't think any benchmark can be relevant.

For me more important are facts like Redis stores everything in memory, saving snapshots to disk so your DB cannot exceed your memory size.
Anonymous said... 7:28 PM: Hi Alex,

Can you update CloudBase website link in your post- http://cloudbase.sourceforge.net

Thanks,
T
dwight said... 9:37 PM: MongoDB: persistence: disk, rebalancing: Y
dwight said... 9:37 PM: Other good potential columns would be:
Secondary indexes?
Sorting?
Alex Popescu said... 11:42 PM: Anonymous: done.

dm:
1. can you point me to where MongoDB rebalancing is mentioned?
2. I'd be glad to add these new criteria, but can you please give more details about them?
dwight said... 6:45 PM: 1. the docs are incomplete at the moment - i will post something when they are updated.

2. what i was thinking is that some datastores are pure "key/value stores" where you can *only* query on the primary key. memcached is a good example of that. Some of the other databases let you query on any field or combination of fields. This is important for some use cases and would be good for folks to know which capabilities are in a given tool.

By secondary indexes, I mean one can create a DB index on a field that is not the primary key. This makes the "non primary key query" above fast.

Sorting is pretty clear -- a case where it is really helpful for the database to do the sorting is when it already has a btree in that order (then it is very fast). Also, for an "ORDER BY ... LIMIT ..." style operation, client sorting isn't efficient.

So i'll revise my previous comment and say good potential columns would be:

- query on non-primary key fields?
- secondary indexes?
- sorting?
Anonymous said... 9:00 PM: Great list - thanks a bunch.

We are looking at a bunch of these technologies. They differ greatly in maturity, ability to deal with different sizes of data, failures, etc.

Our case: we have 180 million keys (32-char in length). The values average 100 chars in length. I would say that reads are 100 more frequent than writes. Which of the distributed key/value hash system would you guys recommend? We care about redundancy, performance and ease of administration (who doesn't :). We run our platform on Java. Thanks.
Anonymous said... 9:55 PM: A "License" column would be helpful. Great resource, thanks!
jaso said... 4:42 AM: Neptune
- Data Model: Bigtable
- Partitioning: Y
- Persistence: DFS(HDFS)
- Rebalancing: Y
- Replication: Y
- Impl: Java
- Client Protocol: Java, RESTFul, Thrift
- Preformance: http://www.jaso.co.kr/neptune/performance.html
Anonymous said... 7:17 PM: hadoop support? (or any other kind of mapreduce)
TheGreeneMan said... 12:59 AM: I think you are missing a very important category of NoSQL database .... the object database. Versant, was the sponsor of the NoSQL meetup in Berlin. Pretty interesting stuff for NoSQL when dealing with complex models and requiring transactions and large scale distribution. Handles C, C++, Java, C#, Python. In the Berlin presentation it showed multi-terrabyte systems for folks like European Space agency, plus there was some stuff on high throughput txns...running a couple of the GDS systems for the airlines.
Anonymous said... 1:24 AM: Why db4o is not in the list? I'd like to see how it compares to the rest.
LKRaider said... 9:42 PM: Add Midgard2 to the list:
http://www.midgard-project.org/midgard2/

Written in C, API for most languages, P2P replication and data sharing support, Object Oriented storage (defined by XML schemas)
Martin Kersten said... 9:27 PM: MonetDB
- Data Model: column-store
- Partitioning: Y
- Persistence: disk
- Replication: available
- Client protocol: All
- Performance: e.g. http://www.cwi.nl/~mk/ontimeReport
Anonymous said... 7:36 PM: I'm rather new to this technology so this comment might not make much sense but should Katta be on this list?

Quick Reference to Alternative data storages

Alternative Data Storages

Notes

Implementation details

Performance

Other projects

Resources:

Labels:

Related Posts

17 comments:

Post a Comment

mindstorms

Latest comments

think differently big

Tag Cloud Sphere ▼

Follow Alex on Twitter ▼

Daily Cloud Stream ▼

Show more articles

Tags

Archive