A relational database management system (RDBMS) imposes a fixed schema, so why would I use schema-less and relational database in the same sentence?
I will not write an introduction about what relational database are and what are the differences between them and other storage alternatives and I'll jump directly to the subject: is it be possible to use a row-based storage in a schema-less mode? Or to make it more explicit: is it possible to store free format objects/data in a fixed format?
We have to agree that at the first glance this looks like the eternal hammer hypothesis: give me a hammer and everything will look like a nail. But in this case things are a lot simpler.
So, what is the solution?
Most of the languages support one form or another of object serialization in binary or pure text format. Java has serialization in binary format, Python has marshaling and (c)pickle. Ruby has marshaling too and the list can go on and on. Both Google with protobuf and Facebook with Thrift have proposed their own high performance binary serialization mechanisms. Besides all these existing mechanisms there is also an universal object marshaling protocol: JSON.
By now the solution should become quite obvious: instead of persisting the object by storing it field by field, we would just serialize the whole object with all its properties. Basically, the table will need just 2 fields: an ID and a field that can store our serialized object. Like FriendFeed is doing it (nb Jackrabbit -- the open source implementation of Java Content Repository spec -- was using this solution for quite a while). And if you think this is not the best idea, I'll just say that most (if not all) of the alternative storage solution are using similar storage solutions.
What do you lose?
Referential integrity
Referential integrity in the sense of foreign keys will be lost. You can maintain it at the application level though.
Indexes
Serialized object in either binary or text format are pretty useless in indexes. So, if you need to search based on specific object properties then you'll have to implement your own solution. For example, Jackrabbit is using Lucene to keep an external index, while FriendFeed is using the database by creating indexes on-request.
What do you win?
A solid storage engine
You don't have to recreate or implement yourself the storage mechanism. The solution will use one that is proved to be solid.
Drivers and Tools
Most of the existing RDBMS are providing additional administration and visualization tools. Not to mention their drivers tested by tons of applications.
Last, but not least, you'll keep any other feature offered by your relational database (f.e. clustering).
I'll not end the post without throwing out the real question:
13 comments:
Another big thing you loose: relations.
You can no longer filter, select, sort, and combine actions based upon individual attributes (columns).
Windows Azure's original SQL Server Data Services used a schemaless Entity-Attribute-Value model with "flexible entities" to enable an entity set to contain entities with different property bags.
They abandoned the approach in favor of a fully relational, schema-centric "SQL Server in the clouds" version at the demand of potential Azure developers in early 2009.
I'd say the plain old EAV model (used by Azure Tables) is the way to go if you don't want to be bound by RDBMS constraints.
--rj
Most major apps I've worked with have at least one table that just contains a serialized blob (usually of XML, but, as you say, it doesn't matter what the blob is). As you note, you lose all ability to query this data, which IMO is a major loss of functionality.
In those app architectures I've advocated for replacing the blobs with nested-set style SQL trees of data, which you CAN query. To my knowledge, no one has ever taken my advice! :-)
Thanks for the comment Deron. I'd say that I've already pointed out those aspects:
1. Relationships are part of referential integrity checks.
2. Filter, select, sort are imo part of losing the automatic indexing, which as pointed out in the post is still possible but using additional systems.
Roger, thanks for bringing up the Azure SQL Server. I think that going back to a relational, schema-centric approach is a good strategy for Microsoft to convince more people to use Azure as the transition will be much easier.
My concern is the tight coupling this introduces between code and data.
What happens when you need to update your data model e.g. add or remove a field? How do you migrate your existing data? If you haven't rolled your own serialization then what happens when code changes affect auto-serialization?
Also, this design would make it more difficult to interface with the data from other applications, data imports etc.
ctford, these are all valid concerns and I don't think there is one single strategy to address them all.
Data migration is usually a problem whatever storage mechanism you are using (or at least this is how I see it). There is even a full book dedicated to DB refactorings and I don't think it is covering all possible changes.
For the scenario presented above, I guess one approach would be to introduce model versioning, but I haven't thought much about this aspect. Thanks for bringing it up!
Nice post!
I think that the answer to your "real question" is to find in the draw-backs pointed out by you and especially by ctford.
To actually answer your question Alex, I think schema-less approach can be justified when your data model is fluid and dependent on runtime inputs.
However, in that instance I would modify your approach slightly to more closely resemble the prototype pattern. I would have two tables, Object and Property. This would allow me to add and remove properties from objects at runtime, perform aggregate queries and would also provide a more stable and application-independent serialisation.
Steve Yegge wrote an interesting (and long) post on the prototype pattern: http://steve-yegge.blogspot.com/2008/10/universal-design-pattern.html
@ctford. the problem with prototypes in a databases (what I would call extension tables) is that they scale as a multiple of the number of attribute values whereas an EAV approach tends to be bounded to the number of objects. If that's a not a problem then it's possibly a valid approach.
Why not a hybrid approach?
Entities get their own table.
Tables have a number of common fields (identity, foreign keys, auditing, required fields) and then also get an entry into a property bag type table, which is simply a Name:Value pair (or JSON, if you like, to cut down on the number of records.) The property bag table can be entity specific, or just one for the entire DB.
The flexibility you lose from your inital approach is the entities themselves are not flexible (you could not create a new type of entity without a schema change.) Though the composition of the entities themselves would be fluid.
Another option is to have two tables in your database. One for Nodes (that works as you suggested -- contains serialized JSON.) The other table for Relationships (Edges) that store links between the Nodes (and also their own JSON data.)
You could even add additional index tables that contain important / indexed fields pulled out from the Nodes and Edge records.
Of course, at this point, it might be easier to write your own DB rather than try to shoehorn this into an existing RDBMS. Traversing edges in this sort of graph database would be a lot of expensive single record look ups. Doing any sort of search (depth-first, breadth-first) would take a long time.
There are papers written on doing this within an RDBMS, though.
Patrick,
The property bag approach is not scalable at all (even if you'd do it in a table per entity way).
Related to the 2nd approach, that is close to what I've described. Your connection table addition is interesting, but I'm not very sure now about the benefit (as you'll have to navigate through 3 tables instead of 2 for reconstructing a connection which would still not impose referential integrity).
For most of us, building a DB from scratch is not an option. My approach would be to take a well established persistence solution and see how I can make it do what I want (normally by giving up to those features that are not 100% needed).
Post a Comment