7 Flavors of MapReduce

| | bookmark | email | 3 comments

I am pretty sure that those reading this post already know what MapReduce is (in case you want to refresh your memories here is the PDF). I'm also pretty sure that you've already heard about the open source implementation of MapReduce contributed by Yahoo to Apache Foundation: Hadoop and you have probably heard also about Amazon Elastic MapReduce.

At least that's pretty much all I knew about MapReduce and its implementations. But I have discovered a few other solutions that offer a mapreduce implementation (disclaimer: I haven't tried these projects and I don't know their current status).



Description: Disco is an open-source implementation of the Map-Reduce framework for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers.

Project: http://discoproject.org/



Description: Skynet is an open source Ruby implementation of Google’s MapReduce framework, created at Geni. With Skynet, one can easily convert a time-consuming serial task, such as a computationally expensive Rails migration, into a distributed program running on many computers. If you’d like to learn more about MapReduce, see my intro at the bottom of this document.

Project: http://skynet.rubyforge.org/


Description: FileMap is a lightweight system for applying Unix-style file processing tools to large amounts of data stored in files. It provides full map-reduce functionality without requiring that you switch your processing to any particular language or runtime environment, install any special software, or have root on your storage and processing nodes.

Project: http://mfisk.github.com/filemap/



Description: Greenplum Database is a software solution built to support the next generation of data warehousing and large-scale analytics processing. Supporting SQL and MapReduce parallel processing, Greenplum Database offers industry-leading performance at low cost for companies managing terabytes to petabytes of data.

Project: http://www.greenplum.com/



Description: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including Hadoop Core, our flagship sub-project, provides the Hadoop Distributed Filesystem (HDFS) and support for the MapReduce distributed computing framework.

Project: http://hadoop.apache.org/

Amazon Elastic MapReduce


Description: Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Project: http://aws.amazon.com/elasticmapreduce/

There is also a project from Microsoft research that seems to related to mapreduce: Dryad (investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center) and its DryadLINQ module (make large-scale, distributed cluster computing simple, simple enough for ordinary programmers).

Do you know any others? Also, if you have any experience with any of these projects, I'd really appreciate if you can share it with us. Links to posts covering any of the projects are welcome.