Hadoop is probably the most complete and largely used of the 7 MapReduce implementations implementations I have counted. The project was initiated at Yahoo! and some time ago it was contributed to Apache Software Foundation. By looking at the committers page, it looks like 13 out of 22 committers are from Yahoo!, so this being said, I cannot stop wondering what is Yahoo! Hadoop?
First answer
Yahoo! is opening up its investment in Hadoop quality engineering to benefit the larger ecosystem and to increase the pace of innovation around open and collaborative research and development.
I'll let you decide if this is PR BS only or not.
Second Answer
The Yahoo! Distribution of Hadoop has been tested and deployed at Yahoo! on the largest Hadoop clusters in the world.
Now, this makes a lot of sense: Yahoo! Hadoop is a version of Hadoop that has been tested and patched internally.
Still confused?
But my confusion still persists:
- why the patches haven't been applied to the source base hosted by Apache?
- why is it not a tag on the source base hosted by Apache? Or at most a separate branch?
- why Yahoo! has decided to host it completely separately on GitHub.
And I guess there are only two possible answers:
- either Yahoo! has no idea how to run an open source project (nb this is hard to believe)
- or Yahoo! has decided to fork back Hadoop and take full control over it.

5 comments:
Here's a theory, they're just using the event to pressure Apache and push their own patches faster.
As far as I can tell, the Yahoo distro is the same as the Cloudera distro. The Cloudera distro is based on 0.18.3, with some patches that were applied to 0.19 or 0.20. Version 0.19 is largely deemed unstable and not production worthy, and 0.20 just came out. Most production deployments (likely Yahoo and Facebook) are still on the 0.18 branch, but benefit from the patches, hence the need for a separate distro.
In short, I don't think there's anything fishy going on here... this is simply how large enterprises work with open source projects, particularly ones that are still in the early stages.
p7a, would you mind pointing me to any other open source project contributed by a company that was afterwards forked back?
I know companies are usually maintaining internally a stable/tested/patched version, but what happens with Hadoop is quite different.
There is a possibility in what you are saying. However, I don't think that it makes sense from a TCO perspective for Yahoo to maintain a different fork of Hadoop internally.
More likely, I think that Hadoop contributed the testing time for these patches. A lot of Hadoop patches by are contributed as JIRA patches, or Hadoop contrib packages. The testing effort of keeping track on what patches you apply is large, and unlikely for users to do. But the code is available, just not applied to the main branch.
I know that Yahoo also has a bunch of code that they haven't contributed back yet (see Arun C. Murthy and Owen O'Malley's presentation at the Hadoop summit, http://bit.ly/ptCd1).
My 0.2
adragomir, what you are saying makes sense, but my main questions remain:
- why would you host it under a different umbrella and not making it just a branch of the existing project? (the costs are even higher this way)
- why merging changes from an external hosted project? (once again consts are higher)
Now, the last part of your comment may be the real answer. Yahoo has *internally forked* the project and using git to manage it is much easier than using svn, so they decided to host it on github.
Post a Comment