mindstorms: Is It Apache Hadoop or Yahoo Hadoop?

Hadoop is probably the most complete and largely used of the 7 MapReduce implementations implementations I have counted. The project was initiated at Yahoo! and some time ago it was contributed to Apache Software Foundation. By looking at the committers page, it looks like 13 out of 22 committers are from Yahoo!, so this being said, I cannot stop wondering what is Yahoo! Hadoop?

First answer

Yahoo! is opening up its investment in Hadoop quality engineering to benefit the larger ecosystem and to increase the pace of innovation around open and collaborative research and development.

I'll let you decide if this is PR BS only or not.

Second Answer

The Yahoo! Distribution of Hadoop has been tested and deployed at Yahoo! on the largest Hadoop clusters in the world.

Now, this makes a lot of sense: Yahoo! Hadoop is a version of Hadoop that has been tested and patched internally.

Still confused?

But my confusion still persists:

why the patches haven't been applied to the source base hosted by Apache?
why is it not a tag on the source base hosted by Apache? Or at most a separate branch?
why Yahoo! has decided to host it completely separately on GitHub.

And I guess there are only two possible answers:

either Yahoo! has no idea how to run an open source project (nb this is hard to believe)
or Yahoo! has decided to fork back Hadoop and take full control over it.

What do you think?

6 comments:

Unknown said... 12:28 AM: Here's a theory, they're just using the event to pressure Apache and push their own patches faster.
p7a said... 4:59 AM: As far as I can tell, the Yahoo distro is the same as the Cloudera distro. The Cloudera distro is based on 0.18.3, with some patches that were applied to 0.19 or 0.20. Version 0.19 is largely deemed unstable and not production worthy, and 0.20 just came out. Most production deployments (likely Yahoo and Facebook) are still on the 0.18 branch, but benefit from the patches, hence the need for a separate distro.

In short, I don't think there's anything fishy going on here... this is simply how large enterprises work with open source projects, particularly ones that are still in the early stages.
Alex Popescu said... 11:21 AM: p7a, would you mind pointing me to any other open source project contributed by a company that was afterwards forked back?
I know companies are usually maintaining internally a stable/tested/patched version, but what happens with Hadoop is quite different.
Unknown said... 4:59 PM: There is a possibility in what you are saying. However, I don't think that it makes sense from a TCO perspective for Yahoo to maintain a different fork of Hadoop internally.

More likely, I think that Hadoop contributed the testing time for these patches. A lot of Hadoop patches by are contributed as JIRA patches, or Hadoop contrib packages. The testing effort of keeping track on what patches you apply is large, and unlikely for users to do. But the code is available, just not applied to the main branch.

I know that Yahoo also has a bunch of code that they haven't contributed back yet (see Arun C. Murthy and Owen O'Malley's presentation at the Hadoop summit, http://bit.ly/ptCd1).

My 0.2
Alex Popescu said... 8:21 PM: adragomir, what you are saying makes sense, but my main questions remain:
- why would you host it under a different umbrella and not making it just a branch of the existing project? (the costs are even higher this way)
- why merging changes from an external hosted project? (once again consts are higher)

Now, the last part of your comment may be the real answer. Yahoo has *internally forked* the project and using git to manage it is much easier than using svn, so they decided to host it on github.
SteveL said... 12:04 PM: -only just seen this, sorry for a very late comment

Hadoop is the most locked down of any ASF project I've ever encountered: no change is made without a JIRA issue, every patch has to get through hudson, including extra tests or justification for no new tests. As a result, this is one of the highest quality apps I've ever come across, but then it needs to be, it is Line of Business for Y!, Facebook, others. But because of that process you can't be so agile, you can't do a quick fix to deal with a problem you see on your deployment and check it in straight away. As a result, most large deployments are generally a recent stable release with local patches for those things they've encountered -patches that may go into SVN_HEAD, but have been pushed back to the older versions. Which people like to keep using, because if you are storing 4+ petabytes of data you don't want to be on nightly builds.

The emergence of the Y! branch probably came out as an answer to Cloudera, to stop them getting all the credit for Hadoop just by producing "the cloudera branch" as RPMs. Y! get to say "this is what we test at scale, what we run, you can play too". Nothing in there is secret, it's just a different set of applied patches to the primary releases.

Incidentally, the "have you patched hadoop?" question is one way to measure how seriously someone is using the project, right up with "what bugs have you found". It's notperfect, you look through that code, you find potential problems. We cherish bug reports.

Is It Apache Hadoop or Yahoo Hadoop?

First answer

Second Answer

Still confused?

Labels:

Related Posts

6 comments:

Post a Comment

mindstorms

Latest comments

think differently big

Tag Cloud Sphere ▼

Follow Alex on Twitter ▼

Daily Cloud Stream ▼

Show more articles

Tags

Archive