I couldn't find anything that could possibly handle many terabytes of data, though. Most Ruby implementations, like the classifier gem, have only a simplistic implementation […]. I decided to create a better naive bayes implementation (for instance, using a Laplacian smoother) that could also handle up to many terabytes of corpus data. We already have a Hadoop cluster with HBase running, and HBase is perfect for storing data like word counts.
via NoSQL databases