skip to main | skip to sidebar

Home
mindstorms
About
NoSQL
Think Differently Big
Profile
Wishlist
RSS

NoSQL: Extracting and Tokenizing 30TB of Web Crawl Data

Alex Popescu | 2011/12/12 | bookmark | email

Extracting and Tokenizing 30TB of Web Crawl Data

All code for this 5 step process of extracting and tokenizing Common crawl's 30TB of data is available on GitHub :

tags:hadoop,aws

via NoSQL databases

Related Posts

Post a Comment

Older Post Newer Post Home

mindstorms

Software and web architectures, cloud computing and a flavor of tech startup entrepreneurship through the eyes of Alex Popescu.

About me: Software architect, Web Aficionado, Cloud Computing Fanboy, Geek Entrepreneur, Speaker, Co-founder and CTO of InfoQ.com, Writing also about NoSQL on the myNoSQL blog

Latest comments

think differently big

Tag Cloud Sphere ▼

Follow Alex on Twitter ▼

follow me on Twitter

Daily Cloud Stream ▼

Show more articles

NoSQL: Thank You
NoSQL: Garth Gibson About HPV and Big Data Applica...
NoSQL: MongoDB and Amazon Elastic Block Storage (EBS)
NoSQL: Why We Chose HBase for AppFirst APM
NoSQL: IndexTank vs Thinking Sphinx vs WebSolr
NoSQL: NoSQL Databases: 6 Business and Technical R...
NoSQL: Couchbase Server 2.0 Durability and Write P...
NoSQL: How Polyglot Persistence and Having Data St...
NoSQL: Hadoop: Amazon Elastic MapReduce and Micros...
NoSQL: LinkedIn Open Sources IndexTank: What Is In...
NoSQL: Statistical Advances: The Maximal Informati...
NoSQL: Project Isotope Will Bring Together Hadoop ...
NoSQL: CouchDB's File Format Is Brilliantly Simple...
NoSQL: 10 MongoDB Tips From Engine Yard Data Team
NoSQL: MacHine Learning, Hadoop, and Mahout
NoSQL: Big Data vs Right Data
NoSQL: Call to Arms: Renjin, R Implementation on J...
NoSQL: Doug Cutting About Hadoop, Its Adoption and...
NoSQL: Prospects and Promises of Big Data
NoSQL: Paper: The Efficiency of MapReduce in Paral...
NoSQL: Forrester Predictions for 2012: Hadoop, In-...
NoSQL: Grails 2.0 and NoSQL
NoSQL: Will Cloudant Become the CouchDB Go to Comp...
NoSQL: RavenDB: Debugging and Troubleshooting
NoSQL: SQL Azure Federation... Aka Sharding
NoSQL: Redis: Optimizing for Roundtrips
NoSQL: Neo4j Gets Experimental JDBC Driver
NoSQL: MongoDB in Numbers: Foursquare, Wordnik, Di...
NoSQL: How to Run a MapReduce Job Against Common C...
NoSQL: Document Databases and Data Migrations
NoSQL: Neo4j Domain Modeling With Spring Data
NoSQL: Data Modeling for Document Databases: An Au...
NoSQL: Attacking NoSQL and Node.js: Server-Side Ja...
NoSQL: Cassandra, Zookeeper, Scribe, and Node.js P...
NoSQL: Neo4j and Spring Data for Configuration Man...
NoSQL: MongoDB in Scala Using Casbah and Salat Obj...
NoSQL: NoSQL Screencast: HBase Schema Design
NoSQL: NoSQL Screencast: Building a StackOverflow ...
NoSQL: Standalone Heroku Postgres' Unanswered Ques...
NoSQL: MySQL Cluster Used to Implement a Highly Av...
NoSQL: 8 Most Interesting Companies for Hadoop's F...
NoSQL: How Does the Future of Computing Look Like?
NoSQL: Centralized Logging With Amazon SimpleDB, S...
NoSQL: Card Payment Sytems and the CAP Theorem
NoSQL: Redis Bitmaps for Real-Time Metrics at Spool
NoSQL: Spotify Architecture: The Peer to Peer Network
NoSQL: MySQL MEMORY as Poor Man's Memcached Replac...
NoSQL: Licensing and Distribution Holding Back the...
NoSQL: EMC Greenplum Database and Hadoop Distribut...
NoSQL: Hadoop and Cassandra in the Top 10 Most Imp...
NoSQL: MongoDB: With Adoption Comes... More Adoption
NoSQL: Muscula Architecture: Node.js, MongoDB, and...
NoSQL: Unintentional Market Confusion... Membase, ...
NoSQL: Hadoop Market Competition: comScore From Cl...
NoSQL: Four Operational Essentials: Rock-Solid Mon...
NoSQL: LAMP Replacement: The Jason Stack
NoSQL: The Wonderful Wizard of Oz Through a Polygl...
NoSQL: HTML5 App Using Hosted MongoDB Instance via...
NoSQL: Enterprise Caches Versus Data Grids Versus ...
NoSQL: Hadoop, HPCC, MapR and the TeraSort Benchmark
NoSQL: MarkLogic Querying for SQL People
NoSQL: Amazon Elastic MapReduce Upgrades to Hadoop...
NoSQL: Yahoo! Sherpa: Status and Advances
NoSQL: PostgreSQL Hstore: The Key Value Store Ever...
NoSQL: Migrating a Membase Cluster
NoSQL: Extracting and Tokenizing 30TB of Web Crawl...
NoSQL: Facebook: There Are No Published Cases of N...
NoSQL: NoSQL Screencasts: Neo4j for Ruby and Java ...
NoSQL: NoSQL Screencast: Busy Java Developers Guid...
NoSQL: Riak: Past and Future
NoSQL: Consistent Hashing Explained: The What and ...
NoSQL: Why I Choose CouchDB Over MongoDB
NoSQL: The Top 5 Reasons to Use Chef
NoSQL: Persistent Graph Structures With Ruby/Rails
NoSQL: Mahout on Amazon EC2: Installing Hadoop/Mah...
NoSQL: Factual API Powered by Node.js and Redis
NoSQL: MySQL Sharding vs MySQL Cluster
NoSQL: MarkLogic, LexisNexis, XML, and Search
NoSQL: Tableau Software and Hadoop
NoSQL: Real-Time Log Collection With Fluentd and M...
NoSQL: Cloudera Enterprise
NoSQL: The Future of NoSQL Database Companies
NoSQL: NoSQL Document Databases: Testing Your Docu...
NoSQL: The Durable Document Store You Didn't Know ...
NoSQL: AolTV Powered by Node.js and MongoDB
NoSQL: No Relations
NoSQL: Hadoop Security Explained
NoSQL: Michael Stonebraker Says in Defense of NewSQL
NoSQL: RAID and Acunu Randomised Duplicate Allocat...
NoSQL: Hadoop/MapReduce on Cassandra Using Ruby an...
NoSQL: Booting the Analytics Application
NoSQL: Cube: Visualizing Time Series
NoSQL: Combining Splunk and Hadoop: Introducing Shep
NoSQL: NoSQL Databases and Node.js
NoSQL: Pagination With Cassandra
NoSQL: Visualizing Wikipedia Update Stream With Re...
NoSQL: When to Use RavenDB?
NoSQL: A Survey of Graph Databases for the Java Pr...

Tags

Archive

myNoSQL a NoSQL blog featuring the best daily NoSQL news, articles and links covering all major NoSQL projects and following closely all things related to NoSQL ecosystem. Everything you need and want to know about NoSQL

.

Alex Popescu @ LinkedIn
Alex Popescu @ Twitter
Alex Popescu @ Facebook
Alex Popescu @ FriendFeed
Alex Popescu @ Indenti.ca
Alex Popescu @ Disqus
Alex Popescu: Think Big Differenty @ Tumblr
Alex Popescu: A Lifestream of Differently Big Thoughts @ Soup.io