NoSQL: Extracting and Tokenizing 30TB of Web Crawl Data

| | bookmark | email

Extracting and Tokenizing 30TB of Web Crawl Data

All code for this 5 step process of extracting and tokenizing Common crawl's 30TB of data is available on GitHub :

tags:hadoop,aws

via NoSQL databases