Extracting and Tokenizing 30TB of Web Crawl Data
All code for this 5 step process of extracting and tokenizing Common crawl's 30TB of data is available on GitHub :
via NoSQL databases
All code for this 5 step process of extracting and tokenizing Common crawl's 30TB of data is available on GitHub :
via NoSQL databases
About me: Software architect, Web Aficionado, Cloud Computing Fanboy, Geek Entrepreneur, Speaker, Co-founder and CTO of InfoQ.com, Writing also about NoSQL on the myNoSQL blog
Post a Comment