Data wrangling for fun and profitΒΆ

Welcome to DataPatterns.org, a collection of tips and tricks for data work. This collection is not an finished document but a collection of opinions and evolving best practices. The purpose is not to present all available options and technologies but to pick one and follow it through. DataPatterns is also a collaborative effort: if you have some good hacks and would like to share them, please contribute a patch to the DataPatterns repository

Some proposed chapters:

  • Types of data
  • Setting up a working environment
  • Scraping things
    • HTML
    • Index & Item
    • Page Elements
    • Have a cookie (State)
    • Threading / FlockScrape?
    • Caching: HTTP and Local
    • Put it somewhere (MongoDB)
    • Put it somewhere else (SQLite)
    • Take a peek inside
  • Storing data
    • Webstore
    • JSONdir
    • Metadata & CKAN
  • Extracting things & cleanup
    • Regexen
    • PDF
    • OCR/ocropus
    • Date parsing
    • Refine / Refine as a Server
    • Text Normalization
    • Calais and Auto-Tagging
  • Entities
    • NLP/NER basics
    • MDM/Codesheets
    • Google Spreadsheet Normalization
    • OpenCorporates.com Recon
    • Helmut
    • GeoNames
  • Graphs
    • RDF and Linked Data
    • NetworkX
    • graphviz + Gephi
  • Mapping (invite)
  • Dataviz (invite)

Related Topics

Fork me on GitHub