Welcome to DataPatterns.org, a collection of tips and tricks for data work. This collection is not an finished document but a collection of opinions and evolving best practices. The purpose is not to present all available options and technologies but to pick one and follow it through. DataPatterns is also a collaborative effort: if you have some good hacks and would like to share them, please contribute a patch to the DataPatterns repository
Some proposed chapters:
- Types of data
- Setting up a working environment
- Scraping things
- HTML
- Index & Item
- Page Elements
- Have a cookie (State)
- Threading / FlockScrape?
- Caching: HTTP and Local
- Put it somewhere (MongoDB)
- Put it somewhere else (SQLite)
- Take a peek inside
- Storing data
- Webstore
- JSONdir
- Metadata & CKAN
- Extracting things & cleanup
- Regexen
- OCR/ocropus
- Date parsing
- Refine / Refine as a Server
- Text Normalization
- Calais and Auto-Tagging
- Entities
- NLP/NER basics
- MDM/Codesheets
- Google Spreadsheet Normalization
- OpenCorporates.com Recon
- Helmut
- GeoNames
- Graphs
- RDF and Linked Data
- NetworkX
- graphviz + Gephi
- Mapping (invite)
- Dataviz (invite)