Sourcing your data
Finding the right ingredients for your data creation is often the hardest part. You will often have to mix-and-match from the approaches below to get all the data and information you need.
1) Search the supermarkets – the data catalogues & data stores
There are a growing number of data catalogues that bring together listings of published open data (and there are also now data marketplaces that can help you find commercially licensed data as well – so be sure to check the details of the data you find).
Data catalogues often have a particular focus – and no one catalogue can tell you about all the data out there.
datahub.io is a catalogue of data from many different sources powered by CKAN software. Good to check if you are not quite sure where the dataset you want might be found to see if someone has already created a ‘packaged‘ version of it.
Data.gov.uk is the UK Governments data catalogue, which aims to include listings of allopen datasets in the public sector. It’s early days yet, but it boasts over 4,600 dataset listings, many of which link direct to spreadsheets and data downloads.
Guardian World Data Store makes it easy to search across a range of different government open data catalogues – browsing data by country and format.
Your local authoritymight have a data store, or at least a data page on their website. London has http://data.london.gov.uk and you can find a list of other local open data web pages through the ‘All Councils’ listing at OpenlyLocal.com.
Publicdata.eu is a new catalogue bringing together data from right across Europe.
2) Specialist independents – data stores
Where the supermarkets are stacking the datasets high, and sharing them free – there might be a specialist in your area of interest – working hard to source and bring together the finest data they can. Fortunately, most of them provide the data for free too.
OpenlyLocal.com is focussed on making local council information accessible. You can find details of local council spending for many authorities alongside details of council meetings and councillors that has been scrumped and scraped from the respective websites for you. Most of the raw data is available through an API – so you might need to explore a few new skills to get at it though.
Timetric.com are specialists when it comes to time series data. If you can plot it on a graph over time, chances are they’ve taken the dataset, tidied it up, and providing ways to search and browse for it – with csv spreadsheet downloads of the raw data.
Do you have a specialist independent you go to for data? Edit the page to add it in….
3) Foraging – searching for the data
GetTheData.org makes a great first port of call to see if other data-foragers have already found a good spot to get the data you are after. It’s a community website full of requests for data, and conversations about good places to find it. It is currently in an 'archived' read-only form and a proposal is underway (requiring support of more people) to replace it with a Q&A site on stack exchange.
If the data you want isn’t available pre-packaged and catalogued, you might need to head out foraging across the Internet. There is a lot of open data in the wild – you just need to know how to spot it.
Search – Try searching the web for the topic you are interested in. Perhaps add ‘data’ as an extra key word. When you read news articles or web pages that appear to be based on data, take note of the names of the data sources they mention and plug that back into a search. Oftentimes that will lead you to some data you might be able to use.
Think-tank websites, academic researcher web pages and even newspaper sites can all host lots of datasets. Just make sure you find out all you can about the provenance of the information before you use it!
Deep searching – You can use a standard Google Search to look for data published in common office formats hosted on a particular web domain: your local council or university for example. All you need are two handy operators:
Using those together you can construct searches like ‘filetype:xls site:oxford.gov.uk’ to find all the Excel spreadsheets that Google has indexed on the Oxford City Council website.
4) Scrumping – screen-scrape the data
It’s not uncommon to findthe data you need… only it’s just out of reach. Perhaps it’s in a table on a web page when you want it in the sort of table you can load into a spreadsheet to sort and chart. Or it might be spread across lots of different web pages and files. That’s where screen-scraping comes in – creating small computer scripts that turn structured information on a website into raw data.
There are recipes that explain the details of screen-scraping coming in the cook book, and you can go screen-scrape scrumping with a variety of different tools.
Google Spreadsheets – using a special formula you can grab tables and lists from other websites direct into your spreadsheet (recipe ).
Scraper Wiki – helps you get started created advanced scrapers which they will run every day to grab information from websites and turn it into accessible raw data (recipe ).
5) Special order– FOI
Perhaps you have found that no-one stocks the data you need – not even in places you can forage or scrump for it. If the data comes from a public body, then it might be time to explore putting in a special request for it using the Freedom of Information Act.
WhatDoTheyKnow.com is a service that makes it easy to submit a Freedom of Information Act request to a local authority, government department or other public body. You have a right to ask authorities for a copy of the information and data they hold, and you can ask for it to me returned as raw data. Search WhatDoTheyKnow to see if anyone has requested the data you want already, and if not, put in your request. (Often if data is available on WhatDoTheyKnow it will be locked up in PDFs. You might need to crowd-source the process of turning it into structured raw data, although there are a few tools and approaches that might help turn PDFs into data programatically)
The Public Sector Information Unlocking Serviceavailable athttp://unlockingservice.data.gov.uk/ provides a root for requesting data is opened up by the Data.gov.uk team. It’s not backed by the legal framework of FOI, but may play a role in data requests under the currently debated ‘Right to Data’ legislation.
IsItOpenData.org provides a useful tool for asking non-public bodies to share their data as open data, or to clarify the licensing.
6) Home grown– research and crowdsourcing
Some data simply doesn’t exist yet – but you can create a raw dataset through research, and through crowd-sourcing, inviting others to help you research.
Simple spreadsheets - if you are systematically working through a research task, keep your results in a spreadsheet. See the section on raw data for ideas about how to structure it well.
Google Forms - available through http://docs.google.com allows you to create an online form that anyone can fill in, with all the responses going direct into a spreadsheet for you to use. You might be able to get supporters to research for you and collaborative build up a useful dataset.