recipe:sliced_and_diced_aid_data_with_google_refine

Google refine is like a swiss-army-knife for working with tabular data. You can load in small and large CSV files into refine and use it to clean up the data, to browse through it and to restructure it. Refine is a great starting point for any data analysis - and is a key part of any data kitchen.

Refine works well with big files (thousands of rows of data), and includes many advanced features to fetch additional data and context from other data sources or from the web. In this recipe we show how to take Aid Data from a general spreadsheet and to prepare it for analysis.

You will need

A installed on your computer (you can , or talk with your IT department about getting access)
A CSV file of aid projects from the IATI Explorer Toolkit (or a spreadsheet/CSV file of your choice). You can find one here, or create a query to fetch your own here.

Step by step

Step 1: Preparing your data

Google Refine works best with data that is already organised into columns in some way. It doesn’t need to be perfectly structured (refine is all about helping you reshape and clean up your data), but it does help if the first row of your file includes a list of column headings, and the following rows include only data (rather than having sub-headings at various rows of the document).

The data we are working with from the International Aid Transparency Initiative (IATI) is already well structured for our use.

Step 2: Load your data into refine

Google refine runs on your computer, but you access it through a web browser. You can usually access it at http://localhost:3333 when it is running.

From the front-page of refine you can create a new project, either fetching CSV data directly from the Internet, or loading a file from your computer. In our example, we’ll load data from a file saved on our computer. As we have a single header line we can leave the default ‘Create Project’ options as they are.

Once we have created our project, refine will load up a spreadsheet-like view of the data. By default refine just shows 10 rows of data, but you can page through all the available data. Whenever you carry out an operation in refine, it will apply to all your active rows, regardless of whether they are displayed on screen or not.

Step 3: Work with columns & cells

In refine most commands are performed using the drop-down icon next to column headings. In our IATI spreadsheet there are a number of columns we don’t need, so we can use the drop-down, and the ‘Edit Column’ > ‘Remove Column’ option to get rid of these from our data. Notice how every action we take is added to the Undo/Redo tab at the side of refine, making it easy to go back if you make a change by mistake*.

Our IATI spreadsheet is a flattened version of an XML file. Because ‘fields’ in XML can have lots of additional attributes, and an XML file might include the same fields in different languages, or might repeat fields, the IATI spreadsheet has a lot of columns we might not need.

From the drop-down menu for a column, you will also find an ‘Edit Cells’ section. This lets you make a particular change to every cell in that column. For example, scrolling across to ‘participating-org’ columns in our IATI dataset we find that some organisation names are written in UPPERCASE, and others lowercase, and others in Title Case. Using ‘Edit Cells’ and the ‘Common Transformations’ options we can convert all the values in a column to one particular case.

*The undo/redo menu tab also lets you export a whole set of steps from your work in Refine so you can apply them to similar files in future. This allows you to create re-usable workflows for cleaning data.

Step 4: Explore what the dataset contains

Before you start changing columns and values, you need a sense of what they contain. For that refine provides the ‘Facet’ feature. This is also a really useful tool for exploring the contents of the data, and carrying out preliminary research.

For example, if we want to check the names of all the organisations involved in aid projects in our dataset we can scroll along to the ‘participating_orgs-implementing’ column and choose ‘Facet’ > ‘Text Facet’ from the drop-down menu. The ‘Facet/Filter’ tab on the left will now include a list of all the participating organisations in the data, along with a count of how many rows these were found in. You can sort the list by ‘count’ to see the most common organisation. When you click an option in the facet box, only the rows with that value will be displayed and you can explore and make changes to these rows. When you have multiple facets and you select an option in one, the others will be updated to represent the narrower set of data to which they now apply.

For example, if you scroll back to the ‘Commitment’ column and choose ‘Facet’ > ‘Numeric Facet’ you can get a graph of the sum of commitments in each aid activity in the dataset as a histogram. Drag the box at the left or right of the histogram to narrow down your exploration to only aid activities above or below a certain value of commitments, and you will see the participating-org facet update to show only organisations with activities in the range of commitments you have now selected.

This can be useful to find outliers in a dataset. Sometimes you might find values that are obviously wrong - perhaps as the result of transcription errors. You can click in any cell in the main refine spreadsheet area and choose ‘Edit’ to remove or fix such values.

In the IATI dataset, facets are also useful to flag up areas for caution in analysis of the data. Adding a text facet for ‘default-currency’ and ‘transaction-value-currency’ lets us quickly see whether any values we are looking at might be in multiple currencies, and thus not directly comparable.

If you are sure you are dealing with just one currency you can also create scatter-plot facets to compare, for example, commitments and disbursments recorded in the IATI dataset. You can also create custom facets. In our IATI dataset we might want to create a quick facet to let us see which sectors our projects fall into. However, the sectors are all in one cell, separated by ‘;’, so creating a simple text facet over this won’t help. Instead using a custom text facet we can use a Google Refine Expression Language (GREL) command to tell refine to split up the cell by ‘;’ before building the facet.

split(value,";")

With facets applied, you can use the ‘Export’ option from the top-right of the screen to export a new CSV file of just the narrowed down set of data you are interested in. This makes refine a useful way to take a massive document, and fetch out a smaller set of data to work with in other tools like your spreadsheet or an online data visualization tool.

(Tip: if you are struggling to find a column, both Firefox and Google Chrome web browsers let you search within a page by pressing Cntrl-F or Cmd-F on mac. Bring up the search box and type in ‘sector’ and then skip through the places it appears in the page to quickly get to the relevant column).

Step 5: Cleaning the data - editing and clustering

Often you will find that data needs some cleaning before you can use it for analysis. For example, even when you have converted a column to a single case, you might find the same organisation name with slight differences of spelling or punctuation in a column. The computer needs help to realise that two similar things are the same.

Google refine allows you to edit a value direct from the facet browser. Hovering over a text facet choice and selecting ‘Edit’ will let you change a value for all the rows it appears in.

For example, in the IATI dataset we might find the participating-org_implementing column includes both ‘UNDP’ (with 5 activities) and ‘United Nations Development Programme’ (with 10 activities), and we want to analyse these together. Clicking ‘Edit’ next to ‘UNDP’ in the Facet browse, we can change this to ‘United Nations Development Programme’. Our facets will update to now show ‘United Nations Development Programme’ with 15 activities. We could use a similar approach to manually collapse a longer list of sectors into a smaller list more general sector labels (although only once we have each sector on it’s own row after step 6)*.

Google Refine also has a powerful ‘Cluster’ tool that will try to automatically detect similar values that should be clustered together. You can choose a variety of methods to look for similar text. This is particularly useful when you are dealing with files with lots of rows of different organisation names or other ‘free text’ identifiers.

(*For large cross-mapping of classifications of IATI Data Publish What You Fund have a method which you could use. Requests for details can be made on the IATI Support Site)

Step 6: Reshaping the data

Because our IATI CSV spreadsheet is really a ‘flattened’ version of a relational (multi-dimensional) dataset, some forms of analysis might need the data ‘re-shaped’ to expand particular dimensions and give each it’s own row.

For this we use the ‘Edit Cells’ > ‘Split Multi Value Cells’ and ‘Edit Cells’ > ‘Fill Down’ transformations.

For example, IATI data includes details of the ‘sectors’ which an aid activity is working in, and often a sector percentage, showing how much of the project is contributing to that sector. In our IATI spreadsheet these are each stored in a single cell separated by ‘;’. By choosing ‘Edit Cells’ > ‘Split Multi Value Cells’ and enter ‘;’ as the separator on the ‘sectors’, ‘sector_vocabularies’, ‘sector_codes’ and ‘sector_percentages’ columns we end up with a row for each sector for each activity, and the associated vocabularies, codes and percentages. We can do this because there will always be the same number of sectors, vocabularies, codes and percentages for each activity in our file. As there might be different numbers of values in other flattened columns, we can only easily ‘expand out’ our data on one dimension at a time. (Other dimensions in the IATI file we might want to expand include ‘transaction_values’ and other transaction columns, participating-orgs, and policy-markers).

We now have one row per-sector, but if you scroll across the table you will see that columns like ‘Commitment’, ‘IATI-Identifier’ and ‘Title’ are only displayed every couple of rows - just for the first sector in each set that we have expanded out. We need to use the ‘Edit Cells’ > ‘Fill Down’ command on each relevant column. This tells refine to copy the value from a row to all blank rows below, until it find a non-blank row, in which case it picks up that value and copies that down over blank rows below.

With our sectors expanded out and values filled down we can now create a simple text facet over the ‘Sectors’ column, and could choose to cluster the sectors if we wanted.

Step 7: Calculation

Expanding out sectors has created one challenge: we have duplicated the ‘commitments’ and other monetary values a number of times over - so if we were to export our data to a spreadsheet and run calculations we would end up with lots of double-counting. Fortunately, we have ‘sector percentages’ to show us how much of the commitments could, notionally, be placed in each row*. For this we’re going to run a calculation transformation.

First we need to make sure refine will treat ‘sector percentages’ as a number. For this we go to the column and choose ‘Edit Cells’ > ‘Common Transformations’ > ‘To Number’. We then take careful note of the ‘sector percentages’ column title (sector_percentages) and scroll to the ‘commitments’ column. Here we choose ‘Edit Cells’ > ‘Transform’ and enter the following formula:

   round(value*(cells["sector_percentages"].value/100))

which tells refine to multiply the current cell value by ‘sector_percentages’ as a number between 0 and 1, and then to round the result.

We could now find roughly how much money is being committed to each sector.

*Sector percentages for each activity should usually add up to 100%; however, some donors include multiple ‘sector vocabularies’, so check carefully whether this is the case as you may still need to filter your data with a facet on sector_vocabulary when exporting to involve double-counting.

Step 8: Fetching third-party data

Refine also has powerful features to fetch additional data from the web to complement the data our file contains. For example, looking up ISO country codes for countries mention in a dataset, or using ‘reconciliation’ services to find identifiers for entities like companies.

However, looking in depth at these features will have to wait for a future recipe.

Step 9: Making a meal of it

You have seen that you can explore large datasets in refine. Sometimes this can be enough for you to track down the small morsels of data you really want to make a full meal of. Other times you will want to take your refined data and use it in other recipe to visualise or analyse it in more depth.

Health Warnings

IATI Data comes from a range of different donors, and the accuracy, completeness and origin of data varies. Statistical calculations over the data should be treated with care.

Examples and variations

Add links to any examples of this recipe, or add notes on possible variations.

refine,, csv,, transformation

recipe/sliced_and_diced_aid_data_with_google_refine.txt · Last modified: 2011/07/02 22:45 by Tim Davies