Wrangling Edinburgh Bike Counter Data

Over the course of the last nine years, the City of Edinburgh Council has installed automatic bike counters across the city, most of them on the network of off-road bike paths. The bike counters themselves are usually pretty unobtrusive and not designed for public display of counts, unlike for example this Sustrans counter on Middle Meadow Walk:

In fact, a few meters down from the public counter, you can also see a Council bike counter that looks like this:

And here is a picture of the internals of one of the counters:

For a number of years now, I (EK) have been intermittently trying to get hold of the data collected by the counters, and last year we finally made some progress, with help from Chris Brace and Alan Tinto.

In Edinburgh Cycling Data, we described our goal of bringing together different sources of cycling data in Edinburgh. A big component of this was getting access to previously unpublished data collected by the City of Edinburgh Council’s off-road bike path counters. A happy outcome of our work is that this data has now been published on the Council’s open data portal. In this post, we will give an overview of the steps that were involved in getting the data to state where it could be published as open data, and we’ll also look at some information that can be gleaned from the data.

Extract-Transform-Load

Extract-Transform-Load refers to the process of extracting data from a variety of sources, transforming into a consistent format, and loading it into a target database.

Extract: Our initial dataset was collected from 19 different counters installed on off-road cycling paths in Edinburgh over the period from 2008 to 2013. It contains information about the counters themselves, including their name and ID, their location coordinates and the number of channels. Each counter currently has two or four channels (which are oriented north, south, east or west), and cycle counts are collected on an hourly basis for each individual channel. Because the counters were installed gradually over the relevant period, and because battery failures intermittently prevented data collection, there are many gaps in coverage. The datasets were made available in both .csv and .xls formats, and exhibited somewhat heterogeneous structures, sometimes being split up into week-by-week tables, sometimes further split by channel, and more recently provided as one single table.

After initial work on this dataset, we received further data from the Council gathered since 2014 from an additional 29 counters, both off-road and on-road. As with the previous dataset, the new data was heterogeneous and incomplete, and only 27 of the counters were associated with bike counts.

Transform and Load: In order to facilitate subsequent analysis of the data, we decided to load it all into a relational database (using SQLite). The transformation step involved writing parsers that would recognise the different structures in the input data, and output them into a new structure which would allow them to be loaded into the new database.

Database structure consisting of five tables

Data Cleaning

Although the Transform step involved normalisation of data fields (e.g., imposing a uniform format for dates and times), there remained a number of problematic data values. For example, one counter recorded multiple counts of the form 9999, which we interpreted as an error. Following this, we examined outlier data value in the form of ‘flat regions’ (i.e., consecutive days with zero counts) and spikes