A Look at 10 Months of CapMetro Vehicle Position Data

Written by Sean Cascketta

Capital Metro introduced a new public API at the end of February 2015 for getting real-time positions of all their vehicles. With updates every one or two minutes, it's a half-decent data source for user-facing apps like Instabus, but it's not exactly "live".

I thought it might be useful to have a historical record of vehicle positions, thinking that perhaps it could be used to evaluate on-time performance or predict future arrival times, so I wrote a simple daemon to continuously archive these vehicle positions. After every day at midnight, the collected data is published to a repository on GitHub for anyone to use. Recently I spruced up the data and loaded it into a public dataset on Google BigQuery for anyone to query.

However, for the purpose of evaluating schedule deviation (the degree to which a vehicle is behind schedule or ahead of schedule), vehicle position data is incomplete on its own because there's no indication of when a vehicle is at a stop. For a given vehicle traveling on a route, we only have its position recorded every one or two minutes, and each of those positions is not necessarily at a stop. Fortunately, there's some additional info available in a format called GTFS which allows us to make a good guess.


From the GTFS overview:

The General Transit Feed Specification (GTFS) defines a common format for public transportation schedules and associated geographic information. GTFS "feeds" allow public transit agencies to publish their transit data and developers to write applications that consume that data in an interoperable way.

When Capital Metro switches to a new schedule (once or twice a year), they make a new GTFS feed available on the Austin Open Data portal. This GTFS feed has information on nearly everything including:

  • Schedules - the time a vehicle will arrive at a stop (depending on the day of the week)
  • Stops - locations of stops, which routes use which stops
  • Trips - which stops a vehicle will visit during a trip, which trip a vehicle will take on a certain day of the week

With this we can use a simple heuristic to determine if a vehicle is at a stop. If a bus is close to a stop (e.g. 250 meters) and we know that the stop is supposed to be visited during its trip, then we can assume it has arrived at that stop at that time. The schedule data in GTFS contains the time a bus is scheduled to arrive at each stop, and we can compare this to the observed arrival time to calculate the deviation from the schedule.

Trying out BigQuery

Now with all this data, processing it is actually a huge pain. It's small enough that I could load it into a single virtual machine's memory, but doing any non-trivial processing on 32M rows on a single instance unsurprisingly takes forever.

My shortcut for now is to use Google's analytics database called BigQuery. You can query tables using SQL and queries over huge datasets complete very quickly. Unfortunately, I'm new to analytics with SQL, so the queries I've written are pretty basic. I've made the dataset on BigQuery public and it costs nothing to query it with a Google Cloud account.

Caveats - Missing Data, Bugs, and Sources of Error

Undoubtedly, there are issues with this data. The GTFS data has errors, so for example, some trips visit a stop with no schedule listed, and some trips don't have a schedule associated with it. There are some trips in the vehicle positions which don't exist in the GTFS data between certain dates, or if the trips do exist, they're only valid for a different day of the week.

In addition, while evaluating the data in BigQuery, I noticed a bug in my code. Sometimes the part that finds scheduled arrival times would return the arrival times for the day before or after, resulting in a non-trivial number of misleading huge outliers in schedule deviation. This I can fix, but it takes a while to process all the data again. For now, I'm making due by filtering out records where the schedule deviation is greater than an hour (3600 seconds).

I've implied this above, but I want to make it clear that the schedule deviation is estimated based on the scheduled arrival time of a vehicle at the nearest stop on the vehicle's trip within 250 meters. At present, I do not have any reliable sources for the schedule deviation so it's hard to say how much error is introduced by this method of estimation.

Anyway, on to the code.

Basic Questions

In [1]:
from __future__ import print_function

import pandas as pd

import plotly
from plotly import tools
from plotly.offline import iplot
from plotly import graph_objs as go

project_id = 'YOUR-PROJECT-ID'

layout_theme = dict(paper_bgcolor='rgb(240, 240, 240)',
                    font={'color': '#444', 'family': 'Open sans, verdana, arial, sans-serif'},
                    plot_bgcolor='rgb(240, 240, 240)',
                    legend={'bordercolor': '#444', 'bgcolor': 'rgb(240, 240, 240)'})