A Look at 10 Months of CapMetro Vehicle Position Data

Written by Sean Cascketta

Capital Metro introduced a new public API at the end of February 2015 for getting real-time positions of all their vehicles. With updates every one or two minutes, it's a half-decent data source for user-facing apps like Instabus, but it's not exactly "live".

I thought it might be useful to have a historical record of vehicle positions, thinking that perhaps it could be used to evaluate on-time performance or predict future arrival times, so I wrote a simple daemon to continuously archive these vehicle positions. After every day at midnight, the collected data is published to a repository on GitHub for anyone to use. Recently I spruced up the data and loaded it into a public dataset on Google BigQuery for anyone to query.

However, for the purpose of evaluating schedule deviation (the degree to which a vehicle is behind schedule or ahead of schedule), vehicle position data is incomplete on its own because there's no indication of when a vehicle is at a stop. For a given vehicle traveling on a route, we only have its position recorded every one or two minutes, and each of those positions is not necessarily at a stop. Fortunately, there's some additional info available in a format called GTFS which allows us to make a good guess.

GTFS

From the GTFS overview:

The General Transit Feed Specification (GTFS) defines a common format for public transportation schedules and associated geographic information. GTFS "feeds" allow public transit agencies to publish their transit data and developers to write applications that consume that data in an interoperable way.

When Capital Metro switches to a new schedule (once or twice a year), they make a new GTFS feed available on the Austin Open Data portal. This GTFS feed has information on nearly everything including:

  • Schedules - the time a vehicle will arrive at a stop (depending on the day of the week)
  • Stops - locations of stops, which routes use which stops
  • Trips - which stops a vehicle will visit during a trip, which trip a vehicle will take on a certain day of the week

With this we can use a simple heuristic to determine if a vehicle is at a stop. If a bus is close to a stop (e.g. 250 meters) and we know that the stop is supposed to be visited during its trip, then we can assume it has arrived at that stop at that time. The schedule data in GTFS contains the time a bus is scheduled to arrive at each stop, and we can compare this to the observed arrival time to calculate the deviation from the schedule.

Trying out BigQuery

Now with all this data, processing it is actually a huge pain. It's small enough that I could load it into a single virtual machine's memory, but doing any non-trivial processing on 32M rows on a single instance unsurprisingly takes forever.

My shortcut for now is to use Google's analytics database called BigQuery. You can query tables using SQL and queries over huge datasets complete very quickly. Unfortunately, I'm new to analytics with SQL, so the queries I've written are pretty basic. I've made the dataset on BigQuery public and it costs nothing to query it with a Google Cloud account.

Caveats - Missing Data, Bugs, and Sources of Error

Undoubtedly, there are issues with this data. The GTFS data has errors, so for example, some trips visit a stop with no schedule listed, and some trips don't have a schedule associated with it. There are some trips in the vehicle positions which don't exist in the GTFS data between certain dates, or if the trips do exist, they're only valid for a different day of the week.

In addition, while evaluating the data in BigQuery, I noticed a bug in my code. Sometimes the part that finds scheduled arrival times would return the arrival times for the day before or after, resulting in a non-trivial number of misleading huge outliers in schedule deviation. This I can fix, but it takes a while to process all the data again. For now, I'm making due by filtering out records where the schedule deviation is greater than an hour (3600 seconds).

I've implied this above, but I want to make it clear that the schedule deviation is estimated based on the scheduled arrival time of a vehicle at the nearest stop on the vehicle's trip within 250 meters. At present, I do not have any reliable sources for the schedule deviation so it's hard to say how much error is introduced by this method of estimation.

Anyway, on to the code.

Basic Questions

In [1]:
from __future__ import print_function

import pandas as pd

import plotly
from plotly import tools
from plotly.offline import iplot
from plotly import graph_objs as go
plotly.offline.init_notebook_mode()

project_id = 'YOUR-PROJECT-ID'

layout_theme = dict(paper_bgcolor='rgb(240, 240, 240)',
                    font={'color': '#444', 'family': 'Open sans, verdana, arial, sans-serif'},
                    plot_bgcolor='rgb(240, 240, 240)',
                    legend={'bordercolor': '#444', 'bgcolor': 'rgb(240, 240, 240)'})

Using the average count of unique trips as a measure of activity, which routes are most active?

In [2]:
q = '''
SELECT
  route_id route,
  COUNT(1) c
FROM
  [capmetrics:capmetrics.vehicle_position]
GROUP BY
  route
ORDER BY
  c DESC'''

df = pd.read_gbq(q, project_id)
Waiting for job to complete...
In [3]:
top = df.iloc[:20]
sc = go.Scatter(x=top.route, y=top.c, name='Average # of Trips', mode='lines+markers', line=dict(color='rgb(141, 200, 242)', width=3), marker=dict(size=10))
layout = go.Layout(xaxis={'title': 'Route', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   yaxis={'title': 'Number of records', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   title='<b>The 20 Most Active Routes</b><br><i style="font-size: 0.7em">Based on total number of records per route</i>',
                   **layout_theme)
fig = go.Figure(data=[sc], layout=layout)
iplot(fig)

Using the count of unique trips as a measure of activity, which day of the week is most active?

In [4]:
q = '''
SELECT
  COUNT(1) c,
  DAYOFWEEK(timestamp) dayofweek
FROM
  [capmetrics:capmetrics.vehicle_position]
GROUP BY
  dayofweek
ORDER BY
  dayofweek ASC'''

df = pd.read_gbq(q, project_id)
days = {1: 'Sunday', 2: 'Monday', 3: 'Tuesday', 4: 'Wednesday', 5: 'Thursday', 6: 'Friday', 7: 'Saturday'}
df = df.replace({'dayofweek': days})
Waiting for job to complete...
In [5]:
sc = go.Scatter(x=df.dayofweek, y=df.c, name='# of Records', mode='lines+markers', fill='tozeroy', line=dict(color='rgb(166, 28, 0)', width=3), marker=dict(size=10))
layout = go.Layout(xaxis={'title': 'Day of the Week', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   yaxis={'title': 'Total number of records', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   title='<b>Activity by Day of the Week</b><br><i style="font-size: 0.7em">Based on number of records per day</i>',
                   **layout_theme)
fig = go.Figure(data=[sc], layout=layout)
iplot(fig)
In [6]:
# This is a little gnarly in order to account for the different UTC offsets for daylight savings, refactor suggestions welcome
q = '''
SELECT
  cdt.hourofday hourofday, cdt.c + cst.c c
FROM (
  SELECT
    HOUR(DATE_ADD(timestamp, -5, "HOUR")) hourofday,
    COUNT(1) c
  FROM [capmetrics.vehicle_position]
  WHERE
    timestamp >= '2015-03-08 07:00'
    AND timestamp <'2015-11-06 07:00'
  GROUP BY hourofday
  ORDER BY hourofday ASC) cdt
JOIN (
  SELECT
    HOUR(DATE_ADD(timestamp, -6, "HOUR")) hourofday,
    COUNT(1) c
  FROM [capmetrics.vehicle_position]
  WHERE
    timestamp < '2015-03-08 08:00'
  GROUP BY hourofday
  ORDER BY hourofday ASC) cst
ON
  cst.hourofday = cdt.hourofday'''

df = pd.read_gbq(q, project_id)
Waiting for job to complete...
In [7]:
sc = go.Scatter(x=df.hourofday, y=df.c, name='# of Records', mode='lines+markers', fill='tozeroy', line=dict(color='rgb(106, 168, 79)', width=3), marker=dict(size=10))
layout = go.Layout(xaxis={'title': 'Hour of day (local time in Austin)', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   yaxis={'title': 'Total number of records', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   title='<b>Activity by Hour of the Day</b><br><i style="font-size: 0.7em">Based on number of records per hour</i>',
                   **layout_theme)
fig = go.Figure(data=[sc], layout=layout)
iplot(fig)

How many vehicle positions have schedule info and a stop on its trip within 250 meters?

In [8]:
q = '''
SELECT
  COUNT(*) c
FROM [capmetrics.vehicle_position]
WHERE
  stop_id IS NOT NULL AND
  sched_dev IS NOT NULL AND
  distance_to_stop < 0.25 AND
  sched_dev > 0'''

df = pd.read_gbq(q, project_id)
print('{} vehicle positions loaded in BigQuery with schedule info and a stop on its trip within 250 meters.'.format(df.c.values[0]))
Waiting for job to complete...
17137418 vehicle positions loaded in BigQuery with schedule info and a stop on its trip within 250 meters.

So about half of the vehicle positions are "complete" in the sense that they can be used to evaluate on-time performance. Moving on...

Of the complete vehicle positions, which routes have the most positive deviation from the schedule?

When I say schedule deviation, I'm referring to the difference between the actual and expected arrival time. When schedule deviation is > 0, the vehicle is behind schedule, when it's < 0, the vehicle is ahead of schedule.

In [9]:
q = '''
SELECT
  route_id route,
  AVG(sched_dev) mean_sched_dev,
  NTH(501, QUANTILES(sched_dev, 1001)) med
FROM [capmetrics.vehicle_position]
WHERE
  stop_id IS NOT NULL AND
  sched_dev IS NOT NULL AND
  distance_to_stop < 0.25 AND
  sched_dev > 0 AND
  sched_dev < 3600
GROUP BY route
ORDER BY med DESC'''

df = pd.read_gbq(q, project_id)
Waiting for job to complete...
In [10]:
top = df.iloc[:20]
mean = go.Bar(x=top.route, y=top.mean_sched_dev, name='Mean', marker=dict(color='rgb(74, 134, 232)'))
median = go.Bar(x=top.route, y=top.med, name='Median', marker=dict(color='rgb(228, 105, 37)'))
layout = go.Layout(barmode='group',
                   xaxis={'title': 'Route', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   yaxis={'title': '<b>Schedule Deviation</b><br><i style="font-size: 0.9em">In seconds, smaller is better</i>', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   title='<b>The 20 Most Delayed Routes</b><br><i style="font-size: 0.7em">Based on deviation from scheduled arrival times</i>',
                   **layout_theme)
fig = go.Figure(data=[median, mean], layout=layout)
iplot(fig)
In [11]:
bottom = df.sort_values(by='med', ascending=True).iloc[:20]
mean = go.Bar(x=bottom.route, y=bottom.mean_sched_dev, name='Mean', marker=dict(color='rgb(74, 134, 232)'))
median = go.Bar(x=bottom.route, y=bottom.med, name='Median', marker=dict(color='rgb(228, 105, 37)'))
layout = go.Layout(barmode='group',
                   xaxis={'title': 'Route', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   yaxis={'title': '<b>Schedule Deviation</b><br><i style="font-size: 0.9em">In seconds, smaller is better</i>', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
                   title='<b>The 20 Least Delayed Routes</b><br><i style="font-size: 0.7em">Based on deviation from scheduled arrival times</i>',
                   **layout_theme)
fig = go.Figure(data=[median, mean], layout=layout)
iplot(fig)

How does schedule variation vary by certain routes, the day of the week, and the hour of the day?

In [12]:
q = '''
SELECT
  route_id,
  DAYOFWEEK(timestamp) dayofweek,
  HOUR(DATE_ADD(timestamp, -6, "HOUR")) hourofday,
  NTH(501, QUANTILES(sched_dev, 1001)) med
FROM
  [capmetrics.complete_vp_view]
GROUP BY
  hourofday, dayofweek, route_id
ORDER BY route_id ASC, dayofweek ASC, hourofday ASC
'''
df = pd.read_gbq(q, project_id)
df.replace({'dayofweek': days}, inplace=True)
Waiting for job to complete...
In [13]:
def get_heatmap(route_id, df):
    route = df.loc[df.route_id == route_id]
    z = [route.loc[route.hourofday == hour].med.tolist() for hour in route.hourofday.unique()]
    x = route.dayofweek.unique()
    y = sorted(route.hourofday.unique().tolist())
    text = [map(lambda s: str(s) + ' seconds', route.loc[route.hourofday == hour].med.tolist()) for hour in route.hourofday.unique()]
    return go.Heatmap(z=z, x=x, y=y, text=text, colorscale='Jet', zauto=False, zmin=0, zmax=400, colorbar=dict(title='Schedule Deviation (in seconds)'))
In [14]:
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=('Route 801', 'Route 1', 'Route 803', 'Route 3'))
fig.append_trace(get_heatmap(801, df), 1, 1)
fig.append_trace(get_heatmap(1, df), 1, 2)
fig.append_trace(get_heatmap(803, df), 2, 1)
fig.append_trace(get_heatmap(3, df), 2, 2)

fig['layout']['yaxis1'].update(title='Hour of Day')
fig['layout']['yaxis2'].update(title='Hour of Day')
fig['layout']['yaxis3'].update(title='Hour of Day')
fig['layout']['yaxis4'].update(title='Hour of Day')
for k, v in layout_theme.items():
    fig['layout'][k] = v
fig['layout'].update(height=800, width=1000, title='Median Schedule Deviation across Select MetroRapid/Local Routes by Day of the Week and Hour')

iplot(fig)
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]