Written by Sean Cascketta
Capital Metro introduced a new public API at the end of February 2015 for getting real-time positions of all their vehicles. With updates every one or two minutes, it's a half-decent data source for user-facing apps like Instabus, but it's not exactly "live".
I thought it might be useful to have a historical record of vehicle positions, thinking that perhaps it could be used to evaluate on-time performance or predict future arrival times, so I wrote a simple daemon to continuously archive these vehicle positions. After every day at midnight, the collected data is published to a repository on GitHub for anyone to use. Recently I spruced up the data and loaded it into a public dataset on Google BigQuery for anyone to query.
However, for the purpose of evaluating schedule deviation (the degree to which a vehicle is behind schedule or ahead of schedule), vehicle position data is incomplete on its own because there's no indication of when a vehicle is at a stop. For a given vehicle traveling on a route, we only have its position recorded every one or two minutes, and each of those positions is not necessarily at a stop. Fortunately, there's some additional info available in a format called GTFS which allows us to make a good guess.
From the GTFS overview:
The General Transit Feed Specification (GTFS) defines a common format for public transportation schedules and associated geographic information. GTFS "feeds" allow public transit agencies to publish their transit data and developers to write applications that consume that data in an interoperable way.
When Capital Metro switches to a new schedule (once or twice a year), they make a new GTFS feed available on the Austin Open Data portal. This GTFS feed has information on nearly everything including:
With this we can use a simple heuristic to determine if a vehicle is at a stop. If a bus is close to a stop (e.g. 250 meters) and we know that the stop is supposed to be visited during its trip, then we can assume it has arrived at that stop at that time. The schedule data in GTFS contains the time a bus is scheduled to arrive at each stop, and we can compare this to the observed arrival time to calculate the deviation from the schedule.
Now with all this data, processing it is actually a huge pain. It's small enough that I could load it into a single virtual machine's memory, but doing any non-trivial processing on 32M rows on a single instance unsurprisingly takes forever.
My shortcut for now is to use Google's analytics database called BigQuery. You can query tables using SQL and queries over huge datasets complete very quickly. Unfortunately, I'm new to analytics with SQL, so the queries I've written are pretty basic. I've made the dataset on BigQuery public and it costs nothing to query it with a Google Cloud account.
Undoubtedly, there are issues with this data. The GTFS data has errors, so for example, some trips visit a stop with no schedule listed, and some trips don't have a schedule associated with it. There are some trips in the vehicle positions which don't exist in the GTFS data between certain dates, or if the trips do exist, they're only valid for a different day of the week.
In addition, while evaluating the data in BigQuery, I noticed a bug in my code. Sometimes the part that finds scheduled arrival times would return the arrival times for the day before or after, resulting in a non-trivial number of misleading huge outliers in schedule deviation. This I can fix, but it takes a while to process all the data again. For now, I'm making due by filtering out records where the schedule deviation is greater than an hour (3600 seconds).
I've implied this above, but I want to make it clear that the schedule deviation is estimated based on the scheduled arrival time of a vehicle at the nearest stop on the vehicle's trip within 250 meters. At present, I do not have any reliable sources for the schedule deviation so it's hard to say how much error is introduced by this method of estimation.
Anyway, on to the code.
from __future__ import print_function
import pandas as pd
import plotly
from plotly import tools
from plotly.offline import iplot
from plotly import graph_objs as go
plotly.offline.init_notebook_mode()
project_id = 'YOUR-PROJECT-ID'
layout_theme = dict(paper_bgcolor='rgb(240, 240, 240)',
font={'color': '#444', 'family': 'Open sans, verdana, arial, sans-serif'},
plot_bgcolor='rgb(240, 240, 240)',
legend={'bordercolor': '#444', 'bgcolor': 'rgb(240, 240, 240)'})
q = '''
SELECT
route_id route,
COUNT(1) c
FROM
[capmetrics:capmetrics.vehicle_position]
GROUP BY
route
ORDER BY
c DESC'''
df = pd.read_gbq(q, project_id)
top = df.iloc[:20]
sc = go.Scatter(x=top.route, y=top.c, name='Average # of Trips', mode='lines+markers', line=dict(color='rgb(141, 200, 242)', width=3), marker=dict(size=10))
layout = go.Layout(xaxis={'title': 'Route', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
yaxis={'title': 'Number of records', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
title='<b>The 20 Most Active Routes</b><br><i style="font-size: 0.7em">Based on total number of records per route</i>',
**layout_theme)
fig = go.Figure(data=[sc], layout=layout)
iplot(fig)
q = '''
SELECT
COUNT(1) c,
DAYOFWEEK(timestamp) dayofweek
FROM
[capmetrics:capmetrics.vehicle_position]
GROUP BY
dayofweek
ORDER BY
dayofweek ASC'''
df = pd.read_gbq(q, project_id)
days = {1: 'Sunday', 2: 'Monday', 3: 'Tuesday', 4: 'Wednesday', 5: 'Thursday', 6: 'Friday', 7: 'Saturday'}
df = df.replace({'dayofweek': days})
sc = go.Scatter(x=df.dayofweek, y=df.c, name='# of Records', mode='lines+markers', fill='tozeroy', line=dict(color='rgb(166, 28, 0)', width=3), marker=dict(size=10))
layout = go.Layout(xaxis={'title': 'Day of the Week', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
yaxis={'title': 'Total number of records', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
title='<b>Activity by Day of the Week</b><br><i style="font-size: 0.7em">Based on number of records per day</i>',
**layout_theme)
fig = go.Figure(data=[sc], layout=layout)
iplot(fig)
# This is a little gnarly in order to account for the different UTC offsets for daylight savings, refactor suggestions welcome
q = '''
SELECT
cdt.hourofday hourofday, cdt.c + cst.c c
FROM (
SELECT
HOUR(DATE_ADD(timestamp, -5, "HOUR")) hourofday,
COUNT(1) c
FROM [capmetrics.vehicle_position]
WHERE
timestamp >= '2015-03-08 07:00'
AND timestamp <'2015-11-06 07:00'
GROUP BY hourofday
ORDER BY hourofday ASC) cdt
JOIN (
SELECT
HOUR(DATE_ADD(timestamp, -6, "HOUR")) hourofday,
COUNT(1) c
FROM [capmetrics.vehicle_position]
WHERE
timestamp < '2015-03-08 08:00'
GROUP BY hourofday
ORDER BY hourofday ASC) cst
ON
cst.hourofday = cdt.hourofday'''
df = pd.read_gbq(q, project_id)
sc = go.Scatter(x=df.hourofday, y=df.c, name='# of Records', mode='lines+markers', fill='tozeroy', line=dict(color='rgb(106, 168, 79)', width=3), marker=dict(size=10))
layout = go.Layout(xaxis={'title': 'Hour of day (local time in Austin)', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
yaxis={'title': 'Total number of records', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
title='<b>Activity by Hour of the Day</b><br><i style="font-size: 0.7em">Based on number of records per hour</i>',
**layout_theme)
fig = go.Figure(data=[sc], layout=layout)
iplot(fig)
q = '''
SELECT
COUNT(*) c
FROM [capmetrics.vehicle_position]
WHERE
stop_id IS NOT NULL AND
sched_dev IS NOT NULL AND
distance_to_stop < 0.25 AND
sched_dev > 0'''
df = pd.read_gbq(q, project_id)
print('{} vehicle positions loaded in BigQuery with schedule info and a stop on its trip within 250 meters.'.format(df.c.values[0]))
So about half of the vehicle positions are "complete" in the sense that they can be used to evaluate on-time performance. Moving on...
When I say schedule deviation, I'm referring to the difference between the actual and expected arrival time. When schedule deviation is > 0, the vehicle is behind schedule, when it's < 0, the vehicle is ahead of schedule.
q = '''
SELECT
route_id route,
AVG(sched_dev) mean_sched_dev,
NTH(501, QUANTILES(sched_dev, 1001)) med
FROM [capmetrics.vehicle_position]
WHERE
stop_id IS NOT NULL AND
sched_dev IS NOT NULL AND
distance_to_stop < 0.25 AND
sched_dev > 0 AND
sched_dev < 3600
GROUP BY route
ORDER BY med DESC'''
df = pd.read_gbq(q, project_id)
top = df.iloc[:20]
mean = go.Bar(x=top.route, y=top.mean_sched_dev, name='Mean', marker=dict(color='rgb(74, 134, 232)'))
median = go.Bar(x=top.route, y=top.med, name='Median', marker=dict(color='rgb(228, 105, 37)'))
layout = go.Layout(barmode='group',
xaxis={'title': 'Route', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
yaxis={'title': '<b>Schedule Deviation</b><br><i style="font-size: 0.9em">In seconds, smaller is better</i>', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
title='<b>The 20 Most Delayed Routes</b><br><i style="font-size: 0.7em">Based on deviation from scheduled arrival times</i>',
**layout_theme)
fig = go.Figure(data=[median, mean], layout=layout)
iplot(fig)
bottom = df.sort_values(by='med', ascending=True).iloc[:20]
mean = go.Bar(x=bottom.route, y=bottom.mean_sched_dev, name='Mean', marker=dict(color='rgb(74, 134, 232)'))
median = go.Bar(x=bottom.route, y=bottom.med, name='Median', marker=dict(color='rgb(228, 105, 37)'))
layout = go.Layout(barmode='group',
xaxis={'title': 'Route', 'type': 'category', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
yaxis={'title': '<b>Schedule Deviation</b><br><i style="font-size: 0.9em">In seconds, smaller is better</i>', 'showgrid': 'true', 'gridcolor': 'rgb(205, 205, 205)'},
title='<b>The 20 Least Delayed Routes</b><br><i style="font-size: 0.7em">Based on deviation from scheduled arrival times</i>',
**layout_theme)
fig = go.Figure(data=[median, mean], layout=layout)
iplot(fig)
q = '''
SELECT
route_id,
DAYOFWEEK(timestamp) dayofweek,
HOUR(DATE_ADD(timestamp, -6, "HOUR")) hourofday,
NTH(501, QUANTILES(sched_dev, 1001)) med
FROM
[capmetrics.complete_vp_view]
GROUP BY
hourofday, dayofweek, route_id
ORDER BY route_id ASC, dayofweek ASC, hourofday ASC
'''
df = pd.read_gbq(q, project_id)
df.replace({'dayofweek': days}, inplace=True)
def get_heatmap(route_id, df):
route = df.loc[df.route_id == route_id]
z = [route.loc[route.hourofday == hour].med.tolist() for hour in route.hourofday.unique()]
x = route.dayofweek.unique()
y = sorted(route.hourofday.unique().tolist())
text = [map(lambda s: str(s) + ' seconds', route.loc[route.hourofday == hour].med.tolist()) for hour in route.hourofday.unique()]
return go.Heatmap(z=z, x=x, y=y, text=text, colorscale='Jet', zauto=False, zmin=0, zmax=400, colorbar=dict(title='Schedule Deviation (in seconds)'))
fig = tools.make_subplots(rows=2, cols=2, subplot_titles=('Route 801', 'Route 1', 'Route 803', 'Route 3'))
fig.append_trace(get_heatmap(801, df), 1, 1)
fig.append_trace(get_heatmap(1, df), 1, 2)
fig.append_trace(get_heatmap(803, df), 2, 1)
fig.append_trace(get_heatmap(3, df), 2, 2)
fig['layout']['yaxis1'].update(title='Hour of Day')
fig['layout']['yaxis2'].update(title='Hour of Day')
fig['layout']['yaxis3'].update(title='Hour of Day')
fig['layout']['yaxis4'].update(title='Hour of Day')
for k, v in layout_theme.items():
fig['layout'][k] = v
fig['layout'].update(height=800, width=1000, title='Median Schedule Deviation across Select MetroRapid/Local Routes by Day of the Week and Hour')
iplot(fig)