Thursday, January 7, 2016

What time of day do Portlanders get in the most accidents?

Everyone thinks that the city they live in has the worst drivers.  I've never been to a city where the
locals didn't warn me how awful the other local drivers are.

While I was in Oregon for the holidays, a friend mentioned to me how awful drivers are in Oregon.  Overall, they actually seemed incredibly polite.  Compared to Houston drivers, whom are all in a massive hurry to get places, Oregonians seemed to be relaxed.  One thing that really struck me was how they all seemed to maintain four car lengths between others on the freeway.  As a practiced Houston driver, it was honestly a little disconcerting.

One thing that was mentioned did seem to make some sense.  It was claimed that Oregon drivers are at their worst during the first rain after a long dry spell.  Because it rains so often, they forget that oil can build up on roads and the first rain can make the roads slick.

With the help of data from the USGS and a twitter feed of police dispatch events maintained by the Portland police department Brandan from cad.oregon911.com, hopefully I can figure out if they really are that bad (at least, in the Portland area).

To get the ball rolling, this first post is simply about how I gathered the data I'll be using and answers the much simpler question of "What time of day is the worst for accidents in Portland"

Getting the Rain Data

Since the USGS rainfall data for Portland is split into 53 files (separated by reading location), I felt like it would be faster to just throw a script together to grab them all.

The script below grabs the main page, iterates through all of the links on that page, and then saves them locally if the link points to a *.rain file (the extension used for the precipitation data).

import requests
from bs4 import BeautifulSoup

req = requests.get('http://or.water.usgs.gov/precip/')
html_doc = req.text
soup = BeautifulSoup(html_doc, 'html.parser')

for link in soup.find_all('a'):
    href = link.get('href')
    if href.endswith('.rain'):
        print(href)
        page = requests.get('http://or.water.usgs.gov/precip/'+href)
        with open('../data/rainfall/'+href, 'w') as fd:
            fd.write(page.text) 

So that was easy enough.  This took a little while to run, but when it was finished I had hourly rainfall information for 53 locations in Portland, going back in some cases to 2000.

Getting the Accident Data

This ended up being far trickier than I thought it would be.  My initial idea was to use the @pdxpolicelog twitter account to grab all of the accident reports they had done since the account opened.  Sadly, the twitter API only allows you to grab the last 3200 tweets from an account, which meant I only had accident data for December 6th-10th (314 accidents total).

Since that date range wouldn't be nearly large enough to draw any sort of conclusion, I had to find an alternative source.

It appears that the twitter feed may have been consuming this civicapps data.  Sadly, that data feed stopped updating around December 10th and is currently returning an empty file.

While the Portland Police Department's website has some interesting crime statistics, they did not have anything for accidents.  Visiting the Multnomah County Sheriff website didn't provide anything useful at all...

Eventually, I found daily accident totals here

I was hoping to get more granular than just daily totals, but I'll work with what I have for now. After talking to Brandan, who runs cad.oregon911.com, I was able to get a CSV file containing two years worth of accident data (over 10,000 accidents!).  I'm incredibly grateful to him for the help.

So anyway...

What time of day do accidents occur most frequently?

Before I dive into my main question, I wanted to play with my new data a little bit and get comfortable with it.  The first question I could think to ask was "what time of the day do most accidents occur?"

A quick query gave me the information I was interested in:

SELECT
  strftime('%H:00:00', timestamp) as hour,
  count(*)
FROM traffic_incidents
WHERE
  timestamp >= "2014-01-01 00:00:01"
GROUP BY hour
ORDER BY hour;


This isn't a very interesting question and the graph above follows a trend that seems pretty logical.  There are fairly few accidents that happen early in the morning.  The first spike is at 7:00am and, I assume, corresponds with morning rush hour.  Then there is a small drop off until lunch when the accident totals begin to climb until the 5:00pm rush hour, where it peaks.

The raw numbers for this graph are below:

Hour Number of Accidents (Since January 1st 2014)
00:00161
01:00120
02:00130
03:00109
04:0077
05:00147
06:00307
07:00515
08:00444
09:00391
10:00381
11:00542
12:00618
13:00700
14:00695
15:00838
16:00863
17:001070
18:00779
19:00512
20:00381
21:00370
22:00269
23:00197


So there you go. Nothing particularly exciting in this post. In my next post I'll explore the rain data and hopefully do some correlation between the two datasets.