First Steps into Data Scraping

There is historical weather data available for download on a  Government of Canada website. One click will provide you with a CSV file containing one years worth of daily data from one monitoring station.

weather.png

I want many years worth of data. What if I want the data from several stations? This could get very tedious…

Enter Selenium, a bridge between computer programming and surfing the web. Using Selenium I can write a very simple program to automate the task of navigating a website, clicking a button, and downloading some data.

The first few steps went very smoothly. Import the package (I downloaded Selenium using pip. Conda didn’t have it, maybe it’s because I’m using 32-bit conda), set up the browser, and direct it to a website.

 from selenium import webdriver

browser = webdriver.Firefox()

browser.get('http://climate.weather.gc.ca/climateData/dailydata_e.html?timeframe=2&Prov=ON&StationID=5097&dlyRange=1937-11-01|2013-06-12&Year=1932&Month=1&Day=01') 

Running this code made a new Firefox window open up, and it directed itself to the URL I specified. I must say this was a whole new experience for me, and I was thrilled.

I noticed that the URL for the page has a field ‘Year’, so this means that I can easily navigate to the page for each year of interest simply by fiddling with the URL.

 a = 'http://climate.weather.gc.ca/climateData/dailydata_e.html?timeframe=2&Prov=ON&StationID=5097&dlyRange=1937-11-01|2013-06-12&Year='
c = '&Month=1&Day=01'
for i in xrange(1953,2013):
b = str(i)
d = a + b + c
browser.get(d)

Now I want to click the button and download the CSV file. So first step is to identify the element on the page, and then have Selenium to click it.

Right-clicking on the button produces a menu with an option to “inspect element”. Inspection of the elements teaches that the “name” of this button is “submit”. So now just select that element and click it.

 browser.browser.find_element_by_name('submit').click() 

I now get a pop-up asking me if I want to save this file or download it. Amazing! But this is a problem. I don’t want to have to deal with this pop-up for each download. Firefox prompts “Always do the same with files of this type?”, however each time you run the script and a browser opens it loads with no preferences etc.., and this will need to be reset each time.

Google directed me to several pages that discuss how to set the Firefox preferences from the script itself (One, Another one), but these didn’t work for me.

I decided to try and see if Selenium works with the Chrome browser, and to see if I would have the same issue. Selenium for Chrome requires a package called chromedriver. Doesn’t seem to work. I keep getting an error about my PATH. I tried to set my PATH to the directory where the chromedriver file was located, but that didn’t help either.  I found a post suggesting that I put the chromedriver file into my /usr/bin directory. Voila! It worked. Chrome is up and running, and Chrome doesn’t prompt you regarding what you want to do with the file. It just downloads it. Beautiful.

So now I have 150 years worth of weather data for Toronto. Now what do I do with it?

Advertisements

One thought on “First Steps into Data Scraping

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s