Mo data mo problems.
Today's update comes courtesy of HHV contributor Koozdra, who decided to liberate the complete monthly data out of the confines of the PDF format from VREB's excellent
historical archive. For those who are interested in the technical process, Koozdra
explains here.
Downloading the pdfs
I went on the site and using the javascript console in chrome injected jquery into the namespace of the page. Using a selector for links that had their href ending in pdf, I was able to find a list of all possible downloadable files. I copied the list into a text file.
Next I wrote a small ruby script to read through the list of links and download their contents into a local folder.
Processing the files
Reading the pdfs
For extracting the text from each pdf I used a gem called 'pdf-reader'. This library lets you iterate over the pages of a pdf and get all of the text as a string. The next challenge was to determine how to classify each line as a data line and to which property type it is associated to. Data lines were identified as having the last four strings starting with a dollar sign (I strip out all lines that start with "total"). Identifying property type lines was a little more challenging. They began with a z or an l (this is according to my library).
Area consistency challenges
Not every pdf has all the data for each area. This is a concern because I don't know how many columns are required when creating a data line. For example if the first file had 10 areas and a second only had eight I would have to know which columns to place the data so that the eight areas line up with the 10 areas from the first file. In order to do this I have know what the union is of all the areas by property type before I start generating the output. To do this I create a map of area type to a set (sets prevent duplicates) of areas. This provides the union required.
Processing the data
I iterate over all the data lines and build a database. In this situation it is a map of maps. Suffice it to say that using this data structure I can provide a property type, month (including year) and area in order to get the five numbers of that specific data line.
Creating the output
Use the property type map and data map I'm able to generate a csv file with appropriate columns spacing.
In short, we now have the monthly sales numbers, volume, average price, 6 month average price and median price for each housing type and each region back to 2006.
Here are some basic charts of the median prices for the various areas. Note that even after a 3 month averaging window, the data is quite noisy due to few sales per region. The regions with extremely low sales have been left out (Vic West, View Royal, Metchosin, Highlands, etc).
Oak Bay gets its own graph since it flattens the rest. Again few sales make the data very noisy.
Langford and Colwood have been relatively hard hit. Interesting how flat Sooke is though. Maybe the credit restrictions have more effect in the younger communities?
10 comments:
Interesting read! Thanks for all the work you put in to that Koozdra.
Thanks Koozdra - interesting data.
Oak Bay is all over the map.
Thanks Kooz!
MLS# 318115 Jubilee area with detached suite
“Sold in June 2010 for $447,500...”
Now listed @ $332,500
http://www.realtor.ca/PropertyDetails.aspx?PropertyID=12737380&PidKey=-124603090
Ouch - that is a big loss for someone. Poor person.
Glad to help.
Wow, you've got a lot of patience, Koozdra. Nice work.
Neat, thanks Koozdra!
MLS has a beta version now. Looks a lot flash-ier
Dave3
Wow, thanks Koozdra.
Here are my stats for last week for SFH in Vic,OB,Esq,SE&SW min 2 beds and 2 baths, priced between $375K & $775K:
Sold: 21
Avg Price: $514K
Med Price: $511K
7 of the 21 had in-law suites and 10 sold for less than BC Assessment.
Compared with same week in 2011:
Sold: 31
Avg Price: $591K
Med Price: $553K
Last week in the areas of Gordon Head, Lambrick Park and Mount Doug, 2 homes sold for an average of $617K. Since the beginning of the year the average sale price in those areas is $577K.
Apartments & Townhomes:
Min 2 beds and 2 baths, priced between $248K & $550K in pretty much the same areas including downtown.
Sold: 16
Avg Sale Price Apts: $356K
Med Sale Price Apts: $357K
10 out of the 16 went for below BC Assessment.
For condos within this criteria, this has been the best sales week in terms of volume since June of last year.
Avg Sale Price T/H: $390K
Med Sale Price T/H: $406K
Whoops, I meant compared with 2012 not 2011. Sorry.
Nice work Koozdra!
Post a Comment