Web Scraping 101 using NYC's DOE's Fair Student Funding Page
Elevator Pitch : I scraped NYC's Department of Education's Fair Student Funding Budget Page to motivate how basic knowledge of scripting could unleash a world of data opportunity. This post is targeted at municipal data analysts who are stuck in the world of excel and want to move up a notch.
What is done - DOE's Fair Student Funding Budget page was scraped to extract "Need" metric for Middle schools using a combination of observing the underlying HTML code and simple unix commands, done iteratively.
Note - Sins of omission and commission were definitely committed to ensure that this post is palatable. This piece is targeted at "analysts" who have no prior knowledge of programming to show what is possible.
As part of NYU CUSP's Urban Science Intensive, we are currently working on a "Social Impact" project, collaborative. Our team of 4 chose education, specifically after-school expansion for middle school students in NYC. This mayoral initiative is the little cousin to Universal Pre-K that gets most of the press.
The initiative has budgeted $190 million to expand after school programming to almost double its current offering. Our project involves creating a siting model to assist with the RFP process to disburse the funds where there is the most "need". A future post will showcase that in detail.
"Need' is where things get interesting because need can mean a lot of things. We found that the DOE's Fair student Budgeting has documented this extensively.
Identifying what we "need"
One of our data wrangling tasks was to be able to quickly get this data for all middle schools in New York City. The list is easily available on the open data portal. However DOE has collected mounds of data on Schools and considering the scope/time of our project - we needed to scope. We eventually stumbled upon a metric that is used by the DOE. I felt it presented a good opportunity to discuss web scraping for the civic data analyst.
This is what the Fair Student Funding Budget looks like for a randomly selected middle school. The page is well designed and conforms to most design standards. Tooltips are presented everywhere and every page has a download option which is great for the end user who is most likely someone associated with that particular school. We however, needed the same information from all schools.
The address (URL) of the above page is:
That highlighted bit signifies a lot. To me this is "REST"ful which means that every single page on this site is uniquely identifiable through its URL. Even better, the URL actually has the School's DBN number.
DBN = school code representing the District, Borough, and Number for the school
So in theory, one can get the "Fair Student Budget" page that has the Need metric we want for each and every middle school in the city. (Of course you can download the excel version of this page every time and then copy paste the fields you want to another sheet but that's manual labor in this digital age). For context, there are over 500+ middle schools in the city for some perspective.
Getting under the hood of a web page
To be able to scrape a web page, you need to know what makes that webpage. Some understanding of HTML can be great here but its not necessary. Just think of the webpage as a regular word doc and all we are doing is a "search"
So what I "need" is the Need Weight Total number in the FY14 Actual Registers column (See image above)
Google Chrome makes this simple. I show a video to describe this part rather than describing it. I use the Developer Tools mode in the browser to get to this point. The goal is to be able to get the unique ID for that specific number that represents that specific column in the Need Weight Total row.
The above id (highlighted & from the video) identifies that specific value from the innards of the web page. We now need to get this id 500+ times (for each middle school). The main assumption here is that all the web pages are similar in format which is usually the case.
curl -s 'http://schools.nyc.gov/AboutUs/funding/schoolbudgets/FY15FairStudentFundingBudget.htm?schoolcode=Q190' | egrep "SchoolPortals|doecontrol_bottomcentercontainer_School_Budget_Overview_lblNWTotal_C07"
This is the command that does it all and I'm going to explain this in a way that doesn't involve glazed eyes :)
curl is the command to get the html code from a webpage. The "-s" means that I don’t want it to output status codes and what not. Its aptly called a silent mode :)
This is followed by the same URL(address) of the page we are trying to scrape
This is the "pipe" operator. It is used to transfer the output from the command on the left of the pipe as input to the command on the right of the pipe.
So command1 | command2 means the output of command1 becomes the input of command2
In this command, I am "piping" the output of the curl command (aka the html code of the web page) to something else.
egrep is the cousin of grep , one of the most versatile commands in programming. grep is basically a search tool. Grep, conventionally, allows you to search for a single word in a file. grep, in its basic form returns the entire line containing that word. egrep lets you search for multiple words. That’s all you need to know for this.
So in plain english, the above command reads "Get me the html code of the web-page without the un-necessary status codes and then search (grep) for the words 'SchoolPortals' and ' doecontrol_bottomcentercontainer_School_Budget_Overview_lblNWTotal_C07' so that I get one line that represents a School Name and then another line that shows me the Need Weight Total value I need from the page"
That translation has been important for me in my education. The output of that command is :
<a href="/SchoolPortals/28/Q190"> <span id="doecontrol_bottomcentercontainer_School_Budget_Overview_lblNWTotal_C07">465</span></td>
Notice what I now have, the school name and that magic number. The next step is to use *ahem* Excel to construct the same command 500 times. Of course you can use a loop and all that programming mumbo jumbo but gotta keep those eyes unglazed :) That’s how I proceeded to construct 500 curl commands. I would then throw all of this in a single script and run it. The output can then be massaged in a regular text editor to get what you need.
There you have it. As always comments / feedback welcome.
Note: This post assumes you either have a MAC that comes with a Unix terminal or something similar in the Windows universe (CygWin is what Ive used before). You could also try and get this done on a GUI using IMPORT.io. I've used it to scrape the transcripts of NPR's This American Life with good results :)