Decision making within the Civic Data Science framework

TLDR: My observations on why decision making needs the same "big" treatment as data or devices. I refer to past decisions that cost NYC  dearly that can be used as learning opportunities. I also include some pedantic descriptions of decision making from academia and bring it back to relevant examples of how technology and decision making come together well. I added pictures to correct for eye-glazing effect.

At ARGO we are building towards a sustainable framework to better scope, understand and eventually deliver solutions for urban problems. So far we have "not disagreed" on the Data Discovery, Data Analysis and Data Integration buckets to define distinct data tasks that fit into a larger normative structure of  Device, Data & Decisions framework to  encompass the civic data science framework. Together along with rapid prototyping  we offer a comprehensive and flexible lens towards a digitally native service delivery model.

Device, Data & Decisions  Image credits: from the Noun Project Router by Yorlmar Campos ; Export Database by Arthur Shlain ; strategy by Gregor Črešnar ; cube by Ates Evren Aydinel;

Device, Data & Decisions
Image credits: from the Noun Project
Router by Yorlmar Campos ; Export Database by Arthur Shlain ; strategy by Gregor Črešnar ; cube by Ates Evren Aydinel;

Our process is a result of many discussions amongst ourselves and others ranging from the epistemological to the semantic.  The overarching mission at ARGO is to partner with city agencies and local governments to help them make qualitatively better decisions about delivering services better. "Better", however,  is a loaded term.

In 2009, "Better" meant spending $549,000,000 to develop a citywide wifi network that turned out to be obsolete in 5 years. 

In 2013, when Hurricane Sandy hit, "Better" meant spending billions in disaster response that was sometimes dysfunctional.

These were well-intentioned and understandably debatable decisions that were not the best use of public $$$ but as we are moving head first into the digital age where policy making today is more than ever reliant on data, these errors of the past are also immense learning opportunities. While tools to "grab the damn data" are evolving at breakneck speed, we need to consider whether our abilities to make actionable decisions are also evolving instep. This is often not the case and also not amongst the typical data science skill set.

The decision maker (often not data savvy) ends up swimming/drowning in data and left with inadequate tools to convert the <<<insert awesome predictive analysis using ridiculous amounts of data but woefully difficult to replicate or communicate >>> into decisions to move the proverbial needle on said policy intervention.

Created using wordle.net

Created using wordle.net

Whenever I sit in a room with "data scientists" or "data-{dashes}",  I often wonder how they define terms such as  "Algorithm", "Big data" & "Urban Science".  I can argue that if asked,  their definitions of the term would form the basis of inherent biases that could very well lead us down the path of the afore-mentioned billion $$$ errors. I often question my own definitions of these terms as they are heavily contextual. ( Disclosure: I spent some time supporting Algorithmic Trading systems at a big bank )

As a Master's student at Penn State's IST program, I researched decision making within crisis management. This included the study of Computer Supported Collaborative Work (CSCW), Human Computer Interaction (HCI) and Human Factors (ergonomic design). I ended up writing my thesis on a theory of team cognition called Transactive memory that seeks to better understand group behavior based on the processes by which individual members of a group makes sense of incoming information. Most of the work dealt with developing a theoretical model to better situate crisis responders to organize incoming information so that they can make effective decisions on the field.

The transactive memory command center is the application of Daniel Wegner's Transactive memory theory to an information environment where decisions are facilitated by individuals who have specific information roles to organize incoming data. This was presented along with a research colleague at a 2008 &nbsp; Department of Homeland Security University network summit focused on catastrophes and complex systems

The transactive memory command center is the application of Daniel Wegner's Transactive memory theory to an information environment where decisions are facilitated by individuals who have specific information roles to organize incoming data. This was presented along with a research colleague at a 2008  Department of Homeland Security University network summit focused on catastrophes and complex systems

A big takeaway from this study was my affinity to the Common Operational Picture - a concept that is heavily used in the military for command & control in a distributed command structure that I find to be immensely useful out of the military context,   underutilized in data intensive environments and could be useful in the civic space. As I self plagiarize from my thesis: 

Working groups solving problems together often need to achieve a common consensus on the important elements of the problem. This common understanding is necessary so that decision-making for an evolving and complex situation can be effectively enabled if knowledge about the situation is aggregated onto a common space for all the decision-makers to make use of collectively. The centralization of information that facilitates such convergent processes is referred to as the common operational picture (COP).
A COP is first and foremost a visual representation; it is a structurally emergent artifact that visually illustrates the relevant information characterizing the situation. (USJFCOM, 2008). A COP is most useful when multiple groups operating under a multi-level organizational structure require quickly accessible and actionable knowledge to rapidly make decisions.
Design and Development of a Transactive memory prototype for geo-collaborative crisis management, Adibhatla, V, Master's Thesis, 2008,  Penn State University

Anthony Townsend, in his book,  Smart Cities: Big Data, Civic Hackers, and the Quest for a New Utopia refers to a similar concept ;  Topsight as described by David Gelernter in his prophetic & seminal book: Mirror Worlds: or the Day Software Puts the Universe in a Shoebox.

Some gems from Mirror World's on topsight:

Topsight is what comes from a far-overhead vantage point, from a bird's eye view that reveals the whole—the big picture; how the parts fit together.
It's easy to organize a data-gathering project, and you can count on a rush of neo-Victorian curatorial satisfaction as your collection grows. But analyzing data requires at least a measure of topsight, and topsight is a rare commodity

The desire for the ultimate topsight.  (1) Rio Operations Center, 2012. [IBM] ; (2) Mission Control Center, Houston, 1965. [NASA]. These images were taken from Mission Control: A History of the Urban Dashboard (Mattern, Shannon. "History of the Urban Dashboard." Places Journal (2015)).

Townsend makes the argument that this need to gain ultimate "topsight" in a city is what drives Rio de janeiro & IBM to build massive top-down surveillance systems using Billions of $$$. These systems eventually yield something similar to an "Informatics of domination" situation  originally articulated in Donna Haraway's' Cyborg Manifesto and referred to in "Critiquing Big Data: Politics, Ethics, Epistemology".  This is an unfortunate outcome that is reminiscent of a Robert Moses approach to constructing digital infrastructures for civic applications.

The Common Operational Picture although originating from the military where the domination narrative is not only implied but required, I argue,  can be effectively repurposed for a less grandiose, localized and practical approach to making day-to-day operational decisions in city agencies.

The Department of Sanitation's Bladerunner platform is a superb example of how a Common Operational Picture looks like in a city operations setting. The platform uses data from the GPS devices used on DSNY vehicles that transmit data using a "cellular network" (curios to know if NYCWin is used here) and then feeds into a flexible UI (the Common Operational Picture) that can be manipulated to allow DSNY managers locate and group DSNY vehicles in real-time by distinct functions (Plowing, Salting, Collection, Supervision) and attain an innocuous yet extremely usable topsight. Bladerunner too costed several million $$ to implement but I'd bet that without it DSNY managers would find themselves operationally crippled (feel free to call me out on this)

DSNY's Bladerunner platform, a Common Operation Picture for DSNY managers

Finally, we designed SQUID to follow the same principles of decision making. Providing a common operational picture on street quality, we hope, would optimize the $1,400,000,000 ($1.4 Billion) budgeted for NYC street resurfacing over the next 10 years. (Ten-Year Capital Strategy, Fiscal Years 2016-2025, The City of New York, Pg 22) . A 1% savings as a result of better decision making around street resurfacing projects would more than pay for the SQUID program not only in NYC but more so in small  to medium sized cities where their street paving dollars are limited.

I leave you with some maps that we made for NYU's GIS DAY 2015 that aggregate the sensor data from SQUID to the Neighborhood Tabulation Areas (NTAs) as provided by the Department of City Planning. We intend to  follow similar design principles used in Bladerunner to develop an effective Common  Operational Picture using SQUID data.

Thanks,

Varun Adibhatla, Argonaut

1 Comment /Source
Print Friendly and PDF

Web Scraping 101 using NYC's DOE's Fair Student Funding Page

Elevator Pitch : I scraped NYC's Department of Education's Fair Student Funding Budget Page to motivate how basic knowledge of scripting could unleash a world of data opportunity. This post is targeted at municipal data analysts who are stuck in the world of excel and want to move up a notch.

What is done -  DOE's Fair Student Funding Budget page was scraped to extract "Need" metric for Middle schools using a combination of observing the underlying HTML code and simple unix commands, done iteratively.

Note -  Sins of omission and commission were definitely committed to ensure that this post is palatable. This piece is targeted at "analysts" who have no prior knowledge of programming to show what is possible.


TLDR version:

As part of NYU CUSP's Urban Science Intensive, we are currently working on a "Social Impact" project, collaborative. Our team of 4 chose education, specifically after-school expansion for middle school students in NYC. This mayoral initiative is the little cousin to Universal Pre-K that gets most of the press.

The initiative has budgeted $190 million to expand after school programming to almost double its current offering. Our project involves creating a siting model to assist with the RFP process to disburse the funds where there is the most "need". A future post will showcase that in detail.

"Need' is where things get interesting because need can mean a lot of things. We found that the DOE's Fair student Budgeting has documented this extensively. 

Identifying what we "need"

One of our data wrangling tasks was to be able to quickly get this data for all middle schools in New York City. The list is easily available on the open data portal. However DOE has collected mounds of data on Schools and considering the scope/time of our project - we needed to scope. We eventually stumbled upon a metric that is used by the DOE. I felt it presented a good opportunity to discuss web scraping for the civic data analyst.

This is what the Fair Student Funding Budget looks like for a randomly selected middle school. The page is well designed and conforms to most design standards. Tooltips are presented everywhere and every page has a download option which is great for the end user who is most likely someone associated with that particular school. We however, needed the same information from all schools.

Highlighted are the "Need Weight Total" and the corresponding value for FY14 Actuals

Highlighted are the "Need Weight Total" and the corresponding value for FY14 Actuals

The address (URL) of the above page is: 

That highlighted bit signifies a lot. To me this is "REST"ful which means that every single page on this site is uniquely identifiable through its URL. Even better, the URL actually has the School's DBN number. 

DBN = school code representing the District, Borough, and Number for the school

So in theory, one can get the "Fair Student Budget" page that has the Need metric we want for each and every middle school in the city. (Of course you can download the excel version of this page every time and then copy paste the fields you want to another sheet but that's manual labor in this digital age). For context, there are over 500+ middle schools in the city for some perspective.

Getting under the hood of a web page

To be able to scrape a web page, you need to know what makes that webpage. Some understanding of HTML can be great here but its not necessary. Just think of the webpage as a regular word doc and all we are doing is a "search"

So what I "need" is the Need Weight Total number in the FY14 Actual Registers column (See image above)

Google Chrome makes this simple. I show a video to describe this part rather than describing it. I use the Developer Tools mode in the browser to get to this point. The goal is to be able to get the unique ID for that specific number that represents that specific column in the Need Weight Total row.

Google Chome's Developer Mode to assist with scraping

Scraping

doecontrolbottomcentercontainerSchoolBudgetOverviewlblNWTotalC07

The above id (highlighted & from the video) identifies that specific value from the innards of the web page. We now need to get this id 500+ times (for each middle school). The main assumption here is that all the web pages are similar in format which is usually the case.

curl -s 'http://schools.nyc.gov/AboutUs/funding/schoolbudgets/FY15FairStudentFundingBudget.htm?schoolcode=Q190'
| egrep "SchoolPortals|doecontrol_bottomcentercontainer_School_Budget_Overview_lblNWTotal_C07"

This is the command that does it all and I'm going to explain this in a way that doesn't involve glazed eyes :)

curl -s

curl is the command to get the html code from a webpage. The "-s" means that I don’t want it to output status codes and what not. Its aptly called a silent mode :)

This is followed by the same URL(address) of the page we are trying to scrape

|

This is the "pipe" operator. It is used to transfer the output from the command on the left of the pipe as input to the command on the right of the pipe.

  • So command1 | command2 means the output of command1 becomes the input of command2

  • In this command, I am "piping" the output of the curl command (aka the html code of the web page)  to something else.

egrep

egrep is the cousin of grep , one of the most versatile commands in programming. grep is basically a search tool. Grep, conventionally, allows you to search for a single word in a file. grep, in its basic form returns the entire line containing that word. egrep lets you search for multiple words. That’s all you need to know for this.

So in plain english, the above command reads "Get me the html code of the web-page without the un-necessary status codes and then search (grep) for the words 'SchoolPortals' and ' doecontrol_bottomcentercontainer_School_Budget_Overview_lblNWTotal_C07'  so that I get one line that represents a School Name and then another line that shows me the Need Weight Total value I need from the page"

That translation has been important for me in my education. The output of that command is :

<a href="/SchoolPortals/28/Q190">
&nbsp;<span id="doecontrol_bottomcentercontainer_School_Budget_Overview_lblNWTotal_C07">465</span></td>

Notice what I now have, the school name and that magic number. The next step is to use *ahem* Excel to construct the same command 500 times. Of course you can use a loop and all that programming mumbo jumbo but gotta keep those eyes unglazed :)  That’s how I proceeded to construct 500 curl commands. I would then throw all of this in a single script and run it. The output can then be massaged in a regular text editor to get what you need.

Constructed the curl command 500+ for every middle school in NYC times using excel

There you have it. As always comments / feedback welcome.

Thanks!

Varun

Argonaut

 

Note: This post assumes you either have a MAC that comes with a Unix terminal or something similar in the Windows universe (CygWin is what Ive used before). You could also try and get this done on a GUI using IMPORT.io. I've used it to scrape the transcripts of NPR's This American Life with good results :)

Comment
Print Friendly and PDF