Towards a definition of civic data science
Data as “factual information used as a basis for reasoning, discussion, or calculation” has been integral to human civilization for thousands of years. Statistics as a formal mathematical discipline dates back to the 18th century. What’s new is the dramatic maturation of the digital revolution in the past decade.
So what does the digital revolution change?
Here it’s important to remember the basics, which the following flowchart from Wikipedia illustrates nicely.
New York Statistician Andrew Gelman digs beneath the hype to nail what’s important about data science tools:
“the point of data science (as I see it) is to be able to grab the damn data. All the fancy statistics in the world won’t tell you where the data are. To move forward, you have to find the data, you need to know how to scrape and grab and move data from one format into another.”
The ability to discover new data sources through scraping and sensors doesn’t change statistical methodology. But it does enable the ability to collect data in newly creative ways. Similarly modern programming tools make it easier to parse messy data or integrate multiple datasets into a usable format.
Much of the excitement about data science stems from some truly impressive developments in turning unstructured text, video, and audio data into useful information. Yet tags of cats in youtube videos aren’t intelligence, just processed data that can be more readily used in analysis (or more likely pushed into a production model for optimizing ads in Google’s context).
So in a civic context, what these new tools mean in a nutshell is the ability to gather data in new creative ways.
Why “grabbing the damn data” matters
Too often in civic inquiry, we settle for what data is available and familiar rather than what actually illuminates the challenge in question. Case in point: social science studies on the efficacy of class size reduction that look at immediate test score changes rather than more robust measures of learning like college or career readiness.
Yet today’s digital reality means that student learning could be tracked in greater granularity through ongoing rather than annual assessment and integrated with the big life preparation questions by integrating longitudinal databases. Yet there’s no metaphysical reason we couldn’t have a system for integrating quizzes from Khan Academy with long term life trajectories on LinkedIn (or other less purely digital data sources).
Or to offer another example: today cities like New York largely track potholes through humans calling in their locations. That dataset is neither comprehensive and quite often inaccurate in the sense of including street defects that are not actually potholes. ARGO’s Street Quality Identification Device (“SQUID”) aims to provide a more comprehensive and robust ground truth of where street quality defects actually are.
These examples obviously stem from our experience at ARGO so we invite your ideas and insights into what defines "civic data science" in the comments below.