What actually defines data science? The term has been labeled the “sexiest job of the twenty-first century” though “data science” lacks a robust definition. Its wikipedia page lists no less than twenty different academic disciplines that it draws on, and that’s just the technical statistical and computer science piece, ignoring the substantive domain expertise in Drew Conway’s canonical Venn diagram.
Yet there’s broad consensus that the above skills alone don’t quite a data scientist make. There’s something special, a certain inquisitiveness or artistic flair with visualization or creative streak that marks a data scientist.
Jeff Hammerbacher, Facebook’s first Data Scientist and early pioneer of the job title, has a Quora profile consisting simply of the word “Curious.” DJ Patil, formerly of LinkedIn and now the US Chief Data Scientist, describes data science as a “team sport” and is skeptical of the dedicated “data science” programs -- academic and boot camps -- popping up.
Rather DJ sees data science skills being built by working in a hard technical academic field like meteorology or experimental physics or bioinformatics that required not only technical statistical and computer science knowledge but the sustained application of those skills to an applied problem (or set of problems).
Of course, there’s a word for that: praxis
The greeks led by Aristotle distinguished between theoretical knowledge and its application. Wikipedia has a great definition of the latter:
“Praxis is the process by which a theory, lesson, or skill is enacted, embodied, or realised.”
This aligns with the intuition that data science is something that is done, and the observation that the canonical data science examples like Netflix’s movie recommendation algorithm or LinkedIn’s “people you may know” feature came out of industry rather than academia. Despite the boom in new academic data science graduate degrees, these programs generally represent inter-disciplinary partnerships rather than a new academic department. This reflects the fact that data science is firmly about how you apply or implement a “theory, lesson or skill.”
Alright, you might be wondering “ok praxis as a framework offers potential for intellectual consistency but so what? What does that mean practically?” Let’s make that example concrete by considering a few examples within the simple and straightforward data collection, data processing and data analysis buckets we borrowed earlier from the US Department of Defense:
Data collection as a concept is far from new and statistics has been concerned with data acquisition since the beginning. The same holds true for science for more broadly and really the longer tradition of rigorous inquiry involving observation to understand our world.
Yet scraping a webpage or building a sensor or parsing a messy dataset demands coding chops that are skills that stand on their own distinct from statistics. As a discipline, statistics offers great principled insights into how to collect data that is invariant to the technological method (sampling for instance) yet I don't think we can just stay within the ethereal abstract world a line worker entering in industrial process information into a notebook and a social media water conservation marketing experiment are the same thing. The same methodological principles about sampling hold, but the technological implementation from words to the web is dramatically different.
Processing and Exploitation
In addition to collecting new data, new tools allow new strategies to convert unstructured data like text, audio or images into useful information programmatically. Many of us have used Siri on iPhone (or similar Google or Amazon services) that automatically turns our voices into actionable information like a search result or phone call. Or data science techniques can turn data sources like youtube cat videos into quantifiable information like counts of cats in videos over time that result from Google's fancy neural network algorithms.
Analysis and Production
With processed data in hand, the next step is to apply mathematical and analytical techniques to derive meaningful insights. An oft cited though apocryphal quotation on what statistics is and is not good for helps provide clarity into what data science tools can and cannot do (the paraphrased version is more illuminating than Fisher’s actual words, at least in isolation):
“Consider the case of extracting gold from ore. The expectation of a good gold extraction machine is that it cleanly separates the gold from the ore with no waste. We would not criticize this machine for failing to extract gold if none was originally present in the ore, nor would we judge it too harshly if it failed to extract gold from ore with only minute quantities. Similarly, statistics is a machine which extracts information from the data. Statistics cannot create information, the data must contain that.”
Similarly data science, no matter how fancy or how deep the neural network, cannot create gold out of useless ore. Sometime people talk about data science and these sorts of tools as if they’re magic and it’s important to draw a bright line from that sort of alchemistic thinking.
What these new tools can do is make it easier to operate that statistical machine. As we saw earlier, data science can enable new strategies to collect and process data so that its amenable to analysis. “Grabbing the damn data” in Andrew Gelman’s memorable phrase.
In terms of actually conducting an analysis, it is difficult to perform a calculation involving an arbitrarily large number of steps by hand for instance. Furthermore, we can take the resulting gold and make it (potentially) more amenable to human insights by creating an interactive visualization rather than simply a static graph.
Lastly, we can apply the resulting gold to actions in a qualitatively different manner using the scale that computation allows. Consider a production data science model used to offer recommendations on Amazon or detect fraud in a stream of credit card transactions. It would be simply impossible to provide that level of granularity in model outputs in real time for so many data points without digital tools.
Together those facets of how the digital revolution subtly impact data collection, processing and analysis enables us to formalize what is distinct about data science: new tools to discover data like sensor readings of street quality, new tools to better integrate even previously unmanageable data like unstructured audio, and new tools to connect a data analysis to an action. Thusly, the word "science" in data science highlights more the connection to computer science than the loosey-goosey exhortations to the scientific method that some practitioners aim to proffer. And the key facet of this new field is the integration of computational, statistical and substantive skills into a single actor and crucially the application of said skills to practical ends through the process of praxis.
Yet what does this emerging suite of tools and techniques mean for how we manage our roads or deal with the California drought or tackle all manner of civic challenges?
An open question: what this new field means for civic challenges
Data science emerged out of Silicon Valley in the context of the social media mega-platforms Facebook and LinkedIn. It’s also illustrative that the example domains listed on the data science wikipedia page are business and medical applications. Applying data science to civic challenges is a new and nascent field, with a handful of largely isolated case studies that everyone quotes (i.e. the New York building inspector example or the Chicago restaurant inspection model).
The challenge in direct comparisons is that not all domains have correlation patterns that are stable across space and time. A cat photo is a cat photo 20 years ago or today or in Zimbawea. For civic challenges that's largely not the case. Humans have this funny habit of making choices. The causal confusions that result from millions of independent actors each with their own free will bumping into each other in something like a city is a big part of what makes "civic" inquiry different than data science generally. In many ways, it goes back to the distinction that the Enlightenment thinkers drew between natural and moral philosophy.
hat larger intellectual tradition helps clarify yet really it's still very much a frontier what the digital revolution means for how we educate the next generation or manage water demand. It's that frontier that ARGO aims to explore and help settle. We've by no means got it all figured out so welcome your comments, questions and insights in comments below.