Preliminary system architecture analysis of California's water data systems

Tony Castalletto has worked over a decade in library science and IT management leadership roles.  He is helping our big California water data project as a "System Architecture" research fellow looking at the institutional and information technology landscape governing California current California water data.

As part of that he is part of the team including Seed Consulting Group working pro-bono in partnership with the California Data Collaborative to conduct an agile development sprint to ingest the data sources identified by the Dodd AB 1775 as the first step in a feasibility study to implement the proposed data integration.  See below for how this project integrates with the already completed version 1.0 of the Seed Consulting WaterLog Project:

He has also written the below analysis of California state level water data collection efforts at my direction.  I do not agree with all of the analysis and reminded Tony after review that it is much easier to point out an institutional problem than to propose a solution, much less implement one.  

Yet pointing out a problem and understanding its root causes is critical in developing and executing a plan to improve the situation, as GK Chesterton eloquently articulated in his parable of a fence.  Please consider this a preliminary analysis of an experience IT professional working to get up to speed on water.  

Any shortcomings in understanding the nuances of water management in my professional opinion illustrates the challenges of navigating our complex water institutional administrative and technological architecture. See below for the memo and as always your insights, questions and challenges are appreciated in the comments.  Thanks much.




To:  Patrick Atwater, Project Manager of the California Data Collaborative

From: Anthony Castelletto, System Architecture Research Fellow, California Data Collaborative

Subject:  Diagnostic Memo of California Department Water Resource’s Data Practices

Date:  April 20, 2016


            The ongoing drought and the forecast of more frequent and persistent droughts throughout the 21st Century has forced the State of California to review its water management and allocation practices.  While relatively accurate information on aggregate water use can be found, detailed, granular water usage data essential to planning and evaluation remains fragmented among a myriad of agencies, commissions, and corporations.  If California is to remain a populous and economically vibrant state, it must transition to adaptive management to carefully husband the limited resource of water.  Urban areas must continue to improve their efficiency.  To do this, California governments, municipal water agencies and regional water boards will need a centralized data warehouse containing raw water-use data down to the household level.  Currently no agency collects and disseminates detailed water usage data, hobbling efforts at collaborative water governance and frustrating the development of a statewide water transfer market.  The state agency best able to collect and centralize such data would be the Department of Water Resources.  However, this agency not done so.  This leaves independent water management districts to act on their own.  Some of these districts have taken action through collaboration, creating the California Data Collaborative to coordinate their work.

            The California Data Collaborative is a project run by a coalition of water agencies operating under a Memorandum of Understanding with Moulton-Niguel Water Management District acting as the lead administrator.  The Collaborative project seeks to develop data tools to enable smarter water conservation and promote greater efficiency in water use.  The limits of utility wide average use statistics stymie the development of these advanced management tools.  Adaptive and responsive management requires fine grained data which accurately captures the behavior of individual consumers.  This data not only makes possible the examination of usage, but also allows economic effects such as price elasticity to be determined.  In addition, since 2013, water utilities have been subject to more stringent reporting requirements.  The California Data Collaborative works to support water managers in achieving their reliability objectives by developing a common data infrastructure. 

Description and Scope of the Problem

            Two developments in state policy drive the need for better data collection and coordination.  First, the ongoing shortages of water due to drought and the increased demands on shrinking supplies caused by an expanding population force water agencies to coordinate their plans.  In fact, such collaborations already exist.  The largest one is the CALFED Bay-Delta Program which manages the Sacramento River and San Joaquin River Deltas.  Under CALFED, state agencies collaborate with with the Federal government to regulated water resources in this watershed.  Such collaborations in allocating water supplies and managing waste require usable and current data.  Secondly, California's current water transfer markets suffer many inefficiencies due to a lack of timely information about supply and demand.  At present, no agency provides the data needed to achieve meaningful collaboration and efficient market transfers.

            While the State of California collects a great deal of information on water use, supply, and rights, these data sets have never been collected into a single data warehouse nor are they made fully available to water managers and state agencies.  Although legislation such as the Sustainable Groundwater Management Act have led to the collection of such data, a number of problems remain.  The data available is disjointed and lacking in context, much of it consists of aggregated information which amount to little more than executive summaries or reports.  Policy and academic researchers cannot generally use this sort of data for their work.  This constrains efforts to evaluate and develop policy and severely impedes innovation from academia or the private sector.  The data format itself presents problems.  The Department of Water Resources and the State Water Resources Control Board deliver data as Adobe PDF documents.  Although humans find such reports easy to read, extracting data for substantive analysis consumes time and resources.  Ultimately, the effort to extract data is wasteful as it duplicates work already done by the boards and the department.  The lack of open data protocols and common data structures severely hampers data sharing.  This situation harms the public and impairs the functioning of many public agencies.  Standardized and open data formats are now commonplace in many disciplines and industries.  The question then is why the State of California has not adopted modern data practices in its water management system.  Furthermore, the Department of Water Resources, seemingly, has the authority to undertake the construction of a modern water accounting system.  Why has it failed to act?


            California's water governance is fragmented, conflicted and contentious.  This fragmentation has only increased over the years.  The Public Policy Institute describes the system as “highly decentralized, with many hundreds of local and regional agencies responsible for water supply, wastewater treatment, flood control, and related land use decisions.”  Currently, the Federal Government, the Department of Water Resources, the State Water Resources Control Board and nine regional control boards administer California's water resources.  Each organization has distinct priorities and different constituencies.  California's lead water management agency, however, is the Department of Water Resources.

            Created in 1956 by legislation, the California Department of Water Resources along with the State Water Rights Control Board (SWRCB) administer the state's water resources.  The Department of Water Resources (DWR) controls the water distribution infrastructure of the state while the Board, originally the State Water Rights Board, allocates supply and manages water rights.  The Board also reviews planning decisions made by the director of the DWR.  The DWR itself holds water rights of its own amounting to 31 million acre-feet.  Thus, authority over water rights was separated out into the State Water Rights Board, now the SWRCB, to prevent conflicts of interest.  At its formation, the DWR was granted sweeping powers and authority over the state's water system and was the sole administrator of the State Water Project, the system of dams and canals that collects and distribute water to agricultural users and population centers.  The State Water Rights Board initially focused solely on the management of water rights, while almost every other function fell under the DWR.  This all changed in 1969 with the passage of the Porter-Cologne Act.  This act, an early effort to deal with pollution made water quality a primary concern and served as a prototype for the Federal Clean Water Act.  Under the legislation, water agencies and other users were forced to collect and report data on th quality of their water and regulate discharges.  The State Water Rights Board was merged with the State Water Pollution Control Board to form the State Water Resources Control Board. The Porter-Cologne act also split up responsibility for water quality among nine regional boards reporting to the SWRCB.  The SWRCB must also determine allocations for all water usersMeanwhile, the DWR controls almost all planning functions, setting the stage for conflict over water management.  Complicating matters, the California EPA also has jurisdiction in many areas. 

            Both agencies find themselves constrained by the state's complex and multi-tiered water rights laws.  These rights range from the Pueblo Era to the new rights in the present one.  Riparian Rights secured prior to 1914 remain unregulated to this day meaning those with land along rivers may divert as much water they like.  This inconsistent and diverse portfolio of rights makes regulation and measurement of inflows difficult so the state lacks any clear notion of how water is tapped.  The Port-Cologne Act constrains the board's ability to track consumption by prioritizing its mission as enforcement of water quality and by situating it within the California Environmental Protection Agency.  Meanwhile, the board lacks any ability initiate investigations of its own initiative; it can only respond to complaints made about appropriation of water in the context of rights violations.

            The DWR, on the other hand has substantial powers of investigation.  The DWR has the ability to collect data and has made some efforts to do so in recent years, though historically, this has not been a priority.  The 2009 Water Conservation Act mandated increased tracking of use from holders of water rights.  This, however leaves a large gap in the understanding of end point water use and appears to omit a great deal of information.  The DWR's Integrated Water Resource Information System began to track this data, but lacks deep and detailed information about urban usage.  Meanwhile, the SWRCB's Electronic Water Rights Management Information System only compiled self reported usage figures from rights holders, bringing its accuracy into question.  Without any central coordination, California Water Management Districts rely on their own data for planning and coordinate activities on an ad hoc basis.  While the evident fragmentation of governance along regional lines and function certainly impedes collection and coordination of usage data, the DWR has not used its statutory powers of investigation to collect data.  Under its authorizing legislation (Section 153 and 158), the department has the ability to conduct any investigation into water issues that deems necessary.  This power is reinforced by the California Water Code which results from the changes of subsequent legislation.  Sections 225 through 238 of the code provide the department with ample authority to collect usage data.

            The problem then results from the powers of discretion given to the directory of the DWR.  While Director of the DWR may initiate data collection at any time, the department's authorizing legislation provides neither statutory trigger for routine data collection nor powers of enforcement to compel the delivery of data.  Thus any director initiating such a data collection effort takes a risk over the statutory scope of the department's authority especially where it impinges on local water management districts.  Without direction from the governor or a clear legislative directive, the agency risks considerable backlash in attempting to create a comprehensive water data management system.  The development of a modern water accounting system may have to wait for new legislation from the state assembly.  As of this writing, Assemblyman Bill Dodd has introduced a bill, AB 1755, which would direct the creation of a new data management agency under the DWR to accomplish this task.

Print Friendly and PDF

Machine assisted sentiment analysis on NYTimes comments

TLDR: I extract the comments from a recent NYTimes article on a proposed public transit option and perform sentiment analysis on the comment text using a "Machine Learning as a service" API to analyze public reaction to the news release. I operationalize Albert Wenger's "Future of Programming" post by stitching together various web services.  + Google Sheets + Blockspring + Indico + Tableau

Courtesy: NYTimes & Friends of the Brooklyn Queens Connector

Courtesy: NYTimes & Friends of the Brooklyn Queens Connector

Here is my stream of consciousness as I read the Times article on the proposed streetcar to connect Brooklyn and Queens:

  1. Woah - shiny new train!
  2. It is hellish to get to anywhere between Brooklyn and Queens - glad to see the city taking steps to address this.
  3. Wait, shiny new train that travels at 12 mph and will be ready only in 2024? 
  4. (From wife) Street cars remind me of charming European cities.
  5. What about buses? - the SBS is pretty ok and you can make them shiny and they can be here <2024!
  6. Hey, that's my school's logo on the bus! 
  7. I wonder what everyone else is saying ...

..and hence this post. Since I was not entirely sure how I felt about a streetcar connecting BKLYN-QNS, I thought it would be cool to navigate the spectrum of opinion that internet comments offer. Perhaps, this could be useful for policy makers who announce a new program and want to gauge public response. This is by far not representative of the universe of all public reaction on this topic but this article was on the front page of The New York Times on Feb 3, 2016 and they moderate their comments, so there is some filtering of internet-isms.

This post is also influenced by Albert Wenger's blogpost on the future of programming (yes, same dude who is proposing a Basic Minimum Income Guarantee)  that explains stitching together digital "services" rather than putting together blocks of code to achieve a certain digital task. 

First, I found and tweaked code to extract NYTimes comments from Neal Caren's blog post that basically constructs the following URL:<ARTICLE-URL>&offset=<INCREMENT-BY-25-FOR-ECACH-PAGE>&sort=newest

Here is the raw comment output for the streetcar article

The available fields from each comment are:

  •     commentID
  •     status
  •     commentSequence
  •     userID
  •     userDisplayName
  •     userLocation
  •     userTitle
  •     userURL
  •     commentTitle
  •     commentBody
  •     createDate
  •     updateDate
  •     approveDate
  •     recommendations
  •     editorsSelection
  •     commentType
  •     trusted
  •     recommendedFlag
  •     reportAbuseFlag
  •     permID
  •     timespeople
  •     sharing

Once I extract all the comments, I throw them up on a Google spreadsheet and use Blockspring to connect to Indico, to do sentiment analysis on each comment. is a Machine Learning as a service company. They abstract away all the complex and "important to understand" math behind the cutting edge machine learning approaches and make it really easy to use so that you don't need an applied math degree to use these techniques as I hope to demonstrate here.

Blockspring allows users to extend google sheets by connecting tabular data with a host of digital services ranging from amazon products to image recognition and in this case, sentiment analysis. Its a freemium service and I am using the free version of the service.

This is how easy it is to call Indico's sentiment analysis API into a plain-jane Google sheet using Blockspring.

=BLOCKSPRING("higher-quality-sentiment-analysis-indico", "text","i think this is a terrible idea!") returns a sentiment score of 0.004784229677 


=BLOCKSPRING("higher-quality-sentiment-analysis-indico", "text","i think this is a fantastic idea!") returns a sentiment score of 0.9275863767

I proceed to do this for all comments and then throw it up on  Tableau. What you see here is a dashboard showing comment activity on the streetcar article overlaid with sentiment scores and other bells and whistles.

Some of the sentiment analysis results are way off but most are pretty good. Some high level observations from the sentiment histogram reveals that opinion is divided with close to 14% of the comments having very negative sentiments while the next highest group ( 8% having very positive comments). Looking at the commenting behavior over time ; close to half the comments were generated within the first 3 hours of the article (11 PM - 2 AM).

The average sentiment is 0.4254 while the median is 0.3564. On its documentation, Indico states that "Values greater than 0.5 indicate positive sentiment, while values less than 0.5 indicate negative sentiment.", so overall sentiment is negative based on the response to this article.

Click to interact with dashboard

Click to interact with dashboard


As always "comments" / feedback / critique / any other ideas appreciated.

NYC Trolley Throwback video.

Varun, Team ARGO

PS: Advance apologies for any grammatical infractions

Print Friendly and PDF

The Unique Civic Data Opportunity in Southern California


"Projects begun in the enthusiasm of boom years have collapsed with the particular boom or have been abandoned like a wagon wheel in the desert.  Reform movements inaugurated during short periods of comparative stability, when the population has begun to take stock of its environment, have been quickly disrupted by new avalanches of population."
--Carey McWilliam,s Southern California Country

  The history of boom and bust and population influx has resulted in a curiously convoluted administrative architecture in Southern California's local governance.  Exhibit A just look at a map of So Cal's 88 cities and compare to the urban reality you see from space.  And that's before you add in the 200 plus water districts, oodles of school districts (neither of which adhere to city boundaries) and the plethora of miscellaneous other muncipalities like vector control.

The whole regional governance thing has been talked about again and again and gain yet there really is no clear line beyond the administrative fictions we impose dividing the 20 some odd million people who call so cal home up.  There's also the general confusion of whether LA equals the city or the county or the so cal region minus san diego and the IE / OC though the latter two less so or simply where people root for the dodgers.

Illustrative that many of the hack for LA people live in Santa Monica...

Yet why Southern California and not the world?  I say that only somewhat tongue in cheek as the cool thing about the explosion in open data portals around the world is it that the internet is truly global and you can instantaneously see how LA's budget or many other issues compares to cities everywhere (theoretically at least -- often these get buried in format, metadata and provenance quagmires).

Southern California's Unique Opportunity

Here's where we turn LA's famous weakness -- its fragmented, fractionalized Yugoslavia-esque overlapping jurisdictions (see here for a civic tech example) -- into a huge asset.

How? One word: experimentation.  

Ester Duflo's sort of pioneering work to measure how to best combat poverty leverages randomized control trials that definitely could be used more and better in American cities though there's also a host of experiments all around us that we could exploit.

Water districts that employed different conservation strategies to cope with the drought.  Schools that tried different after school programs. Etc. Etc. Etc.  It's not as neat and tidy as a RCT but hey nothing in the real world of public affairs ever is as academics would like and what's already happening all around us is radically cheaper: free!

This notion of using experiments to better understand how to effectively tackle civic challenges connects nicely into using these insights as a mechanism to start new civic entrepreneurial endeavors to actually deliver improvements targeted for the specific needs of groups within Southern California's hugely diverse population.

Recently for the first time in over a century,  a majority of California's residents were born here.  Perhaps we can take the time then to take the pioneering spirit, that willingness to experiment with the new rather than rest with the received idea, and apply it to our brass tacks, bread and butter civic challenges.

"A nearly perfect physical environment, Southern California is a great laboratory of experimentation.  Here, under ideal testing conditions, one can discover what will work, in houses, clothes, furniture, etc."
     --Carey McWilliams Southern California Country
Print Friendly and PDF


TLDR: I document our preparation and experiences at NYC Big Apps where we presented SQUID as a finalist in the Connected Cities category and Learnr as a semi-finalist in the Civic engagement category. This post is intended for anyone who may want to apply their creative energies for future BIG APPS competitions. We have provided links to our final pitch presentation and script and code for the demo we created for the Big Apps finals.

The pitch booths for Learnr & SQUID at NYC Big Apps semi-finals

Wednesday, Dec 2nd was an intense day for ARGO. We did not win at NYC Big Apps but it gave us an opportunity to prepare for a larger stage . Congratulations to Tommy Mitchell at Citycharge, an idea that took shape during the Occupy Wall Street movement and all the other Winners - supremely deserving! The entire BIG APPS experience was intense in a good way. While I wish we could have leveraged more of the network that BIGAPPS provides the space for, we met some pretty awesome people during the semis and finals and heard their stories. I'm sure bigapps created many happy accidents to fuel NYC's epic civic tech scene.

NYU CUSP and the awesome SONYC team were also part of the roller coaster ride. The gathering at BAM CAFE before the the big final pitch will be remembered as a palpable moment of nerves and adrenaline with a nice balance of camaraderie and competition.

Graham and I had spent the previous week agonizing over a presentation that would last 180 seconds followed by 120 seconds of Q&A. The judging panel was a collection of very accomplished people with gobs of experience in tech, policy, government and academia. Our final presentation needed to be a tight pitch which required it to be controlled and rehearsed down to the syllable while not sounding robotic. BIG APPS also allowed us to present a demo of our final product. Since SQUID relies on being outside where we could get a GPS fix - the added challenge was to show something that worked and gave the judges a peek of the idea just enough for them to "get it" during the 1-2 minutes they had to evaluate our demo. We had a weekend and 2 evenings to put this together. Game ON!

Our eventual demo consisted of SQUID connected to a USB powered LCD screen that I impulsively bought on Amazon as part of Black friday froth. The basic idea was to have an interactive demo  with some real-time visual feedback of the accelerometer readings. We overlaid the video feed with a  graph (generated in matplotlib) of the real-time accelerometer readings. 

The LCD screen displaying a live video feed overlaid with a graph of accelerometer readings and annotations

The LCD screen displaying a live video feed overlaid with a graph of accelerometer readings and annotations

Demo day rapid prototyping!

Demo day rapid prototyping!

While this may look disjointed, it was a quick way of showing the sensors at work i.e. camera / accelerometer and give someone with little prior understanding of SQUID, the aha moment (that SQUID measures street quality using data from vibrations and imagery - a supplemental document was provided just to be sure :). It also gave me a chance to get my hands dirty while coding up the demo.

Raspberry Pi, in addition to being a full fledged linux computer that leverages the accomplishments of the open source community for the past 20 years, also has a thriving python ecosystem. A great example of this is the picamera module, a python interface for the Pi's camera module.

Before putting together this demo, I only had a vague idea of what we wanted to do and there were no readymade examples that we could quickly repurpose. The basic elements of this vision of a demo were:

  • Display some Imagery superimposed with accelerometer readings concurrently.
  • Packaging the entire thing into a self-contained unit that explained itself.

picamera allows you to easily annotate text or an image on a video feed. HOWEVER, overlaying anything more complicated quickly becomes beast mode. Without going into some pretty dense and customized C++ implementations, the options I found were limited to implement fast.

To display a real-time graph of sensor readings on top of the video feed was painful but eventually worked ! In a nutshell:

I borrowed code from all over and repurposed. That's it. The screen and other trappings worked out of the box. I want to belabor this point of repurposing code and reifying a vague concept to prototype in a short time.  I do not identify as a software developer and I am not one. I find that I am way too restless and impatient to carefully implement beautiful complexity. Doing yoga does not fix this I have learnt, its innate although the design patterns from more established software implementations are a great resource.

I relate to the quick and dirty school of thought - to get a whisk of an idea and then be persistent towards a minimal viable form of execution so that it "just works". This way of doing things is also not comfortable but FUN when things come together.

This post is an attempt to document that experience and address it to a non-technical audience. I want to demonstrate the  many (messy) ways of being able to program and make stuff and to think about programming in unconventional ways that are not part of some prescriptive cookbook (although those help tremendously :) and finally to eliminate self-doubt through blind optimism and persistence. This is primarily intended for the programmatically challenged who I happily identify with and learn from.

Eric S. Raymond - one of the pioneering evangelists of Linux & the early open source movement and author of the Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary said this of programmers:

 "Good programmers know what to write. Great ones know what to rewrite (and reuse)".

That is a loaded statement and ESR is a provocative figure but it gave me the mental currency to try stuff however zany, unintuitive and often of no practical use. It may not be the most "optimal way" of doing something and that's ok.

Large swaths of  the internet, I argue, were built this way. I am going to end this thought with yet another reference to Anthony Townsend's Smart city book, a source for so many great tech origin stories, that provide further evidence on this bottom-up approach to technology development ( Function v Specification )

In the 1970s, telecommunications companies and academic computer scientists battled over the design of the future Internet. Industry engineers backed X. 25, a complex scheme for routing data across computer networks. The computer scientists favored a simpler, collaborative, ad hoc approach. As Joi Ito, director of the MIT Media Lab, describes it: The battle between X. 25 and the Internet was the battle between heavily funded, government backed experts and a loosely organized group of researchers and entrepreneurs. The X. 25 people were trying to plan and anticipate every possible problem and application. They developed complex and extremely well-thought-out standards that the largest and most established research labs and companies would render into software and hardware. The Internet, on the other hand, was being designed and deployed by small groups of researchers following the credo “rough consensus and running code,” coined by one of its chief architects, David Clark. Instead of a large inter-governmental agency, the standards of the Internet were stewarded by small organizations, which didn’t require permission or authority. It functioned by issuing the humbly named “Request for Comment” or RFCs as the way to propose simple and light-weight standards against which small groups of developers could work on the elements that together became the Internet.

The above may ring true for some big breakthrough in the Internet of Things space as well and most of that ad-hoc energy exists today in nondescript DIY community forums. So, in the spirit of early internet innovation, we humbly issue an RFC to this post and the larger thinking behind SQUID and civic data science.  Here is a video of everything coming together for the SQUID Big Apps demo.

Special thanks to Oklahomer and his contribution in the picamera space.

Our final pitch presentation & syllable controlled script

The code for this demo is available here . 


Print Friendly and PDF