TL;DR: this post documents how to parse water usage data and integrate it with tax assessor and census data. The examples are water specific but the approach in cleaning and integrating messy government data have broader applications.
This post details some tricks for manipulating and merging messy government data from a specific water conservation research study. The goal in sharing these code snippets is to show the power of off the shelf free open source tools to tackle regular municipal data challenges.
Loading the data
With python's pandas package it's a breeze. We just import the library so python has the pandas package loaded and then it's just a line of code.
import pandas as pd accountKey = pd.read_csv('account_apn.csv')
It's also possible to get fancier and specify header names and columns that you want to parse as dates.
usage = pd.read_table('usage.txt', header=0, names = ['account','date','days in billing period','usage','et','unknown'])
The same functionality is built into R and also a breeze.
rebate_clean <- read.csv("rebate_clean.csv")
Transforming rebates from long to wide format
The water utility in question pays cash rebates to encourage water conservation. Each rebate is described individually by a seperate line. We want the rebate type to be its own column so that we can join all the rebates from a given account by date. Here's the quick little r script we used to do that:
data.wide.value <- dcast(rebate_clean, ACCT + REBATE. ~ CODE, value.var="TOTAL.AMOUNT", sum) data.wide.date <- dcast(rebate_clean, ACCT + REBATE. ~ CODE, value.var="ENTER.DATE") data.wide.qty <- dcast(rebate_clean, ACCT + REBATE. ~ CODE, value.var="QTY", sum) data.wide1 <- merge(data.wide.value, data.wide.date, by="REBATE.") data.wide <-merge(data.wide1,data.wide.qty, by="REBATE.")
This code depends on R's reshape2 library's dcast which has excellent documentation that you can pull up either online or using R's help(dcast) command. The tutorial here has a great overview of wide and long formats and how reshape2 can help you move between the two. It's possible to do the same thing in pandas though IMHO R strikes a better balance between flexibility and ease of use.
Getting running totals by account
As part of the research project, we wanted to get running totals for a couple variables by account. To get running totals across an entire data frame it's easy enough to use pandas' built-in .cumsum() functionality but the challenge here is to sum by each individual account. Initially I tried doing this by naively looping over each account but that took longer than 1.5 days so we needed a vectorized approach. The following is what we came up with.
for var in rebateVars: taVar = str(var) + '.totalAmount' columnsToAdd = ['account','date',var,taVar] cumRebate = usageCustomerRebate[columnsToAdd].groupby(by=['account','date']).sum().groupby(level=).cumsum() cumRebate.columns = ['cum'+str(var),'cum'+str(taVar)] usageCustomerRebate = pd.merge(usageCustomerRebate, cumRebate, left_on = ['account', 'date'], right_index = ['account', 'date'])
This approach takes about 824 seconds to run over about a million rows of data. So a vast improvement over the naive loop.
Merging many excel files
For a somewhat tangential but related water flow task, we needed to join a couple hundred excel files by month and key words embedded in the title. See below for what we're working with here.
Here I used python's glob package to manage the many file names, python's xlrd package to read the excel files and pandas to manage the extracted data. Here's how we get all the relevant file names:
path =r'C:\Users\patwater\Dropbox\CUSP_Projects\Drought data\MNWDflow\data' allFiles = glob.glob(path + "/*.xlsx")
Then the trick is to only grab the excel files that we care about. The data comes form three different water treatment facilities so we only want to grab the "regional" data so that we keep apples grouped with apples and oranges grouped with oranges. Here's the code I used to just grab the "regional" excel files:
regional = pd.DataFrame() parseList =  for file in allFiles: if 'regional' in file or 'Regional' in file: print file regionalNames = ['DATE','INFLUENT FLOW MGD', 'SECONDARY EFF. FLOW MGD','FINAL EFFLUENT FLOW MGD', 'RECLAIMED WATER MGD'] df = pd.read_excel(file, skiprows = 7, skipfooter = 8, parse_cols = [1,2,3,4,5]) df.columns = regionalNames parseList.append(df) regional = pd.concat(parseList)
Hopefully that gets your data juices flowing. Please leave any other civic data challenges you have or better ways of dealing with these sorts of data parsing and integration issues in the comments.