By Leon Yin Last Updated 2017-06-11
View this notebook in NBViewer or Github
This two part module will show you how to request data from USASpending.gov, store it as a tab-separated value (tsv), and perform some temporal and spatial analysis.
I hope these two modules are clear enough to be used by Journalists, Laywers, and other folks who wish to use data to audit government contractors
This notebook describes how to download a annual records from the USASpending.gov website for a specific contractor.
In this example we download all records for the Core Corrections Association of America (CCA), and all it's subsidiaries.
Let's start by getting all the Python packages we used in this module first.
%%sh
pip install -r requirements.txt
%matplotlib inline
import os
from zipfile import ZipFile
from multiprocessing import Pool
import requests
from io import BytesIO
from itertools import repeat
import datetime
import pandas as pd
latest_update = '20170515'
next_latest = '20170115'
year = 2017
dep= 'All'
url = ('http://download.usaspending.gov/data_archives/{UP_MONTH}/'
'tsv/{{YEAR}}_{DEP}_Contracts_Full_{UP_DATE}.tsv.zip'.format(
UP_MONTH=latest_update[:-2], DEP=dep, UP_DATE=latest_update))
url
'http://download.usaspending.gov/data_archives/201705/tsv/{YEAR}_All_Contracts_Full_20170515.tsv.zip'
# for files that were not updated by the date above...
url_legacy = ('http://download.usaspending.gov/data_archives/{UP_MONTH}/'
'tsv/{{YEAR}}_{DEP}_Contracts_Full_{UP_DATE}.tsv.zip'.format(
UP_MONTH=next_latest[:-2], DEP=dep, UP_DATE=next_latest))
data_in = 'data_in/spending'
# these are the years we are interested in:
start = 2000
end = 2017
years = [y for y in range(start, end + 1)]
# There are the aliases for each company, this is case insensitive,
# but need to be enclose in a single or double quote!
companies = ['Corrections Corporation of America',
'CoreCivic',
'TransCor']
def load_and_sift(year, regex):
'''Downloads zipped tsv file from:
https://www.usaspending.gov/DownloadCenter/Pages/dataarchives.aspx
to a requests object.
Expands zipfile and reads each file, chunkwise into Pandas dataframes.
The dataframe (df) is filtered by the conpanies' RegEx expression.
Args:
year (int): The fiscal year of records to load.
regex (string): A regex expression of company name(s).
Returns:
df: a Pandas Dataframe containing records from the given.
'''
print(year)
r = requests.get(url.format(YEAR=year))
last_update = datetime.datetime.strptime(latest_update, '%Y%m%d')
if r.status_code == 404: # if url doesn't work, use the legacy url.
r = requests.get(url_legacy.format(YEAR=year))
last_update = datetime.datetime.strptime(next_latest, '%Y%m%d')
if r.status_code == 200: # make sure the download was successful.
# the downloaded stream is a zip archive
zipfile = ZipFile(BytesIO(r.content))
df_final = pd.DataFrame()
# for each file in the zip archive
for f in zipfile.namelist():
# process the file in dataframe chunks!
for df in pd.read_csv(zipfile.open(f), sep='\t',
chunksize=100000, low_memory=False):
# filter the dataframe chunk for active vendors
# and relevant company names.
df = df[(~df['vendorname'].isnull()) &
(df['vendorname'].str.contains(regex, case=False))]
# some date tags...
df['lastupdate'] = last_update
df['contract_year'] = year
df['filename'] = f
df['search_terms'] = regex
df_final = df_final.append(df, ignore_index=True)
return df_final
else:
raise "bad request"
This next step might take a while.
It's expedited using pool
, which parallelizes the task
If is equivalent to
for year, co in zip(years, repeat('|'.join(companies))):
df_list += load_and_sift(year, co)
df = pd.concat(df_list, ignore_index=True)
with Pool() as pool:
df_list = pool.starmap(load_and_sift, zip(years, repeat('|'.join(companies))))
df = pd.concat(df_list, ignore_index=True)
We can take a peek at 5 random records here:
df.sample(5)
unique_transaction_id | transaction_status | dollarsobligated | baseandexercisedoptionsvalue | baseandalloptionsvalue | maj_agency_cat | mod_agency | maj_fund_agency_cat | contractingofficeagencyid | contractingofficeid | ... | prime_awardee_executive4 | prime_awardee_executive4_compensation | prime_awardee_executive5 | prime_awardee_executive5_compensation | interagencycontractingauthority | last_modified_date | lastupdate | contract_year | filename | search_terms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
782 | 43e9c36cedfb72813511fe13fc39d742 | active | 0.00 | 0.0 | 0.0 | 1500: JUSTICE, DEPARTMENT OF | 1501: OFFICES, BOARDS AND DIVISIONS | 1500: JUSTICE, DEPARTMENT OF | 1501: OFFICES, BOARDS AND DIVISIONS | OFDT: OFFICE OF THE FEDERAL DETENTION TRUSTEE | ... | NaN | 0.0 | NaN | 0.0 | X: Not Applicable | 01/09/2012 | 2017-04-15 | 2012 | datafeeds\2012_All_Contracts_Full_20170415.tsv | Corrections Corporation of America|CoreCivic|T... |
280 | 4871c9c2bf29ef438ad8c8e2a3a22926 | active | 48073.00 | 48073.0 | 48073.0 | 9700: DEPT OF DEFENSE | 5700: DEPT OF THE AIR FORCE | 9700: DEPT OF DEFENSE | 5700: DEPT OF THE AIR FORCE | FA5613: FA5613 700 CONS LGC | ... | NaN | 0.0 | NaN | 0.0 | X: Not Applicable | 04/18/2008 | 2017-04-15 | 2008 | datafeeds\2008_All_Contracts_Full_20170415.tsv | Corrections Corporation of America|CoreCivic|T... |
874 | 4ec36c4a3a56e151c7e7f89d77f0be77 | active | 0.00 | 0.0 | 0.0 | 1500: JUSTICE, DEPARTMENT OF | 1501: OFFICES, BOARDS AND DIVISIONS | 1500: JUSTICE, DEPARTMENT OF | 1501: OFFICES, BOARDS AND DIVISIONS | OFDT: OFFICE OF THE FEDERAL DETENTION TRUSTEE | ... | NaN | 0.0 | NaN | 0.0 | X: Not Applicable | 06/18/2013 | 2017-04-15 | 2012 | datafeeds\2012_All_Contracts_Full_20170415.tsv | Corrections Corporation of America|CoreCivic|T... |
919 | 20204082b037e602ef22bad3c6d7a918 | active | 1522136.00 | 1522136.0 | 1522136.0 | 1500: JUSTICE, DEPARTMENT OF | 1501: OFFICES, BOARDS AND DIVISIONS | 1500: JUSTICE, DEPARTMENT OF | 1501: OFFICES, BOARDS AND DIVISIONS | OFDT: OFFICE OF THE FEDERAL DETENTION TRUSTEE | ... | NaN | 0.0 | NaN | 0.0 | X: Not Applicable | 04/26/2013 | 2017-04-15 | 2013 | datafeeds\2013_All_Contracts_Full_20170415.tsv | Corrections Corporation of America|CoreCivic|T... |
1306 | 3d423442685d42d99a05519784779622 | active | -2087743.31 | 0.0 | 0.0 | 1500: Department of Justice | 1544: U.S. MARSHALS SERVICE | 1500: Department of Justice | 1544: U.S. MARSHALS SERVICE | HQ018: OFFICE OF CONTRACTS&AGREEMENTS | ... | NaN | 0.0 | NaN | 0.0 | X: NOT APPLICABLE | 12/07/2015 | 2017-04-15 | 2016 | datafeeds\2016_All_Contracts_Full_20170415.tsv | Corrections Corporation of America|CoreCivic|T... |
5 rows × 229 columns
There are a lot of columns! So it might to hard to know what you're after.
len(df.columns)
229
save the data to a gzipped tab-separated value document (tsv)
outfile = data_in + '_' + companies[0].replace(' ', '_').lower() + '.tsv.gz'
df.to_csv(outfile, sep='\t', compression='gzip', index=False)
print("Data saved to {}".format(outfile))
Data saved to data_in/spending_corrections_corporation_of_america.tsv.gz
Now we're ready to perform some analysis!