About GovTrack > Developers

Source Data

The underlying data about the U.S. Congress that powers this site is the only such database made freely available for others to reuse. The data covers the activity of bills, PDFs of bill texts, roll call votes, and photos of members of Congress. Almost all of the data files are in XML format. You can browse the underlying source data for this website here. These are the very same data files that GovTrack uses to make itself go, so pretty much anything you see on the site is in one of those files.

The data files are released into the public domain but see the license terms which covers your access. Please contact me if you would like to start using the data, just because I'm curious and like to know what it's being used for. Also, if you are an ongoing user of the data it is really important that you join the GovTrack mail list to stay updated with any data format changes which are occasionally not backwards compatible.

Data Directory

Here is an overview of much of the data provided by GovTrack. Read on below for how to access the data. Occasional downloads by HTTP are permitted, but in most cases the only supported method for you to get these files is to download them in bulk with Unix rsync (see below).

  • The root of the data directory is http://www.govtrack.us/data. Feel free to explore. However, do not use HTTP to refresh a file more than once daily or to download data in bulk. See the section on rsync below for that.
  • data/us/people.xml: This file contains everyone that has ever served in Congress, and U.S. presidents, with their party affiliation, terms in Congress, birthdays, etc. This file is quite large... best not to open it in your browser. This file has been put together from a variety of sources and is maintained by hand. All people in the database are identified by a numeric ID with no particular meaning.
    • The format of this file is essentially self-explanatory.
    • On the person elements, the only required fields are id, firstname, lastname, name, and bioguideid. Other fields are omitted if they are not known.
    • bioguideid, osid, and pvsid refer to the ids assigned at bioguide.congress.gov, the Center for Responsive Politics, and VoteSmart.
    • title and state attributes are set if the person currently has a role in Congress.
    • role elements within each person node indicate each elected term in Congress the person has served or is serving.
    • The current attribute is present and set to 1 if the role is current. You can also determine this by looking at the start and end dates, which are in YYYY-MM-DD format.
    • The type attribute is sen (senator), rep (congressman), or prez (U.S. president). For senators, a class attribute gives their election class (1, 2, or 3). For representatives, the district attribute gives the congressional district: 0 for at-large, -1 for historical data where the district is not known.
    • For senators and representatives, the state attribute gives the USPS state abbreviation of the state or territory they represent. Besides the 50 states, this includes delegates from American Samoa (AS), District of Columbia (DC), Guam (GU), Northern Mariana Islands (MP), Puerto Rico (PR), Virgin Islands (VI), and the former (for historical data) Dakota Territory (DK), Philippines Territory/Commonwealth (PI), and Territory of Orleans (OL). Puerto Rico's delegate is called a Resident Commissioner.
  • The data/photos directory contains jpeg images of Members of Congress, past and present. Not all MoC's have photos. The name of the photo is the GovTrack numeric identifier for the person followed by: nothing, for the largest original image available; 200px, 100px, 50px, for three sizes of the photo, by width; all followed by .jpeg. -credit.txt files give a tab-delimeted source URL and source description information for each photo.
  • Most other files are organized by "Congress". A "Congress" is a two-year session of activity, starting in the year after an election year. Many things in Congress reset after each two year term, such as bill numbers. In GovTrack, a "Congress" is called a "session", which is actually a misnomer because each "Congress" is made up of two "sessions" which follow the calendar years. 2011 starts the 112th Congress. Each session is in its own directory: data/us/112, data/us/111, data/us/110, etc. We have roll call data going back to the first Congress, so we have that going back to data/us/1.
  • data/us/sessions.tsv gives the start and end date of each (one-year) session and (two-year) Congress.
  • data/us/111/people.xml mimics the layout of the full people.xml file described above but only contains those Members of Congress who have or had a role during this particular session of Congress.
  • Roll call votes in data/us/111/rolls represent all votes where individual votes have been recorded. Votes by unanimous consent, for instance, are not included here. The format of this file type is described in additional documentation.
  • Bills are resolutions are encoded in data/us/111/bills. The format of this file type is described in additional documentation.
  • data/us/gis/zip4dist-prefix.txt.gz: A table (.gz-compressed) mapping ZIP+4 codes to congressional districts. If all zipcodes starting with a prefix map to the same district, only the prefix is included. An entry like "123 NY-01" means all zip and zip4 codes starting with "123" are in New York's first congressional district. Current as of the summer of 2008; this comes from work by Carl Malamud and Aaron Swartz here.

Bulk Access

To download files in bulk, or if you plan to regularly update files, you must use the rsync tool. I require rsync because it will only download updated files when you want to refresh your local files, and it supports compression.

Getting the Data On Linux/Mac OS X

rsync is available on Linux and you should be able to find it for Mac OS X as well; just type on a command-line e.g.:

rsync -az --delete --delete-excluded govtrack.us::govtrackdata/us/111/bills .

This will download the 111th Congress bill data into a directory called bills in the current directory.

Getting the Data on Windows

On Windows, install DeltaCopy, which contains rsync for Windows. Then on a command line type:

mkdir C:\GovTrackData
cd "\Program Files\Synametrics Technologies\DeltaCopy"
rsync -avz --delete govtrack.us::govtrackdata/us/111/bills /GovTrackData

Note that you have to give a relative path to your GovTrackData directory because rsync will interpret "C:" as something other than a drive letter, since there are no drive letters in the Unix world.

This will put bill XML files in either C:\GovTrackData\bills or C:\cygwin\GovTrackData\bills. cygwin is the name of a common Windows wrapper around Unix tools. That's something to do with DeltaCopy, not GovTrack.

What This Does

This will download the 111th Congress bill data. The first download should be roughly 75MB. Subsequent updates will be much less. The directory structure exposed by rsync mirrors the HTTP-browsable data directory (but, again, please don't do massive downloading by HTTP).

The source data in all is 16 gigabytes, so don't think about downloading the whole thing in one shot. And be nice on my bandwidth.

The XML files are updated roughly daily (a good time for you to rsync them is 4PM Eastern time, daily). The directories for roll call votes (e.g. data/us/111/rolls) are updated much more frequently. If you need almost-real-time roll call vote data, you can rsync that directory hourly.