Source Data

The underlying data about the U.S. Congress that powers this site is the only such database made freely available for others to reuse. The data covers the activity of bills, PDFs of bill texts, roll call votes, indexing for fast searches (meant for me not for you), and photos of members of Congress. Almost all of the data files are in XML format.
You can browse the underlying source data for this website here. These are the very same data files that GovTrack uses to make itself go, so pretty much anything you see on the site is in one of those files. You can think of this directory as a really simple API. You have to know the naming convention for the files (that's the API), and then you can access the XML directly via HTTP. There's no API key for this data and no restrictions on how the data can be used. Provided I have any copyright claims to any of the data described here, I am releasing it into the public domain.
Some documentation of the structure of the data files is in the Data Directory page on the wiki. You're encouraged to update the wiki as you figure things out, or email me with questions so I can update the documentation.
Please contact me if you would like to start using the data, just because I'm curious and like to know what it's being used for. (I never say no. Commercial users should keep in mind that the data is provided as-is, with no guarantee of continued service!)
Quick links:
- data: The root of the data directory.
- Documentation for the data files.
- people.xml - current Congress: Everyone that is serving in Congress now, plus anyone that has resigned/retired/died/etc. since the start of the current Congress (i.e. two-year session). It includes their party affiliation, terms in Congress this session (more than one if they changed roles), birthdays, current committee assignments, etc. Note that the location of the file changes every two years.
- people.xml - with historical info: Everyone that has ever served in Congress, and U.S. presidents, with their party affiliation, terms in Congress, birthdays, etc. This file is quite large... best not to open it in your browser.
- bills.index.xml: A summary of the bills introduced this session of Congress.
- zip4dist-prefix.txt.gz: A table (.gz-compressed) mapping ZIP+4 codes to congressional districts. If all zipcodes starting with a prefix map to the same district, only the prefix is included. An entry like "123 NY-01" means all zip and zip4 codes starting with "123" are in New York's first congressional district. Current as of the summer of 2008; this comes from work by Carl Malamud and Aaron Swartz here.
Bulk Access
If you're going to be using the data directory as an API and are requesting select files as needed, then just use HTTP GET. That is, just download the files from the data directory like normal. But if you want a large chunk of the database, and especially to keep it up to date, I do not allow you to use HTTP.
To download files in bulk, or if you plan to regularly update files, you must use the rsync tool. I require rsync because it will only download updated files when you want to refresh your local files, and it supports compression. rsync is readaily available on Linux; just type:
rsync -az govtrack.us::govtrackdata/us/110/bills .
On Windows, install DeltaCopy, which contains rsync for Windows. Then go to a command line and type:
mkdir C:\GovTrackData cd "\Program Files\Synametrics Technologies\DeltaCopy" rsync -avz --delete govtrack.us::govtrackdata/us/110/bills C:/GovTrackData
This will download the 110th Congress bill data into a directory called bills in the current directory on Linux, or into C:\GovTrackData\bills on Windows. The first download should be roughly 75MB. Subsequent updates will be much less. The directory structure exposed by rsync mirrors the HTTP-browsable data directory (but, again, please don't do massive downloading by HTTP).
The source data in all is 16 gigabytes, so don't think about downloading the whole thing in one shot. And be nice on my bandwidth.
The XML files are updated roughly daily (a good time for you to rsync them is 4PM Eastern time, daily). The directories for roll call votes (e.g. .../110/rolls) are updated much more frequently. If you need almost-real-time roll call vote data, you can rsync that directory hourly.
