So you want to build a site like GovTrack for your state? Cool!
A lot of governing that affects our lives occurs at the state level, but there are only a few sites like GovTrack following state legislation. If you're interested in applying techology to improving government, then this is a call for you to get involved in building a legislation tracking tool in your own state.
First of all...
- Check out your state's legislature's websites. See what is already available. Maybe their sites are pretty good and you don't really need to build an independent site for your state.
- Drop them an email and see what they think about extending their website to do the things you want. They might be happy to hear your thoughts. Governments always face budget limitations and narrow mandates, so if they aren't enthusiastic it's probably "the system" not the people.
- You're not going to want to type in all of the information by hand each day, so get a feel for what kind of information about your state legislation is already online. Can you find out who is elected to the state congress, what bills they sponsor, and how they voted?
- Sunlight Foundation has something called the Fifty State Project where they're trying to get developers to help build a large 50-state database of legislative information. If you're a software developer, take a look. That's the first place to start.
- If you want to talk to others with a similar interest, please join GovTrack's mail list.
To build a site like this, you need a good programmer with experience in building data-driven websites. Depending on what your state makes available and how they make it available (i.e. as a database directly or only as a website) it can take more or less time to program the data-gathering part of the site. Plus then you have to build the website itself. 100 hours is if you're lucky, 500 hours to get off the ground could be more like it.
Gathering the data
There are really two parts to a government-tracking system. The first part is gathering the information into a central database. How to do this depends on what your state makes available to you.
In the worst (but usual) case, the state provides no machine-usable database of legislative information. This is the state of things at the federal level. GovTrack gets the information then by "screen-scraping" various government websites (mainly THOMAS). Screen-scraping means writing a program that can automatically fetch a web page and extract information out of the mess of HTML tags. GovTrack's screen-scrapers are written in Perl because it happens to be well-suited for flexibly searching through text with "regular expressions". But Python and the other light-weight scripting languages would work just as fine.
Screen-scraping is an unfortunate necessity most of the time. In an ideal world, the government would simply allow direct access to the databases it uses to power its own websites. After all, we're talking about public domain information. If "machine-readable" formats of the data were available, the gathering part of this task would be straight-forward. The downside of screen-scraping is that minute changes to the format or layout of a page can break the screen-scraper, and so this means that one has to be committed to maintaining the screen-scraper after you've finished writing it. The first task, then, is to transform the human-readable government web pages into machine-readable databases.
Now, the information available at the federal level is not the same that is available at the level of states. The government site THOMAS provides incredibly comprehensive information about all activity on federal legislation. States vary widely with what information they put online, and that limits the type of independent tracking site that can be made (without an enormous effort of requesting hard-copies of the information). The first step is to look for the website of your state legislature and check out what kind of information is published. Is there a list of all pending legislation? Does it show the status of legislation, i.e. has it been voted on, enacted, etc? Is there a list of legislators and how they have voted?
If you're particularly daring, you could try calling up your state assembly and asking someone in the know about what kind of information they put online (maybe it's not all publicly visible), and whether they would be interested in publishing information in a structured format.
Pick some page with interesting information and write a little screen-scraping program to extract the information out of it. Your program might want to go down through the raw HTML line-by-line, applying a regular expression on each line to pick out the number or title of a bill.
The information isn't useful unless it ends up in a structured format, something that will help you work with the data when you want to display it on your own web page, or when you want to scan it for events to update users about. So you'll want to output what you screen-scrape into a structured data file. XML works very well for this purpose. The XML should be "normalized," which means that you put things like bill identifiers, like "H.R. 1201," into a very precise, rigid format, for instance by splitting the two parts of the identifier into attributes, e.g. "<bill type='h' number='1201'/>". Beyond normalization, you shouldn't worry much about what the XML file looks like, as long as it very precisely captures the information you've found in the original HTML.
Once you have at least one scraper, you should think about how to fetch updated information from the state's website on a regular basis — preferably without having to download their entire website to your computer each day. You may have to scrape just for a list of updated bills, and then re-fetch the pages for the bills.
If you're lucky enough to live in a state where such a database is already made available to you by your legislature, you'll still probably want to bring the information together into your own database system that's more suited for the next step, and to develop some scripts to keep your database up to date.
Publishing the Information
The second part of the government-tracking system is providing a website interface like GovTrack's for navigating through the data so users can do research, and then for letting users subscribe to track particular events.
Because you might be interested in providing some of the same functionality as GovTrack, like email updates, RSS/Atom feeds, maps to finding legislative districts, etc., it might be useful to avoid the duplicated effort of creating a whole new state-specific site by instead integrating your state-level data into www.GovTrack.us. This hasn't been done before, but it's an interesting possibility.
Integrating in some form with GovTrack helps to create a network of political information at multiple levels of government. The practical applications of integration include being able to easily track a politician's history in different aspects of government, or in searching for topic-specific legislation at the federal and local level at the same time.
On the other hand, you might rather run your own website. Maybe that's even better! (Especially since the code base of OpenCongress.org is totally reusable, and GovTrack's website sources are a bit of a mess.)
How hard is all of this?
This is a project that requires a real committment. GovTrack took about two years of free-time-now-and-then to create, and it continues to require regular attention to keep the screen-scrapers going when the sources of data change their sites. (It doesn't happen often, but it happens.) But, now that GovTrack exists, spin-offs should go much faster since you'll have the moral support of GovTrack behind you.
Closing Remarks
There's a real need for state-level sites like GovTrack. There is virtually no awareness of state-level politics, at least not in the communities I've been in, and yet it's a very important part of a federalist system. So please get involved!

