Tracking State Governments
So you want to build a site like GovTrack for your state? Cool!
A lot of governing that affects our lives occurs at the state level,
but there are only a few sites
like GovTrack following state legislation.
If you're interested in applying
techology to improving government, then this is a call for you to get
involved in building a legislation tracking tool in your own state.
First of all...
- Check out your state's legislature's websites. See what is already
available. Maybe their sites are pretty good and you don't really
need to build an independent site for your state.
- Drop them an email and see what they think about extending their
website to do the things you want. They might be happy to hear your
thoughts. Governments always face budget
limitations and narrow mandates, so if they aren't enthusiastic
it's probably "the system" not the people.
- You're not going to want to type in all of the information
by hand each day, so get a feel for what kind of information about
your state legislation is already online. Can you find out who is
elected to the state congress, what bills they sponsor, and how
they voted?
- Sunlight Foundation has something called the
Fifty State Project
where they're trying to get developers to help build a large 50-state
database of legislative information. If you're a software developer,
take a look. That's the first place to start.
- If you want to talk to others with a similar interest, please join GovTrack's mail list.
To build a site like this, you need a good programmer with experience
in building data-driven websites. Depending on what your state makes
available and how they make it available (i.e. as a database
directly or only as a website) it can take more or less time to
program the data-gathering part of the site. Plus then you have to
build the website itself. 100 hours is if you're lucky, 500 hours
to get off the ground could be more like it.
Gathering the data
There are really two parts to a government-tracking system. The
first part is gathering the information into a central database.
How to do this depends on what your state makes available to you.
In the worst (but usual) case, the state provides no machine-usable
database of legislative information. This is the state of things at
the federal level. GovTrack gets the information then by "screen-scraping" various government websites
(mainly THOMAS). Screen-scraping
means writing a program that can automatically fetch a web page
and extract information out of the mess of HTML tags. GovTrack's
screen-scrapers are written in Perl because it happens to be
well-suited for flexibly searching through text with "regular expressions".
But Python and the other light-weight scripting languages
would work just as fine.
Screen-scraping is an unfortunate necessity most of the time. In an ideal world, the
government would simply allow direct access to the databases it uses to
power its own websites. After all, we're talking about public domain
information. If "machine-readable" formats of the data were
available, the gathering part of this task would be straight-forward.
The downside of screen-scraping is that minute changes to the format or
layout of a page can break the screen-scraper, and so this means that
one has to be committed to maintaining the screen-scraper after you've
finished writing it. The first task, then, is to transform the
human-readable government web pages into machine-readable
databases.
Now, the information available at the federal level is not the
same that is available at the level of states. The government site
THOMAS provides incredibly comprehensive information about all
activity on federal legislation. States vary widely with what information
they put online, and that limits the type of independent tracking site
that can be made (without an enormous effort of requesting hard-copies
of the information). The first step is to look for the website
of your state legislature and check out what kind of information
is published. Is there a list of all pending legislation? Does it
show the status of legislation, i.e. has it been voted on, enacted, etc?
Is there a list of legislators and how they have voted?
If you're particularly daring, you could try calling up your
state assembly and asking someone in the know about what kind of
information they put online (maybe it's not all publicly visible),
and whether they would be interested in publishing information
in a structured format.
Pick some page with
interesting information and write a little screen-scraping program
to extract the information out of it. Your program might want to
go down through the raw HTML line-by-line, applying a regular expression
on each line to pick out the number or title of a bill.
The information isn't useful unless it ends up in a structured format,
something that will help you work with the data when you want to
display it on your own web page, or when you want to scan it for
events to update users about. So you'll want to output what you screen-scrape
into a structured data file. XML works very well for this purpose.
The XML should be "normalized," which means that you put things like
bill identifiers, like "H.R. 1201," into a very precise, rigid format,
for instance by splitting the two parts of the identifier into
attributes, e.g. "<bill type='h' number='1201'/>". Beyond
normalization, you shouldn't worry much about what the XML file
looks like, as long as it very precisely captures the information
you've found in the original HTML.
Once you have at least one scraper, you should think about
how to fetch updated information from the state's website on
a regular basis — preferably without having to download
their entire website to your computer each day. You may have
to scrape just for a list of updated bills, and then re-fetch
the pages for the bills.
If you're lucky enough to live in a state where such a database is
already made available to you by your legislature, you'll still
probably want to bring the information together into your own
database system that's more suited for the next step, and to develop
some scripts to keep your database up to date.
Publishing the Information
The second part of the government-tracking system is
providing a website interface like GovTrack's
for navigating through the data so users can do research, and then
for letting users subscribe to track particular events.
Because you might be interested in providing some of the same
functionality as GovTrack, like email updates, RSS/Atom feeds, maps to
finding legislative districts, etc., it might be useful to avoid the
duplicated effort of creating a whole new state-specific site by instead
integrating your state-level data into www.GovTrack.us. This hasn't
been done before, but it's an interesting possibility.
Integrating in some form with GovTrack helps to create a network
of political information at multiple levels of government. The
practical applications of integration include being able to easily
track a politician's history in different aspects of government,
or in searching for topic-specific legislation at the federal
and local level at the same time.
On the other hand, you might rather run your own website.
Maybe that's even better! (Especially since the code base of
OpenCongress.org is
totally reusable, and GovTrack's website sources are a bit
of a mess.)
How hard is all of this?
This is a project that requires a real committment. GovTrack took
about two years of free-time-now-and-then to create, and it continues to
require regular attention to keep the screen-scrapers going when the
sources of data change their sites. (It doesn't happen often, but
it happens.) But, now that GovTrack exists,
spin-offs should go much faster since you'll have the moral support of
GovTrack behind you.
Closing Remarks
There's a real need for state-level sites like GovTrack.
There is virtually no awareness of state-level politics, at
least not in the communities I've been in, and yet it's a
very important part of a federalist system. So please
get involved!