- What is RDF?Unless you know RDF well, it's probably best if you try to forget what you already know. Here is RDF from the beginning.
- RDF file formatsRDF can be written in different ways. I'll show the basics of Notation 3 (N3).
- Agreeing, or agreeing to disagreeIntegrating with other RDF information gives RDF real meaning.
- Distributed informationRDF is useful for mixing and meshing data from different sources.
- Comparing RDF with XMLXML is too restrictive for meshing decentralized information. I'll compare how XML and RDF would be used in a simple scenario.
- IntermissionTake a break here. It only gets more complicated.
- RDF Ontologies, Schemas, and VocabulariesHigher-level information -- data about data -- lets computers draw useful inferences from data automatically.
- More To LearnThis isn't the end of what there is to know about RDF.
What is RDF?
Unless you know RDF well, it's probably best if you try to forget what you already know about it. RDF exists at the intersection of a few different technologies (XML and syndication, for instance), so it's easy to be lead into thinking that it is a particular data format or a tool for blog feeds. In fact, RDF is more abstract than this. So, forget what you know. Here is RDF from the beginning. (For the official documentation on RDF, start with the RDF Primer.)
RDF is nothing more than a general method to decompose information into pieces. The emphasis is on general here because the same method can be used for any type of information. And the method is this: Express information as a list of statements in the form SUBJECT PREDICATE OBJECT. The subject and object are names for two things in the world, and the predicate is the name of a relation between the two. You can think of predicates as verbs.
Here's how I would break down information about my apartment into S-P-O statements:
| SUBJECT | PREDICATE | OBJECT |
|---|---|---|
| I | own | my apartment |
| my apartment | has | my computer |
| my apartment | has | my bed |
| my apartment | is in | Philadelphia |
The subjects in this example are I and my apartment (three times); the predicates are own, has, and is in; and the objects are my apartment, my computer, my bed, and Philadelphia. Note how the predicates are all relations between two things. Own is a relation between an owner and an 'ownee', has is a relation between the container and the thing contained, and is_in is the inverse relation (between the contained and the container, order is very important). Also, consider that my_apartment is just a name, but it refers to (or denotes) an entity in the real world. Entities are sometimes also called resources.
Each subject-predicate-object line above is called a statement or triple.
The next aspect of RDF almost goes without saying, but I want to put everything down in print: If someone refers to something as X in one place and X is used in another place, the two X's refer to the same thing in the real world. When I wrote my_apartment in the first row, it's the same apartment that I meant when I wrote my_apartment in the other three rows. (The reverse is not true. Two things with different names in RDF may still refer to the same thing in the real world, although normally that doesn't happen.)
This already gets us a lot farther than you might realize. Given this table of S-P-O statements, I can write a simple program that can answer questions like "who own my_apartment" and "my_apartment has what." The question itself is in the form of an S-P-O statement, except the program will consider wh-words to be wild-cards. A simple question-answering program can compare the question to each row in the table. Each matching row is an answer.
Question: my_apartment has what Answer: my_apartment has my_computer Answer: my_apartment has my_bed
The computer doesn't need to know what has actually means in English for this to be useful. That is, it's left up to the application writer to choose appropriate names for things (e.g. my_apartment) and to use the right predicates (own, has). RDF tools are ignorant of what these names mean, but they can still usefully process the information. (I'll get to more useful things later.)
RDF information is meant to be published on the Internet, and so the names I used above have a problem. I shouldn't name something my_apartment because someone else might use the name my_apartment for their apartment too. Following from the last fact about RDF, RDF tools would think the two instances of my_apartment referred to the same thing in the real world, whereas in fact they were intended to refer to two different apartments. The last aspect of RDF is that names must be global, in the sense that you must not choose a name that someone else might conceivably also use to refer to something different. Formally, names for subjects, predicates, and objects must be Universal Resource Identifiers (URIs).
URIs have the same syntax or format as website addresses, so you will see RDF files that contain URIs like http://www.w3.org/1999/02/22-rdf-syntax-ns#, where that URI is the global name for some entity. It happens to be the global name for RDF itself, but the fact that it looks like a web address is totally incidental. There may or may not be an actual website at that address, and it doesn't matter. The URIs you see in RDF documents are merely verbose names for entities, nothing more.
URIs are used as global names because they provide a way to break down the space of all possible names into units that have obvious owners. URIs that start with http://www.govtrack.us/ are implicitly controlled by me, or whoever is running the website at that address. By convention, if there's an obvious owner for a URI, no one but that owner will create a new resource with that URI. This prevents name clashes. If you create a URI in the space of URIs that you control, you can rest assured no one will use the same URI to denote something else.
I might re-write the table about my apartment as it is below, replacing the simple names above with URIs:
| SUBJECT | PREDICATE | OBJECT |
|---|---|---|
| http://taubz.for.net/me | urn://global/name/own | http://taubz.for.net/my_apartment |
| http://taubz.for.net/my_apartment | urn://global/name/has | http://taubz.for.net/my_computer |
| http://taubz.for.net/my_apartment | urn://global/name/has | http://taubz.for.net/my_bed |
| http://taubz.for.net/my_apartment | urn://global/name/is_in | urn://us_cities/Philadelphia |
That's all there is to knowing what RDF actually is. It's not a particular format, but instead a method of describing things using S-P-O statements and URIs.
RDF file formats
There are a few standard formats for writing out RDF information. The two most common are Notation 3 (N3), which is basically a tabular format, and RDF/XML, which is an XML-based format.
These formats allow you to abbreviate parts of URIs in the same way that XML allows you to abbreviate namespaces. I won't go into the syntax of defining RDF namespaces in each RDF file format, but I will use taubz as an abbreviation for http://taubz.for.net/ (note that in URIs, the trailing slash is a part of the name) and other namespaces as needed.
The N3 version of the information about my apartment is written as:
taubz:me global:own taubz:my_apartment . taubz:my_apartment global:has taubz:my_computer . taubz:my_apartment global:has taubz:my_bed . taubz:my_apartment global:is_in <urn://us_cities/Philadelphia> .
It's just subject, predicate, and object followed by a period. Entities (subjects, predicates, and objects) are written either as URIs within angled brackets, or as namespace-prefixed names (as in taubz:me, the shorthand for http://taubz.for.net/me, provided I previously defined the taubz namespace in the N3 file).
N3 has some syntactic sugar that allow further abbreviations:
taubz:me global:own taubz:my_apartment .
taubz:my_apartment global:has taubz:my_computer, taubz:my_bed ;
global:is_in <urn://us_cities/Philadelphia> .
Commas indicate another object for the same subject and predicate. Semicolons indicate another predicate and object for the same subject.
Turtle and NTriples are just like N3, but each simpler than N3.
The RDF/XML format is the most common. (It's the basis of RSS 1.0, for instance.) The format is fairly complex in its details, so I won't even use it here. I'll continue to use N3.
Agreeing, or agreeing to disagree
One really important aspect of RDF is that users of RDF don't need to agree ahead of time on anything more than they need to. By using RDF, you've already got an extensible foundation for any type of data. If no one has coined a URI for something you want to describe, you can create your own URI for it. This goes for not just subjects and objects but predicates as well. In the examples above, I made up all of the URIs, and it's still perfectly valid RDF.
But, RDF data has more meaning if you choose URIs that other people are using. Two RDF documents with no URIs in common have no information that can be interrelated. But, two documents that have some URIs in common are talking about some of the same things. Someone else might want to publish information about my apartment, such as how far it is from where they live. By using the same URI for my apartment in the two documents, RDF tools will be able to recognize that the two documents are describing the same thing.
Here is an example of using RDF to describe books. Let's say, hypothetically, that the Library of Congress posted an RDF list of books and Amazon.com did the same.
urn:isbn:0143034650 dc:title "Free Culture : The Nature and Future of Creativity" . urn:isbn:0613917472 dc:title "Code and Other Laws of Cyberspace" . urn:isbn:B00005U7WO dc:title "The Future of Ideas" .
urn:isbn:0143034650 amazon:price "$15.00" . urn:isbn:0613917472 amazon:price "$26.35" . urn:isbn:B00005U7WO amazon:price "$9.95" .
(Note that in these files I've used literal values for objects, rather than names of other resources. Literals, such as strings, numbers, dates, can be objects of statements but not subjects or predicates.)
The URIs for the books (urn:isbn:...) are what tie the two files together. An RDF application using these files would be able to report that "The Future of Ideas" is "$9.95" at Amazon because both the title and the price are related to a common resource, denoted by the URI urn:isbn:B00005U7WO. If the files did not have the same URI for that book, nothing would indicate that the titles went with particular prices.
It's also important to use the same predicates others are using when the predicates you want already exist. Only a human can determine how a predicate should be interpreted (how an application can make use of it), so when you make up a URI for a predicate, only you will know what that predicate means. Existing RDF applications won't be able to make heads or tails of it. For instance, if you're describing documents, you should use the existing Dublin Core title and description predicates so RDF applications that already use those predicates will be able to extract the titles and descriptions from your data. If you're describing people, use existing FOAF predicates so that FOAF-based applications will be able to take advantage of the information.
I used dc:title in my books example above because the predicate already existed to convey the information I wanted to put into the files: a relation between a book and its title. No standard predicate exists for giving the Amazon.com price of a book, so I made one up. (Remember amazon:price is an abbreviated form of a full URI, but I've left out the declaration of amazon: for the sake of brevity.)
Be careful of respecting the meanings of existing predicates, though. RDF applications expect the dc:title predicate to relate a document to its title. You wouldn't want to use it to relate a telephone to its phone number, for instance.
Distributed information
Most of the time you get your information from one place. For instance, a single database might contain the information for an entire product line. There isn't much on the web in the way of distributed information because it's usually hard to put information together from multiple sources, each of which may have its own data format and conventions.
Here's a scenario where distributed information makes a lot of sense: a database of products from multiple vendors and reviews of those products from multiple reviewers. No one vendor is going to want to be responsible for maintaining a central database for this project, especially since it will contain information for competing products and negative reviews. Likewise, no one reviewer may have the resources to keep such a database up to date. How can this become a reality?
RDF is particularly suited for this project. Each vendor and reviewer will publish a file in RDF on their own websites. The vendors will choose URIs for their products, and the reviewers will use those URIs when composing their reviews. Vendors don't need to agree on a common naming scheme for products, and reviewers aren't tied to a vendor-controlled data format. RDF allows the vendors and reviewers to agree on what they need to agree on, without forcing anyone to use one particular vocabulary.
vendor1:productX dc:title "Cool-O-Matic" . vendor1:productX retail:price "$50.75" . vendor1:productX vendor1:partno "TTK583" . vendor1:productY dc:title "Fluffertron" . vendor1:productY retail:price "$26.50" . vendor1:productX vendor1:partno "AAL132" .
vendor2:product1 dc:title "Can Closer" . vendor2:product1 retail:price "$28.11" . vendor2:product1 vendor2:warranty_code "None." . vendor2:product2 dc:title "Dust Unbuster" . vendor2:product2 retail:price "$33.21" . vendor2:product2 vendor2:warranty_code "X12" .
vendor1:productX dc:description "This product makes this really good. A good buy!" .
vendor2:product2 dc:description "Who needs something to unbust dust? A dust buster would be a better idea." . vendor2:product2 review:rating review:Excellent
It's an open question just how an application will retrieve these files, but I'll put that aside. Once an application has these files, it has enough information to relate products to reviews to prices, and even to vendor-specific information like vendor1:partno and vendor2:warranty_code. What you should take away from this example is how unconstraining RDF is, while still allowing applications to immediately be able to relate information together.
And, RDF applications don't need to know about the nature of the data in these files to be able to make use of it. If an application already knows what the dc:title and dc:description predicates are for -- which applies to any type of data -- then it is already able to present the titles and reviews of the four subject entities.
In addition, the vendors and reviewers did not have to agree on much to make this happen. They had to agree to use RDF, but they didn't have to agree on any specific data format or even on specific URIs. It helps they they agreed on URIs for indicating titles and prices, although even that wasn't strictly necessary. But, crucially, they didn't have to enumerate everything any vendor would want to include about their products. When a vendor needed something that wasn't already agreed on (product numbers and warranty codes), they were able to create a new predicate without disrupting any existing systems. Likewise, the reviewers aren't tied to a vendor-controlled vocabulary. Reviewers were free to add their own relations, such as a ratings, to their RDF files.
Another way to look at this from the standpoint of interoperability. Vendor 1's format is entirely interoperable with anyone else's format, even though Vendor 1 didn't hash out a common format with anyone. When someone comes along and wants to be interoperable with Vendor 1's information, they don't need a new format, they just need to choose the right subjects, predicates, and objects.
Is this any better than XML? I'll take a look at that next.
Comparing RDF with XML
In the last section I showed how RDF could be used to create a decentralized database for product information and reviews. Here is how a similar system would be accomplished using non-RDF XML.
First, though, I'll look at how a single vendor would approach this alone. The vendor, if it was so inclined, might publish an XML file with a node for each product, and within that a node for its name and some vendor-specific information.
<products>
<product title="Cool-O-Matic">
<price>50.75</price>
<partno>TTK583</partno>
...
What can be done with this file? An application to display this information would have to be specifically programmed to know that <product> nodes are for products, with titles in the title attribute, etc. And, if a reviewer wanted to post a review XML file, the only way to relate reviews to products would be by name. Two vendors might have products with the same name, so vendors would have to use IDs of some sort to keep their products separate.
The first problem arises. Vendors will need to come together to establish a product ID system so that IDs are unique within the local ID space of this vendor consortium. RDF solves this problem by requiring that all IDs be globally unique, and by using URIs for IDs, allowing individuals to create IDs in a local space that they control.
With IDs in the XML files, reviewers will be able to identify products in their review files, but applications still won't be able to relate products to reviews. IDs aren't enough. The applications themselves have to be told where to find the IDs. In Vendor 1's file, it might be in the product node's ID attribute. In the review files, it might be in the review node's product attribute. Even if the vendors and reviewers agreed where to put the ID, the application still needs to know where it is. RDF solves this problem by making everything a global ID (except literals), so anything the RDF application sees is an ID that means something.
The vendors and reviewers next have to decide what constitutes a valid product or review XML file, and how the nodes of these files should be interpreted by software. If these files are defined by a DTD or Schema, the files will not be extensible. Before adding anything new into these files, such as vendor-specific information, all of the vendors and reviewers will need to agree to the DTD or Schema change. Without a DTD or Schema, there are no rules for what elements go where. It's a trade-off.
I could go on, but you should see now that XML isn't particularly suited for distributed, extensible information.
Intermission
This is a good point to take a break. What I've discussed above is the fundamentals of RDF and why those fundamentals make RDF different. There's a lot more to say about the fundamentals of RDF. Notably, I left out the idea that collections of RDF statements can be thought of as a web, or a graph in the mathematical sense. But, I need to move on.
Below I'll continue, explaining some advanced uses of RDF, including ontologies.
RDF Ontologies, Schemas, and Vocabularies
So far I've shown how RDF can be used to describe the relationships between entities in the world. RDF can be used at a higher level, too, to describe relationships among the predicates. Ontologies, schemas, and vocabularies, which all mean roughly the same thing, are RDF information about RDF information.
RDF ontologies play a vaguely similar role as XML Document Type Definitions and XML Schema. But they are as different as they are the same. DTDs and XML Schema specify what constitutes a valid document. They don't indicate how a document should be interpreted, and they only restrict the set of elements that can be used in any given file. RDF ontologies provide relations between higher-level things, entirely for the purpose of indicating to applications how some information should be interpreted. (RDF ontologies also don't restrict at all which predicates are valid where. Any statement is valid anywhere.)
RDF, RDF Schema (RDFS), and Web Ontology Language (OWL) define a few classes and predicates that are, merely by convention, used to provide higher-level descriptions of data.
The first higher-level predicate is the rdf:type predicate. (rdf is the usual namespace abbreviation for http://www.w3.org/1999/02/22-rdf-syntax-ns#.) It relates any entity to another entity whose rdf:type is rdfs:Class. (rdfs is the usual namespace abbreviation for http://www.w3.org/2000/01/rdf-schema#.) The purpose of this predicate is to indicate what kind of thing a resource is. But, as with anything else in RDF, the choice of class is either by convention or arbitrary.
To add class information into the vendor N3 files from a few sections ago, a vendor would simply append this:
vendor1:productX rdf:type general:Product .
As with choosing predicates, it's helpful to choose URIs for classes that are used by others. Agreement among different parties for classes, and for the other things in this section, is very important.
One interesting class is rdf:Property. (Note that rdf:Property is merely a URI that in the RDF specs appears as the subject of a statement that is related by rdf:type to the URI rdfs:Class. Things that are classes are usually given uppercase names.) This is the class that predicates are typed as. To be explicit about amazon:price, I would create an RDF ontology file that contained:
amazon:price rdf:type rdf:Property .
Previously amazon:price was a predicate, and here it is a subject. Since it's just shorthand for a URI that denotes an entity, we can use it in statements just like anything else.
Other RDFS predicates are used to provide even more information about predicates. The rdfs:domain and rdfs:range predicates relate a predicate to the rdf:Class of resources that can serve as the subject or object of the predicate, respectively. Here's an example:
vendor2:warranty_code rdfs:domain general:Product . vendor2:warranty_code rdfs:range rdfs:Literal .
These statements say that the subjects of vendor2:warranty_code are things typed as general:Product and the objects of this predicate are literals (raw text). Specifying domains and ranges for predicates serves two purposes. First, it allows applications to make inferences from statements about the types of things. If it sees something that is the subject of vendor2:warranty_code, it can infer that it is a general:Product. Second, these specifications serve as documentation of a vocabulary for people. The RDF itself is used to indicate how predicates should be used.
Two RDFS predicates are used to give relations between classes and predicates. The rdfs:subClassOf relation indicates that one class is a sub-class of another. For instance, the class Mammal is a sub-class of the class Animal. Anything true of the Animal class is also true of the Mammal class, and applications are able to make such inferences once this predicate is present. The rdfs:subPropertyOf is similar, but for predicates. For example, the friend predicate is a sub-property of the knows predicate. Any friend is someone you know.
Lastly, I want to mention four classes that are sub-classes of rdf:Property: owl:SymmetricProperty, owl:TransitiveProperty, owl:FunctionalProperty, and owl:InverseFunctionalProperty. (The OWL namespace is http://www.w3.org/2002/07/owl#.) Applications can use these classes, by convention, to make inferences about data. You would use these classes in an ontology like this:
amazon:price rdf:type owl:FunctionalProperty .
Because these classes are defined in the OWL ontology as being sub-classes of rdf:Property, applications can infer the following:
amazon:price rdf:type rdf:Property .
That's the same statement as earlier. So, when you use a sub-class in place of the 'parent' class, you're being strictly more informative. Anything the application knew before it still knows, and it knows more because the sub-class is more specific.
OWL symmetric and transitive properties tell applications that the following inferences are valid. If the application sees the statement S P O, and if P is typed as a symmetric property, then O P S is also true. (The friend relation is symmetric, more or less. If you're my friend, I'm your friend.) If the application sees the statements X P Y and Y P Z, and if P is typed as a transitive property, then X P Z is also true. (subClassOf is a transitive relation. If Mammal is a sub-class of Animal and Animal is a sub-class of Organism, then Mammal is a sub-class of Organism.)
OWL functional and inverse-functional properties tell applications that two entities are, in some circumstances, the same thing, even if they have different URIs. (Remember that like URIs must refer to the same thing, but different URIs may still refer to the same thing.) Functional properties can occur just once for any given subject. If a functional property is used twice to relate a single subject to two URIs, then RDF applications can conclude that the two URIs denote the same thing in the real world. (amazon:price is a functional property. For any product, there is only one amazon:price.) Inverse functional properties do the same in reverse. For any object, there is only one subject for a particular inverse functional property. (The ISBN relation is inverse functional. For any ISBN, there is only one book that has that ISBN.)
These classes and predicates are some of the standard tools that are available to you when you write ontologies. For some examples of complete ontologies, including the standard RDF, RDFS, and OWL ontologies, see SchemaWeb.
More To Learn
I could probably write about RDF continuously for another few years, but I have to stop for now. Here's some things I haven't gotten to writing about.
There are a few querying languages for RDF. You provide a query like who owns what, and the query engine returns back some part of the RDF source data that matches your query. You could also write a query as an RDF graph itself, and the query engine's job is to match the query graph that you provide with the target data.
I hope to return to these points in the future.

