What is RDF and what is it good for?

Author: Joshua Tauberer. Link to original: http://www.rdfabout.com/intro/ (English).
Tags: linkeddata, owl, rdf, semanticweb, sparql, w3c, webofdata Submitted by dulanov 09.12.2008. Public material.
This is an introduction to RDF (“Resource Description Framework”), which is the standard for encoding metadata and other knowledge on the Semantic Web. In the Semantic Web, computer applications make use of structured information spread in a distributed and decentralized way throughout the current web. RDF is an abstract model, a way to break down knowledge into discrete pieces, and while it is most popularly known for its RDF/XML syntax, RDF can be stored in a variety of formats. This article discusses the abstract RDF model, two concrete serialization formats, how RDF is used and how it differs from plain XML, higher-level RDF semantics, best practices for deployment, and querying RDF data sources.

Translations of this material:

into Russian: Что такое RDF и для чего он хорош?. Translation complete.
Submitted for translation by dulanov 09.12.2008 Published 8 months, 1 week ago.

Text

This is an introduction to RDF (“Resource Description Framework”), which is the standard for encoding metadata and other knowledge on the Semantic Web. In the Semantic Web, computer applications make use of structured information spread in a distributed and decentralized way throughout the current web. RDF is an abstract model, a way to break down knowledge into discrete pieces, and while it is most popularly known for its RDF/XML syntax, RDF can be stored in a variety of formats. This article discusses the abstract RDF model, two concrete serialization formats, how RDF is used and how it differs from plain XML, higher-level RDF semantics, best practices for deployment, and querying RDF data sources.

This document was originally written in October 2005. In July 2006 it was revised and extended with material from my xml.com article "What is RDF". In January 2008 it was revised with more on N3 and RDF/XML and extended with the new sections Linked Data for the Web and Querying Semantic Web Databases

Why we need a new standard for the Semantic Web

On the Semantic Web, computers do the browsing for us. The “SemWeb” enables computers to seek out knowledge distributed throughout the Web, mesh it, and then take action based on it. To use an analogy, the current Web is a decentralized platform for distributed presentations while the SemWeb is a decentralized platform for distributed knowledge. RDF is the W3C standard for encoding knowledge.

There of course is knowledge on the current Web, but it's off limits to computers. Consider a Wikipedia page, which might convey a lot of information to the human reader, but to the computer displaying the page all it sees is presentation markup. To the extent that computers make sense of HTML, images, Flash, etc., it's almost always for the purpose of creating a presentation for the end-user. The real content, the knowledge the files are conveying to the human, is opaque to the computer.

What is meant by “semantic” in the Semantic Web is not that computers are going to understand the meaning of anything, but that the logical pieces of meaning can be mechanically manipulated by a machine to useful ends.

So now imagine a new Web where the real content can be manipulated by computers. For now, picture it as a web of databases. One “semantic” website publishes a database about a product line, with products and descriptions, while another publishes a database of product reviews. A third site for a retailer publishes a database of products in stock. What standards would make it easier to write an application to mesh distributed databases together, so that a computer could use the three data sources together to help an end-user make better purchasing decisions?

There's nothing stopping anyone from writing a program now to do those sorts of things, in just the same way that nothing stopped anyone from exchanging data before we had XML. But standards facilitate building applications, especially in a decentralized system. Here are some of the things we would want a standard about distributed knowledge to consider:

1. Files on the Semantic Web need to be able to express information flexibly. Life can't be neatly packed into tables, as in relational databases, or hierarchies, as in XML. The information about movies and TV shows contained in the graph below is really best expressed as a graph:

[[diagram: Knowledge as a Graph]]

Of course, we can't be drawing our way through the Semantic Web, so instead we will need a tabular notation for these graphs. Compare the table below to the figure above. Each row represents an arrow (an “edge”) in the figure. The first column has the name of the “node” at the start of the edge. The second column has the label of the edge itself (the kind of edge). The third column has the name of the node at the end of the arrow.

[[table: Start Node Edge Label End Node]]

Whether we represent the graph as a picture or in a table, we're talking about the same thing. Both describe what is abstractly called a graph. More on this later.

2. Files on the Semantic Web need to be able to relate to each other. A file about product prices posted by a vendor and a file with product reviews posted independently by a consumer need to have a way of indicating that they are talking about the same products. Just using product names isn't enough. Two products might exist in the world both called “The Super Duper 3000,” and we want to eliminate ambiguity from the SemWeb so that computers can process the information with certainty. The SemWeb needs globally unique identifiers that can be assigned in a decentralized way.

3. We will use vocabularies for making assertions about things, but these vocabularies must be able to be mixed together. A vocabulary about TV shows developed by TV aficionados and a vocabulary about movies independently developed by movie connoisseurs must be able to be used together in the same file, to talk about the same things, for instance to assert that an actor has appeared in both TV shows and movies.

These are some of the requirements that RDF, Resource Description Framework, provides a standard for, as we'll see in the next section. Before getting too abstract, here are actual RDF examples of the information from the graph above, first in the Notation 3 format, which closely follows the tabular encoding of the underlying graph:

[[example: Notation 3 Example]]

And in the standard RDF/XML format, which may have a more intuitive feel and is more explicit about hierarchical structure in the graph, but in most cases it tends to obscure the underlying graph:

[[example: RDF/XML Example]]

RDF was originally created in 1999 as a standard on top of XML for encoding metadata — literally, data about data. Metadata is of course things like who authored a Web page, what date a blog entry was published, etc., information that is in some sense secondary to some other content already on the regular Web. Since then, and perhaps even after the updated RDF spec in 2004, the scope of RDF has really evolved into something greater. The most exiting uses of RDF aren't in encoding information about Web resources, but information about and relations between things in the real world: people, places, concepts, etc.

Introducing RDF

Unless you know Resource Description Framework (RDF) well, it's probably best if you try to forget what you already know about it as you read the rest of this section. RDF exists at the intersection of a few different technologies, so it's easy to be lead into thinking that it is merely a particular XML data format or a tool for blog feeds. Forget what you know. Here is RDF from the beginning.

RDF is a general method to decompose knowledge into small pieces, with some rules about the semantics, or meaning, of those pieces. The point is to have a method so simple that it can express any fact, and yet so structured that computer applications can do useful things with knowledge expressed in RDF. I say "method" in particular, rather than format, because one can write down those pieces in any number of ways and still preserve the original information and structure, just like how one can express the same meaning in different human languages or implement the same data structure in multiple ways.

In some ways, RDF can be compared to XML. XML also is designed to be simple and applicable to any type of data. XML is also more than a file format. It is a foundation for dealing with hierarchical, self-contained documents, whether they be stored on disk in the usual brackets-and-slashes format, or held in memory.

What sets RDF apart from XML is that RDF is designed to represent knowledge in a distributed world. That RDF is designed for knowledge, and not data, means RDF is particularly concerned with meaning. Everything at all mentioned in RDF means something. It may be a reference to something in the world, like a person or movie, or it may be an abstract concept, like the state of being friends with someone else. And by putting three such entities together, the RDF standard says how to arrive at a fact. The meaning of the triple “(John, Bob, the state of being friends)” might be that John and Bob are friends. By putting a lot of facts together, one arrives at some form of knowledge. Standards built on top of RDF, including RDFS and OWL, add to RDF semantics for drawing logical inferences from data.

For comparison, XML itself is not very much concerned with meaning. XML nodes don't need to be associated with particular concepts, and the XML standard doesn't indicate how to derive a fact from a document. For instance, if you were presented with a few XML documents whose root nodes were in a foreign language you don't understand, you couldn't do anything useful with the documents but display them. RDF documents with nodes you can't understand could still actually be usefully processed because RDF specifies some basic level of meaning. Now, this isn't to say that you couldn't develop your own standard on top of XML that says how to derive the set of facts in an XML document, but you'll find you've probably just reinvented something like RDF.

The second key aspect of RDF is that it works well for distributed information. That is, RDF applications can put together RDF files posted by different people around the Internet and easily learn from them new things that no single document asserted. It does this in two ways, first by linking documents together by the common vocabularies they use, and second by allowing any document to use any vocabulary. This allows enormous flexibility in expressing facts about a wide range of things, drawing on information from a wide range of sources.

[[block: For the official documentation on RDF, start with the RDF Primer.]]

Triples of knowledge

RDF provides a general, flexible method to decompose any knowledge into small pieces, called triples, with some rules about the semantics (meaning) of those pieces.

The foundation is breaking knowledge down into basically what's called a labeled, directed graph, if you know the terminology.

Each edge in the graph represents a fact, or a relation between two things. The edge in the figure above from the node vincent_donofrio labeled starred_in to the node the_thirteenth_floor represents the fact that actor Vincent D'Onofrio starred in the movie “The Thirteenth Floor.” A fact represented this way has three parts: a subject, a predicate (i.e. verb), and an object. The subject is what's at the start of the edge, the predicate is the type of edge (its label), and the object is what's at the end of the edge.

The six documents composing the RDF specification tell us two things. First, it outlines the abstract model, i.e. how to use triples to represent knowledge about the world. Second, it describes how to encode those triples in XML. We'll take each subject in turn.

The abstract RDF model: Statements

RDF is nothing more than a general method to decompose information into pieces. The emphasis is on general here because the same method can be used for any type of information. And the method is this: Express information as a list of statements in the form SUBJECT PREDICATE OBJECT. The subject and object are names for two things in the world, and the predicate is the name of a relation between the two. You can think of predicates as verbs.

Here's how I would break down information about my apartment into RDF statements:

SUBJECT PREDICATE OBJECT
I own my_apartment
my_apartment has my_computer
my_apartment has my_bed
my_apartment is_in Philadelphia

These four lines express four facts. Each line is called a statement or triple.

The subjects, predicates, and objects in RDF are always simple names for things: concrete things, like my_apartment, or abstract concepts, like has. These names don't have internal structure or significance of their own. They're like proper names or variables. It doesn't matter what name you choose for anything, as long as you use it consistently throughout.

Names in RDF statements are said to refer to or denote things in the world. The things that names denote are called resources (dating back to RDF's use for metadata for web resources), nodes (from graph terminology), or entities. These terms are generally all synonymous. For instance, the name my_apartment denotes my actual apartment, which is an entity in the real world. The distinction between names and the entities they denote is minute but important because two names can be used to refer to the same entity.

Predicates are always relations between two things. Own is a relation between an owner and an 'ownee'; has is a relation between the container and the thing contained; is_in is the inverse relation, between the contained and the container. In RDF, the order of the subject and object is very important.

The next aspect of RDF almost goes without saying, but I want to put everything down in print: If someone refers to something as X in one place and X is used in another place, the two X's refer to the same entity. When I wrote my_apartment in the first line, it's the same apartment that I meant when I wrote it in the other three lines.

The rules so far already get us a lot farther than you might realize. Given this table of statements, I can write a simple program that can answer questions like "who own my_apartment" and "my_apartment has what." The question itself is in the form of an RDF statement, except the program will consider wh-words like who and what to be wild-cards. A simple question-answering program can compare the question to each row in the table. Each matching row is an answer. Here's the pseudocode:

[[pseudocode: Pseudocode for Question-Answering]]

The computer doesn't need to know what has actually means in English for this to be useful. That is, it's left up to the application writer to choose appropriate names for things (e.g. my_apartment) and to use the right predicates (own, has). RDF tools are ignorant of what these names mean, but they can still usefully process the information. (I'll get to more useful things later.)

URIs to Name Resources

RDF information is meant to be published on the Internet, and so the names I used above have a problem. I shouldn't name something my_apartment because someone else might use the name my_apartment for their apartment too. Following from the last fact about RDF, RDF tools would think the two instances of my_apartment referred to the same thing in the real world, whereas in fact they were intended to refer to two different apartments. The last aspect of RDF is that names must be global, in the sense that you must not choose a name that someone else might conceivably also use to refer to something different. Formally, names for subjects, predicates, and objects must be Uniform Resource Identifiers (URIs).

Now, in the SemWeb world, URIs are treated in a somewhat inconsistent way, so bear with me here.

On the one hand, URIs are supposed to be opaque. URIs can have the same syntax or format as website addresses, so you will see RDF files that contain URIs like http://www.w3.org/1999/02/22-rdf-syntax-ns#type, where that URI is the global name for some entity. But, the fact that it looks like a web address is totally incidental. There may or may not be an actual website at that address, and it doesn't matter. There are other types of URIs besides http:-type URIs. URNs are a subtype of URI used for things like identifying books by their ISBN number, e.g. urn:isbn:0143034650. TAGs are a general-purpose type of URI. They look like tag:govtrack.us,2005:congress/senators/frist. This article will use a mix of these three URI types.

Whatever their form, URIs you see in RDF documents are merely verbose names for entities, nothing more.

That was the first perspective on URIs. Now, the second perspective is that in recent years (2007 and on), there actually is an expectation that if you create an http: URI — or, any dereferencable URI — that you actually put something at that address so that RDF clients can access that page and get some information. Here's the bottom line: As for what a URI means in a document, what the URI is simply doesn't matter, but when you use dereferencable URIs, there may be an expectation that you put something on the web at that address. We will return to this in the section about Linked Data.

URIs are used as global names because they provide a way to break down the space of all possible names into units that have obvious owners. URIs that start with http://www.govtrack.us/ are implicitly controlled by me, or whoever is running the website at that address. By convention, if there's an obvious owner for a URI, no one but that owner will "mint" a new resource with that URI. This prevents name clashes. If you create a URI in the space of URIs that you control, you can rest assured no one will use the same URI to denote something else. (Of course, someone might use your URIs in a way that you would not appreciate, but this is a subject for another article.)

Since URIs can be quite long, in various RDF notations they're usually abbreviated using the concept of namespaces from XML. As in XML, a namespace is generally declared at the top of an RDF document and then used in abbreviated form later on. Let's say I've declared the abbreviation taubz for the URI http://razor.occams.info/index.html#. In many RDF notations, I can then abbreviate URIs like http://razor.occams.info/index.html#my_apartment by replacing the namespace URI exactly as it is given in the declaration with the abbreviation and a colon, in this case simply as taubz:my_apartment. The precise rules for namespacing depend on the RDF serialization syntax being used.

Importantly, namespaces have no significant status in RDF. They are merely a tool to abbreviate long URIs.

I might re-write the table about my apartment as it is below, replacing the simple names I first used above with abritrary URIs:

[[example: RDF about My Apartment]]

The table above is just an informal table representing the graph of information that exists at an abstract level, which could just as well be described by the figure below. We will talk more about standard ways of actually writing out RDF later on.

[[diagram: RDF as a Graph]]

Wrapping It Up So Far

And that's RDF. Everything else in the Semantic Web builds on those three rules, repeated here to hammer home the simplicity of the system:

1. A fact is expressed as a triple of the form (Subject, Predicate, Object).
2. Subjects, predicates, and objects are given as names for entities, whether concrete or abstract, in the real world.
3. Names are in the format of URIs, which are opaque and global.

These concepts form most of the abstract RDF model for encoding knowledge. It's analogous to the common API that most XML libraries provide. If it weren't for us curious humans always peeking into files, the actual format of XML wouldn't matter so much as long as we had our appendChild, setAttribute, etc. Of course, we do need a common file format for exchanging data, and in fact there are two for RDF, which we look at later.

Blank Nodes and Literal Values

There is actually a bit more to RDF than the three rules above. So far I've described three types of things in RDF: resources (things or concepts) that exist in the real world, global names for resources (i.e. URIs), and RDF statements (triples, or rows in a table). There are two more things.

Literals

The first new thing is the literal value. Literal values are raw text that can be used instead of objects in RDF triples. Unlike names (i.e. URIs) which are stand-ins for things in the real world, literal values are just raw text data inserted into the graph. Literal values could be used to relate people to their names, books to their ISBN numbers, etc.:

[[example: Some Uses of Literals]]

Blank/Anonymous Nodes

Then there are anonymous nodes, blank nodes, or bnodes. These terms are all synonymous. The words anonymous or blank are meant to indicate that these are nodes in a graph without a name, either because the author of the document doesn't know or doesn't want to or need to provide a name. In a sense, this is like saying “John is friends with someone, but I'm not telling who.” When we say these nodes are nameless, keep in mind two things. First, the real-world thing that the node denotes is not inherently nameless. John's friend, in the example, has a name, after all. Second, when we say nameless here, we are refering to the concept of naming things with URIs. Actual blank nodes in documents may be given “local” identifiers so that they may be referred to multiple times within a document. It is only that these local identifiers are explicitly not global, and have no meaning outside of the document in which they occur.

Here's one way literal values and anonymous nodes are used. One literal value in the example is "Joshua Tauberer", and the anonymous or blank node is _:anon123.

[[example: Blank Nodes and Literal Values]]

To distinguish between URIs, namespaced names (abbreviated URIs), anonymous nodes, and literal values, I used the following common convention:

* Full URIs are enclosed in angle brackets.
* Namespaced names are written plainly, but their colons give them away.
* Anonymous nodes are written like namespaced names, but in the reserved "_" namespace with an arbitrary local name after the colon.
* Literal values are enclosed in quotation marks.

You should take a moment to try to visualize what graph is described by the table. Picture arrows between nodes.

There is one blank node in this example, _:anon123. What we know about this resource is that it is the author of <urn:isbn:0143034650> and it has the name Lawrence Lessig. Because no global name is used for this resource, we can't really be sure who we're talking about here. And, if we wanted to say more about whatever is denoted by _:anon123, we would have to do it in this very RDF document because we would have no way to refer to this particular Lawrence Lessig outside of the document.

More on Literals: Language Tags and Datatypes

Literal values can be optionally adorned with one of two pieces of metadata. The first is a language tag, to specify what language the raw text is written in. The language tag should be viewed as a vestige of how RDF was used in the early days. Today it is an ugly hack. You may see “ "chat"@en ”, the literal value “chat” with an English language tag, or “ "chat"@fr ”, the same with the French language tag.

Alternatively, a literal value can be tagged with a URI indicating a datatype. The datatype indicates how to interpret the raw text, such as as a number, a URI, a date or time, etc. Datatypes can be any URI, although the datatypes defined in XML Schema are used by convention. The notation for datatypes is often the literal value in quotes followed by two carets, followed by the datatype URI (possibly abbreviated):

[[example: Datatypes]]

Datatypes are a bit tricky. Let's think of the datatype for floating-point numbers. At an abstract level, the floating-point numbers themselves are different from the text we use to represent them on paper. For instance, the text “5.1” represents the number 5.1, but so does “5.1000” and “05.10”. Here there are multiple textual representations — what are called lexical representations — for the same value. A datatype tells us how to map lexical representations to values, and vice versa.

The semantics of RDF takes language tags and datatypes into account. This means two things. First, a literal value without either a language tag or datatype is different from a literal with a language tag is different from a literal with a datatype. These four statements say four different things and none can be inferred from the others:

[[example: Literal Semantics]]

So, an untyped literal with or without a language tag is not the same as a typed literal. The second part of the semantics of literals is that two typed literals that appear different may be the same if their datatype maps their lexical representations to the same value. The following statements are equivalent (at least for an RDF application that has been given the semantics of the XSD datatypes):

[[example: Datatype Semantics]]

These mean John's age is 10. That is, the textual representation of the number is besides the point and is not part of the meaning encoded by the triples. Note that if the float datatype were not specified, the triples would not be inherently equivalent, and the textual representation of the 10 would be maintained as part of the information content.

More on Blank Nodes: Some Caveats

Unlike the rule for URIs stating that they are global, local identifiers used to name blank nodes are explicitly not global. A local bnode identifier used in two separate documents can refer to two things. Still, however, the identifier itself is arbitrary, and the actual identifier used in any particular case is not a part of the information content of the document.

There are some contexts in which blank nodes are treated slightly differently. In queries, they may be taken as variables. Sometimes they may also be thought of as “existentially bound” (but that is a topic for another time).

Anonymous nodes are often used to avoid having to assign people URIs, as in the example above. They're also often used in representing more complex relations:

[[example: Blank Nodes for Complex Relations]]

Here the anonymous node was used as an intermediate step in the relation between me and the parts of my name. The node represents my name in a structured way, rather than using a single opaque literal value "Joshua Ian Tauberer". RDF only allows binary relations, so it's necessary to express many-way relations using intermediate nodes, and these nodes are often anonymous.

Reading and Writing RDF

In this section, we describe two standard ways of writing out RDF. Because RDF is the abstract graph-like model, we call these written formats serialization syntaxes for RDF.

Notation 3

Notation 3 (“N3”), or the subset called Turtle, is a de facto standard for writing out RDF. It is not a W3C standard, but it is widely deployed, commonly used in email discussions between SemWeb developers, and the most important RDF notation to understand because it most clearly captures the abstract graph.

Here is an example, and it should be mostly clear what triples are encoded by this document.

[[example: Notation 3 Example]]

In tabular form (actually also a standard called NTriples), this is:

[[example: Tabular or NTriples Form]]

This is meant to express the geographic latitude and longitude coordinates of Princeton University, that it has a department, and that that department is named "Department of Computer Science".

Pages: ← previous Ctrl next
1 2 3