Since several readers yesterday seem to struggle with some of the big-picture concepts in my essay about standards-based journalism in a semantic economy, I'm taking this opportunity to build a better cognitive bucket for all this information I'm shoveling your way. And it seems like the big problem is understanding what RDF is and how it might work in this context.
Rather than doing this in essay form, I'm going to try it as an FAQ.
Q. What's so special about RDF? How is linking to RDF statements any different than just linking to source documents?
A. I focus on RDF because I think of it as a machine-readable grammar that humans can understand, too. And because of the simple and amazingly flexible way that RDF statements are constructed, adding a new RDF statement doesn't mean that all the previous statements about its subject have to be edited to reflect the new stuff.
For instance, at this hour, Google News links to a story about a three-way-tie among the top Republicans in a poll about likely 2012 presidential candidates. How should I store that info?
If I wanted to store information in a Wikipedia-like text summary of facts, and if I wanted it to be useful no matter how a future user might wish to retrieve it, I'd have to go in and create a page for this poll. Then I'd have to edit the pages on polling, on the polling company involved, on the Republican Party, on the 2012 election, on each individual candidate, plus updates to pages for various concepts such as "Christian Right" and "voters over 65." That's a lot of writing.
On the other hand, if I simply ingested the results as a spreadsheet document, I could convert each of the cells in the spreadsheet into an RDF triple:
(candidate) | has support in poll X | (y percent of respondents)
And if I made sure that I used the same terms over and over again ("Huckabee" would normalize to "Michael Dale Huckabee," and so on), without "editing" the database entry known as "Michael Dale Huckabee," I would have expanded it. There would likely be several manual triples that I would have to create to describe additional information about the poll, but once I've established an information template for handling poll data, this won't take me too long.
For purposes of example, let's pretend that three weeks from now, a report surfaces that calls into question the accuracy of this poll. Instead of going back into each record I've created or modified, I'd simply write a new set of triples, including:
Washington Post-ABC News poll conducted by telephone Jan. 13-16, 2011 | has controversy | employee claims data flaw
You'd probably need dozens of RDF triples to represent just the first sentence of that news report:
A supervisor deliberately skewed data collected during a recent poll of likely Republican voters in a deliberate effort to boost fundraising for one of the candidates, Gallup data analyst Erica Borges claimed Friday in a sworn affadavit.
...but that's because human language is based on the assumption that both the speaker and the listener agree that words have common meanings. Computers, however, don't do nuance. They require us to specify meanings in unambiguous ways. This makes RDF absurdly redundant to human readers, but extremely powerful to computer programmers. If you give a computer a set of statements in RDF and provide it the right set of instructions, it can derive valuable statements about large data sets in far less time than humans reading documents and assembling statements with a standard word processor.
Another thing about this logic is, I can write triples about triples, and expand the set of things that I write about :
[Washington Post-ABC News poll conducted by telephone 13012011 to 16012011 | has controversy | employee claims data flaw] | has date of complaint | 18022011
[Washington Post-ABC News poll conducted by telephone 13012011 to 16012011 | has controversy | employee claims data flaw] | has source | Erica Borges
Erica Borges | has employer | The Gallup Organization
[Erica Borges | has employer | The Gallup Organization] | has job title | data analyst
...which means that I can modify more complex concepts without having to write lots of triples from scratch each time I want to say something. If I were to build a system that stores, retrieves and recycles existing statements, I could describe things in RDF with greater efficiency than I could in English.
Q. Are you some kind of expert in RDF?
Absolutely NOT. I have never had the opportunity to create RDF as part of a workflow. Neither have I ever modeled complex concepts in RDF. I'm not a computer programmer, or a semantic technology expert.
When I write about RDF, I typically avoid going into much depth, as I know there are people who've been doing these things for years who would probably look at my examples and laugh (for instance, I'm not sure that the examples above are cleanly constructed).
That doesn't mean that my general statements are wrong. It simply means that I'm not the best candidate to construct a demonstration project.
Q. So are you saying we should just stop writing news stories and produce databases instead?
No. I'm saying that if we used what I call a Semantic Content Management System, it might be possible to create and link to RDF expressions that support the statements we make in collecting and reporting the news.
Q. That's stupid. "Facts" are fluid things. We disagree about facts. Such a system would never work, because it's too explicit. Don't you realize that context matters?
As human beings, we go around assigning meaning to things. When we see light in the wavelength of 630–740 nm, we call it "red," and we don't get much disagreement. But when we look health care reform, we look at the same data and reach different conclusions, which causes many of us to see red.
In the system I propose, everyone using it would be encouraged to cite sources for their statements. So if a critic of health care reform sees evidence of "death panels" in the law, it would be up to us to collect exactly what sub-section of the bill is being referenced. We'd create a triple that includes that citatation. And later, when everyone is talking about "death panels," it would be much easier for people to find that section.
(as an exercise, go find that exact section of the law. time yourself)
Also, because of the nature of RDF, we could collect all the eventual citations about "death panels" very easily, even if the stories in which the citation occur don't use the searchable term "death panels." As various people discuss and examine the death panel allegation, our reporting on the subject would create a library of triples that can be examined by machines on behalf of our efforts to clarify competing claims.
Would everyone be forced to cite a source? Absolutely not. But would un-sourced claims be evaluated as strongly as claims with citations? Of course not.
Here's a more common example. I'm covering a political rally. I'm making notes. I'm looking at my notes later for quotes and paraphrases. What's the fact I'm citing?
Answer: My reporting. I am the source. And consequently, I am not only writing a news story, but adding to and republishing my publication's directory of meaning. So if I go to a political rally and I hear Sarah Palin say the words: "Just because the words 'death panel' aren't in the law doesn't mean the effect of the law won't be death panels," it will be up to me to to tag that quote like this:
Sarah Palin | has quote | "Just because the words 'death panel' aren't in the law doesn't mean the effect of the law won't be death panels."
[Sarah Palin | has quote | "Just because the words 'death panel' aren't in the law doesn't mean the effect of the law won't be death panels."] | has date | 1834Z13122011
[Sarah Palin | has quote | "Just because the words 'death panel' aren't in the law doesn't mean the effect of the law won't be death panels."] | has location | Manchester High School West auditorium, Manchester, New Hampshire
[Sarah Palin | has quote | "Just because the words 'death panel' aren't in the law doesn't mean the effect of the law won't be death panels."] | has source | "Palin defends 'death panels' claim" http://myNewHamphireDailyExample.com/2011/12-13/palin-defends-death-panels.html
"Palin defends 'death panels' claim" http://myNewHamphireDailyExample.com/2011/12-13/palin-defends-death-panels.html | has author | James Olsen
["Palin defends 'death panels' claim" http://myNewHamphireDailyExample.com/2011/12-13/palin-defends-death-panels.html] | has audio file object | http://myNewHamphireDailyExample.com/2011/december/reporting-notes-recordings/palinRally12122011.mp3
And so on, an example that also serves to demonstrate the need for an SCMS interface that automates the creation of these RDF expressions.
Palin's claim may or may not be predictively accurate. But all the factual claims in RDF are, in essence, merely provisional. Not only that, but more complete citations and data can be added to the stories we write later, as needed.
Q. That sounds expensive! Why would anyone want to do that?
I wrote a previous essay called "Imagining the Semantic Economy" that went into that subject. My web site contains a directory that links to other posts I've written speculating on the revenue streams that could be derived from complete data sets. I believe that reporting and editing workflows can be adapted to create specific data sets. My usual example involves covering a house fire, and supposes that we could sell the resulting data set to insurance companies, home improvement warehouses, etc.)
And while I haven't written about this except in passing, I believe that directories of meaning will eventually acquire a financial value in and of themselves.
Q. How is this different than a database?
It's actually several databases. There's the database of all web stories on your site's server. There's the database of everything you produce for print, which may or may not be available to Web readers. There's a database of objects and resources, like digital video files and recordings of interviews and spreadsheets. There's a database of structured information from various publishing CMS, like the complete data set for all music venues that you get if you use Ellington to publish information about nightlife in your market.
And then there's what I call the directory of meaning .It's a public-facing directory of RDF expressions derived from our work as journalists.
Q. So if I viewed the source code for a story, would it have all these RDF triples embedded in it?
No. When you view the HTML source code, you'll notice a declaration in the metadata directing bots to the URL where the directory of meaning is located. You will also notice that there are tags around many of the words and phrases in the text, most of which have been created by the SCMS without input from the reporter. The inline RDF tags will include a value, and when that value is combined with the URL declared at the top of the document metadata, the result will be URI for the RDF expression that contains the "aboutness" of the statement.
Q. So this is just about metadata? There are already plenty of news metadata schema out there! Why not just use the stuff that's already out there instead of trying to propose new stuff? What's the big deal?
First, and I've said this before in several ways, this is not just about metadata. Traditional metadata describes the "aboutness" of objects: a .doc file that contains a story; a .jpeg file containing a news photograph; a .mov file containing a video. What I propose would embed the "aboutness" of every concept included in the document, but it would also store those statements in a way that would make them useful independent of the documents in which they appear.
So while a document would continue to have metadata (a file creation date, a pub date, an author, a dateline, etc.), the document's metadata would include links to a directory of meanings that would grow and develop with each new definition and relationship defined by the users of that directory.
You will notice that I'm actually using general terms. I don't get into the differences between RDF and RDFa, or weigh in on whether we should develop an SCMS with a preference for NewsML or NITF. Should we use the terms standardized by IPTC or go with a Dublin Core vocabulary? I have some thoughts, but they're irrelevant. These are decisions to be made during development, and frankly, to the extent that this is possible, an SCMS should be designed to be as platform-agnostic as possible. The world is just changing too fast to endorse one schema over another at such an early stage.
Why not just go with what we've got? I think that's exactly what I'm doing. What we've got is RDF within XML. Both are built to be extensible. That makes them compatible with multiple ontologoical systems.
What we don't "got" is a practical system to manage and publish directories of meaning, and then to embed that meaning within documents. No system on the planet today will do that job. But the individual tasks required to construct such a system are not exotic.
Q. That's all pretty is an abstract geek way, but you're asking us to believe you on a lot of things. Where's your proof?
I'm not offering proof, or examples. I'm offering an alternative that I think gives us the ability to create things that would be better than what we create now. I'm offering my logic, and I'm willing to be wrong. But I'm not going to agree that I'm wrong if you're challenging the claim without yet grasping the logic.
Q.The Don't Repeat Yourself model is compelling in the abstract, but I'm not convinced that the resulting directory of meaning will be useful -- in other words, it could "work" without providing value, or even doing the things that you say it would do. Is it possible that the writerly arguments against atomizing content beneath the story level are the winning arguments?
It's certainly possible. And for the record, these questions are largely questions that I've been asking myself for more than a year. And yes, I think I'm familiar with the principles that would argue against a system like mine, and to the extent that I'm capable of doing so I've tried to construct this idea with those critiques in mind.
Do I propose one information structure for all users? No.
Do I propose that governing or reviewing bodies should control facts? No.
Do I propose centralized, top-down control over semantic expressions? No.
I propose that cooperative, interoperable information structures will be an emergent property of media that makes use of SCMS. I propose that the driving force behind such developments will be capitalistic, not idealistic -- but that idealists will enjoy using these tools. I even suspect that a few idealists with the right backers could begin to develop this technology as an open source project. Right now.
Q. Let me get this right: I'm a reporter, and I've had to swallow shit sandwich after shit sandwich for the past decade, I'm doing my work PLUS the work of the three friends of mine who USED to sit in the three empty cubicles that surround mine in this godforesaken moonscape we used to call a newsroom. Management has been asking me to do all this extraneous Web bullshit like blog and Tweet and Facebook, like that's a verb. And now here you come, telling my bosses that they should be asking me to tag every goddamn definitive statement in my news story to an alphabet soup of geek bullshit that has no concrete proof that it's ever going to pay off for anybody. What I want to know is, what's your address? Because I'm going to come around there personally and kick your ass.
I live at 794 Rutledge Ave. in Charleston, SC, but please don't come around and hurt me.
My hope is that we can use these tools to create value for news information that goes beyond how many hits your stories get for the banner ads on the pages where they appear. My hope is that additional revenue will convince your company to start hiring again -- not just reporters and editors, but people who will come in and help you create, manage and benefit from these semantic initiatives. I hope that news companies will get back to competing based on quality, not just low cost. I'm hoping that by changing the rules of the game, we can make life miserable for the hacks who have been tormenting you and better for the people who want to do good, honest, honorable work.
I recognize that these are new concepts (or at the very least, old concepts arranged in ways that are unfamiliar to most journalists). But if you're interested in making something better, I encourage you to stick with it a bit. If I didn't believe this had the possibility of being helpful, I promise you I wouldn't have spent so much time thinking about it.