XARK 3.0

  • Xark began as a group blog in June 2005 but continues today as founder Dan Conover's primary blog-home. Posts by longtime Xark authors Janet Edens and John Sloop may also appear alongside Dan's here from time to time, depending on whatever.

Xark media

  • ALIENS! SEX! MORE ALIENS! AND DUBYA, TOO! Handcrafted, xarky science fiction, lovingly typeset for your home printer!



Blog powered by Typepad
Member since 06/2005

Statcounter has my back

« Imagining the Semantic Economy | Main | An FAQ for my semantic journalism essay »

Thursday, January 20, 2011


Feed You can follow this conversation by subscribing to the comment feed for this post.

Jay Rosen

Thanks, Dan. Tons to think about here. I need to do some more reading to understand it all.

Two suggestions...

Seventh W: "what's the background here?" Meaning" who, what, where, when, why, how and who cares are all necessary to understand a news story, but frequently the story as a whole isn't understandable without the context and background that make it news.

More people might get what you are talking about if you used an existing journalistic product to illustrate. One I kept thinking about as I tried to "get" what this future system would look like is:


am I way off off here?


Yeah, unfortunately, there is no existing journalistic example, which makes these concepts harder to communicate.

Crunchbase, for instance, is a database in the sense that it includes links to semi-structured data organized in a logical manner. But when you read the profiles of each company or person, the statements are not linked to their sources.

To imagine what I'm describing, you have to imagine that when you read the New York Times online, every statement in every sentence is linked to an RDF expression in a semantic directory. And if breaking news is occurring faster than computer-assisted semantic markup can keep up with, then you can expect that an editor will be going back to complete that formal linking later.

So you'll read the news the same way, but the way we report and record information will create a machine-readable model of that information that will feed the creation of as many data sets as we can make profitably.

The reason there is no current journalistic equivalent is that there is no content management system in which you can both create and publish text while creating, managing and publishing the directory of semantic references that supports those statements. It's not practical without computer assistance to writers and editors.

Reading the links to my definitions of "meaning models" and "directories of meaning" might help you grok the concept. I accept that it's a hurdle.



Like Jay I am commenting while still trying to digest this.

I love the box-score comparison. Structured data are critical to the enjoyment of baseball. The great moments of our baseball memory come from someone (Hank Aaron, Cal Ripken, Sammy Sosa and Mark McGwire) challenging a cherished record that exists only because of structured data. Not every baseball fan reads box scores, but everyone cherishes the records and honors the record holders. And a key part of the anger over performance-enhancing drugs is that they tarnish the records.

A challenge to such a semantic system might be that some incorrect facts (like the performance-enhancing drugs) can tarnish the data repository. For instance, if Beck and others of his ilk repeat a lie often enough, how do we keep these multiple links from providing it with credibility (a data version of the echo chamber that repeats the talking points of ideological voices today)? The system also will spread and lend credibility to innocent errors (such as the NPR reporting that Gabrielle Gifford had died, which was corrected, but many errors go unreported). A critical part of developing effective standards will be to include ways to challenge, verify and correct information.

Now, I'll reread and read some related links before commenting further in ignorance.


I think that there are two possible answers to the points you raise, Steve, and they're why I've focused on RDF.

First, people are going to get things wrong. No problem there. And the reason that you base standards on information practices instead of information outcomes is because if your info is cleanly coded and the referring links work, it's easier for people and bots to check your facts.

Like I said about Beck, he can repeat a big lie over and over, just like he does now. But if he tried to do so while qualifying for certification, it would be relatively easy for us to check his claims in real-time, or show that the claim is just fantasy. And while Beck and Fox could certainly create their own directory of meaning, if that directory is standards compliant, it will be possible to compare what Fox considers to be fact-based citation to what the rest of us consider to be fact-based citation.

Second, information is going to need to be updated, whether because initial reports are wrong, or because we learn additional facts. If you're using the RDF grammar of triples (subject/predicate/object) as your system for describing your factual statements, then you create additional triples to make the correction. That way your record would record that an error existed, but it wouldn't propagate.

One benefit in doing this in RDF/XML is that it gives you the ability to run scripts against your records. So if a record includes the predicate "has date of death" Jan. 8, 2011 AND a predicate that says "has rehab start date" of Jan. 19, 2011, it's possible that we'll spot these contradictions with machine intelligence.

Finally, because the meaning model for each story is based on URI links to a directory of statements and knowledge, if there's an update to that URI (we DIDN'T find WMDs in Iraq after all, Darth Vader turns out to be Luke's FATHER, etc.), we only have to update the directory, not every instance of semantic markup that references the URI.


Dan, some fascinating ideas here.

I have to echo some of the comments brought up already - It's probably very difficult for the average reporter to picture working in a semantic web world. As a journalism grad student, does this mean in the future I'll have to research, interview, write, report - and do programming too? (By the way, I'm writing a paper on your ideas for Steve Buttry's class.)

I wonder, too, how much money could actually me made here. You mention in a previous post that in the future we could have "practical ways of tracing the value of original contributions and collecting and distributing marginal payments across vast scales." The word that stands out for me is "marginal." I could see larger organizations like New York Times and Wall Street Journal benefiting from the semantic web because they have that wealth of data. Where does that leave smaller, publications - say, hyperlocal sites?

Still reading and understanding all this so forgive me if I'm bringing up some obvious questions.


Jolie, sorry I'm so delayed in responding. Long story.

Anyway, several responses:

No. 1, I think experience shows that good journalism is a full-time job. We need programmers who are also journalists, and programming that communicates reported information should be considered a form of journalism. But asking our reporters to do both is silly.

No. 2, good reporting is good reporting, whatever the means of communicating it. Our dirty little secret as an industry is that most of the stories in average metros are not only based on shoddy concepts (We need content, so just go out and interview some people and make it interesting, regardless of the value of what we're presenting), but the reporting really isn't very good. By that I mean: it doesn't answer valuable questions, it isn't advancing our knowledge, it isn't expanding our understanding. It's too easy to tart up some human interest angle and call it a story.

No. 3, if you think the money to be made in data is marginal, look at the money to be made in web advertising. THAT'S marginal.

No. 4. If you think that the NYT and the WSJ are the only types of publications that could make money off of data, you're not thinking about data clearly. Journalists tend to think of data as numbers about important investigative subjects that other people compile for us to download and study before reporting. Very artificial concept of data.

As I've tried to illustrate, the data that has value to someone is information that can be used to make money. Which is to say that the data we compile and report is valuable to individuals and groups when it either cuts the cost of something they currently do, or replaces a data collecting cost they used to have, or gives them a competitive advantage in their market or business, or increases the quality of a product, or makes particular research questions affordable, or just generally makes their decision-making better.

And the truth is, the information that service providers or business people desire isn't stories, and it isn't artificially balanced, and it's often not all that "newsworthy." It's focused and it's accurate, and it solves a problem for them.

So the issues in compiling data are tools (you need tools we don't currently have), market value (you figure out what information has value as a data product and what information doesn't, and you adjust your reporting and your staffing accordingly), marketing & packaging (you have to put the data into useful formats and you have to sell it, not just wait for clients to buy it). These are novel ideas to most of us. And those are just the primary tasks and products...

Reporters aren't going to do all of those things. But you're going to be part of a team that's going to have more rigorous standards than today's journalism. Because your only real function in the eyes of your company now is providing audience attention bait. In the new model, you'll have to conduct journalism to a repeatable standard, or the information will just be junk.

Remember: information outside of a meaningful structure is just static.

No. 5, that idea about tracing the value of original contributions across networks is an idea about an emergent property of a semantic economy. I think it's fascinating, and I can see some of these things potentially driving those developments, but I wouldn't focus on that if I were you. Try to get the immediate stuff first.

No. 6, and perhaps most importantly, STOP THINKING OF THIS AS THE "SEMANTIC WEB." It's NOT the Semantic Web. It's based on semantic principles, and it's on the web, so I understand the confusion, but the distinction is significant. The idea of the Semantic Web as a "Web of Meaning" supposes an end state and some starting points, but leaves the middle stuff to some highly questionable assumptions and some absurd philosophical questions. This idea takes semantic technologies and applies them to some practical, for-profit issues.

So you might get a de facto "Web of Meaning" as an emergent property of such work by many parties, but it wouldn't be the Semantic Web that others have imagined.

The comments to this entry are closed.