Abstract: This 2,600-word essay describes how a news-media industry equipped with semantic tools could develop a standards-based certification program for journalism. Certification would be by non-governmental bodies similar to the ISO, and would focus on information processing and coding standards rather than on matters of ethics. The effect of such a standard would enable interoperability between semantic directories, which would allow for the creation of software that compares facts and checks the accuracy of statements.It is the second in a series of essays about semantic media. Since some readers had difficulty imagining the system, I've added an FAQ for these ideas, and added descriptions to the links listed at the end.
There's nothing exciting or romantic about standards. Personally, I find them tedious.
There, I've said it.
But there's no happy and profitable future for journalism without professional standards, and for that future we should be willing to sweat a little tedium. If my conclusions are correct, then our emerging information-based global economy will require interchangeable, robust data in the same way that our current economy requires that every finely threaded quarter-inch screw must have 28 threads per inch, machined to precisely the same height and pitch and thread-axis angle, regardless of whether the screw is manufactured in the People's Republic of China or Alpharetta, Ga.
Manufacturers of products that require screws need to know not only that the fasteners they buy are machined to precise dimensions and specifications, but that the tolerance of the machining and the durability of the materials fall within known quality standards. This generic uniformity makes it possible for manufacturers around the world to compete primarily on cost, and it works because manufacturers voluntarily abide by standards that were published and certified by international bodies.
If we are to benefit from a global semantic economy, then the information that fuels its industries and drives its markets must be structured, coded, stored, published and quality-certified to international standards. This is a foreseeable future (one among many), because we now possess the tools and techniques required to build the first platform that would make semantic markup of narrative documents a profitable, persistent revenue stream across the publishing industry. So long as the value of these resources continues to expand, we can expect to see voluntary participation in these activities, with resulting increases in productivity and organization.
In other words, don't look for this to occur via the W3C's promulgation of abstract (though thoughtful) Web standards. Instead we should expect to see the industries that benefit from the semantic economy creating “voluntary” standards and certifications based on their mutual interest in increased profits. Those information producers who choose not to voluntarily upgrade their processes will suffer the same fate as manufacturers who choose to ignore ISO 9000 certification: death by capitalism.
Which raises the question: If we imagine a future business (not to mention a larger economy) that derives value from the communication of explicit meaning and robust data, how will we conduct journalism in that context?
The view from the past
The notion of applying professional standards to journalism isn't a new idea, just an impossible dream. For the most part, previous discussions of journalism standards typically degraded to debates over journalism ethics – a worthy goal, no doubt, but certainly not an objective, measurable standard. What is fairness? When is it acceptable to invade a person's privacy? Does balance require the equal presentation of popular but factually discredited perspectives? What should be considered in measurements of accuracy? And so on.
The problem with using traditional press ethics as the starting point for a universal standard is that one cannot equate an explicit standard to an abstract goal. To require that a news report be “fair and balanced” without defining those terms in unambiguous detail is the equivalent of throwing out the maximum outside diameter and TPI standards for fasteners and publishing something that declares that in future, all screws “shall be well-made with the highest regard for quality.” Good luck rebuilding your old engine with that hardware, pal.
Previous attempts at press standards worked forward from ethical news judgment and backwards from desired outcomes to propose professional standards of conduct and process. None have succeeded in improving journalism.
The standards I propose ignore ethics, but not out of hostility toward the subject. My assumption is that applying extremely basic rules to the news and information we produce will improve journalism and public understanding. When we focus on the quality of the workmanship, we need not fret the intent of its creator. An ethical journalist with the right tools who uses professional standards will proudly link her work to facts that support her statements. Those who seek seek to mislead the public or profit from shoddy work would face two bad options: either avoid certification and be judged less credible, or adopt the professional standard and thereby reveal your true identity.
Future standards for future communications
Today, as in 2005, as in 1905, most news is organized and communicated as “stories.” To be a story, a collection of information must have a who, a when, a what, a where, and in most cases, a why and a how.
But that traditional approach to news leaves out the sixth W: “Who cares?” A story must be potentially interesting to a valuable audience, or it isn't worth producing.
This is the fatal flaw with a theory of the press that is based on stories. It assumes that the only information that has value is the information that seems immediately interesting.
But what if baseball coverage worked that way? If our only records of Major League Baseball games were stories about dramatic, game-defining events, we'd never read about a bloop single in the scoreless third, because who cares about facts irrelevant to the story? The “story” is the game-winning RBI in the bottom of the ninth.
We enjoy baseball's intricacies today because of an 1860s sports journalist named Henry Chadwick. Chadwick earned a spot in the Baseball Hall of Fame for inventing a structured code for recording the outcome of every play in a baseball game. It's because of Chadwick's “box score” that we value inconsequential third-inning singles, because that hit – like the ground-out before it and the strike-out that followed – is part of a comprehensive data set. If we recorded for posterity only those plays that individual sportswriters deemed immediately interesting, we would have no baseball statistics. No objective way of comparing players across eras. No Sabremetrics.
Today's journalists tend to view new alternatives to narrative as barbaric assaults on their art form, but the simple truth is that we love good baseball stories in large part because box scores give writers the freedom to reject stenography in favor of the most interesting narrative, as well as the ability to make wonderfully definitive statements. We need not cautiously declaim that “many close observers agree that Josh Hamilton is one of the best batters in game.” We can instead write that “Hamilton, who hit .359 this season, has added a new milestone to his improbable comeback from drug addiction: 2010 American League batting champion.”
If life were as linear and rules-driven as a baseball game, this would be easy. Since it isn't, finding the right approach is important. So let's begin with a founding concept: Don't Repeat Yourself, also known by its unfortunate acronym, DRY.
DRY facts
The DRY Principle comes from programming, and it's as elegant an approach to managing semantic assets as we're likely to find:
''Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.''
The person who wrote this definition was referring to a software system, but when journalists cover a beat, they create an implicit system of knowledge, organized almost exclusively by documents. Our job is to make that implicit system explicit, and to organize it by each piece of data involved, regardless of whether the information is contained in a published text document, an unpublished spreadsheet, or a semi-public database.
Because the tool for creating this DRY representation of facts will be a Semantic Content Management System, our output will be valid XML/XHTML with embedded inline semantic markup. Its coded declaration will include a link to our directory of “single, unambiguous, authoritative representations.” More to the point, our SCMS will make such markup practical by offering a window beside our text composition screen that will continually display the connections between the natural language text we're composing and its likely links to the DRY representations in our existing directory. That directory will grow with every statement we make about every fact we acquire.
We'll edit that semantic structure (the invisible, machine-readable “meaning model” beneath our story) alongside the text document, connecting to existing items and creating new ones as necessary. And when we publish our story, we'll also “republish” our directory of meaning. Not a directory for one story, but the complete directory of every concept, entity or relationship we've every published.
Because our system is founded on XML and the DRY Principle, a curious human or a relatively simple computer program will be able to trace all the statements within our story back to definitive conclusions. The improvement over traditional search algorithms will be immediately noticeable.
For all these things to work properly, we must be precise in our reporting, writing, editing and data coding. To meet the requirements of standards-based certification, a news organization will have to demonstrate not only the use of data-entry and ontological conventions, but also an acceptable level of quality-control. Finally, machine audits of the system will have to reach certain levels of accuracy and completeness.
On first blush, this sounds incredibly dull. No mention of ethics, no grandiose references to the First Amendment and the Founding Fathers. So yes, it's dull – but it also accomplishes several objectives that modern journalism can only dream of addressing:
-
By standardizing and publicizing semantic markup and data entry, this system would make the data contained and described within its directories and archives generically useful to anyone who wishes to write scripts for it – or purchase some portion of a data-set for use in their own companies;
-
By declaring and publishing our directory of DRY representations, this system would make it possible for others to index, participate in, or subscribe to that directory;
-
By demonstrating the degree of accuracy and completeness of the archive and its associated semantic directory, the system will give an objective measure of the credibility of the information it presents.
Whereas most attempts at press reform attempt to fix news judgment, fairness or accuracy – three wildly subjective concepts – this approach bypasses them. If we can produce news in which the factual basis for every statement can be immediately derived, sourced and evaluated by all observers, using relatively simple technology, then why concern ourselves with meta concepts like “fairness?” Debates will continue, as they should. But if we give people user-friendly tools that provide authoritative access to facts, then over time we will isolate the less credible voices in society to rhetorical ghettos of their own construction.
Professional standards for a serial abuser
Take Glenn Beck for example. Without professional standards, Beck is able to play at journalistic presentation, then retreat behind the defense that he is only an "opinion-maker" or an entertainer each time his outrageous conduct gets him in trouble. His opinions are based on mixtures of unsourced paranoid fantasy and out-of-context facts, yet anyone who believes that Beck shuns research is utterly wrong. His show notes are a jumble of claims and charges supported by an unsorted link-dump to a mixture of mainstream sources and right-wing institutions. It is research offered to provide the illusion of credibility, not meaningful answers.
The show description from Beck's Jan. 6, 2011, show clocks in at 343 words, followed by 3,600 words in the cut-and-paste jumble called “Links to stories and information used in tonight's show.” The bottom of that section includes 24 footnote links, about a quarter of which are sourced to the conservative Heritage Foundation. Simply determining whether these links support Beck's claims would require a hefty amount of reading, and even then you wouldn't have definitive statements of fact, only general correlations between claims and source documents.
But no link supports the obvious whopper in Beck's summary:
Glenn also criticized the Congress for skipping parts of the original Constitution because they were offensive, such as the three-fifths compromise, which the Founders put in so that The South would not have too much power and slavery could eventually (sic) repealed.
Why no link? Because no such link exists. There is no serious historian on earth who believes the three-fifths compromise was an effort to repeal slavery. Yet this fiction served as the foundation to Beck's subsequent argument (which is rather difficult to describe, as he seemed to make one point about it on the 6th and a very different point about it on the 7th).
In an era of journalistic standards, an obvious snake-oil salesman like Beck could opt out of certification. He could still broadcast the same bullshit, but he could hardly make a convincing claim to be more trustworthy than certified organizations with standards for information structure and transparency. Freedom of speech covers everyone, but it's nice to have reliable ways of distinguishing (and continually fact-checking) reputable sources.
Of course, Beck could always adopt professional standards – but tagging his statements to DRY representations of fact would expose his counter-factual practices in short order. Nothing would keep Beck and Fox News from creating their own directories of fact, nor should it. Standards-based directory rules would allow privately curated directories to operate cooperatively with publicly curated directories. But abiding by those rules also means that your directory will be parsable by robots. If you're creating phony facts to support phony claims, it will be only a matter of time before a bot uncovers the scheme and holds your organization up for ridicule – and eventual decertification.
Standard information practices won't make journalists honest or ethical. But over time, they will reveal the liars, the hacks and the terribly confused.
How we get there
Sadly, no one is likely to build an SCMS simply because they can't wait to begin creating standards-based journalism. Companies are going to spend millions of dollars to build these systems because they will add new, profitable revenue-streams to newsgathering and publishing. It will be up to our journalists to spot the opportunity to build something better.
Neither can we expect our journalistic foundations, think-tanks and university departments to take up this cause, at least not initially. These are quorum-sensing entities with their own agendas, and few of these people are likely to jump on the semantic bandwagon until the bandwagon overtakes them.
But it is not too early to begin talking about these ideas. The more I write about them, the more people I find who have entertained similar thoughts. The more we discuss them, the less exotic they appear.
And it's certainly not too early for journalists to begin educating themselves on semantic principles. If you're interested in this future, now's the time to learn some basics: ontologies, controlled vocabularies, XML, RDF, parsing engines. Figure out what Dublin Core means. Learn enough that you can distinguish between an information structure and an information model. Then wow your friends at the next newsroom kegger. If you really want to get on board, read the book.
It's also a good time for the semantic researchers out there to realize that there is a world beyond libraries and the ponderous groves of academe. Semantic developers need to take a long, detailed look at the idea of seeding the news and publishing industries with new tools – not because such tools are cool and civic-minded, but because they have enormous profit potential. Their potential transformative power is just a side-effect.
And all you geek geniuses out there, working in the search industry? We need you. We need your participation in this development, because once the tools I describe begin their spread, Google will have to catch up, pay up, or fall behind. And you know it better than anyone.
The first step toward standards-based journalism will be the development of a Semantic Content Management System, an XML middleware dashboard that goes beyond standard metadata to embed machine-readable meaning within natural language text documents. The pieces of that platform exist today, but putting them together into a reliable, useful product is a major undertaking. Deploying it will take time. Developing the markets for robust data and the infrastructure for standards-based journalism won't take place overnight.
It's an enormous task, with enormous rewards attached to the results. Our future awaits it.
NOTES
These are the 24 sub-pages I created for the home-grown links in this essay:
- Tools and techniques, a short list of functions I consider to be part of this concept.
- SCMS, a stand-alone description of my concept of a Semantic Content Management System.
- Semantic Markup, short description of what I mean by the term.
- Narrative, a short discussion of what I see as the strengths and weakness of a journalism model based solely on the communication of stories.
- Information Products from media workflows, stand-alone definition of this concept, with links to the posts where I first explored it.
- Robust data, a definition of the concept.
- An overview of possible information standards, a very basic list of 11 possible topics for standardiation.
- The Sixth W, traditionally known as "news judgment," and why it belongs in these discussions.
- How a beat represents an implicit system of knowledge, trying to make the distinction between the limits of implicit systems and explicit systems.
- Thoughts on DRY Principle journalism. This is one of my key concepts, and this page describes my history of thinking on the subject, along with how I think it could be useful.
- Declarations in XHTML, a short thing for people who aren't familiar with the concept of metadata declarations.
- What I mean by a "meaning model". This one winds up being one of the most lengthy pages I created for this essay. It includes the most lengthy examples that I wrote about how the system might work.
- What I mean by a "directory of meaning". Not as detailed as the "meaning model page, it's not a short definition page either.
- Benefits of semantic directories over search, explores this topic a bit further, but not a long post.
- What I mean by "data coding" in this contextx, a short definition page.
- What I mean by "machine audits", a short definition page.
- Open-sourcing our semantic directories, a short discussion page.
- More on buying, selling and valuing data, a short discussion page.
- Professionals, amateurs and credibility, a long-form, blog-post-style page, talking about how the continuing lack of substantial progress since "Bloggers v. Journalists is Over" six years ago demonstrates that the problem with distinguishing between professionals and amateurs stems from the fact that there's not enough difference between the two to draw a clear line.
- What I mean by standards-based directory rules, a short definition.
- What I mean by "your directory will be parsable by robots' , a short discussion.
- RDF, triples, structured and semi-structured datax, a page where I talk about my own continuing ambivalence toward a "pure RDF" solution, and leaving open the possibility of designing the system to make use of structured and semi-structured data that is created outside of an RDF system.
- Building a better bucket, is a blog-style page describing why we have to make extra efforts at helping people understand ideas that are new to them. Very meta.
- An FAQ for my semantic journalism essay is a long-form FAQ exploring how these ideas would function, and most importantly, how they would be different from the models that people are familiar with today.
This technique of writing "in-depth" for the Web by creating additional supporting links that expand on the top-level document is part of what I described in my essay "A New Form of Writing." We still don't have an interface that makes it easy, or a browser plug-in that would reveal the benefits of such techniques.Such functions could easily be baked into WordPress or Drupal, and I expect to see features in Firefox and Chrome someday that will take full advantage of semantically structured Web pages, probably in ways I've never imagined.
The first essay in this series, "Imagining the Semantic Economy," was published in December 2010.
More of my essays and posts on media topics can be found on this directory, where I've included brief summaries to help you find subjects that interest you. Of related interest to readers of this topic are the posts "2020 Vision: What's Next For News?", "The Lack of Vision Thing," "Narrative is Dead! Long Live Narrative!" "A New Form of Writing," "Free Wants to be Big," "The Limits of Social," "The Imagination Gap" and "The Future Is Nearer Than You Think."
And if you have never heard of me before, here's my website.
14:50 note: Based on a comment that I just read on Twitter, at least one reader got the impression that I was being hostile toward journalists with my suggestions for things that journalists could read to become more conversant with semantic concepts. I didn't see that coming because it didn't sound that way in my head when I was writing it. Not trying to be snarky, tring to be helpful. dc
Thanks, Dan. Tons to think about here. I need to do some more reading to understand it all.
Two suggestions...
Seventh W: "what's the background here?" Meaning" who, what, where, when, why, how and who cares are all necessary to understand a news story, but frequently the story as a whole isn't understandable without the context and background that make it news.
More people might get what you are talking about if you used an existing journalistic product to illustrate. One I kept thinking about as I tried to "get" what this future system would look like is:
http://www.crunchbase.com/
am I way off off here?
Posted by: Jay Rosen | Thursday, January 20, 2011 at 09:48
Yeah, unfortunately, there is no existing journalistic example, which makes these concepts harder to communicate.
Crunchbase, for instance, is a database in the sense that it includes links to semi-structured data organized in a logical manner. But when you read the profiles of each company or person, the statements are not linked to their sources.
To imagine what I'm describing, you have to imagine that when you read the New York Times online, every statement in every sentence is linked to an RDF expression in a semantic directory. And if breaking news is occurring faster than computer-assisted semantic markup can keep up with, then you can expect that an editor will be going back to complete that formal linking later.
So you'll read the news the same way, but the way we report and record information will create a machine-readable model of that information that will feed the creation of as many data sets as we can make profitably.
The reason there is no current journalistic equivalent is that there is no content management system in which you can both create and publish text while creating, managing and publishing the directory of semantic references that supports those statements. It's not practical without computer assistance to writers and editors.
Reading the links to my definitions of "meaning models" and "directories of meaning" might help you grok the concept. I accept that it's a hurdle.
Posted by: Dan | Thursday, January 20, 2011 at 10:11
Dan,
Like Jay I am commenting while still trying to digest this.
I love the box-score comparison. Structured data are critical to the enjoyment of baseball. The great moments of our baseball memory come from someone (Hank Aaron, Cal Ripken, Sammy Sosa and Mark McGwire) challenging a cherished record that exists only because of structured data. Not every baseball fan reads box scores, but everyone cherishes the records and honors the record holders. And a key part of the anger over performance-enhancing drugs is that they tarnish the records.
A challenge to such a semantic system might be that some incorrect facts (like the performance-enhancing drugs) can tarnish the data repository. For instance, if Beck and others of his ilk repeat a lie often enough, how do we keep these multiple links from providing it with credibility (a data version of the echo chamber that repeats the talking points of ideological voices today)? The system also will spread and lend credibility to innocent errors (such as the NPR reporting that Gabrielle Gifford had died, which was corrected, but many errors go unreported). A critical part of developing effective standards will be to include ways to challenge, verify and correct information.
Now, I'll reread and read some related links before commenting further in ignorance.
Posted by: Stevebuttry | Thursday, January 20, 2011 at 11:25
I think that there are two possible answers to the points you raise, Steve, and they're why I've focused on RDF.
First, people are going to get things wrong. No problem there. And the reason that you base standards on information practices instead of information outcomes is because if your info is cleanly coded and the referring links work, it's easier for people and bots to check your facts.
Like I said about Beck, he can repeat a big lie over and over, just like he does now. But if he tried to do so while qualifying for certification, it would be relatively easy for us to check his claims in real-time, or show that the claim is just fantasy. And while Beck and Fox could certainly create their own directory of meaning, if that directory is standards compliant, it will be possible to compare what Fox considers to be fact-based citation to what the rest of us consider to be fact-based citation.
Second, information is going to need to be updated, whether because initial reports are wrong, or because we learn additional facts. If you're using the RDF grammar of triples (subject/predicate/object) as your system for describing your factual statements, then you create additional triples to make the correction. That way your record would record that an error existed, but it wouldn't propagate.
One benefit in doing this in RDF/XML is that it gives you the ability to run scripts against your records. So if a record includes the predicate "has date of death" Jan. 8, 2011 AND a predicate that says "has rehab start date" of Jan. 19, 2011, it's possible that we'll spot these contradictions with machine intelligence.
Finally, because the meaning model for each story is based on URI links to a directory of statements and knowledge, if there's an update to that URI (we DIDN'T find WMDs in Iraq after all, Darth Vader turns out to be Luke's FATHER, etc.), we only have to update the directory, not every instance of semantic markup that references the URI.
Posted by: Dan | Thursday, January 20, 2011 at 14:19
Dan, some fascinating ideas here.
I have to echo some of the comments brought up already - It's probably very difficult for the average reporter to picture working in a semantic web world. As a journalism grad student, does this mean in the future I'll have to research, interview, write, report - and do programming too? (By the way, I'm writing a paper on your ideas for Steve Buttry's class.)
I wonder, too, how much money could actually me made here. You mention in a previous post that in the future we could have "practical ways of tracing the value of original contributions and collecting and distributing marginal payments across vast scales." The word that stands out for me is "marginal." I could see larger organizations like New York Times and Wall Street Journal benefiting from the semantic web because they have that wealth of data. Where does that leave smaller, publications - say, hyperlocal sites?
Still reading and understanding all this so forgive me if I'm bringing up some obvious questions.
Posted by: Jolie | Friday, March 18, 2011 at 15:09
Jolie, sorry I'm so delayed in responding. Long story.
Anyway, several responses:
No. 1, I think experience shows that good journalism is a full-time job. We need programmers who are also journalists, and programming that communicates reported information should be considered a form of journalism. But asking our reporters to do both is silly.
No. 2, good reporting is good reporting, whatever the means of communicating it. Our dirty little secret as an industry is that most of the stories in average metros are not only based on shoddy concepts (We need content, so just go out and interview some people and make it interesting, regardless of the value of what we're presenting), but the reporting really isn't very good. By that I mean: it doesn't answer valuable questions, it isn't advancing our knowledge, it isn't expanding our understanding. It's too easy to tart up some human interest angle and call it a story.
No. 3, if you think the money to be made in data is marginal, look at the money to be made in web advertising. THAT'S marginal.
No. 4. If you think that the NYT and the WSJ are the only types of publications that could make money off of data, you're not thinking about data clearly. Journalists tend to think of data as numbers about important investigative subjects that other people compile for us to download and study before reporting. Very artificial concept of data.
As I've tried to illustrate, the data that has value to someone is information that can be used to make money. Which is to say that the data we compile and report is valuable to individuals and groups when it either cuts the cost of something they currently do, or replaces a data collecting cost they used to have, or gives them a competitive advantage in their market or business, or increases the quality of a product, or makes particular research questions affordable, or just generally makes their decision-making better.
And the truth is, the information that service providers or business people desire isn't stories, and it isn't artificially balanced, and it's often not all that "newsworthy." It's focused and it's accurate, and it solves a problem for them.
So the issues in compiling data are tools (you need tools we don't currently have), market value (you figure out what information has value as a data product and what information doesn't, and you adjust your reporting and your staffing accordingly), marketing & packaging (you have to put the data into useful formats and you have to sell it, not just wait for clients to buy it). These are novel ideas to most of us. And those are just the primary tasks and products...
Reporters aren't going to do all of those things. But you're going to be part of a team that's going to have more rigorous standards than today's journalism. Because your only real function in the eyes of your company now is providing audience attention bait. In the new model, you'll have to conduct journalism to a repeatable standard, or the information will just be junk.
Remember: information outside of a meaningful structure is just static.
No. 5, that idea about tracing the value of original contributions across networks is an idea about an emergent property of a semantic economy. I think it's fascinating, and I can see some of these things potentially driving those developments, but I wouldn't focus on that if I were you. Try to get the immediate stuff first.
No. 6, and perhaps most importantly, STOP THINKING OF THIS AS THE "SEMANTIC WEB." It's NOT the Semantic Web. It's based on semantic principles, and it's on the web, so I understand the confusion, but the distinction is significant. The idea of the Semantic Web as a "Web of Meaning" supposes an end state and some starting points, but leaves the middle stuff to some highly questionable assumptions and some absurd philosophical questions. This idea takes semantic technologies and applies them to some practical, for-profit issues.
So you might get a de facto "Web of Meaning" as an emergent property of such work by many parties, but it wouldn't be the Semantic Web that others have imagined.
Posted by: Dan | Wednesday, March 23, 2011 at 08:21