Abstract: This 2,600-word essay describes how a news-media industry equipped with semantic tools could develop a standards-based certification program for journalism. Certification would be by non-governmental bodies similar to the ISO, and would focus on information processing and coding standards rather than on matters of ethics. The effect of such a standard would enable interoperability between semantic directories, which would allow for the creation of software that compares facts and checks the accuracy of statements.It is the second in a series of essays about semantic media. Since some readers had difficulty imagining the system, I've added an FAQ for these ideas, and added descriptions to the links listed at the end.
There's nothing exciting or romantic about standards. Personally, I find them tedious.
There, I've said it.
But there's no happy and profitable future for journalism without professional standards, and for that future we should be willing to sweat a little tedium. If my conclusions are correct, then our emerging information-based global economy will require interchangeable, robust data in the same way that our current economy requires that every finely threaded quarter-inch screw must have 28 threads per inch, machined to precisely the same height and pitch and thread-axis angle, regardless of whether the screw is manufactured in the People's Republic of China or Alpharetta, Ga.
Manufacturers of products that require screws need to know not only that the fasteners they buy are machined to precise dimensions and specifications, but that the tolerance of the machining and the durability of the materials fall within known quality standards. This generic uniformity makes it possible for manufacturers around the world to compete primarily on cost, and it works because manufacturers voluntarily abide by standards that were published and certified by international bodies.
If we are to benefit from a global semantic economy, then the information that fuels its industries and drives its markets must be structured, coded, stored, published and quality-certified to international standards. This is a foreseeable future (one among many), because we now possess the tools and techniques required to build the first platform that would make semantic markup of narrative documents a profitable, persistent revenue stream across the publishing industry. So long as the value of these resources continues to expand, we can expect to see voluntary participation in these activities, with resulting increases in productivity and organization.
In other words, don't look for this to occur via the W3C's promulgation of abstract (though thoughtful) Web standards. Instead we should expect to see the industries that benefit from the semantic economy creating “voluntary” standards and certifications based on their mutual interest in increased profits. Those information producers who choose not to voluntarily upgrade their processes will suffer the same fate as manufacturers who choose to ignore ISO 9000 certification: death by capitalism.
Which raises the question: If we imagine a future business (not to mention a larger economy) that derives value from the communication of explicit meaning and robust data, how will we conduct journalism in that context?
The view from the past
The notion of applying professional standards to journalism isn't a new idea, just an impossible dream. For the most part, previous discussions of journalism standards typically degraded to debates over journalism ethics – a worthy goal, no doubt, but certainly not an objective, measurable standard. What is fairness? When is it acceptable to invade a person's privacy? Does balance require the equal presentation of popular but factually discredited perspectives? What should be considered in measurements of accuracy? And so on.
The problem with using traditional press ethics as the starting point for a universal standard is that one cannot equate an explicit standard to an abstract goal. To require that a news report be “fair and balanced” without defining those terms in unambiguous detail is the equivalent of throwing out the maximum outside diameter and TPI standards for fasteners and publishing something that declares that in future, all screws “shall be well-made with the highest regard for quality.” Good luck rebuilding your old engine with that hardware, pal.
Previous attempts at press standards worked forward from ethical news judgment and backwards from desired outcomes to propose professional standards of conduct and process. None have succeeded in improving journalism.
The standards I propose ignore ethics, but not out of hostility toward the subject. My assumption is that applying extremely basic rules to the news and information we produce will improve journalism and public understanding. When we focus on the quality of the workmanship, we need not fret the intent of its creator. An ethical journalist with the right tools who uses professional standards will proudly link her work to facts that support her statements. Those who seek seek to mislead the public or profit from shoddy work would face two bad options: either avoid certification and be judged less credible, or adopt the professional standard and thereby reveal your true identity.
Future standards for future communications
Today, as in 2005, as in 1905, most news is organized and communicated as “stories.” To be a story, a collection of information must have a who, a when, a what, a where, and in most cases, a why and a how.
But that traditional approach to news leaves out the sixth W: “Who cares?” A story must be potentially interesting to a valuable audience, or it isn't worth producing.
This is the fatal flaw with a theory of the press that is based on stories. It assumes that the only information that has value is the information that seems immediately interesting.
But what if baseball coverage worked that way? If our only records of Major League Baseball games were stories about dramatic, game-defining events, we'd never read about a bloop single in the scoreless third, because who cares about facts irrelevant to the story? The “story” is the game-winning RBI in the bottom of the ninth.
We enjoy baseball's intricacies today because of an 1860s sports journalist named Henry Chadwick. Chadwick earned a spot in the Baseball Hall of Fame for inventing a structured code for recording the outcome of every play in a baseball game. It's because of Chadwick's “box score” that we value inconsequential third-inning singles, because that hit – like the ground-out before it and the strike-out that followed – is part of a comprehensive data set. If we recorded for posterity only those plays that individual sportswriters deemed immediately interesting, we would have no baseball statistics. No objective way of comparing players across eras. No Sabremetrics.
Today's journalists tend to view new alternatives to narrative as barbaric assaults on their art form, but the simple truth is that we love good baseball stories in large part because box scores give writers the freedom to reject stenography in favor of the most interesting narrative, as well as the ability to make wonderfully definitive statements. We need not cautiously declaim that “many close observers agree that Josh Hamilton is one of the best batters in game.” We can instead write that “Hamilton, who hit .359 this season, has added a new milestone to his improbable comeback from drug addiction: 2010 American League batting champion.”
If life were as linear and rules-driven as a baseball game, this would be easy. Since it isn't, finding the right approach is important. So let's begin with a founding concept: Don't Repeat Yourself, also known by its unfortunate acronym, DRY.
The DRY Principle comes from programming, and it's as elegant an approach to managing semantic assets as we're likely to find:
''Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.''
The person who wrote this definition was referring to a software system, but when journalists cover a beat, they create an implicit system of knowledge, organized almost exclusively by documents. Our job is to make that implicit system explicit, and to organize it by each piece of data involved, regardless of whether the information is contained in a published text document, an unpublished spreadsheet, or a semi-public database.
Because the tool for creating this DRY representation of facts will be a Semantic Content Management System, our output will be valid XML/XHTML with embedded inline semantic markup. Its coded declaration will include a link to our directory of “single, unambiguous, authoritative representations.” More to the point, our SCMS will make such markup practical by offering a window beside our text composition screen that will continually display the connections between the natural language text we're composing and its likely links to the DRY representations in our existing directory. That directory will grow with every statement we make about every fact we acquire.
We'll edit that semantic structure (the invisible, machine-readable “meaning model” beneath our story) alongside the text document, connecting to existing items and creating new ones as necessary. And when we publish our story, we'll also “republish” our directory of meaning. Not a directory for one story, but the complete directory of every concept, entity or relationship we've every published.
Because our system is founded on XML and the DRY Principle, a curious human or a relatively simple computer program will be able to trace all the statements within our story back to definitive conclusions. The improvement over traditional search algorithms will be immediately noticeable.
For all these things to work properly, we must be precise in our reporting, writing, editing and data coding. To meet the requirements of standards-based certification, a news organization will have to demonstrate not only the use of data-entry and ontological conventions, but also an acceptable level of quality-control. Finally, machine audits of the system will have to reach certain levels of accuracy and completeness.
On first blush, this sounds incredibly dull. No mention of ethics, no grandiose references to the First Amendment and the Founding Fathers. So yes, it's dull – but it also accomplishes several objectives that modern journalism can only dream of addressing:
By standardizing and publicizing semantic markup and data entry, this system would make the data contained and described within its directories and archives generically useful to anyone who wishes to write scripts for it – or purchase some portion of a data-set for use in their own companies;
By declaring and publishing our directory of DRY representations, this system would make it possible for others to index, participate in, or subscribe to that directory;
By demonstrating the degree of accuracy and completeness of the archive and its associated semantic directory, the system will give an objective measure of the credibility of the information it presents.
Whereas most attempts at press reform attempt to fix news judgment, fairness or accuracy – three wildly subjective concepts – this approach bypasses them. If we can produce news in which the factual basis for every statement can be immediately derived, sourced and evaluated by all observers, using relatively simple technology, then why concern ourselves with meta concepts like “fairness?” Debates will continue, as they should. But if we give people user-friendly tools that provide authoritative access to facts, then over time we will isolate the less credible voices in society to rhetorical ghettos of their own construction.
Professional standards for a serial abuser
Take Glenn Beck for example. Without professional standards, Beck is able to play at journalistic presentation, then retreat behind the defense that he is only an "opinion-maker" or an entertainer each time his outrageous conduct gets him in trouble. His opinions are based on mixtures of unsourced paranoid fantasy and out-of-context facts, yet anyone who believes that Beck shuns research is utterly wrong. His show notes are a jumble of claims and charges supported by an unsorted link-dump to a mixture of mainstream sources and right-wing institutions. It is research offered to provide the illusion of credibility, not meaningful answers.
The show description from Beck's Jan. 6, 2011, show clocks in at 343 words, followed by 3,600 words in the cut-and-paste jumble called “Links to stories and information used in tonight's show.” The bottom of that section includes 24 footnote links, about a quarter of which are sourced to the conservative Heritage Foundation. Simply determining whether these links support Beck's claims would require a hefty amount of reading, and even then you wouldn't have definitive statements of fact, only general correlations between claims and source documents.
But no link supports the obvious whopper in Beck's summary:
Glenn also criticized the Congress for skipping parts of the original Constitution because they were offensive, such as the three-fifths compromise, which the Founders put in so that The South would not have too much power and slavery could eventually (sic) repealed.
Why no link? Because no such link exists. There is no serious historian on earth who believes the three-fifths compromise was an effort to repeal slavery. Yet this fiction served as the foundation to Beck's subsequent argument (which is rather difficult to describe, as he seemed to make one point about it on the 6th and a very different point about it on the 7th).
In an era of journalistic standards, an obvious snake-oil salesman like Beck could opt out of certification. He could still broadcast the same bullshit, but he could hardly make a convincing claim to be more trustworthy than certified organizations with standards for information structure and transparency. Freedom of speech covers everyone, but it's nice to have reliable ways of distinguishing (and continually fact-checking) reputable sources.
Of course, Beck could always adopt professional standards – but tagging his statements to DRY representations of fact would expose his counter-factual practices in short order. Nothing would keep Beck and Fox News from creating their own directories of fact, nor should it. Standards-based directory rules would allow privately curated directories to operate cooperatively with publicly curated directories. But abiding by those rules also means that your directory will be parsable by robots. If you're creating phony facts to support phony claims, it will be only a matter of time before a bot uncovers the scheme and holds your organization up for ridicule – and eventual decertification.
Standard information practices won't make journalists honest or ethical. But over time, they will reveal the liars, the hacks and the terribly confused.
How we get there
Sadly, no one is likely to build an SCMS simply because they can't wait to begin creating standards-based journalism. Companies are going to spend millions of dollars to build these systems because they will add new, profitable revenue-streams to newsgathering and publishing. It will be up to our journalists to spot the opportunity to build something better.
Neither can we expect our journalistic foundations, think-tanks and university departments to take up this cause, at least not initially. These are quorum-sensing entities with their own agendas, and few of these people are likely to jump on the semantic bandwagon until the bandwagon overtakes them.
But it is not too early to begin talking about these ideas. The more I write about them, the more people I find who have entertained similar thoughts. The more we discuss them, the less exotic they appear.
And it's certainly not too early for journalists to begin educating themselves on semantic principles. If you're interested in this future, now's the time to learn some basics: ontologies, controlled vocabularies, XML, RDF, parsing engines. Figure out what Dublin Core means. Learn enough that you can distinguish between an information structure and an information model. Then wow your friends at the next newsroom kegger. If you really want to get on board, read the book.
It's also a good time for the semantic researchers out there to realize that there is a world beyond libraries and the ponderous groves of academe. Semantic developers need to take a long, detailed look at the idea of seeding the news and publishing industries with new tools – not because such tools are cool and civic-minded, but because they have enormous profit potential. Their potential transformative power is just a side-effect.
And all you geek geniuses out there, working in the search industry? We need you. We need your participation in this development, because once the tools I describe begin their spread, Google will have to catch up, pay up, or fall behind. And you know it better than anyone.
The first step toward standards-based journalism will be the development of a Semantic Content Management System, an XML middleware dashboard that goes beyond standard metadata to embed machine-readable meaning within natural language text documents. The pieces of that platform exist today, but putting them together into a reliable, useful product is a major undertaking. Deploying it will take time. Developing the markets for robust data and the infrastructure for standards-based journalism won't take place overnight.
It's an enormous task, with enormous rewards attached to the results. Our future awaits it.
These are the 24 sub-pages I created for the home-grown links in this essay:
- Tools and techniques, a short list of functions I consider to be part of this concept.
- SCMS, a stand-alone description of my concept of a Semantic Content Management System.
- Semantic Markup, short description of what I mean by the term.
- Narrative, a short discussion of what I see as the strengths and weakness of a journalism model based solely on the communication of stories.
- Information Products from media workflows, stand-alone definition of this concept, with links to the posts where I first explored it.
- Robust data, a definition of the concept.
- An overview of possible information standards, a very basic list of 11 possible topics for standardiation.
- The Sixth W, traditionally known as "news judgment," and why it belongs in these discussions.
- How a beat represents an implicit system of knowledge, trying to make the distinction between the limits of implicit systems and explicit systems.
- Thoughts on DRY Principle journalism. This is one of my key concepts, and this page describes my history of thinking on the subject, along with how I think it could be useful.
- Declarations in XHTML, a short thing for people who aren't familiar with the concept of metadata declarations.
- What I mean by a "meaning model". This one winds up being one of the most lengthy pages I created for this essay. It includes the most lengthy examples that I wrote about how the system might work.
- What I mean by a "directory of meaning". Not as detailed as the "meaning model page, it's not a short definition page either.
- Benefits of semantic directories over search, explores this topic a bit further, but not a long post.
- What I mean by "data coding" in this contextx, a short definition page.
- What I mean by "machine audits", a short definition page.
- Open-sourcing our semantic directories, a short discussion page.
- More on buying, selling and valuing data, a short discussion page.
- Professionals, amateurs and credibility, a long-form, blog-post-style page, talking about how the continuing lack of substantial progress since "Bloggers v. Journalists is Over" six years ago demonstrates that the problem with distinguishing between professionals and amateurs stems from the fact that there's not enough difference between the two to draw a clear line.
- What I mean by standards-based directory rules, a short definition.
- What I mean by "your directory will be parsable by robots' , a short discussion.
- RDF, triples, structured and semi-structured datax, a page where I talk about my own continuing ambivalence toward a "pure RDF" solution, and leaving open the possibility of designing the system to make use of structured and semi-structured data that is created outside of an RDF system.
- Building a better bucket, is a blog-style page describing why we have to make extra efforts at helping people understand ideas that are new to them. Very meta.
- An FAQ for my semantic journalism essay is a long-form FAQ exploring how these ideas would function, and most importantly, how they would be different from the models that people are familiar with today.
This technique of writing "in-depth" for the Web by creating additional supporting links that expand on the top-level document is part of what I described in my essay "A New Form of Writing." We still don't have an interface that makes it easy, or a browser plug-in that would reveal the benefits of such techniques.Such functions could easily be baked into WordPress or Drupal, and I expect to see features in Firefox and Chrome someday that will take full advantage of semantically structured Web pages, probably in ways I've never imagined.
The first essay in this series, "Imagining the Semantic Economy," was published in December 2010.
More of my essays and posts on media topics can be found on this directory, where I've included brief summaries to help you find subjects that interest you. Of related interest to readers of this topic are the posts "2020 Vision: What's Next For News?", "The Lack of Vision Thing," "Narrative is Dead! Long Live Narrative!" "A New Form of Writing," "Free Wants to be Big," "The Limits of Social," "The Imagination Gap" and "The Future Is Nearer Than You Think."
And if you have never heard of me before, here's my website.
14:50 note: Based on a comment that I just read on Twitter, at least one reader got the impression that I was being hostile toward journalists with my suggestions for things that journalists could read to become more conversant with semantic concepts. I didn't see that coming because it didn't sound that way in my head when I was writing it. Not trying to be snarky, tring to be helpful. dc