Not so very long ago, I believed that the "goldilocks zone" for journalistic information was semistructured data. Such as: unstructured data (natural language text) is too soft, structured data (formatted cells) are too hard, and semistructured data (a descriptor containing some amount of text) is just right.
If what we're trying to do is "reverse publishing" (Web first, propagating into multiple print and republishing formats), then this is still pretty close to true.
For instance, I could create a document in my Content Management System with a structured bit of information called "NAME." And I could enter the value NAME="Barack Hussein Obama." And then every time I ran a script against that value, I could retrieve some associated documents and records, like DOB or BIO or SUMMARY.
Some of these bits lend themselves to structure. A person's date of birth should always be referenced in the same date format, no matter how it's expressed in natural language text. But how would you create a "structured" biography for all people?
So one solution to these issues, if all you're looking at is publishing, is to construct a database of knowledge assets in atomized chunks. Everything that can be defined in structured terms would be defined in structured terms (such as, date of birth is expressed in 13 characters: HHMM +"Z" for Zulu, representing the Universal Time Code offset, + DDMMYYYY) . Everything that couldn't be communicated that way would be described in semi-structured containers (such as: BIO is a 500-character field containing a brief biography of a person).
The benefit of such a system is that if you construct stories or news items or general descriptions of resources this way, you have the ability to write scripts that say "Go get the field NAME and run it with the object PHOTO over the field BIO over the field EMAIL." If you construct your information this way, you can do all sorts of recombinant automated work, but if you work entirely in "stories" (unstructured data), none of that power is available to you.
Over time, the drawbacks to this approach became apparent. It requires extra work to do this, extra work to store it, extra work to make these stored bits of knowledge useable. And because these semi-structured snippets are still natural language text, they still require editing and updating. They cannot be minded, tended and farmed solely by automated means.
RDF gives us the ability to improve that system. For instance, if I manage my knowledge of Barack Obama not by writing narrative snippets but by writing RDF triples, that information is much easier to update, fact-check and deploy.
For instance, Obama's press secretary recently announced he would be leaving the adminstration in February. If I managed my knowledge of the adminstration via narrative snippet, I would have to find my references to Gibbs and update them as natural language text. Wikipedia does this successfully through the use of volunteers, but media companies would have to update all sorts of records with limited staff.
On the other hand, if I managed my information base by RDF, then the triple series...
Robert Gibbs | has job | White House press secretary
(Robert Gibbs | has job | White House press secretary) | has start date | xxxxZ01202009
would be completed when he steps down with the triple...
(Robert Gibbs | has job | White House press secretary) | has end date | whatever the February end date winds up being
Because this system adds new facts and is constructed with that approach in mind, we can extend our knowledge model without having to edit it obsessively. Not only is this approach easier to maintain, but we can script all sorts of things out of the results, and check our accuracy with software rather than just human quality control.
However, as much as I love the elegance of this approach, I am not convinced that there might not yet be a place in our directories of meaning for the occasional semi-structured piece of information, identified by URI and deployable in creative ways.
More work? You bet. Worth the effort? Depends on the asset and its use.
My feeling is that despite the urge to semantic purity, the most evolved SCMS would manage a combination of techniques and tools, with users deciding how best to make use of their informational assets.




Recent Comments