XARK 3.0

  • Xark began as a group blog in June 2005 but continues today as founder Dan Conover's primary blog-home. Posts by longtime Xark authors Janet Edens and John Sloop may also appear alongside Dan's here from time to time, depending on whatever.

Xark media

  • ALIENS! SEX! MORE ALIENS! AND DUBYA, TOO! Handcrafted, xarky science fiction, lovingly typeset for your home printer!



Blog powered by Typepad
Member since 06/2005

Statcounter has my back

« 2011 GOP strategies, in summary | Main | The worst thing about the paywall question »

Monday, June 06, 2011


Feed You can follow this conversation by subscribing to the comment feed for this post.


Favorite quote ever:

I love bicycles, too, but I wouldn't recommend them as a spacefaring technology.

You make a very good argument, and I agree with it in the sense that "answer" is a good atomic unit (though stories will continue to be popular, but you know that.) But I wonder how structured our structured data really has to be before it's useful. Consider how well IBM's Watson does in answering questions from a huge database of unstructured text. I don't think we really yet know enough about knowledge representation to get ambitious about representing the world as metadata.


Thanks Dan. My favorite quote is "we must connect the question “what's new?” to the question “what do we know?” so that the first feeds the second and the second informs the first."

In this fast developing globally networked world, our kerosene approaches are holding us back, and are not much fun. To have more fun, and make more progress, we need to approach our work with a focus on the information network, not any existing product.

The technological issues are within our grasp. The cultural addiction to kerosene is much more difficult.




Jonathan, I took a different lesson from the Watson stunt. It took IBM years and an estimated $2 billion to build a "computer" that consisted of 90 servers in 10 racks, and that computer was barely better than humans in a speed contest that involves low levels of confidence and -- in this case -- had to have specially selected questions so that Watson would even stand a chance.

This is not to say that we won't have true AI someday. But we don't today. And while everyone has been pinning their hopes on Natural Language Text Analysis engines, my experience is that none can produce results with the level of confidence (better than 95 percent) that you'd need to build reliable systems.

The one system that I know that produces information with those kinds of degrees of confidence is the human news reporting system. Not that we journalists approach 95 percent confidence in "breaking" information, but that as we move through the process, our confidence on discrete statements and sourcing is supposed to reach these levels.

So rather than spend billions of dollars and years of development cycles on Watson-style AI, why not invest a few million on relatively simple upgrades to word processing software and web/print publishing CMS? Embed the coding steps in an intuitive, machine-assisted workflow. That's the foundation of the business idea I'm working to develop with Abe Abreu.

And I disagree bout what we know about representing the world... so long as we're building that representation pixel by pixel with discrete RDF triples, not block by block with massive, dependent summaries or top-down, heavy ontologies.

Chuck, I think getting wild-eyed every now and then is an absolute requirement. It's where visions come from.

"The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man." -- George Bernard Shaw, 1903.


Really interesting post Dan. I wonder what your thoughts are on this somewhat related post by VC Albert Wenger: http://continuations.com/post/4158738112/embrace-messy-data-to-reach-internet-scale

The point being that you can only enforce so much structure if you want to achieve real scale.

I tend to agree with you on the whole and think that a professional creator class has the incentive to structure their data properly if it results in economic value for that professional organization. I wrote about that here: http://goldbergfam.info/blog/2011/04/20/an-open-standard-for-commercial-content-syndication/

On a related note: what is the business idea you're working on with Abe Abreu?


I interpret Watson in a different way: doesn't matter what it cost. Point is they demonstrated and advanced the state of the art in extracting useful knowledge from all available data sources, structured and unstructured. And when you read the technical papers, structured data wasn't at the heart of that success. General ontologies and DBPedia tables worked well for certain closed domains (say, presidents, countries, species, basic constraint relations like country-isnt-a-person, etc.) but weren't the main knowledge store or type inference engine. The main knowledge store was unstructured text from a huge variety of places (Wikipedia, newswire archives, web crawls, etc.) plus an open type inference system based on statistical patterns of word use (the algorithm is called PRISMATIC.)

Yeah, structured data is great. But what do we want to use it for? Answering someone's question, right? At the moment, the best open-domain question answering techniques are in Watson, which improved the state of the art accuracy from about 30% to >80% in half a decade. As for the hardware required: every time you do a google search you use that much computing power. If I was going to base a startup off of answering people's questions, I would be stockpiling unstructured data -- and smart humans, as you so rightly point out.


Ah. Perhaps I see now the difference between our premises. My assumption is that question answering by smart humans is going to be so massively amplified by algorithmic question answering systems -- really they're just very clever search engines -- that that's where we want to focus our investment at the current time, if we want the fastest possible increase in the general quality and fastest possible decrease in the general cost to get a question answered for a random member of the majority of humanity.

Or, let me put a question to you instead: how do you forsee this massive new quantity of structured data being used during the question answering process in the future?


David, I'd say that the excitement in the Web 2.0 period of the past decade was around distributed knowledge. We were all excited by the power of folksonomic systems, hashtags and ad-hocracies. There was a sense that we had discovered a self-assembling principle for information. This messy looseness was a strength, because it enabled connections that occurred without top-down control.

I think the new experience is of the limits of these systems, particularly from an economic viability standpoint. As powerful and democratic as they are, they're not assembling the large audiences that are required to produce traffic that investors can get excited about. Very few of us are making money off content via these systems.

So I see messy data as abundant and valuable and easily acquired, but the processing costs of using it for anything other than low-value, ad-supported media are probably higher than you'd want them to be. Since most people are focused on trying to drive down overhead so that low-value media can make profits, that's probably a good place to put some attention.

I'm just not particular attracted to the low-cost model, because I don't like where I suspect that leads. And the things that interest me tend to be information applications that value the information instead of the audience.

On your post about open standards for commercial content, two small thoughts:

First, I think your parting thoughts about social media "finding" content for people are top-level important. Because the idea that news finds the consumer is becoming increasingly true, and the channels by which that news finds us are increasingly proprietary.

Second, I'd be willing to use any open standard that works, but the big issue to me is the value of the content, and who it really belongs to.

I think the value of data lies not in each individual point, but in the structure that gives it context and meaning. But can we apply the same standard to unstructured news? Who does "news" belong to, and how long does that "ownership" last? The answers that I get when I ask those questions are so ephemeral that I haven't wanted to invest great energies in exploring them.

So yes, if you've got content that you can value and clear ownership of that content, then you could make progress in these areas. But my concern is that it won't be the best open standard that wins, but the agreement on a standard -- any standard -- between the industry organizations that have the most to gain or lose.

The business that I'm working to develop with Abe is one that would demonstrate and develop a system for embedding and publishing machine-readable meaning in regular old human-readable documents.


Jonathan: Now THAT could be a very successful business! Smart humans, working with algorithmic tools producing FAQs that can be easily generated and retrieved. Very interesting idea you have there (if I've read it right).

Here's how I think my structured approach might play out:

A news company starts using s workflow that not only tags factual statements with machine-assisted RDF statements, it records those RDF statements in a directory. The directory belongs to the company, but the majority of the directory is publicly available to any user (or robot).

And then magic happens. ;-)

I kid about this because it's the thing I hear a lot: "You can't just black-box what happens next, because that's not a business." Which is absolutely true.

BUT: it's also true that if I owned a directory like this, I'd be doing two things: 1. I'd be writing apps and APIs that could parse this data into things that answer questions that certain groups of people consider valuable; and 2. I'd be expanding, shaping and improving the data I capture so that I can filter it into datasets that I can sell to institutional clients.

I don't think THAT'S magic. I think that's an economic process of discovering what data has the most value and then adapting your business to make that transaction profitable.

The magic part would be the secondary, emergent aspects of these transactions. If any organization can use a similar workflow tool to mark up its valuable data, wouldn't it make economic sense for those organizations to develop cooperative, open standards that allow for the buying and selling of their data? I think that's where things would get interesting, because the market would bend toward datasets that offered truly authoritative answers. I think you'd find authority through cooperation.

So I don't think you start by imagining every standard. I think you start by creating a tool that makes it practical for groups of humans without special training to mark-up text, re-use existing knowledge, and customize their information models without having to write a work order to the I.T. department every time.

The comments to this entry are closed.