RWW argues that for the Semantic Web to really take off, content-management systems need to incorporate semantic markup. They argue,
Allowing authors or readers to add tags to articles or posts allows a measure of classification, but it does not capture the true semantic essence of the document. Automated Semantic Parsing (especially within a given domain) is on the way – a la Spock, twine and Powerset – but it is currently limited in scope and needs a lot of computing power; in addition, if we could put the proper tools in the authors’ hands in the first place, extracting the semantic meaning would be so much easier.
For example, imagine that you are building an online repository of content, using paid expert authors or community collaboration, to create a large number of similar records – say, a cookbook of recipes, a stack of electrical circuit designs, or something similar. Naturally, you would want to create domain-specific semantic knowledge of your stack at the same time, so that you can classify and search for content in a variety of ways, including by using intelligent queries.
Ideally, the authors would create the content as meaningful XML text, so that parsing the semantics would be much easier. A side benefit is that this content can then be easily published in a variety of ways and there would be SEO benefits as well, if search engines could understand it more easily. But tools that create such XML, and yet are natural and easy for authors to use, don’t appear to be on their way; and the creation of a custom tool for each individual domain seems a difficult and expensive proposition.
The problem with XML authoring, as the author notes, is that it’s too time-consuming from a user perspective. You’re basically requiring that the user fill out a detailed, unique form on every post or content node.
What’s really needed is a way for the CMS to prefill semantic data for the user, and then let the user tweak it. The prefill would have to come from contextual information (post title keywords, word frequency analysis, link text) and metadata (category, tags). In a way you have a mini-search engine index running against your own post, and giving you “search results” to let you “rank” the sub-content into a structured form. And even then, take pains to hide the XML-ness; instead of showing the user a pile of confusing <blah>blahblah</blah> xml markup, it should provide a cleaner view like
drink: triple latte
cost: $4.50
opinion: sucks, overpriced
where of course the labels (drink, cost, opinion) are mapped to the actual XML containers <drink></drink> etc. The user can edit the list easily, insert or delete labels as they choose, and then hit publish.
To achieve this, you need good metadata. By good, I mean “rich” – it should be noted that tagging alone is actually pretty poor as far as metadata goes because it’s usually only a taxonomy imposed by the author, not a true folksonomy. The advantage of the latter is that the metadata is more variable, giving any semantic algorithm more room to play with. Note that tagging as implemented in WordPress is not a true folksonomy, though a plugin now exists to rectify that. Semantic algorithms will starve on taxonomies alone.
Thanks for pointing that out. It’s a pretty good read.
I’m still trying to wrap my head around how the <drink></drink> is converted into xhtml. Or is this all for internal use/mashups? Or some sort of future where web content is displayed directly in raw xml.
ideally you use XSMLT transforms to convert from XML to HTML – though it shoudl be noted that XSLT authoring tools are also not very good. Thats kind of the “last mile” problem for Semantic web – an XSLT transform assumes a rigid input XML structure. But Semantic Web might deliver all sorts of funky tags. You need some kind of dynamic XSLT instead. Thats also going to need to take metadata and contextual info into account as well… we are bordering on AI here.