Summary Tables from Encyclopedias, Automatically

Chocolate bars on a shelf

Suppose you want to create a web-based encyclopedia about chocolate bars. You want a web page (article) about every chocolate bar that ever existed. Each pages would give factual information about the chocolate bar in question: the date it first came out, whether it included peanuts, its mass, its length, whether it was milk chocolate, and so on. Pages could also include any other text that might be considered important. For example, maybe SuperMumboBar played a key role in a 1978 movie. Other chocolate bars may not have been in any movies, but that’s okay—you can still include the movie fact in the page about SuperMumboBar.

Suppose you spend a few months and you now have 258 articles about chocolate bars (not every bar that’s ever existed, but a nice start).

Now suppose you want to make a table of chocolate bars that contain milk chocolate (i.e. not all 258). Each row should list one chocolate bar, and you want the table to have three columns: the name of the chocolate bar, the mass, and the length. What would you do?

Most likely, you would have to go through each of the 258 articles, check to see if it was milk chocolate, then manually copy the name, mass, and length into your table. Isn’t that crazy? You already have all the information! What if I add some more articles about milk chocolate bars in the future? What if I got the mass of SuperMumboBar wrong in the original version of the article? Do I have to maintain the table as if it’s independent of the articles? There must be a better way.

This is where the database people say, “Oh Troy, you just need a database!” The thing is, I don’t think in terms of tables or databases. I think in terms of sentences and paragraphs. My extra bit of information about SuperMumboBar being in a 1978 movie is not some random extra stuff for the “Miscellaneous” field (column) in a database. It’s just as relevant as the mass.

In April, I spent some time searching for a web Content Management System (CMS) that would make it easy for me to create summary tables from articles. I wanted something reliable by construction, so all the “natural language processing” solutions were not acceptable. I’m okay with marking up my articles to identify places where I’m giving facts.

Eventually, I came across Semantic MediaWiki and it turns out to be exactly what I need. For example, I can go into edit mode and write:

SuperMumboBar first appeared on store shelves in [[had launch date::1967]].

People reading that (not in edit mode) would just see “SuperMumboBar first appeared on store shelves in 1967.” but Semantic MediaWiki would know the fact that SuperMumboBar::had launch date::1967. That subject::predicate::object structure is what’s known as a “semantic triple” because it adds meaning (semantics) to the connection between SuperMumboBar and 1967. Semantic MediaWiki stores all the semantic triples and can use them to construct tables as needed. For example, suppose I go into edit mode and insert this code into an article:

{{#ask: [[has type::milk chocolate]]

| mainlabel = Chocolate Bar Name

| ?has mass = Mass (g)

| ?has length = Length (cm)

}}

The reader would see a nicely-formatted table with all the milk chocolate bars listed. It would have three columns: Chocolate Bar Name, Mass (g), and Length (cm). The table would always be up-to-date (agreeing with all the articles from which the data was automatically extracted). The reader would even be able to click on a column header to sort the table using the values in that column (e.g. sort in order of increasing mass). Isn’t that awesome?

Semantic MediaWiki is actually an extension for MediaWiki, the same open-source software that powers Wikipedia and many other wikis, so it’s quite powerful. Unfortunately, it’s a bit of a pain to install MediaWiki and keep it up-to-date. I’d rather focus on creating content rather than messing around with the underlying code and servers.

I found a great site that offers managed hosting of Semantic MediaWiki, called Referata.com. They take care of the servers and the software updates, freeing you up to make great semantic encyclopedias. They even have an introductory plan that’s free, so you can try it out with no financial obligations. The guy behind Referata.com (Yaron Koren) wrote many extensions to make Semantic MediaWiki even easier to use (and those are also installed at Referata.com). I emailed him some questions and he replied very quickly, so I’m super impressed with the customer service.

You might wonder how I’m using Semantic MediaWiki. It’s still in the setup stages, but I’ll be sure to let you know once it’s ready for the public. No, it’s not an encyclopedia of chocolate bars!

Summary: Semantic MediaWiki is an extension for MediaWiki (the software that powers Wikipedia) that lets you extract summary tables from the articles very easily. Referata.com offers managed hosting of Semantic MediaWiki wikis.

Aside: Semantic MediaWiki lets you create pages that are part of the “Semantic Web” (RDF and all that). That’s nice, but the main reason I like it is for its ability to generate up-to-date summary tables easily.

Photo Credit: Chocolate Bars by Gary Soup on Flickr is licensed under a Creative Commons Attribution 2.0 license.

A friendly human.