MITCH

Mining for information in texts from the cultural heritage: An NWO CATCH project

MITCH is part of the CATCH programme (Continuous access to cultural heritage), which aims to expose knowledge hidden in the Dutch cultural heritage. As curator of numerous specimens, Naturalis is obviously a large contributor to that cultural heritage. Less evident, but of no less value are the vast amounts of documents describing these specimens: logs, labels, registries, publications, taxonomies, etc. Combining these sources reveals Naturalis' value: the documents are the key to the information and knowledge about the collection.

A major obstacle in exposing these relations is the sheer quantity of the available data and the different media, formats and methods used to store the data. Many sources of information are only available as paper documents, but the digital information sources are not readily accessible either and far from uniform. Variations range from minor typing errors to the use of different taxonomies, as a result of progressing research and international conventions.

To harness the enormous quantity of existing and future data, the MITCH project will develop the technological utilities to "mine" these data. Research in text mining, has advanced to a level at which language technology and information extraction modules can be used to structure large volumes of unstructured or semi-structured data. The project's goal is to provide the tools to extract, correct, normalize and link data, so that information from different sources can be combined, disclosed and put to better use.

Notice that the focus of this project is the automation of knowledge enrichment and understanding of digital data in flat text and textual object databases. Other projects, such as sister project SCRATCH, will deal with the capture of paper documents and its transformation to flat text.

The MITCH research programme is a joint effort of Naturalis and Tilburg University under the umbrella of NWO/CATCH.

More information

See the MITCH website.

Participants

Research team

Piroska Lendvai 
Postdoc Researcher
P.Lendvai (at) uvt.nl

Marieke van Erp
PhD student
M.G.J.vanErp (at) uvt.nl

Steve Hunt
scientific programmer
S.J.Hunt (at) uvt.nl

Coordination

Antal van den Bosch
coordinator Tilburg University

René Dekker
coordinator Naturalis

Former staff

Caroline Sporleder
postdoc researcher

Tijn Porcelijn
scientific programmer

Friday, August 27, 2010