text processing


Nederlab, online laboratory for humanities research on Dutch text collections


A user-friendly and tool-enriched open access web interface that that aims at containing all digitized texts relevant for the Dutch national heritage and the history of Dutch language and culture (c. 800 - present).




The OpenConvert tools convert to TEI or FOLiA from a number of input formats (alto, text, word, HTML, ePub). The tools are available as a Java command line tool, a web service and a web application.


PaQu - Parse and Query


The PaQu web service makes it possible to search in syntactically annotated corpora in Dutch. You can parse your own Dutch text corpus or use one of two corpora provided by the developers.


TTNWW integrates and makes available existing Language Technology (LT) software components for the Dutch language that have been developed in the STEVIN and CGN projects. The LT components are made available as web-services in a simplified workflow system that enables researchers without much technical background to use standard LT workflow recipes. The web services are available in two separate domains: "Text" and "Speech" processing. The TTNWW services have been created in a Dutch and Flemish collaboration project building on the results of past Dutch and Flemish projects. The web services are partly deployed in the SURF-SARA BiG-Grid cloud or at CLARIN centres in the Netherlands and at CLARIN VL University partners.


Online dictionary (ancient) Greek - Dutch for the letter Pi. Search functions include searches for Greek lemmata; search of Greek declined or conjugated word-forms that lead to the correct lemma (‘lemmatizer’); searches for Dutch words leading to different Greek lemmata; etymological searches. The dictionary is linked to Logeion, the international website of Greek dictionaries at the University of Chicago. The developers estimate that a complete version of the dictionary will be finished by the end of 2016 and that it will be published by the end of 2017.


With this web-application an end user can have historical Dutch texts tokenized, lemmatized and part-of-speech tagged, using the most appropriate resources (such as lexica) for the text in question. For each specific text, the user can select the best resources from those available in CLARIN, wherever they might reside, and where necessary supplemented by own lexica.


TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts.