OpenSONAR: a 500 MW reference corpus
SummaryOpenSoNaR is an online system that allows for analyzing and searching the large scale Dutch reference corpus SoNaR. Due to the size of the corpus (500 million words), accessing the information contained in the dataset has proven to be difficult for less technically inclined researchers. OpenSoNaR facilitates the use of the SoNaR corpus by providing a user-friendly online interface.
Background
SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme.
SONAR contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types including both texts from conventional media and texts from the new media. All texts except for texts from the social media (Twitter, Chat, SMS) have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. All annotations were produced automatically, no manual verification took place.
The SoNaR project was carried out by Katholieke Universiteit Leuven (CCL), Hogeschool Gent (Dept. Vertaalkunde, LT3), Radboud University Nijmegen (CLST), Tilburg University (TiCC/ILK), Twente University (HMI), and Utrecht University (UiL-OTS). It was coordinated by Radboud University.
Due to the size of the SoNaR corpus the number of hits shown in OpenSONAR is limited to 8 million hits. If the results of your query exceeds this limit only the first 8,000,000 hits will be shown.
OpenSONAR is an online application for exploration of and searching in the SoNaR corpus. In the Exploration (Dutch: verken) interface you can look into the corpus distributions, request statistics from sub-corpora, retrieve n-grams from sub-corpora and search for specific documents using the SoNaR document ID. In the Search (Dutch: zoek) interface you can use four different search strategies: simple (simpel), extended (uitgebreid), advanced (geavanceerd) or expert (expert).
In OpenSONAR click the green question mark in the left upper corner for a guided tour (in Dutch).
- Project leader: Dr. Martin Reynaert (Tilburg University)
- CLARIN center: INL
- Help contact : reynaert@uvt.nl
- Web-sites: http://opensonar.inl.nl
- User scenario's (screencasts, screenshots): n.a.
- Manual: http://ticclops.uvt.nl/SoNaR_end-user_documentation_v.1.0.4.pdf http://zilla.taalmonsters.nl/opensonar/OpenSoNaR%20Handleiding.pdf [Dutch]
- Tool/Service link: http://opensonar.inl.nl
- Publications: Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I. (2013) The Construction of a 500 Million Word Reference Corpus of Contemporary Written Dutch in: Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme (eds. P. Spyns, J. Odijk), Springer Verlag.