OpenSONAR

Title

OpenSONAR: a 500 MW reference corpus of Contemporary Written Dutch

Description

SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. SONAR contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types including both texts from conventional media and texts from the new media. All texts except for texts from the social media (Twitter, Chat, SMS) have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. All annotations were produced automatically, no manual verification took place. The texts are enriched with several annotations (Part of Speech and lemma information) and are available as FoLiA xml files (folia.xml). The system relies on BlackLab server as back-end and WhiteLab as user-interface. OpenSONAR is an online application for exploration of and searching in the SoNaR corpus.

Project

OpenSONAR

CLARIN National Project

CLARIN-NL

CLARIAH-CORE

CLARIN centre

Dutch Language Institute

Research domain

Linguistics

Linguistic Subject

Computational Linguistics

general linguistics

Lexicology

Morphology

Syntax

text and corpus linguistics

Tool task

corpus browsing

corpus searching

corpus exploration

Country

Netherlands

Input Language

Dutch

Access Contact

/mailto:servicedesk@ivdnt.org/

Institute for the Dutch Language

Project Contact

/mailto:-/

Creator Contact

Dr. Nelleke Oostdijk

Radboud University

Katholieke Universiteit Leuven (CCL)

Hogeschool Gent (Dept. Vertaalkunde, LT3)

Tilburg University (TiCC/ILK)

Twente University (HMI)

Utrecht University (UiL-OTS)

Documentation

OpenSONAR Manual - First Use

SoNaR User Manual 1.0.4

Source code

not specified

Original source

http://portal.clarin.nl/node/4195

Publications

van de Camp, M, Reynaert,MandOostdijk, N. 2017.WhiteLab 2.0: AWeb Interface for Corpus Exploitation. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 231–243. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.19. License: CC-BY 4.0

de Does, J, Niestadt, J and Depuydt, K. 2017. Creating Research Environments with BlackLab. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 245–257. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.20. License: CC-BY 4.0

Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I. (2013) The Construction of a 500 Million Word Reference Corpus of Contemporary Written Dutch in: Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme (eds. P. Spyns, J. Odijk), Springer Verlag.

Resource

SearchPage

CMDI File Link

License

other

Inventory Scope

local