DiscAN: Towards a Discourse Annotation system for Dutch language corpora


The DiscAn corpus is a collection of subcorpora of Dutch language that have been annotated at the level of discourse. These subcorpora form a set of Dutch corpus analyses of coherence relations and discourse connectives that have been compiled and annotated by researchers at several universities in The Netherlands and Belgium. In the DiscAn project, funded by CLARIN-NL, this set of corpus analyses has been standardized (both in terms of raw data – the texts – and analyses) and opened up for further scientific research.


The six subcorpora included in the DiscAn corpus were all compiled with a focus on Dutch causal connectives. The compilers all selected examples of specific Dutch connectives from Dutch natural occurring language. They included some context; for some subcorpora only the two segments connected by the causal connective, for others a complete paragraph or more. In a later stage, the developers hope to include other corpora which have not focused on causal connectives, but also include coherence relations that are not explicitly marked by a connective, and subcorpora containing other coherence relations than just causal ones. Users of the DiscAn corpus should bear in mind that for now, the corpus only contains causal connective subcorpora. For that reason, certain research questions cannot currently be answered by the DiscAn corpus, such as questions comparing the number of occurrence of connectives between different discourse genres. The DiscAN corpus contains the following subcorpora on causal connectives:

  • Degand (2001): 143 cases of aangezien, want and omdat from newspapers (NRC Handelsblad 1994)
  • Pander Maat & Sanders (2000): 100 cases of dus and daarom from newspapers (Volkskrant 1994 and 1995)
  • Persoon (2010): 105 cases of omdat and want from Corpus of Spoken Dutch (CGN)
  • Pit (2003): 198 cases of doordat/omdat/want/aangezien from newspapers (Volkskrant 1995) and 107 cases of doordat/omdat/want from fiction texts
  • Sanders & Spooren (2009): 551 cases in total of omdat and want from newspapers, spontaneous conversations and chat interaction; 100 cases of omdat and 101 cases of want from newspapers (D-Coi); 100 cases of omdat and 100 cases of want from spontaneous conversations (CGN); 50 cases of omdat and 100 cases of want from chat interaction (VU-Chat-corpus and D-COI)
  • Stukker (2005): 286 cases of daardoor (94), daarom (94) and dus (98) from newspapers (Trouw 2001)
  • Project leader: Prof. Dr. T.J.M. Sanders (Utrecht University) 

  • CLARIN center: Max-Planck Institute for Psycholinguistics
  • Help contact: https://tla.mpi.nl/contact/



