Corpus Studio Web

Title

Corpus Studio Web

Description

Summary CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. Background CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. It does so by supporting researchers in writing queries that operate on syntactically parsed text corpora in a number of major xml formats. Queries that belong together are kept in xml documents that are called ‘Corpus Research Projects’ (CRPs). These documents contain the queries, the order in which they are to be executed, meta-information about the queries and the project as a whole, as well as a specification of the input used for the project. The use of CRPs helps improve the replicability of corpus research. Access Any CLARIN-NL user can access the CorpusStudio web application and make use of the 'standard' corpora. New users must provide a login name and password, after which they can make use of the application. Adaptable The CorpusStudio code is open-source. Users can take the code, adapt it and use it for their own purposes. Users can also take the code from GitHub as it is, but build their own server in order to run the application on their own text-corpora. User documentation and an API are available (see below). The current version of CorpusStudio supports xml text corpora in the FoLiA and Psdx formats. Extensions to other xml formats are possible. CrpxProcessor provides the basic functionality and is on github on https://github.com/ErwinKomen/CrpxProcessor. CrppServer takes care of /crpp and uses CrpxProcessor. It is on GitHub on https://github.com/ErwinKomen/CrppServer. CrpStudio is on https://github.com/ErwinKomen/CrpStudio, takes care of /crpstudio and uses CrpxProcessor. Main features Keep all important aspects of a research project in one file Define one or more search queries in a hierarchy Uses w3c developed Xquery and Xpath Integrated CorpusStudio-specific Xquery functions User-definable functions and variables Create corpus result databases with user-definable features accompanying each hit Divide the output into calculatable categories Divide the results into meta-data-dependent groups Parallel processing yields a speed-up of a factor 20-100 compared to the Windows version Compatibility with the Windows programs "Cesax" and "CorpusStudio" Limitations and future developments Current limitations to the program include: working with result database, restricted login system, no document view, grouping is restricted to system-defined groups, no query or project wizard. Although the CLARIN-NL project has stopped in December 2015, every effort will be undertaken to make sure that a number of essential features are going to be added.

Project

Corpus Studio Web

CLARIN National Project

CLARIN centre

Meertens Institute

Research domain

Linguistic Subject

Tool task

Country

Netherlands

Tool Type

Research Phase

Tool status

Output format

Input Language

Access Contact

Project Contact

Creator Contact

Documentation

Source code

Original source

Publications

Komen, E. R. 2017. Beyond Counting Syntactic Hits. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 259–268. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.21. License: CC-BY 4.0
Komen, Erwin R. 2011. Coreferenced corpora for information structure research. In Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources. (Studies in Variation, Contacts and Change in English 10) Jukka Tyrkkö, Terttu Nevalainen, Matti Rissanen & Matti Kilpiö (eds). Helsinki, Finland: Research Unit for Variation, Contacts, and Change in English.
Komen, Erwin R. 2013. Finding focus: a study of the historical development of focus in English. Utrecht: LOT.
Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian academy of sciences.

Resource

CMDI File Link

License

unknown

Inventory Scope