Ucto Engine

https://raw.githubusercontent.com/LanguageMachines/ucto/master/logo.svg

Title

Ucto Tokeniser Engine

Description

The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Project

CLARIN-NL
CLARIAH-CORE

CLARIN National Project

CLARIN centre

none yet

Research domain

Linguistic Subject

Tool task

Country

Netherlands

Tool Type

Research Phase

Tool status

Input format

Output format

Input Language

Version

v0.13

Access Contact

Project Contact

Creator Contact

Ko van der Sloot

Documentation

Source code

Resource

CMDI File Link

License

GNU GPL

Inventory Scope