PICCL

Title

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Description

PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.

Project

CLARIN-NL
CLARIAH-CORE
Nederlab

CLARIN National Project

CLARIN centre

none yet

Research domain

Linguistic Subject

Tool task

Country

Netherlands

Tool Type

Research Phase

Tool status

Input format

Output format

Input Language

Version

v0.6.4

Access Contact

Project Contact

Creator Contact

Ko van der Sloot

Documentation

Source code

Publications

Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf

Resource

CMDI File Link

License

GNU GPL

Inventory Scope