CLARIN Standards List

Title Resource type Status Summary Comments
CES (Corpus Encoding Standard) data format, text approved MULTEXT, along with EAGLES and the Vassar/CNRS collaboration (supported by the U.S. National Science Foundation), have developed a Corpus Encoding Standard that will "serve as a widely accepted set of encoding standards for corpus-based work... The CES is specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora." The CES is available in SGML, and XML. also known as TEI light
Citer Unspecified Is now known as PISA (see PISA)
CHAT (Codes for the Human Analysis of Transcripts) data format
CMDI (Component Metadata Infrastructure) metadata format, Infrastructure
Controlled Vocabulary Guideline Too generic, specific controlled vocabulary standards, e.g., CLAVAS, should be mentioned
Country Codes Unspecified Too generic, specific country code sets, e.g., ISO 3166, should be mentioned
CLARIN MDI Unspecified Is now known as CMDI (see CMDI)
CQLF (Corpus Query Lingua Franca) Guideline, query language Under development
CQP (Corpus Query Processor) Tool, query language
CSV (Comma Separated Values) data format There is no CSV standard. Implementations differ in escaping, field names in the first row, single or double quotes, …
DataCITE Metadata schema
DCAM (Dublin Core Abstract Model) Metadata schema
DCMES (Dublin Core Metadata Element Set) Metadata schema
DCR (Data Category Registry) Knowledge resource Two successors of ISOcat are (most likely) upcoming: a new ISO DCR (using and a CLARIN Concept Registry
DiAML (Dialogue Act Markup Language) data format
DictionaryEntry-RePresentation data format What is the relationship to LMF?
DITA (Darwin Information Typing Architecture) data format
DOL (Distributed Ontology Language) Knowledge representation See OntoIOp
DSSSL (Document Style Semantics and Specification Language) Transformation language
EAD (Encoded Archival Description) Metadata schema
EAF (ELAN (or EUDICO) Annotation Format) data format
Feature structures data format Should be more specific as there are several FSRs
GOLD (General Ontology for Linguistic Description) Knowledge resource
H264 Data format or encoding H264 is an encoding, should the recommendation also include a container, e.g., H.264/MPEG-4 AVC?
Handle Infrastructure
HTML (Hypertext Markup Language) data format Be more specific in HTML versions
HTTP (Hypertext Transfer Protocol) Infrastructure
HyTime data format
IMDI (ISLE Metadata Initiative) Metadata schema
IPA (International Phonetic Alphabet ) Data format? Is there a standard (Unicode) representation of IPA?
ISBD (International Standard Bibliographic Description) Metadata schema, Guideline Gives a structure but not really a representation (?)
ISO-Thesauri Knowledge representation
ITS (Internationalization Tag Set) data format
JATS (Journal Article Tag Suite) data format
JPEG (Joint Photographic Experts Group) data format
JSON (JavaScript Object Notation) data format Could be considered too general, i.e., just like plain XML
LAF (Linguistic Annotation Framework) data format
KAF (Kyoto (or Knowledge) Annotation Framework (or Format)) data format
Language Codes Unspecified Too generic, specific language code sets should be mentioned, e.g., ISO 639-3
LMF (lexical Markup Framework) data format
MAF (Morpho-syntactic Annotation Framework) data format
MARC (Machine-readable Cataloging) Metadata schema
MARTIF (Machine-readable Terminology Interchange Format) data format
METS (Metadata Encoding and Transmission Standard) Metadata schema
MJPEG 2000 (Motion JPEG 2000) data format
MLIF (Multilingual Information Framework) data format
MPEG21 DID (Digital Item Declaration) data format
MPEG7 data format
MPEG-2 data format
MPEG-1 Layer 3/MP3 data format
Multilingual Thesaurus Guideline
NISO MIX (Metadata for Images in XML Schema) Metadata schema
nRQL (new Racer Query Language) Tool, query language Propriety? Recommend SPARQL
OAI-PMH (Open Archives Initiative - Protocol for Metadata Harvesting) Infrastructure
OLAC Metadata Metadata schema
OLiA Knowledge resource
OntoIOp (Ontology Integration and Interoperability) Knowledge representation
Ontology Mapping Language Knowledge representation
OAI-ORE (Open Archives Initiative - Object Reuse and Exchange) Metadata schema, Infrastructure
OWL (Web Ontology Language) Knowledge representation
PCM (Pulse-code modulation) data format PCM is an encoding used by various formats, e.g., WAV, shouldn’t these formats be recommended?
PDF/A (Portable Documentation Format) data format
PISA (Persistent identification and sustainable access) Infrastructure, Guideline
PNG (Portable Network Graphics) data format
PML (Prague Markup Language) data format
RDF (Resource Description Framework) Knowledge representation, data format See RDF/XML, RDFS
RDF/XML data format Should recommend one common RDF serialization
RDFS (RDF Schema) Schema language
RELAX NG (REgular LAnguage for XML Next Generation) Schema language
Representation of names and scripts Unspecified Too generic, specific standards, e.g., ISO 15924, should be recommended.
Representation of entries in dictionaries Unspecified Too generic, specific standards should be recommended.
RDQL (RDF Data Query Language) query language Recommend SPARQL
REST (Representational state transfer) Infrastructure
RQL (RDF Query Language) query language Recommend SPARQL
RTF (Rich Text Format) data format
SeRQL (Sesame RDF Query Language) query language, Tool Recommend SPARQL
SGML (Standard Generalized Markup Language) data format
SKOS (Simple Knowledge Organization System) Knowledge representation
SemAF (Semantic annotation framework) data format
SemRoleML (Semantic role markup language) data format
SimpL-1 (Simplified natural language) Guideline?
SOAP (Simple Object Access Protocol) Infrastructure
SPARQL (Simple Protocol and RDF Query Language) query language
SRX (Segmentation Rules eXchange) Data format or, Knowledge representation?
SRU/CQL Infrastructure
Structured vocabularies for information retrieval data format
SynAF (Syntactic annotation framework) data format
Systems to manage terminology, knowledge and content Unspecified Too generic
TermLR () Knowledge resource Can the TC 37 i-Term terminology resource be opened up for the public?
TBX (TermBase eXchange) data format
TEI (Text Encoding and Interchange) Guidelines Guideline, data format, Metadata schema
TextMD (Technical Metadata for Text) Metadata schema
TIFF (Tagged Image File Format) data format
TimeML (Markup Language for Temporal and Event Expressions) data format
Tipster data format
TMF (Terminological markup framework) Data format?
TMS (terminology management systems) Guideline
TMX (Translation Memory eXchange) data format
Topic Maps Knowledge representation
TRIPLE query language Recommend SPARQL
Turtle data format See RDF, RDF/XML
Unicode data format
URI (Uniform Resource Identifier) / URN (uniform resource name) Infrastructure
WordNet Knowledge resource, Knowledge representation There are many WordNets. This can refer to the original, i.e., or the approach. The original can be used as a pivot. The approach would need recommended data formats.
WordSeg () data format
WSDL (Web Service Description Language) Infrastructure
XCES (Corpus Encoding Standard in XML) data format See CES
XHTML (Extensible HyperText Markup Language) data format
XML (Extensible Markup Language) data format Could be considered too general. Better to recommend specific XML schemas/vocabularies.
XMLNS (XML Namespace) data format
XPath query language
XQuery query language
XSD (XML Schema Definition) Schema language
XSL-FO (XSL Formatting Objects) data format
XSLT (Extensible Stylesheet Language Transformations) Transformation language