THE CURRENT STATE OF THE SYNTAGRUS CORPUS
Abstract:
The paper presents a description of the main features and options of a diversely tagged corpus of Russian texts called SynTagRus. The corpus has been developed by the A. A. Kharkevich Institute for Information Transmission Problems, RAS, and is currently considered to be a subcorpus of RNC, where it is referred to as the “Syntactic Corpus”. Much attention is given to the linguistic principles underlying the diff erent annotation types: morphological, syntactic, lexical semantic, lexical functional, elliptical, microsyntactic, coreferential, and temporal. Statistical data are given which characterize a variety of aspects of SynTagRus and its fragments. SynTagRus is a corpus with a 100-percent disambiguation at all levels of annotation. The paper outlines the obvious advantages of this approach but at the same time notes the diffi culties associated with the need to always make defi nite decisions and choose single annotation options even in cases when the linguistic material undeniably allows for multiple linguistic description. Much attention is given to certain diff erences that exist between the SynTagRus and the main RNC subcorpora, such as distribution of words by parts of speech or specifi c morphological solutions that are accepted in SynTagRus in contrast to RNC (e.g. individual morphological categories, like verbal aspect and voice, certain cases of nouns etc.).