THE SYNTAGRUS TODAY
Abstract:
The paper describes the current state of the SynTagRus corpus composed of Russian texts tagged with morphosyntactic structures. At certain points of the corpus development, additional kinds of tagging were introduced, namely lexical-semantic, lexical- functional, anaphoric and microsyntactic tagging.
The morphosyntactic tagging of a sentence includes morphological structures of all the words and the syntactic structure of the sentence in the form of a dependency tree, in accordance with I. Mel’čuk and A. Zholkovsky’s “Meaning — Text” model. Lexical- semantic tagging implies that each word is assigned a corresponding entry of the Russian combinatorial dictionary. Lexical-functional tagging means finding the phrases in the text which can be interpreted in terms of lexical functions. Anaphoric tagging results in marking the antecedents of pronouns. Microsyntactic tagging identifies syntactic phrasemes and certain nonstandard syntactic constructions occuring in the text.
Tagging of a new text is performed in several stages. First, the text is processed by the multifunctional linguistic processor ETAP-4 which makes morphosyntactic and lexical- semantic tagging in automatic mode. Then the output of the processor is checked and possibly corrected by specially trained annotators. After that, ETAP-4 uses the morpho- syntactic structures of the text to make lexical-functional and anaphoric tagging. Finally, the annotators check these types of tagging and manually perform microsyntactic tagging.
The SynTagRus corpus may be used in theoretical linguistic research as well as practical lexicography. The corpus statistics may help optimize decision making in various automatic text processing procedures. Another very promising possibility is to use the SynTagRus data as input for the modern machine learning systems.