TODAY’S STATE OF THE DIALECTAL SUBCORPUS


2015. № 3 (6), 142-162

Moscow State University, Vinogradov Russian Language Institute of the Russian Academy of Sciences

Abstract:

The paper presents the current state of the Corpus of dialectal texts which is a subcorpus of the RNC. Between 2005 and 2009, a pilot dialectal corpus was marked up. Since then a considerable work has been done to improve the markup of the dialectal texts on different levels (metatextual information on the time and place of collection, genre and other properties of the texts; phonetics, morphology, semantics, elements of syntax). Programming means for corpus markup, including the graphic user interface, are created. Texts are collected in different regions of Russia; previously collected texts, both published and taken from archives, are digitalized and marked up. A technology converting the transcribed texts into an orthographized form is applied. An orthographized version is then marked up semi-automatically. It is also possible to include the texts that exist only in orthographized version. The morphological annotation includes tags on dialectal properties on different levels (stem, inflection, and derivation). Multimodal information (video and audio) can be also included. The paper offers information on other dialectal corpora that collaborate with the RNC team.