THE MULTIMODAL MODULE AS PART OF THE RUSSIAN NATIONAL CORPUS


2015. № 3 (6), 65-87

Vinogradov Russian Language Institute of the Russian Academy of Sciences

Abstract:

The article describes the Multimodal Russian Corpus (MURCO) as the new project in the framework of the Russian National Corpus (RNC). The pilot version of the MURCO was opened for general access in October, 2010 and since then MURCO has increased its volume up to 4.3 mln tokens and developed its structure. The corpus presently consists of the following subcorpora: 1) Movie speech, 2) Public Spoken Russian, 3) Private Spoken Russian 4) Theater speech 5) Written-to-be-Spoken Russian. The MURCO is organized as a collection of clixts. A clixt is a pair of an audio- or video clip and the corresponding fragment of the text transcript. A user has the opportunity to download not only the text com-ponent of a clixt, but also its sound and video component, so after downloading a user may use any program to analyze it. MURCO is marked up with different types of annotation. Some of them are standard for the RNC (metatextual, morphological, semantic annotation), some types are special for the spoken component of the RNC (sociological and accentological annotation), and some of the mark-up dimensions are specific only for the MURCO (orthoepic, annotation of vocal structure, speech act and gesture annotation). The speech act and gesture annotation is used in the smaller part of the MURCO, which is called the deeply annotated MURCO. The article describes the main ways of retrieving information from the corpus and types of search queries. The prospects of further development of MURCO are as follows: increasing the number of texts, developing software for processing multimedia content, creating new modules of MURCO – multimodal parallel corpora.