PARALLEL TEXTS WITHIN THE RUSSIAN NATIONAL CORPUS: NEW DIRECTIONS AND RESULTS


2015. № 3 (6), 194-234

Vinogradov Russian Language Institute of the Russian Academy of Sciences

Abstract:

The paper presents the current states of the parallel corpora within the RNC and the updates of the last six years. The parallel corpora with the RNC include the following bilingual Russian—X corpora: Armenian, Belarusian, Bulgarian, English, Estonian, French, German, Italian, Latvian, Polish, Spanish, and Ukrainian. Virtually for all these language we have language pairs in both directions of translation. The RNC now includes a multilingual corpus that consists of nine texts and involve more than 20 languages, mostly Slavic. The Russian-French corpus has been developed, since 2012, as a polyvariant corpus with multiple translations of the same text into the same language. Polyvariant texts are also included also in the multilingual corpus. Parallel corpora now include texts of different genres, not only fiction, but also news, technical, scientific, religious and  legal texts, viz. all the main subdivisions represented in the main RNC. The discrepancies between the original text and the translation are now specially marked (arbitrary omission, adding, or change by the translator). Most languages represented in the corpus also have a morphological tagset and annotation. Some examples of corpus-based studies of lexicon and grammar are examined in more detail.