2015. № 3 (6), 194-234

Vinogradov Russian Language Institute of the Russian Academy of Sciences

Abstract:

The paper presents the current states of the parallel corpora within the RNC and the updates of the last six years. The parallel corpora with the RNC include the following bilingual Russian—X corpora: Armenian, Belarusian, Bulgarian, English, Estonian, French, German, Italian, Latvian, Polish, Spanish, and Ukrainian. Virtually for all these language we have language pairs in both directions of translation. The RNC now includes a multilingual corpus that consists of nine texts and involve more than 20 languages, mostly Slavic. The Russian-French corpus has been developed, since 2012, as a polyvariant corpus with multiple translations of the same text into the same language. Polyvariant texts are also included also in the multilingual corpus. Parallel corpora now include texts of different genres, not only fiction, but also news, technical, scientific, religious and legal texts, viz. all the main subdivisions represented in the main RNC. The discrepancies between the original text and the translation are now specially marked (arbitrary omission, adding, or change by the translator). Most languages represented in the corpus also have a morphological tagset and annotation. Some examples of corpus-based studies of lexicon and grammar are examined in more detail.

Keywords:

Key words: Parallel corpus

annotation

multilingual corpus

lexical typology

grammatical typology

language-specific lexicon

perfect

Abstract:

Search