ON PARALLEL TEXTS WITHIN THE RUSSIAN NATIONAL CORPUS : NEW LANGUAGES AND NEW CHALLENGES


2019. № 3 (21), 41-61

 Vinogradov Russian Language Institute of the Russian Academy of Sciences, National Research University Higher School of Economics

Abstract:

The paper discusses the main trends in the development of the parallel corpora within the RNC since 2015. The New languages section deals with seven new language pairs that emerged during this period, their architecture and tagging.

Compared with the list of languages that form bilingual parallel pairs with Russian available in 2015, the following new languages have appeared in the RNC: Bashkir, Buryat, Chinese, Czech, Finnish, Lithuanian, Swedish. Creating the parallel subcorpora comes as the combined efforts of autonomous Russian and foreign teams coordinated by the team of developers of the RNC in Moscow. Virtually all the new languages offer specific challenges for the Corpus developers with regard to their annotation. The size of some of the language pairs already available in 2015 has also significantly increased in four years.

The New challenges section goes further to explore some general trends of parallel corpora across different language pairs such as regional/national language varieties, representativeness with regard to text genres, new annotation and search types etc. At the present stage, both due to an increase in the typological diversity of the languages and scripts involved, and through the use of more complex morphological analyzers, the set of additional markup parameters has been significantly expanded. The purpose of incorporating polycentric language samples into the corpus has become one of the most important. The genre representativeness of corpora, which was only being planned in 2015, is one of the main goals in 2019, and this goal is taken into account from the very beginning of the creation of new language pairs.

A case study is dedicated to studying pluperfect in 24 languages of Europe in an expanded multilingual corpus. The analysis of a multilingual text based on the material of an extended collection allows us to construct a network of distances between the data for the Pluperfect category in 24 European languages/lects.