PERSIAN POETIC CORPUS


2022. № 1 (31), 65-71

National Research University “Higher School of Economics”,

The Institute for Linguistic Studies, Russian Academy of Sciences

Abstract:

The text deals with the technical principles that formed the basis of the new corpus of the Persian language published on the Internet at linghub.ru/persian_poet_corpus. The corpus is of the poetic type, i. e. it contains poetic works and has a special markup re- presenting the poetic level of the text structure, in our case it is meter and rhyme. Corpora of this type have already been created for the Russian, Bashkir and Czech languages. In the case of Persian, the development of the corpus was more diffi cult because we do not have tools for automatic markup of several key parameters of the Persian language (for example, it is impossible to make phonetic transcription or transliteration of the text). The corpus consists of texts with a total amount of about 4 million tokens, represented in 16 thousand works. The corpus is diverse from the genre (15 items), author’s and time points of view, as it covers the works of several dozens of authors who lived during the 9th–17th centuries. The texts have morphological markup. The poetic meter was borrowed from ganjoor.net, and the markup of the rhymes and the redif is done by our own eff orts.