A SIMPLE PIPELINE FOR QUANTITATIVE ANALYSIS OF «EUGENE ONEGIN»


2017. № 4 (14), 183-192

Moscow State University

Abstract:

Natural language processing methods are widely used for solving both applied (web search, document classification) and scientific problems (distributional semantics research with word2vec models). However, such methods are not widely used for analyzing poetry. In this paper we demonstrate a simple case of data analysis applied to poetic texts. We show that a simple pipeline of NLP utilities can save а huge amount of manual work. The pipeline consists of the free morphology analyzer Yandex mystem, а simple web application for manual morphological disambiguation implemented in JavaScript and Python utilities for word stress detection and data aggregation. Core text processing is based on regular expressions. The approach will be demonstrated using the classical text of Eugene Onegin. Note that the pipeline described in the paper was not a standalone application but an auxiliary utility developed for the purely linguistic task of analyzing the large text of Eugene Onegin. Thus there may be design flaws in the proposed approach. Moreover, the current level of development of modern programming languages and their corresponding libraries (e.g. nltk library for Python) allows for the even faster development of data analysis applications for texts. The pipeline described in the paper may be used as an example for further quantitative research of poetic texts.