2017. № 4 (14), 113-126

Institute of Czech Literature, Czech Academy of Sciences


The study presents the first results from the statistical processing of the Czech Verse Corpus, which is a lemmatized, phonetically, morphologically, metrically, and strophically annotated corpus of Czech poetry from the 19th and the beginning of the 20th centuries. It contains 1700 books, with approximately 2.5 million verse lines and 15 million words. The paper compares the frequencies of parts of speech in the Czech Verse Corpus and other available corpora of Czech as well in their particular subcorpora (fiction, journalism, technical literature, spoken Czech). It also focuses on the relation between the frequency of parts of speech and the length of the line (the frequency of finite verbs, nominal vs. verbal phrases), literary movements (poetry of the 1850s and 1860s vs. poetry of 1870s vs. 1880s), author (Svatopluk Čech vs. Viktor Dyk) and literary genre (lyrics vs. the epics of Adolf Heyduk). As the study has been done on rather small samples, the results should be considered provisional. In the future we plan to broaden out material and to use the parameters analyzed here as stylometrical indicators that may be useful for both literary history and textual criticism.