DOI: 10.31912/pvrli-2019.3.5

2019. № 3 (21), 100-110

Saint Petersburg State University

Abstract:

The article describes some stages of creation and extension of the Russian everyday speech corpus “One Day of Speech” (the ORD corpus), as well as peculiar conditions for publishing its online version. Speech material of the ORD corpus was obtained in natural communicative situations. Volunteer-respondents were people who expressed their willingness to live a day “with a dictaphone dangling around their necks” and fill out several questionnaires. Before recording, the respondents were instructed; they learned to use sound recording equipment and received the “Informant’s memo”. We asked them to intervene in recording process as little as possible, not to turn off the dictaphone and behave “as usual”. All records are anonymous, respondents and their interlocutors are named using codes, personal information in transcripts is marked in a special way. The ORD corpus contains mainly private conversations. An important requirement for the publication of its online version is ensuring anonymity of speakers. Thus, transcripts of sound recordings can be published only after the anonymization of personal names, surnames, and nicknames, and only after any information that may lead to the disclosure of the speaker’s identity is excluded from the texts. The article describes the method proposed for anonymization of personal data and poses the problem of censorship of text transcripts.

Keywords:

Russian language

everyday spoken speech

speech corpus

personal information

pre-publishing processing

Abstract:

Search