Russian Learner Translator CorpusСвободное произведение культуры

Russian Learner Translator Corpus (RusLTC) is a multiple learner translator corpus, which stores Russian translator trainees’ translations out of English into Russian (L1) and out of Russian into English (L2). The project was initiated in 2011, and is being developed by a cross-functional team of translator trainees and computational linguists from National Research University Higher School of Economics (Moscow, Russia) and Tyumen State University (Tyumen, Russia).

In Russian

We collect translations from 10 Russian universities, which offer specialist, BA and MA translation programmes. All translations are made as part of routine and exam assignments or as submissions for translation contests by students majoring in translation. There are, however, translations from trainees who study translation as a supplementary course or have chosen translation as their second higher education (part-time and continuous learning programmes). We also include translations made as internship tasks by students majoring in translation and as graduation translation projects by part-time students. All these trainee populations are described in the searchable Corpus metadata.

RusLTC is a large corpus, which is not designed with a specific research purpose in mind, but rather for a broad research agenda in translation studies, including

  1. exploring variation and choice in translation, when different translations of the same source are compared;
  2. comparing learner translator output to native data which can bring to conclusions about non-nativeness and 'translationese'; the translator interlanguage and consequences of the constraints of translation as communicative activity as opposed to free speech;
  3. exploring interdependence between the translation characteristics and various metadata (direction and conditions of translation, source text genre);
  4. analysis of concordances of multiple translations (comparing several translations to sources) that can help to develop and test hypotheses about error-prone linguistic items or other sources of mistakes in translation (“problem areas”), for example, “Are translator’s false friends a problem in the sample selected?” “Do learner translators overuse less frequent ways of expressing epistemic modality when translating from English?”
  5. computer-aided error analysis of the most common translation errors and speculating about weaker components in the current translator population competences; this strand of research can be extended to include the study into the didactics of translation quality assessment (TQA);
  6. apart from learner corpora and translation studies research, the results of which can be applied in the curriculum and materials design, there are numerous ways how the Corpus can be directly used as a teaching and learning aid for the course in practical translation.

As of June 14, 2015 RusLTC contains the total of more than 1.5 mln word tokens. The number of translations to a single source varies from 1 to more than 60. All translations and metadata are anonymised. Detailed and automatically updated statistics.

We stick to open knowledge philosophy in choosing corpus-building technology, and are happy to make the Corpus available online. To the best of our knowledge RusLTC is the third multiple learner translator corpus available online after and MeLLANGE LTC (see Table 1 which outlines LTC-related research as of March 2014).


Genres in the corpus

The Corpus is aligned at sentence-level with LF Aligner, the misalignments in the output were manually corrected in Olifant. The query interface supports lexical search for both sources and targets and returns all occurrences of the query item in respective texts along with their targets/sources. There is an option to view full texts and narrow down the search by relevant metadata, including the translator’s gender and affiliation, education type and level, grade for the translation, year and conditions of translation (routine/exam; home/classroom) and source text genre. The source genre types include those most commonly offered as translation tasks at partner universities (academic, informational, technical, fiction, educational, speech, letters, advertisement, review – see the figure on the right for their distribution). All query results can be exported in CSV format. Note that the current query interface is beta and can malfunction.

There is a comparatively small translation mistakes annotation project associated with the Corpus content. Error annotation is performed with customized brat text annotation framework (Stenetorp et al 2012). It allows creating standardized machine-readable annotation. It is based on the pre-defined error typology which is an hierarchy of 30 tags, including 'good_solution' and 'note' (see the typology description). As of June 2015 we have enriched 477 translations with error annotation (391 translations into Russian and 86 translations into English; information on the annotated texts can be found in Table 3), but as the annotation tool is in everyday classroom use, this data is rapidly growing. So far this part of the Corpus lacks proper query interface and can only be viewed online.

RusLTC is downloadable as a customized TMX-file and a plaintext archive.

The corpus structure is based on the following filename conventions. Each text in the Corpus receives a unique name starting with the source language code (EN or RU), the number indicating the contributor, the unique number of the source and (in case of translations) the number of an individual translation for the same source. For example, RU_1_35.txt and EN_1_35_3.txt are a Russian source and its third English translation correspondingly. If a source has only one translation it still receives number 1 after the third underscore in the filename to identify that it is a translation. Table 2 in the attachment describes the Corpus content in terms of source text title, genre, size and number of translations available and can be used to navigate the Corpus archive.

Error types

For all types of metadata and tagging we use stand-off annotation, which means that information about the annotation is stored in different files (ex. RU_1_35.head.txt – the file containing metadata for the respective text; EN_1_35_3.ann – the file containing the error annotation). The error statistics can be processed to show the distribution of any error categories in any sample from an individual translation to the whole collection of texts. This statistics for the English-Russian subcorpus is shown in the figure to the left.


Feel free to contact us at rlpcorpus@gmail.com.

The Corpus content is available under the Creative Commons Attribution-ShareAlike license. You can use it freely as long as you make the reference to the authors (Kunilovskaya, Kutuzov 2014 in the References below), and if you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

If you are the author of any translations included in the Corpus and don’t want them to be freely available for research purposes anymore, contact the RusLTC team, and we will remove them from the Corpus.

Download RusLTC texts as a plain text archive (including metadata)

Download RusLTC as an aligned TMX file (bitext)

Query interface


RusLTC Team is an enthusiast group of translation students, computational linguists and translator trainers from the Tyumen State University (Tyumen, Russia) and the National Research University Higher School of Economics (Moscow, Russia).

Since the start of the project, RusLTC data was used in a number of research projects, there is undergraduate corpus-based comparative and translation studies research going on and much is planned for the future.

Some of the publications based on RusLTC can be found below:

  1. Kunilovskaya M. (2017)Linguistic tendencies in English to Russian translation: the case of connectives. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2017” (pp. 221-233) Appendix 1. Semantic taxonomy of connectives
  2. Kunilovskaya M., Kutuzov A. (2015) A quantitative study of translational Russian (based on а translational learner corpus)// Труды международной конференции «Корпусная лингвистика-2015». – Спб.: С.-Петербургский гос. университет, Филологический факультет, 2015. – С. 33–40.
  3. Kunilovskaya M.(2015) How Far do we Agree on the Quality of Translation?In English Studies at NBU, 2015, Vol. 1, Issue 1, 18–31.
  4. Maria Kunilovskaya, Andrey Kutuzov. (2014) Russian Learner Translator Corpus: Design, Research Potential and Applications In Text, Speech and Dialogue (pp. 315-323). Springer International Publishing (2014, January).
  5. Kunilovskaya, M.A. (2013). Error-based TQA and Error Mark-up in BRAT. In Proceedings of International Conference on Translatology, Problems of Translation and Methods of Teaching Translation. Nizhny Novgorod, Russia. Issue 16. Vol 1, 59–71 (in Russian). [Куниловская, М.А. Классификация переводческих ошибок и их электронная разметка в brat //Проблемы теории, практики и дидактики перевода: Сборник научных трудов. Серия "Язык. Культура. Коммуникация". Выпуск 16. Том 1. - Нижний Новгород: Нижегородский государственный лингвистический университет им. Н.А. Добролюбова, 2013. - С. 59–71].
  6. Kunilovskaya, M.A. and Morgoun, N.L. (2013). Gains And Pitfalls Of Sentence-Splitting.In Translation In Perm National Research Polytechnic University Herald / Linguistic and Pedagogy. # 8 (50), 152–166.
  7. Ilyushchenya, T.A. and Kunilovskaya, M.A. (2013). Inter-rater Reliability in Student Translation Evaluation. In Proceedings of International Conference on Translation Studies Ecology of Translation: Interdisciplinary Research and Perspectives. OCTOBER 4-5, Tyumen, Russia, 105–115 (in Russian). [Ильющеня, Т.А, Куниловская, М.А. Надежность результатов описания и оценки качества учебного письменного перевода // Экология перевода: перспективы междисциплинарных исследований: материалы I Международной научно-практической конференции (г. Тюмень, 4-5 октября 2013 г.)". – Тюмень: Издательство «ШУКЛИН & АЛЕКСАНДРОВ» 2013. – С. 105–115].
  8. Kunilovskaya, M.A. (2012). Towards Translation Error Mark-up in Russian Learner Translator Corpus.Paper presented at Translation Forum Russia, September 30. Kazan, Russia [Куниловская, М.А. Классификация переводческих ошибок для создания разметки в учебном параллельном корпусе Russian Learner Translator Corpus, in Lingua mobilis: Научный журнал, № 40. – ЧелГУ: Лаборатория межкультурных коммуникаций, 2013. – С. 141–159].
  9. Kutuzov, A.B., Kunilovskaya M.A., Oschepkov A.Yu. and Chepurkova A., Yu. (2012). Russian Learner Parallel Corpus as a Tool for Translation Studies. In Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference "Dialogue", Issue 11. Vol 1 of 2, 362–369.
  10. Kutuzov, A.B. (2012). Is there a difference between male and female translations (based on the RusLTC data). In Proceedings of International Conference on Translatology, Problems of Translation and Methods of Teaching Translation. Nizhny Novgorod, Russia. Issue 15. Vol 1, 97–104 (in Russian). [Кутузов, А.Б. Переводы мужские и женские: есть ли разница (на материале Корпуса несовершенных переводов)? // Проблемы перевода, лингвистики и литературы: сборник научных трудов. Серия "Язык. Культура. Коммуникация". Выпуск 15. Том 1. - Нижний Новгород: НГЛУ, 2012. – C. 97–104].
  11. Kutuzov, A.B. Russian Learner Translator Corpus: Importance of the Project. In Proceedings of International Conference on Translatology, Problems of Translation and Methods of Teaching Translation. Nizhny Novgorod, Russia. Issue 14. Vol 1 (in Russian). [Кутузов, А.Б. Корпус несовершенных переводов: необходимость проекта // Сборник научных трудов "Проблемы теории, практики и дидактики перевода", вып.14, т.1. - Нижний Новгород, 2011].

Here are the slides from one of the recent presentations on the Corpus, made at TSD-2014 conference, Brno.

RusLTC at TSD-2014 (Brno) from Maria Kunilovskaya
Creative Commons License
Russian Learner Translator Corpus by RusLTC Team is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Permissions beyond the scope of this license may be available at rlpcorpus@gmail.com.