In this paper, we present a dataset for cross-language (Russian-English) text alignment subtask of plagiarism detection. We compare different models for detecting translated plagiarism. One is based on different textual similarity scores, which exploit word embeddings. Another model extends the previous one with the features obtained via neural machine translation. The last model is built on top of pre-trained language representation (Bert) via fine-tuning for our task. The Bert model shows great performance and outperforms other models. However, it requires much more computation resources than simpler models. Therefore, it seems reasonable to use both context-free models and contextual models together in modern plagiarism detection systems.
PDF на сайте Международной конференции «Диалог» (англ.): http://www.dialog-21.ru/media/4642/zubarevdvplussochenkoviv-110.pdf
Презентация на сайте Международной конференции «Диалог» (англ.): www.dialog-21.ru/media/4843/zubarev-sochenkov.pptx
Semantic Scholar: https://api.semanticscholar.org/CorpusID:219600500
Zubarev D. V., Sochenkov I. V. Cross-language text alignment for plagiarism detection based on contextual and context-free models // Papers from the Annual International Conference "Dialogue", 2019, pp. 809-820