Cross-language text alignment for plagiarism detection based on contextual and context-free models

Authors

Sochenkov I. Zubarev D.

Annotation

In this paper, we present a dataset for cross-language (Russian-English) text alignment subtask of plagiarism detection. We compare different models for detecting translated plagiarism. One is based on different textual similarity scores, which exploit word embeddings. Another model extends the previous one with the features obtained via neural machine translation. The last model is built on top of pre-trained language representation (Bert) via fine-tuning for our task. The Bert model shows great performance and outperforms other models. However, it requires much more computation resources than simpler models. Therefore, it seems reasonable to use both context-free models and contextual models together in modern plagiarism detection systems.

External links

PDF at the Dialogue international conference website: http://www.dialog-21.ru/media/4642/zubarevdvplussochenkoviv-110.pdf

Presentation at the Dialogue international conference website: www.dialog-21.ru/media/4843/zubarev-sochenkov.pptx

Semantic Scholar: https://api.semanticscholar.org/CorpusID:219600500

Reference link

Zubarev D. V., Sochenkov I. V. Cross-language text alignment for plagiarism detection based on contextual and context-free models // Papers from the Annual International Conference "Dialogue", 2019, pp. 809-820