Обнаружение фрагментов в задаче кросс-языкового поиска заимствований с использованием эмбеддингов слов

Авторы

Зубарев Д. В. Соченков И. В.

Аннотация

In this paper, we present a dataset for cross-language (Russian-English) text alignment subtask of plagiarism detection. We compare different models for detecting translated plagiarism. One is based on different textual similarity scores, which exploit word embeddings. Another model extends the previous one with the features obtained via neural machine translation. The last model is built on top of pre-trained language representation (Bert) via fine-tuning for our task. The Bert model shows great performance and outperforms other models. However, it requires much more computation resources than simpler models. Therefore, it seems reasonable to use both context-free models and contextual models together in modern plagiarism detection systems.

Внешние ссылки

PDF на сайте Международной конференции «Диалог» (англ.): http://www.dialog-21.ru/media/4642/zubarevdvplussochenkoviv-110.pdf

Презентация на сайте Международной конференции «Диалог» (англ.): www.dialog-21.ru/media/4843/zubarev-sochenkov.pptx

Semantic Scholar: https://api.semanticscholar.org/CorpusID:219600500

Ссылка при цитировании

Zubarev D. V., Sochenkov I. V. Cross-language text alignment for plagiarism detection based on contextual and context-free models // Papers from the Annual International Conference "Dialogue", 2019, pp. 809-820