In this paper, we compare different methods for cross-lingual similar document retrieval for distant language pair, namely Russian and English languages. We compare various methods among them: classical Cross-Lingual Explicit Semantic Analysis (CL-ESA), machine translation methods and approaches based on cross-lingual embeddings. We introduce two datasets for evaluation of this task: Russian-English aligned Wikipedia articles and automatically translated Paraplag. Conducted experiments show that an approach with inverted index, with an extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings, achieves better performance in terms of recall and MAP than other methods on both datasets.
DOI: 10.1007/978-3-030-81200-3_16
Скачать полный текст сборника (PDF) с сайта конференции DAMDID 2020 (англ.): http://damdid2020.cs.vsu.ru/DAMDID_2020_Extended_Abstracts.pdf
Скачать полный текст сборника (PDF) на eLibrary (англ., требуется регистрация): https://www.elibrary.ru/item.asp?id=44512068
ResearchGate: https://www.researchgate.net/publication/353285239_Comparison_of_Cross-Lingual_Similar_Documents_Retrieval_Methods
D. V. Zubarev, I. V. Sochenkov. Comparison of cross-lingual similar documents retrieval methods // Data Analytics and Management in Data Intensive Domains: ХХII International Conference DAМDID/RCDL' 2020 (October 13–16, 2020, Voronezh, Russia): Extended Abstracts of the Conference. Edited bу Bernhard Thalheim, Sergey Makhortov, Alexander Sychev. – Voronezh : Voronezh State University, 2020. pp. 207–210.