In this paper, we compare different methods for cross-lingual similar document retrieval for distant language pair, namely Russian and English languages. We compare various methods among them: classical Cross-Lingual Explicit Semantic Analysis (CL-ESA), machine translation methods and approaches based on cross-lingual embeddings. We introduce two datasets for evaluation of this task: Russian-English aligned Wikipedia articles and automatically translated Paraplag. Conducted experiments show that an approach with inverted index, with an extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings, achieves better performance in terms of recall and MAP than other methods on both datasets.
DOI: 10.1007/978-3-030-81200-3_16
Download the collection of proceedings (PDF) from the DAMDID 2020 conference website: http://damdid2020.cs.vsu.ru/DAMDID_2020_Extended_Abstracts.pdf
Download the collection of proceedings (PDF) from eLibrary (registration required): https://www.elibrary.ru/item.asp?id=44512068
ResearchGate: https://www.researchgate.net/publication/353285239_Comparison_of_Cross-Lingual_Similar_Documents_Retrieval_Methods
D. V. Zubarev, I. V. Sochenkov. Comparison of cross-lingual similar documents retrieval methods // Data Analytics and Management in Data Intensive Domains: ХХII International Conference DAМDID/RCDL' 2020 (October 13–16, 2020, Voronezh, Russia): Extended Abstracts of the Conference. Edited bу Bernhard Thalheim, Sergey Makhortov, Alexander Sychev. – Voronezh : Voronezh State University, 2020. pp. 207–210.