Comparison of cross-lingual similar documents retrieval methods

Authors

Sochenkov I. Zubarev D.

Annotation

In this paper, we compare different methods for cross-lingual similar document retrieval for distant language pair, namely Russian and English languages. We compare various methods among them: classical Cross-Lingual Explicit Semantic Analysis (CL-ESA), machine translation methods and approaches based on cross-lingual embeddings. We introduce two datasets for evaluation of this task: Russian-English aligned Wikipedia articles and automatically translated Paraplag. Conducted experiments show that an approach with inverted index, with an extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings, achieves better performance in terms of recall and MAP than other methods on both datasets.

External links

DOI: 10.1007/978-3-030-81200-3_16

Download the collection of proceedings (PDF) from the DAMDID 2020 conference website: http://damdid2020.cs.vsu.ru/DAMDID_2020_Extended_Abstracts.pdf

Download the collection of proceedings (PDF) from eLibrary (registration required): https://www.elibrary.ru/item.asp?id=44512068

ResearchGate: https://www.researchgate.net/publication/353285239_Comparison_of_Cross-Lingual_Similar_Documents_Retrieval_Methods

Reference link

D. V. Zubarev, I. V. Sochenkov. Comparison of cross-lingual similar documents retrieval methods // Data Analytics and Management in Data Intensive Domains: ХХII International Conference DAМDID/RCDL' 2020 (October 13–16, 2020, Voronezh, Russia): Extended Abstracts of the Conference. Edited bу Bernhard Thalheim, Sergey Makhortov, Alexander Sychev. – Voronezh : Voronezh State University, 2020. pp. 207–210.