In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.
DOI: http://dx.doi.org/10.15514/ISPRAS-2019-31(5)-9
Denis Zubarev's presentation at ISPRAS OPEN-2019 (or watch at YouTube):
PDF at the Proceedings of the Institute for System Programming journal's website: https://ispranproceedings.elpub.ru/jour/article/view/1221
PDF at the Ivannikov Institute for System Programming of the RAS: https://www.ispras.ru/proceedings/docs/2019/31/5/isp_31_2019_5_127.pdf
PDF at MathNet: http://mi.mathnet.ru/eng/tisp458
ResearchGate: https://www.researchgate.net/publication/338217425_Cross-lingual_similar_document_retrieval_methods
Zubarev D. V., Sochenkov I. V. Cross-lingual similar document retrieval methods. Proceedings of the Institute for System Programming, vol. 31, issue 5, 2019, pp. 127-136 DOI: 10.15514/ISPRAS-2019-31(5)-9.