Cross-lingual similar document retrieval methods

Authors

Sochenkov I. Zubarev D.

Annotation

In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.

External links

DOI: http://dx.doi.org/10.15514/ISPRAS-2019-31(5)-9

Denis Zubarev's presentation at ISPRAS OPEN-2019 (or watch at YouTube):

PDF at the Proceedings of the Institute for System Programming journal's website: https://ispranproceedings.elpub.ru/jour/article/view/1221

PDF at the Ivannikov Institute for System Programming of the RAS: https://www.ispras.ru/proceedings/docs/2019/31/5/isp_31_2019_5_127.pdf

PDF at MathNet: http://mi.mathnet.ru/eng/tisp458

ResearchGate: https://www.researchgate.net/publication/338217425_Cross-lingual_similar_document_retrieval_methods

Reference link

Zubarev D. V., Sochenkov I. V. Cross-lingual similar document retrieval methods. Proceedings of the Institute for System Programming, vol. 31, issue 5, 2019, pp. 127-136 DOI: 10.15514/ISPRAS-2019-31(5)-9.