In this article we compare the quality of various cross-lingual embeddings on the cross-lingual text classification problem and explore the possibility of transferring knowledge between languages. We consider Multilingual Unsupervised and Supervised Embeddings (MUSE), multilingual BERT embeddings, XLM-RoBERTa (XLM-R) model embeddings, and Language-Agnostic Sentence Representations (LASER). Various classification algorithms use them as inputs for solving the task of the patent categorization. It is a zero-shot cross-lingual classification task since the training and the validation sets include the English texts, and the test set consists of documents in Russian.
DOI: 10.1007/978-3-030-81200-3_13
Download the collection of proceedings (PDF) from the DAMDID 2020 conference website: http://damdid2020.cs.vsu.ru/DAMDID_2020_Extended_Abstracts.pdf
Download the collection of proceedings (PDF) from eLibrary (registration require): https://www.elibrary.ru/item.asp?id=44512058
Ryzhova, A., Sochenkov, I. Extrinsic Evaluation of Cross-Lingual Embeddings on the Patent Classification Task // Data Analytics and Management in Data Intensive Domains: ХХII International Conference DAМDID/RCDL' 2020 (October 13–16, 2020, Voronezh, Russia): Extended Abstracts of the Conference. Edited bу Bernhard Thalheim, Sergey Makhortov, Alexander Sychev. – Voronezh : Voronezh State University, 2020. pp. 181–183.