Development of Cross-Language Embeddings for Extracting Chemical Structures from Texts in Russian and English

Authors

Devyatkin D. , Molodchenkov A. , Lukin A.

Annotation

This study is dedicated to describing an algorithm for implementation cross-lingual embeddings to extract chemical structures from texts in both Russian and English. The proposed algorithm focuses on fine-tuning of pre-trained models based on transformer architecture. After analyzing existing models, mBERT and LaBSE were selected. The training datasets for these models included texts related to chemistry and adjacent fields of science. Fine-tuning was done using a collected set of scientific articles and patent texts in Russian and English. For English, the ChemProt corpus was also used. The model was trained on tasks such as masked language modeling and entity recognition. Comparisons were made with several models, including BioBERT. The results of the experiments showed that the proposed implementation of embeddings more effectively solve the task of recognition chemical structure names in texts in both Russian and English.

External links

Download the article (PDF) from the official website: http://injoit.ru/index.php/j1/article/view/2131

Download the article (PDF) from eLibrary (registration required): https://www.elibrary.ru/item.asp?id=82341426

Reference link

Alexey Molodchenkov, Dmitry Deviatkin, Sergey Loginov, Alexey Lupatov, Alisa Gisina, Anton Lukin. Development of Cross-Language Embeddings for Extracting Chemical Structures from Texts in Russian and English // International Journal of Open Information Technologies, ISSN: 2307-8162, vol. 13, no. 5, 2025, pp. 62–66.