Architecture of an information collection and extraction system for an intelligent search and analytical platform

Authors

Sochenkov I. Zubarev D.

Annotation

Internet data serves as the foundation for a wide range of tasks, from information retrieval to analytical processing. With the rapid growth of data volumes, efficient metadata extraction from dynamic web resources has become critically important. Traditional information collection and extraction methods based on static templates are largely ineffective when processing interactive content. This paper presents the architecture of an adaptive information collection and extraction system that integrates standard data extraction techniques with machine learning technologies. The system has a modular structure comprising the following subsystems: task management, monitoring and logging, crawling, link management, and metadata extraction. The crawling subsystem processes both static and dynamic content through browser emulation. A hybrid approach combining structured rules and machine learning is used for metadata extraction. Experimental results demonstrated successful metadata extraction from various web resources, including pages with dynamic content and complex structures. The system exhibited high accuracy and resilience to changes in data formats while strictly adhering to ethical data collection standards, such as compliance with robots.txt directives and applying reasonable request intervals. Thus, the proposed solution represents a significant step toward the development of universal data collection and extraction systems for modern information environments. The developed software tools have been utilized in populating the index databases of the Neopoisk system.

External links

DOI: 10.15514/ISPRAS-2025-37(2)-20

Download the article (PDF) or read online at the journal's websute (in Russian): https://ispranproceedings.elpub.ru/jour/article/view/1922

Download the article (PDF) from the ISP RAS website (in Russian): https://www.ispras.ru/proceedings/docs/2025/37/2/isp_37_2025_2_263.pdf

Download the full Issue (PDF) or read online at the journal's website (in Russian): https://ispranproceedings.elpub.ru/jour/issue/viewIssue/123/174

Download the article (PDF) from eLibrary (in Russian, registration required): https://www.elibrary.ru/item.asp?id=80645858

Math-Net.Ru: https://www.mathnet.ru/eng/tisp981

Reference link

Serenko D. S., Terentev E. D., Zubarev D. V., Sochenkov I. V. Architecture of an information collection and extraction system for an intelligent
search and analytical platform // Proceedings of ISP RAS, vol. 37, issue 2, 2025. pp. 263–280.