Labeled message datasets are required to train tools to detect hatred messages in social media, and the collection of those datasets is a laborious task. Such sets are widely available in the public domain for English and Arabic but can be hardly found for the Russian and other languages. The article presents an automated approach to creating labeled sets of religious hatred messages from social media. This approach combines focus crawling of social network messages and active learning approaches. Crawling is a step-by-step procedure that uses active learning methods to correct message labeling and to train the classifier used to filter irrelevant texts. The developed approach makes it possible to simultaneously form a multilingual corpus of religious hatred messages and train a classifier to identify them.
DOI: 10.18127/j20729472-202302-06
Buy the article at the Highly Available Systems journal website: http://radiotec.ru/en/journal/Highly_available_systems/number/2023-2/article/23538
Volkov S. S., Devyatkin D. A., Sochenkov I. V., Shelmanov A. O. Automated approach to collect religious hatred messages from social media // Highly Available Systems. 2023. V. 19. № 2. P. 70−80.