Random forests of univariate decision trees are widely used for data and text analysis, but they have limited applicability in the analysis of large and sparse datasets. One approach to solving this problem is to build forests of decision trees with oblique or more complex splits. The training time for such forests on large-scale data significantly exceeds the time for univariate decision trees. In addition, building different types of splits requires different computing resources: a linear split can be trained with the central processors, while training nonlinear splits requires the use of graphics processors. This paper proposes a distributed architecture of a system for training random forests, in which the training process is parallelized at the level of individual splits. It helps reduce downtime of hardware resources and assign training tasks to various types of computing nodes depending on the type of split, as well as dynamically scale the resources depending on the load. Experiments have shown that it can significantly reduce the total computer time required to train forests on large high-dimensional datasets. The presented architecture can be considered as a basis for creating applied systems for data mining and image processing intended for use in various sectors of the economy: agriculture, industry, and transport.
Devyatkin D. A. System for distributed training of Kernel forests. Highly Available Systems / Sistemy vysokoy dostupnosti. 2022. V. 18. № 3. P. 59−68. DOI: https://doi.org/10.18127/j20729472-202203-05 (in Russian)