Forest packing: Fast parallel, decision forests

J Browne, D Mhembere, TM Tomita, JT Vogelstein… - Proceedings of the 2019 …, 2019 - SIAM
Proceedings of the 2019 SIAM International Conference on Data Mining, 2019SIAM
Decision Forests are popular machine learning techniques that assist scientists to extract
knowledge from massive data sets. This class of tool remains popular because of their
interpretability and ease of use, unlike other modern machine learning methods, such as
kernel machines and deep learning. Decision forests also scale well for use with large data
because training and run time operations are trivially parallelizable allowing for high
inference throughputs. A negative aspect of these forests, and an untenable property for …
Abstract
Decision Forests are popular machine learning techniques that assist scientists to extract knowledge from massive data sets. This class of tool remains popular because of their interpretability and ease of use, unlike other modern machine learning methods, such as kernel machines and deep learning. Decision forests also scale well for use with large data because training and run time operations are trivially parallelizable allowing for high inference throughputs. A negative aspect of these forests, and an untenable property for many real time applications, is their high inference latency caused by the combination of large model sizes with random memory access patterns. We present memory packing techniques and a novel tree traversal method to overcome this deficiency. The result of our system is a grouping of trees into a hierarchical structure. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in contiguous cache lines. We extend this layout with a re-ordering of the tree traversal algorithm to take advantage of the increased memory throughput provided by out-of-order execution and cache-line prefetching. Together, these optimizations increase the performance and parallel scalability of classification in ensembles by a factor of ten over an optimized C++ implementation and a popular R-language implementation.
Society for Industrial and Applied Mathematics
以上显示的是最相近的搜索结果。 查看全部搜索结果