Lightlda: Big topic models on modest computer clusters

J Yuan, F Gao, Q Ho, W Dai, J Wei, X Zheng… - Proceedings of the 24th …, 2015 - dl.acm.org
Proceedings of the 24th International Conference on World Wide Web, 2015dl.acm.org
When building large-scale machine learning (ML) programs, such as massive topic models
or deep neural networks with up to trillions of parameters and training examples, one usually
assumes that such massive tasks can only be attempted with industrial-sized clusters with
thousands of nodes, which are out of reach for most practitioners and academic researchers.
We consider this challenge in the context of topic modeling on web-scale corpora, and show
that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million …
When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens --- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limited memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These contributions are built on top of the Petuum open-source distributed ML framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果