Lightlda: Big topic models on modest computer clusters- 学术资源搜索

Lightlda: Big topic models on modest computer clusters

J Yuan, F Gao, Q Ho, W Dai, J Wei, X Zheng… - Proceedings of the 24th …, 2015 - dl.acm.org

J Yuan, F Gao, Q Ho, W Dai, J Wei, X Zheng, EP Xing, TY Liu, WY Ma

Proceedings of the 24th International Conference on World Wide Web, 2015•dl.acm.org

When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens --- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limited memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These contributions are built on top of the Petuum open-source distributed ML framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.

ACM Digital Library

展开收起

被引用次数：239 相关文章所有 10 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果