Hybridized artificial intelligence models with nature-inspired algorithms for river flow modeling: A comprehensive review, assessment, and possible future research …

H Tao, SI Abba, AM Al-Areeq, F Tangang… - … Applications of Artificial …, 2024 - Elsevier
River flow (Q flow) is a hydrological process that considerably impacts the management and
sustainability of water resources. The literature has shown great potential for nature-inspired …

Efficient memory management for large language model serving with pagedattention

W Kwon, Z Li, S Zhuang, Y Sheng, L Zheng… - Proceedings of the 29th …, 2023 - dl.acm.org
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …

Orca: A distributed serving system for {Transformer-Based} generative models

GI Yu, JS Jeong, GW Kim, S Kim, BG Chun - 16th USENIX Symposium …, 2022 - usenix.org
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …

A fast post-training pruning framework for transformers

W Kwon, S Kim, MW Mahoney… - Advances in …, 2022 - proceedings.neurips.cc
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …

Dnnfusion: accelerating deep neural networks execution with advanced operator fusion

W Niu, J Guan, Y Wang, G Agrawal, B Ren - Proceedings of the 42nd …, 2021 - dl.acm.org
Deep Neural Networks (DNNs) have emerged as the core enabler of many major
applications on mobile devices. To achieve high accuracy, DNN models have become …

{ROLLER}: Fast and efficient tensor compilation for deep learning

H Zhu, R Wu, Y Diao, S Ke, H Li, C Zhang… - … USENIX Symposium on …, 2022 - usenix.org
Despite recent advances in tensor compilers, it often costs hours to generate an efficient
kernel for an operator, a compute-intensive sub-task in a deep neural network (DNN), on …

{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training

Z Lin, Y Miao, Q Zhang, F Yang, Y Zhu, C Li… - … USENIX Symposium on …, 2024 - usenix.org
With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …

AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction

S Zheng, R Chen, A Wei, Y Jin, Q Han, L Lu… - Proceedings of the 49th …, 2022 - dl.acm.org
Hardware specialization is a promising trend to sustain performance growth. Spatial
hardware accelerators that employ specialized and hierarchical computation and memory …

Apollo: Automatic partition-based operator fusion through layer by layer optimization

J Zhao, X Gao, R Xia, Z Zhang… - Proceedings of …, 2022 - proceedings.mlsys.org
We study fusion for deep neural networks (DNNs) in a just-in-time (JIT) compilation
framework Apollo. It considers both memory-and compute-bound tensor operators for fusion …

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Z Zheng, X Yang, P Zhao, G Long, K Zhu… - Proceedings of the 27th …, 2022 - dl.acm.org
This work reveals that memory-intensive computation is a rising performance-critical factor in
recent machine learning models. Due to a unique set of new challenges, existing ML …