Hybridized artificial intelligence models with nature-inspired algorithms for river flow modeling: A comprehensive review, assessment, and possible future research …
River flow (Q flow) is a hydrological process that considerably impacts the management and
sustainability of water resources. The literature has shown great potential for nature-inspired …
sustainability of water resources. The literature has shown great potential for nature-inspired …
Efficient memory management for large language model serving with pagedattention
High throughput serving of large language models (LLMs) requires batching sufficiently
many requests at a time. However, existing systems struggle because the key-value cache …
many requests at a time. However, existing systems struggle because the key-value cache …
Orca: A distributed serving system for {Transformer-Based} generative models
Large-scale Transformer-based models trained for generation tasks (eg, GPT-3) have
recently attracted huge interest, emphasizing the need for system support for serving models …
recently attracted huge interest, emphasizing the need for system support for serving models …
A fast post-training pruning framework for transformers
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …
However, prior work on pruning Transformers requires retraining the models. This can add …
Dnnfusion: accelerating deep neural networks execution with advanced operator fusion
Deep Neural Networks (DNNs) have emerged as the core enabler of many major
applications on mobile devices. To achieve high accuracy, DNN models have become …
applications on mobile devices. To achieve high accuracy, DNN models have become …
{ROLLER}: Fast and efficient tensor compilation for deep learning
Despite recent advances in tensor compilers, it often costs hours to generate an efficient
kernel for an operator, a compute-intensive sub-task in a deep neural network (DNN), on …
kernel for an operator, a compute-intensive sub-task in a deep neural network (DNN), on …
{nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training
With the growing model size of deep neural networks (DNN), deep learning training is
increasingly relying on handcrafted search spaces to find efficient parallelization execution …
increasingly relying on handcrafted search spaces to find efficient parallelization execution …
AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction
Hardware specialization is a promising trend to sustain performance growth. Spatial
hardware accelerators that employ specialized and hierarchical computation and memory …
hardware accelerators that employ specialized and hierarchical computation and memory …
Apollo: Automatic partition-based operator fusion through layer by layer optimization
J Zhao, X Gao, R Xia, Z Zhang… - Proceedings of …, 2022 - proceedings.mlsys.org
We study fusion for deep neural networks (DNNs) in a just-in-time (JIT) compilation
framework Apollo. It considers both memory-and compute-bound tensor operators for fusion …
framework Apollo. It considers both memory-and compute-bound tensor operators for fusion …
AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures
This work reveals that memory-intensive computation is a rising performance-critical factor in
recent machine learning models. Due to a unique set of new challenges, existing ML …
recent machine learning models. Due to a unique set of new challenges, existing ML …