Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems

S Oral, J Simmons, J Hill, D Leverman… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale
parallel file systems (PFS) to support its operations. During this process, OLCF acquired …

[PDF][PDF] Olcfs 1 tb/s, next-generation lustre file system

S Oral, DA Dillow, D Fuller, J Hill, D Leverman… - Proceedings of Cray …, 2013 - cug.org
Abstract The Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National
Laboratory (ORNL) has a long history of deploying the world's fastest supercomputers to …

Theta: Rapid installation and acceptance of an XC40 KNL system

K Harms, T Leggett, B Allen, S Coghlan… - Concurrency and …, 2018 - Wiley Online Library
In order to provide a stepping stone from the Argonne Leadership Computing Facility's
(ALCF) world class production 10 petaFLOP IBM BlueGene/Q system, Mira, to its next …

Improving large-scale storage system performance via topology-aware and balanced data placement

F Wang, S Oral, S Gupta, D Tiwari… - 2014 20th IEEE …, 2014 - ieeexplore.ieee.org
With the advent of big data, the I/O subsystems of large-scale compute clusters are
becoming a center of focus. More applications are putting greater demands on end-to-end …

[HTML][HTML] Accelerating network communication and I/O in scientific high performance computing environments

SM Neuwirth - 2019 - ub.uni-heidelberg.de
High performance computing has become one of the major drivers behind technology
inventions and science discoveries. Originally driven through the increase of operating …

PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

P Huo, A Devulapally, H Al Maruf… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Deep Learning Recommendation Models (DLRMs) have become increasingly popular and
prevalent in today's datacenters, consuming most of the AI inference cycles. The …

Alleviating i/o interference through workload-aware striping and load-balancing on parallel file systems

Y Tsujita, T Yoshizaki, K Yamamoto, F Sueyasu… - … Conference, ISC High …, 2017 - Springer
Nowadays parallel file systems have been widely used in many supercomputers. Lustre is
one of the most used parallel file systems, and its enhanced file system named FEFS (Fujitsu …

[PDF][PDF] A next-generation parallel file system environment for the OLCF

GM Shipman, DA Dillow, D Fuller… - Proceedings of Cray …, 2012 - cug.org
When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory's
Leadership Computing Facility (OLCF) was the world's largest scale Lustre parallel file …

[PDF][PDF] I/O router placement and fine-grained routing on Titan to support Spider II

M Ezell, D Dillow, S Oral, F Wang, D Tiwari… - Cray User Group …, 2014 - cug.org
The Oak Ridge Leadership Computing Facility (OLCF) introduced the concept of Fine-
Grained Routing in 2008 to improve I/O performance between the Jaguar supercomputer …

面向分层混合存储架构的协同式突发缓冲技术

周恩强, 张伟, 董勇, 卢宇彤 - 国防科技大学学报, 2015 - journal.nudt.edu.cn
科学计算产生和分析的数据规模日益增长, 高性能计算机的存储系统在体系架构和软件管理方法
上面临重大挑战. 针对天河-2 系统的新型分层混合存储架构, 提出一种由应用程序耦合的协同式 …