查看文章

Prottrans: Toward understanding the language of life through self-supervised learning

作者

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, Burkhard Rost

发表日期

2021/7/7

期刊

IEEE transactions on pattern analysis and machine intelligence

卷号

期号

页码范围

7112-7127

出版商

IEEE

简介

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM- embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2 …

引用总数

被引用次数：1294

2020202120222023202411 111 272 468 426

学术搜索中的文章

Prottrans: Toward understanding the language of life through self-supervised learning

A Elnaggar, M Heinzinger, C Dallago, G Rehawi… - IEEE transactions on pattern analysis and machine …, 2021

被引用次数：1293 相关文章所有 17 个版本

Prottrans: Toward 660 understanding the language of life through self-supervised learning*

A Elnaggar, M Heinzinger, C Dallago, G Rehawi… - IEEE Transactions on

被引用次数：3 相关文章