作者
Henning Redestig, Floris van der Flier, David Estell, Sina Pricelius, Lydia Dankmeyer, Sander van Stigt Thans, Harm Mulder, Rei Otsuka, Frits Goedegebuur, Laurens Lammerts, Diego Staphorst, Aalt DJ van Dijk, Dick de Ridder
发表日期
2023/11/20
简介
Protein engineering increasingly relies on machine learning models to computationally pre-screen variants to identify those that meet the target requirements. Although machine learning approaches have proven effective, their performance on prospective screening data has room for improvement. Prediction accuracy can vary greatly from one variant to the next. So far, it is unclear what characterizes variants that are associated with large model error. We designed and generated a dataset that can be stratified based on four structural characteristics (buriedness, number of contact residues, proximity to the active site and presence of secondary structure), to answer this question. We found that variants with multiple mutations that are buried, closely connected with other residues or close to the active site, which we call challenging mutations, are harder to model than their counterparts (ie exposed, loosely connected, far from the active site). This effect emerges only for variants with multiple challenging mutations, since single mutations at these sites were not harder to model. Our findings indicate that variants with challenging mutations are appropriate benchmarking targets for assessing model quality and that stratified dataset design can be leveraged to highlight areas of improvement for machine learning guided protein engineering.
学术搜索中的文章