Artefacts and biases affecting the evaluation of scoring function on decoy sets for protein structure prediction

Julia Handl, Joshua Knowles and Simon C. Lovell

Decoy data sets, consisting of a solved protein structure and numerous alternative native-like structures, are in common use for both the evaluation and the development of scoring functions in protein structure prediction. Several pitfalls with the use of these data sets have been identified in the literature, as well as useful guidelines for generating more effective decoy data sets. We contribute to this ongoing discussion an empirical assessment of several decoy data sets commonly used in experimental studies. We find that artefacts in the large majority of these data make it trivial to discriminate the native structure, and even the use of several decoy sets does not insure against this. More fundamentally, sampling biases present in the way these data are generated or used can strongly affect measurement of the correlation between score and RMSD to the native. We demonstrate how, depending on type of bias and evaluation context, sampling biases may lead to both over- or under-estimation of the quality of scoring terms, functions or methods.

Supplementary material

Data sets used in this work

Software used in this work

  • The R project for Statistical Computing. The specific functions used were funtion cor(..,method="spearman") to compute Spearman rank correlations, function cor.test(..,method="spearman") to test the statistical significance of a given correlation and function parcoord(..) from the MASS library to generate parallel axes plots.
  • The Rosetta method for protein structure prediction, version 2.3.
  • The TINKER molecular modelling software, which implements the Amber99 all-atom energy function.
  • C code to select the Pareto optimal set from amongst a set of vectors.