Decoy data sets, consisting of a solved protein structure and numerous alternative native-like structures, are in common use for both the evaluation and the development of scoring functions in protein structure prediction. Several pitfalls with the use of these data sets have been identified in the literature, as well as useful guidelines for generating more effective decoy data sets. We contribute to this ongoing discussion an empirical assessment of several decoy data sets commonly used in experimental studies. We find that artefacts in the large majority of these data make it trivial to discriminate the native structure, and even the use of several decoy sets does not insure against this. More fundamentally, sampling biases present in the way these data are generated or used can strongly affect measurement of the correlation between score and RMSD to the native. We demonstrate how, depending on type of bias and evaluation context, sampling biases may lead to both over- or under-estimation of the quality of scoring terms, functions or methods.
Supplementary materialData sets used in this work
Software used in this work
|