TY - GEN
T1 - Evaluating the evaluations of code recommender systems
T2 - 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016
AU - Proksch, Sebastian
AU - Amann, Sven
AU - Nadi, Sarah
AU - Mezini, Mira
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/8/25
Y1 - 2016/8/25
N2 - While researchers develop many new exciting code recommender systems, such as method-call completion, code-snippet completion, or code search, an accurate evaluation of such systems is always a challenge. We analyzed the current literature and found that most of the current evaluations rely on artificial queries extracted from released code, which begs the question: Do such evaluations reect real-life usages? To answer this question, we capture 6,189 fine-grained development histories from real IDE interactions. We use them as a ground truth and extract 7,157 real queries for a specific method-call recommender system. We compare the results of such real queries with different artificial evaluation strategies and check several assumptions that are repeatedly used in research, but never empirically evaluated. We find that an evolving context that is often observed in practice has a major effect on the prediction quality of recommender systems, but is not commonly reected in artificial evaluations.
AB - While researchers develop many new exciting code recommender systems, such as method-call completion, code-snippet completion, or code search, an accurate evaluation of such systems is always a challenge. We analyzed the current literature and found that most of the current evaluations rely on artificial queries extracted from released code, which begs the question: Do such evaluations reect real-life usages? To answer this question, we capture 6,189 fine-grained development histories from real IDE interactions. We use them as a ground truth and extract 7,157 real queries for a specific method-call recommender system. We compare the results of such real queries with different artificial evaluation strategies and check several assumptions that are repeatedly used in research, but never empirically evaluated. We find that an evolving context that is often observed in practice has a major effect on the prediction quality of recommender systems, but is not commonly reected in artificial evaluations.
KW - Artificial Evaluation
KW - Empirical Study
KW - IDE Interaction Data
UR - http://www.scopus.com/inward/record.url?scp=84989170526&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84989170526&partnerID=8YFLogxK
U2 - 10.1145/2970276.2970330
DO - 10.1145/2970276.2970330
M3 - Conference contribution
AN - SCOPUS:84989170526
T3 - ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering
SP - 111
EP - 121
BT - ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering
A2 - Khurshid, Sarfraz
A2 - Lo, David
A2 - Apel, Sven
PB - Association for Computing Machinery, Inc
Y2 - 3 September 2016 through 7 September 2016
ER -