Selecting third-party libraries: the data scientist’s perspective

Sarah Nadi, Nourhan Sakr

Research output: Contribution to journalArticlepeer-review

Abstract

With the increased reliance on data-driven decisions and software services, data scientists are becoming an integral part of many software teams and enterprise operations. To perform their tasks, data scientists rely on various third-party libraries (e.g., pandas in Python for data wrangling or ggplot in R for data visualization). Selecting the right library to use is often a difficult task, with many factors influencing this selection. While there has been a lot of research on the factors that software developers take into account when selecting a library, it is not clear if these factors influence data scientists’ library selection in the same way, especially given several differences between both groups. To address this gap, we replicate a recent survey of library selection factors, but target data scientists instead of software developers. Our survey of 90 participants shows that data scientists consider several factors when selecting libraries to use, with technical factors such as the usability of the library, fit for purpose, and documentation being the three highest influencing factors. Additionally, we find that there are 11 factors that data scientists rate differently than software developers. For example, data scientists are influenced more by the collective experience of the community but less by the library’s security or license. We also uncover new factors that influence data scientists’ library selection, such as the statistical rigor of the library. We triangulate our survey results with feedback from five focus groups involving 18 additional data science experts with various roles, whose input allow us to further interpret our survey results. We discuss the implications of our findings for data science library maintainers as well as researchers who want to design recommender and/or comparison systems that help data scientists with library selection.

Original languageEnglish (US)
Article number15
JournalEmpirical Software Engineering
Volume28
Issue number1
DOIs
StatePublished - Jan 2023

Keywords

  • Data science
  • Data scientists
  • Library selection
  • Software libraries
  • Third-party dependencies

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Selecting third-party libraries: the data scientist’s perspective'. Together they form a unique fingerprint.

Cite this