TY - GEN
T1 - A Large-scale Data Set and an Empirical Study of Docker Images Hosted on Docker Hub
AU - Lin, Changyuan
AU - Nadi, Sarah
AU - Khazaei, Hamzeh
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/9
Y1 - 2020/9
N2 - Docker is currently one of the most popular containerization solutions. Previous work investigated various characteristics of the Docker ecosystem, but has mainly focused on Dockerfiles from GitHub, limiting the type of questions that can be asked, and did not investigate evolution aspects. In this paper, we create a recent and more comprehensive data set by collecting data from Docker Hub, GitHub, and Bitbucket. Our data set contains information about 3,364,529 Docker images and 378,615 git repositories behind them. Using this data set, we conduct a large-scale empirical study with four research questions where we reproduce previously explored characteristics (e.g., popular languages and base images), investigate new characteristics such as image tagging practices, and study evolution trends. Our results demonstrate the maturity of the Docker ecosystem: we find more reliance on ready-to-use language and application base images as opposed to yet-to-be-configured OS images, a downward trend of Docker image sizes demonstrating the adoption of best practices of keeping images small, and a declining trend in the number of smells in Dockerfiles suggesting a general improvement in quality. On the downside, we find an upward trend in using obsolete OS base images, posing security risks, and find problematic usages of the latest tag, including version lagging. Overall, our results bring good news such as more developers following best practices, but they also indicate the need to build tools and infrastructure embracing new trends and addressing potential issues.
AB - Docker is currently one of the most popular containerization solutions. Previous work investigated various characteristics of the Docker ecosystem, but has mainly focused on Dockerfiles from GitHub, limiting the type of questions that can be asked, and did not investigate evolution aspects. In this paper, we create a recent and more comprehensive data set by collecting data from Docker Hub, GitHub, and Bitbucket. Our data set contains information about 3,364,529 Docker images and 378,615 git repositories behind them. Using this data set, we conduct a large-scale empirical study with four research questions where we reproduce previously explored characteristics (e.g., popular languages and base images), investigate new characteristics such as image tagging practices, and study evolution trends. Our results demonstrate the maturity of the Docker ecosystem: we find more reliance on ready-to-use language and application base images as opposed to yet-to-be-configured OS images, a downward trend of Docker image sizes demonstrating the adoption of best practices of keeping images small, and a declining trend in the number of smells in Dockerfiles suggesting a general improvement in quality. On the downside, we find an upward trend in using obsolete OS base images, posing security risks, and find problematic usages of the latest tag, including version lagging. Overall, our results bring good news such as more developers following best practices, but they also indicate the need to build tools and infrastructure embracing new trends and addressing potential issues.
UR - http://www.scopus.com/inward/record.url?scp=85096684066&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096684066&partnerID=8YFLogxK
U2 - 10.1109/ICSME46990.2020.00043
DO - 10.1109/ICSME46990.2020.00043
M3 - Conference contribution
AN - SCOPUS:85096684066
T3 - Proceedings - 2020 IEEE International Conference on Software Maintenance and Evolution, ICSME 2020
SP - 371
EP - 381
BT - Proceedings - 2020 IEEE International Conference on Software Maintenance and Evolution, ICSME 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th IEEE International Conference on Software Maintenance and Evolution, ICSME 2020
Y2 - 27 September 2020 through 3 October 2020
ER -