A Large-scale Data Set and an Empirical Study of Docker Images Hosted on Docker Hub

Changyuan Lin, Sarah Nadi, Hamzeh Khazaei

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Docker is currently one of the most popular containerization solutions. Previous work investigated various characteristics of the Docker ecosystem, but has mainly focused on Dockerfiles from GitHub, limiting the type of questions that can be asked, and did not investigate evolution aspects. In this paper, we create a recent and more comprehensive data set by collecting data from Docker Hub, GitHub, and Bitbucket. Our data set contains information about 3,364,529 Docker images and 378,615 git repositories behind them. Using this data set, we conduct a large-scale empirical study with four research questions where we reproduce previously explored characteristics (e.g., popular languages and base images), investigate new characteristics such as image tagging practices, and study evolution trends. Our results demonstrate the maturity of the Docker ecosystem: we find more reliance on ready-to-use language and application base images as opposed to yet-to-be-configured OS images, a downward trend of Docker image sizes demonstrating the adoption of best practices of keeping images small, and a declining trend in the number of smells in Dockerfiles suggesting a general improvement in quality. On the downside, we find an upward trend in using obsolete OS base images, posing security risks, and find problematic usages of the latest tag, including version lagging. Overall, our results bring good news such as more developers following best practices, but they also indicate the need to build tools and infrastructure embracing new trends and addressing potential issues.

Original languageEnglish (US)
Title of host publicationProceedings - 2020 IEEE International Conference on Software Maintenance and Evolution, ICSME 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages371-381
Number of pages11
ISBN (Electronic)9781728156194
DOIs
StatePublished - Sep 2020
Event36th IEEE International Conference on Software Maintenance and Evolution, ICSME 2020 - Virtual, Adelaide, Australia
Duration: Sep 27 2020Oct 3 2020

Publication series

NameProceedings - 2020 IEEE International Conference on Software Maintenance and Evolution, ICSME 2020

Conference

Conference36th IEEE International Conference on Software Maintenance and Evolution, ICSME 2020
Country/TerritoryAustralia
CityVirtual, Adelaide
Period9/27/2010/3/20

ASJC Scopus subject areas

  • Software
  • Safety, Risk, Reliability and Quality
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'A Large-scale Data Set and an Empirical Study of Docker Images Hosted on Docker Hub'. Together they form a unique fingerprint.

Cite this