CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot

Liang Niu, Shujaat Mirza, Zayd Maradni, Christina Pöpper

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Code generation language models are trained on billions of lines of source code to provide code generation and auto-completion features, like those offered by code assistant GitHub Copilot with more than a million users. These datasets may contain sensitive personal information—personally identifiable, private, or secret—that these models may regurgitate. This paper introduces and evaluates a semi-automated pipeline for extracting sensitive personal information from the Codex model used in GitHub Copilot. We employ carefully-designed templates to construct prompts that are more likely to result in privacy leaks. To overcome the non-public training data, we propose a semi-automated filtering method using a blind membership inference attack. We validate the effectiveness of our membership inference approach on different code generation models. We utilize hit rate through the GitHub Search API as a distinguishing heuristic followed by human-in-the-loop evaluation, uncovering that approximately 8% (43) of the prompts yield privacy leaks. Notably, we observe that the model tends to produce indirect leaks, compromising privacy as contextual integrity by generating information from individuals closely related to the queried subject in the training corpus.

Original languageEnglish (US)
Title of host publication32nd USENIX Security Symposium, USENIX Security 2023
PublisherUSENIX Association
Pages2133-2150
Number of pages18
ISBN (Electronic)9781713879497
StatePublished - 2023
Event32nd USENIX Security Symposium, USENIX Security 2023 - Anaheim, United States
Duration: Aug 9 2023Aug 11 2023

Publication series

Name32nd USENIX Security Symposium, USENIX Security 2023
Volume3

Conference

Conference32nd USENIX Security Symposium, USENIX Security 2023
Country/TerritoryUnited States
CityAnaheim
Period8/9/238/11/23

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot'. Together they form a unique fingerprint.

Cite this