ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire

Research output: Contribution to journalConference articlepeer-review

Abstract

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paperwe explore their use for CTA.We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.

Original languageEnglish (US)
Pages (from-to)2279-2292
Number of pages14
JournalProceedings of the VLDB Endowment
Volume17
Issue number9
DOIs
StatePublished - 2024
Event50th International Conference on Very Large Data Bases, VLDB 2024 - Guangzhou, China
Duration: Aug 25 2024Aug 29 2024

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models'. Together they form a unique fingerprint.

Cite this