Feature Extraction for Large-Scale Text Collections

Luke Gallagher, Antonio Mallia, J. Shane Culpepper, Torsten Suel, B. Barla Cambazoglu

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems. In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.

    Original languageEnglish (US)
    Title of host publicationCIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management
    PublisherAssociation for Computing Machinery
    Pages3015-3022
    Number of pages8
    ISBN (Electronic)9781450368599
    DOIs
    StatePublished - Oct 19 2020
    Event29th ACM International Conference on Information and Knowledge Management, CIKM 2020 - Virtual, Online, Ireland
    Duration: Oct 19 2020Oct 23 2020

    Publication series

    NameInternational Conference on Information and Knowledge Management, Proceedings

    Conference

    Conference29th ACM International Conference on Information and Knowledge Management, CIKM 2020
    Country/TerritoryIreland
    CityVirtual, Online
    Period10/19/2010/23/20

    Keywords

    • clueweb
    • feature extraction
    • feature importance
    • feature index
    • feature repository
    • lambdamart
    • learning to rank
    • ltr

    ASJC Scopus subject areas

    • General Business, Management and Accounting
    • General Decision Sciences

    Fingerprint

    Dive into the research topics of 'Feature Extraction for Large-Scale Text Collections'. Together they form a unique fingerprint.

    Cite this