Facial expressions (FEs) communicate a rich variety of social, grammatical, and affective signals. However, the most generally accepted set of recognizable FEs remains limited to seven basic displays of emotion: happiness, sadness, fear, anger, disgust, surprise and contempt. To develop intelligent virtual agents capable of interpreting and synthesizing nuanced facial behavior, we need a more complete lexicon. One roadblock has been the limiting nature of forced-choice study designs, the most common paradigm for investigating observer judgements of FEs. However, there has been no consensus on an objective way to evaluate alternative free response designs. We present a human-in-the-loop artificial intelligence pipeline for analyzing sets of freely chosen natural language labels. The pipeline, FreeRes-NLP, makes it possible to automatically identify whether there is consensus on the signal value of an FE and which label best classifies it. FreeRes-NLP scales to process very large datasets. We validate our approach in two stages: 1) comparison between label synonymy scores from ten computer algorithms and human raters across three synonym datasets, and 2) examples of pipeline results compared with manual data processing results from emotion and FE recognition studies. The pipeline can potentially improve automated facial expression recognition and procedural modeling of virtual humans.