Abstract
Neural network models have recently made striking progress in natural language processing, but they are typically trained on orders of magnitude more language input than children receive. What can these neural networks, which are primarily distributional learners, learn from a naturalistic subset of a single child's experience? We examine this question using a recent longitudinal dataset collected from a single child, consisting of egocentric visual data paired with text transcripts. We train both language-only and vision-and-language neural networks and analyze the linguistic knowledge they acquire. In parallel with findings from Jeffrey Elman's seminal work, the neural networks form emergent clusters of words corresponding to syntactic (nouns, transitive and intransitive verbs) and semantic categories (e.g., animals and clothing), based solely on one child's linguistic input. The networks also acquire sensitivity to acceptability contrasts from linguistic phenomena, such as determiner-noun agreement and argument structure. We find that incorporating visual information produces an incremental gain in predicting words in context, especially for syntactic categories that are comparatively more easily grounded, such as nouns and verbs, but the underlying linguistic representations are not fundamentally altered. Our findings demonstrate which kinds of linguistic knowledge are learnable from a snapshot of a single child's real developmental experience.
Original language | English (US) |
---|---|
Article number | e13305 |
Journal | Cognitive Science |
Volume | 47 |
Issue number | 6 |
DOIs | |
State | Published - Jun 2023 |
Keywords
- Child development
- First-person video
- Language learning
- Learnability
- Multimodal learning
- Neural networks
- Statistical learning
ASJC Scopus subject areas
- Experimental and Cognitive Psychology
- Cognitive Neuroscience
- Artificial Intelligence