TY - JOUR
T1 - Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
AU - Saparov, Abulhair
AU - Pang, Richard Yuanzhe
AU - Padmakumar, Vishakh
AU - Joshi, Nitish
AU - Kazemi, Seyed Mehran
AU - Kim, Najoung
AU - He, He
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
AB - Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
UR - http://www.scopus.com/inward/record.url?scp=85175596521&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175596521&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85175596521
SN - 1049-5258
VL - 36
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
Y2 - 10 December 2023 through 16 December 2023
ER -