A Corpus-based Probabilistic Granunar with Only Two Non-tenninals

Satoshi Sekine, Ralph Grishman

Research output: Contribution to conferencePaperpeer-review


The availabil i ty of large, syntactically-bracketed corpora such as the Penn Tree Bank affords us the opportunity to automatically build or train broad-coverage grammars, and in particular t.o train probabilisti c grammars. A number of recent parsing experiments have also indicated that. grammars whose production probabilities are dependent on the ,context can be more effective than context-free grammars in selecting a correct parse. To make maximal use of context, we have automatically constructed, from the Penn Tree Bank version 2, a grammar in which the symbols S and NP are the only real nonterminals, and the other non-terminals or grammatical nodes are in effect embedded into the right-hand-sides of the S and NP rules. For example, one of the rnles extraded from the tree bank would be S -> NP VBX JJ CC VBX NP [1] ( where NP is a non-terminal and the other symbols are terminals - part-of-speech tags of the Tr-ee Bank ) . Tbe most common structure in t.he Tree Bank a5sociat.ed with this expansion is (S ·NP ( VP ( VP VB.I (ADJ J J ) C C (VP VBX NP ) ) ) ) [2] . So i f our parser uses rule [l] j n parsing a sentence, i t. will generate structure [2] for the corresponding part of the sentence. l. sing 94% of the Penn Tree Bank for training, we extracted 32,296 distinct rules ( 2:3,386 for S, and .910 for NP ) . We also built a smaller version of the grammar based ,on higher frequency patterns for use a5 a back-up when the larger grammar is unable to produce a parse due to memory limitation . We applied this parser to 1 ,989 Wall St1·eet Journal sentences (separate from the training set and with no lirrnt on sentence length ) . Of the parsed sentences ( 1 ,899 ) , the percentage of no-crossing sentences is 33:9%, and Parseval recall and precision are 73.43% and 72 .61 %.

Original languageEnglish (US)
Number of pages8
StatePublished - 1995
Event4th International Workshop on Parsing Technologies, IWPT 1995 - Prague and Karlovy Vary, Czech Republic
Duration: Sep 20 1995Sep 24 1995


Conference4th International Workshop on Parsing Technologies, IWPT 1995
Country/TerritoryCzech Republic
CityPrague and Karlovy Vary

ASJC Scopus subject areas

  • Artificial Intelligence
  • Human-Computer Interaction
  • Linguistics and Language


Dive into the research topics of 'A Corpus-based Probabilistic Granunar with Only Two Non-tenninals'. Together they form a unique fingerprint.

Cite this