Noun phrases (NPs) are a crucial part of natural language, exhibiting in many cases an
extremely complex structure. However, NP structure is largely ignored by the statistical parsing field, as the most widely-used corpus is not annotated with it. This lack of gold-standard data has
restricted all previous efforts to parse NPs, making it impossible to perform the supervised experiments that have achieved high performance in so many Natural Language Processing (NLP) tasks.
We comprehensively solve this problem by manually annotating NP structure for the entire Wall Street Journal section of the Penn Treebank. The inter-annotator agreement scores that we attain refute the belief that the task is too difficult, and demonstrate that consistent NP annotation is possible. Our gold-standard NP data is now available and will be useful for all parsers.
We present three statistical methods for parsing NP structure. Firstly, we apply the Collins (2003) model, and find that its recovery of NP structure is significantly worse than its overall performance. Through much experimentation, we determine that this is not a result of the special base-NP model used by the parser, but primarily caused by a lack of lexical information.
Secondly, we construct a wide-coverage, large-scale NP Bracketing system, applying a
supervised model to achieve excellent results. Our Penn Treebank data set, which is orders of magnitude larger than those used previously, makes this possible for the first time. We then implement and experiment with a wide variety of features in order to determine an optimal model. Having achieved this, we use the NP Bracketing system to reanalyse NPs outputted by the Collins (2003)
parser. Our post-processor outperforms this state-of-the-art parser.
For our third model, we convert the NP data to CCGbank (Hockenmaier and Steedman, 2007), a corpus that uses the Combinatory Categorial Grammar (CCG) formalism. We experiment with a CCG parser and again, implement features that improve performance. We also evaluate the CCG parser against the Briscoe and Carroll (2006) reannotation of DepBank (King et al., 2003), another corpus that annotates NP structure. This supplies further evidence that parser performance
is increased by improving the representation of NP structure.
Finally, the error analysis we carry out on the CCG data shows that again, a lack of lexicalisation causes difficulties for the parser. We find that NPs are particularly reliant on this lexical information, due to their exceptional productivity and the reduced explicitness present in modifier
sequences. Our results show that NP parsing is a significantly harder task than parsing in general.
This thesis comprehensively analyses the NP parsing task. Our contributions allow wide-coverage, large-scale NP parsers to be constructed for the first time, and motivate further NP parsing research for the future. The results of our work can provide significant benefits for many NLP tasks, as the crucial information contained in NP structure is now available for all downstream systems.