http://hdl.handle.net/2123/6229
Title: | The Inductive Inference of Structure in Text Streams |
Authors: | Patrick, Jon Palko, Dusan Khan, Asiz |
Keywords: | Humanities Computing |
Issue Date: | 2001 |
Publisher: | Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney. |
Citation: | Computing Arts 2001 : digital resources for research in the humanities : 26th-28th September 2001, Veterinary Science Conference Centre, the University of Sydney / hosted by the Scholarly Text and Imaging Service (SETIS), the University of Sydney Library, and the Research Institute for Humanities and Social Sciences (RIHSS), the University of Sydney |
Abstract: | Text can be thought of as a data stream that has embedded in it a variety of structural elements that indicate semantic changes in content. One of the simpler examples of this is a dictionary. We work from the general principles of inductive inference and use automata theory to model a text data stream. From this principle an Intelligent Self-Learning Parser-editor for inferencing the structure in the text, verifying it for accuracy and automating error correction is feasible. A good case study to test the quality of the solution is the conversion of dictionaries in text format into a database format. This is not necessarily a straightforward task as the data is noisy to some extent due to typographic errors, and inconsistent structure across dictionary entries. As well information for attribute demarcation is most often implied by changes in text formats and not by explicit symbols. In this project the aim has been to build a parser-editor that can be trained to identify the structure of dictionary entries and then learn from examples to parse unseen entries. The software has to be able to cope with erroneous data, missing data and irregularly formatted data and intelligently prompt a user to intervene in the parsing process as well as allow and record irregular structures. The technique has been used to convert a Basque-English bilingual dictionary from Word processing files into XML files. |
URI: | http://hdl.handle.net/2123/6229 |
Rights and Permissions: | Copyright the University of Sydney |
Type of Work: | Conference paper |
Appears in Collections: | Computing Arts 2001: Digital Resources for Research in the Humanities |
File | Description | Size | Format | |
---|---|---|---|---|
patrick02.pdf | 23.42 kB | Adobe PDF |
Items in Sydney eScholarship Repository are protected by copyright, with all rights reserved, unless otherwise indicated.