Information Extraction from Radiology Reports for a Population Based Cancer Registry

Nguyen, Hoang Minh Dung

Access status:

USyd Access

Field	Value	Language
dc.contributor.author	Nguyen, Hoang Minh Dung
dc.date.accessioned	2013-10-24
dc.date.available	2013-10-24
dc.date.issued	2013-03-31
dc.identifier.uri	http://hdl.handle.net/2123/9466
dc.description.abstract	In a noisy corpus such as in clinical data, the text usually contains a large number of misspell words, abbreviations and acronyms that can be an obstacle to high quality information extraction and classification. Furthermore, the gold-standard training data needed for supervised learning usually contains many errors and inconsistencies due to differences in human annotators. In this research, a specialised proof-reading process for the clinical domain to resolve unknown tokens and convert scores and measures into a standard layout is introduced. The automatic coding of the texts increased the coded content significantly after the automatic correction process. Accuracy of the automatic coding and annotation of the notes which have not been coded by the clinical staff is suggested by the system output. To deal with the problem of noisy training data, this thesis proposes an algorithm for a method named “reverse active learning” which means applying active learning in reverse order to improve performance of supervised machine learning on clinical corpora. The effects of automatic proof-reading and reverse active learning are shown to produce results on the i2b2 2010 clinical corpus that are a state-of-the-art of supervised learning method and offer a means of improving all processing strategies in clinical language processing. Finally, a Cancer Staging Information Extraction System based on the combination of proposed methods of proof-reading, supervised learning, active learning and reverse active learning is presented. In this research, free-text reports are annotated for examples of the information to be extracted and then algorithms are developed that use the examples to compute a more general model of the desired content. Besides traditional supervised learning methods such as Conditional Random Fields and Support Vector Machines, active learning approaches are investigated to bring further improvement to information extraction system performance.	en_AU
dc.rights	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
dc.subject	information extraction	en_AU
dc.subject	active learning	en_AU
dc.subject	machine learning	en_AU
dc.subject	clinical	en_AU
dc.subject	radiology reports	en_AU
dc.subject	cancer	en_AU
dc.title	Information Extraction from Radiology Reports for a Population Based Cancer Registry	en_AU
dc.type	Thesis	en_AU
dc.type.thesis	Doctor of Philosophy	en_AU
usyd.faculty	Faculty of Engineering and Information Technologies, School of Information Technologies	en_AU
usyd.department	Graduate School of Engineering and Information Technologies	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU