Accurate variable selection and causal structure recovery in high-dimensional data

Xu, Ning

Access status:

USyd Access

Field	Value	Language
dc.contributor.author	Xu, Ning
dc.date.accessioned	2020-07-24
dc.date.available	2020-07-24
dc.date.issued	2020	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/22920
dc.description.abstract	From the perspective of econometrics, an accurate variable selection method greatly enhances the reliability of causal analysis and interpretation of the estimators, espe- cially in a world of ever-expanding data dimensions. While variable selection methods in machine learning and statistics have been developed rapidly and applied widely in different branches of data science in the last decade, they have been more slowly adopted in econometrics. Nevertheless, the machine learning methods, including lasso, forward regression, cross-validation and marginal correlation ranking (also called vari- able screening) are subject to a range of issues that may result in errors in variable selection and inaccurate causal interpretation. I propose two new variable-selection methods that significantly mitigate the issues with existing techniques and that provide accurate variable selection and reliable causal structure estimation in high-dimensional data. In Chapter 1, I develop bounds for cross-validation errors that may be used as a criterion for variable selection with many existing learning algorithms (including lasso, forward regression and variable screen- ing), yielding a sparse and stable model that retains all of the relevant variables. In Chapter 2, I develop an entirely new learning algorithm for variable selection— subsample-ordered least-angle regression (solar)—and show in simulations that solar out-performs coordinate descent and lars-lasso in terms of the sparsity, stability, ac- curacy, and robustness of variable selection. In Chapter 3 I demonstrate the superior variable-selection performance of solar using real-world data from two completely dif- ferent samples: prostate cancer patients and house prices. I also show that combining solar variable selection with linear probabilistic graph learning yields a plausible, data- driven method to recover causal structure in data.	en_AU
dc.language.iso	en	en_AU
dc.publisher	University of Sydney	en_AU
dc.subject	variable selection	en_AU
dc.subject	least angle regression	en_AU
dc.subject	directed acyclic graph	en_AU
dc.subject	constraint based learning	en_AU
dc.subject	casual structure recovery	en_AU
dc.subject	high dimensional spaces	en_AU
dc.title	Accurate variable selection and causal structure recovery in high-dimensional data	en_AU
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
usyd.faculty	SeS faculties schools::Faculty of Arts and Social Sciences::School of Economics	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Fisher, Timothy