Accurate variable selection and causal structure recovery in high-dimensional data
Field | Value | Language |
dc.contributor.author | Xu, Ning | |
dc.date.accessioned | 2020-07-24 | |
dc.date.available | 2020-07-24 | |
dc.date.issued | 2020 | en_AU |
dc.identifier.uri | https://hdl.handle.net/2123/22920 | |
dc.description.abstract | From the perspective of econometrics, an accurate variable selection method greatly enhances the reliability of causal analysis and interpretation of the estimators, espe- cially in a world of ever-expanding data dimensions. While variable selection methods in machine learning and statistics have been developed rapidly and applied widely in different branches of data science in the last decade, they have been more slowly adopted in econometrics. Nevertheless, the machine learning methods, including lasso, forward regression, cross-validation and marginal correlation ranking (also called vari- able screening) are subject to a range of issues that may result in errors in variable selection and inaccurate causal interpretation. I propose two new variable-selection methods that significantly mitigate the issues with existing techniques and that provide accurate variable selection and reliable causal structure estimation in high-dimensional data. In Chapter 1, I develop bounds for cross-validation errors that may be used as a criterion for variable selection with many existing learning algorithms (including lasso, forward regression and variable screen- ing), yielding a sparse and stable model that retains all of the relevant variables. In Chapter 2, I develop an entirely new learning algorithm for variable selection— subsample-ordered least-angle regression (solar)—and show in simulations that solar out-performs coordinate descent and lars-lasso in terms of the sparsity, stability, ac- curacy, and robustness of variable selection. In Chapter 3 I demonstrate the superior variable-selection performance of solar using real-world data from two completely dif- ferent samples: prostate cancer patients and house prices. I also show that combining solar variable selection with linear probabilistic graph learning yields a plausible, data- driven method to recover causal structure in data. | en_AU |
dc.language.iso | en | en_AU |
dc.publisher | University of Sydney | en_AU |
dc.subject | variable selection | en_AU |
dc.subject | least angle regression | en_AU |
dc.subject | directed acyclic graph | en_AU |
dc.subject | constraint based learning | en_AU |
dc.subject | casual structure recovery | en_AU |
dc.subject | high dimensional spaces | en_AU |
dc.title | Accurate variable selection and causal structure recovery in high-dimensional data | en_AU |
dc.type | Thesis | |
dc.type.thesis | Doctor of Philosophy | en_AU |
dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en_AU |
usyd.faculty | SeS faculties schools::Faculty of Arts and Social Sciences::School of Economics | en_AU |
usyd.degree | Doctor of Philosophy Ph.D. | en_AU |
usyd.awardinginst | The University of Sydney | en_AU |
usyd.advisor | Fisher, Timothy |
Associated file/s
Associated collections