Ultra-high dimensional partial correlation analysis and its applications
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Neo, EmilyAbstract
Modern large datasets which have emerged from advances in information technology have presented both opportunities and challenges for statistical analysis and applications. The proliferation of data samples is favoured in numerous statistical learning approaches, and has enabled ...
See moreModern large datasets which have emerged from advances in information technology have presented both opportunities and challenges for statistical analysis and applications. The proliferation of data samples is favoured in numerous statistical learning approaches, and has enabled powerful insights and applications of machine learning and artificial intelligence in numerous domains. However, simultaneously, the growth in the number of features (p) such that it at times exceeds the number of samples (n), has also presented an obstacle to rigorous statistical analysis in a number of other scientific areas. This large scale data regime where p grows much faster than n (p>>n), is known as the ultra-high dimensional data regime and is commonly encountered in climate science, health and finance. In this setting, numerous classical statistical approaches are ill-defined, presenting challenges to data mining, multivariate analysis and statistical learning. This dissertation proposes methods which are catered to the challenging ultra-high dimensional setting. We first propose a large scale partial correlation screening approach with error control, named PARSEC, which provides a principled method for identifying significant partial correlations rooted in an inferential framework. PARSEC provides a sparse estimate of inverse covariance structure in the large-$p$ setting, which can be used for learning Gaussian Graphical Models (GGMs or undirected graphical models) and also leveraged by other methods that rely on a stable estimate of the inverse covariance matrix. We demonstrate PARSEC's usage for these analysis purposes on popular and prevalent real applications, including breast cancer gene screening and finance portfolio selection. Lastly, we outline a constraint-based approach known as RECON, which extends the PARSEC framework to learning the mechanisms of large scale directed graphical models.
See less
See moreModern large datasets which have emerged from advances in information technology have presented both opportunities and challenges for statistical analysis and applications. The proliferation of data samples is favoured in numerous statistical learning approaches, and has enabled powerful insights and applications of machine learning and artificial intelligence in numerous domains. However, simultaneously, the growth in the number of features (p) such that it at times exceeds the number of samples (n), has also presented an obstacle to rigorous statistical analysis in a number of other scientific areas. This large scale data regime where p grows much faster than n (p>>n), is known as the ultra-high dimensional data regime and is commonly encountered in climate science, health and finance. In this setting, numerous classical statistical approaches are ill-defined, presenting challenges to data mining, multivariate analysis and statistical learning. This dissertation proposes methods which are catered to the challenging ultra-high dimensional setting. We first propose a large scale partial correlation screening approach with error control, named PARSEC, which provides a principled method for identifying significant partial correlations rooted in an inferential framework. PARSEC provides a sparse estimate of inverse covariance structure in the large-$p$ setting, which can be used for learning Gaussian Graphical Models (GGMs or undirected graphical models) and also leveraged by other methods that rely on a stable estimate of the inverse covariance matrix. We demonstrate PARSEC's usage for these analysis purposes on popular and prevalent real applications, including breast cancer gene screening and finance portfolio selection. Lastly, we outline a constraint-based approach known as RECON, which extends the PARSEC framework to learning the mechanisms of large scale directed graphical models.
See less
Date
2024Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
The University of Sydney Business School, Discipline of Business AnalyticsAwarding institution
The University of SydneyShare