Novel deep learning-based methods for improved prediction and feature-learning in high-throughput proteomic and transcriptomic data
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Geddes, Thomas AndrewAbstract
The rise of high-throughput Omics technologies has allowed researchers to measure biomolecular
species of interest en masse at the sample or individual cell level. These technologies, including bulk
and single cell transcriptomics, mass spectrometry (MS) proteomics, and other MS ...
See moreThe rise of high-throughput Omics technologies has allowed researchers to measure biomolecular species of interest en masse at the sample or individual cell level. These technologies, including bulk and single cell transcriptomics, mass spectrometry (MS) proteomics, and other MS techniques capable of quantifying post-translational modifications (PTMs) of proteins, produce extremely large datasets, presenting new opportunities and challenges for data analysis. These datasets may capture complex relationships in the regulation of genes, proteins and PTMs. However, the development of sophisticated techniques is required both to extract this information and to overcome pathologies and challenges that arise. Issues such as missingness, biological noise, the curse of dimensionality, and others make these datasets non-trivial to analyse. This thesis explores different approaches to analysing high-throughput datasets, extracting useful information and addressing some of the challenges involved. Chapter 2 introduces Thunderbolt, a traditional analysis pipeline which provides tools for diagnosis and remedy of pathologies inherent to specific MS proteomics datasets; differential expression analysis; and downstream analysis tools. The chapter demonstrates a full analysis workflow to address a specific hypothesis and discusses approaches to dealing with dataset pathologies. Chapter 3 introduces scCCESS, a flexible autoencoder-based framework for improving the performance of clustering methods when applied to single-cell RNA-seq datasets by diversifying and simplifying inputs to the chosen clustering algorithm. Chapter 4 introduces ConGregatE-PPI, a predictive ensemble artificial neural network model which leverages complementary information from multiple datasets to improve prediction of protein-protein interactions in a specific biological context.
See less
See moreThe rise of high-throughput Omics technologies has allowed researchers to measure biomolecular species of interest en masse at the sample or individual cell level. These technologies, including bulk and single cell transcriptomics, mass spectrometry (MS) proteomics, and other MS techniques capable of quantifying post-translational modifications (PTMs) of proteins, produce extremely large datasets, presenting new opportunities and challenges for data analysis. These datasets may capture complex relationships in the regulation of genes, proteins and PTMs. However, the development of sophisticated techniques is required both to extract this information and to overcome pathologies and challenges that arise. Issues such as missingness, biological noise, the curse of dimensionality, and others make these datasets non-trivial to analyse. This thesis explores different approaches to analysing high-throughput datasets, extracting useful information and addressing some of the challenges involved. Chapter 2 introduces Thunderbolt, a traditional analysis pipeline which provides tools for diagnosis and remedy of pathologies inherent to specific MS proteomics datasets; differential expression analysis; and downstream analysis tools. The chapter demonstrates a full analysis workflow to address a specific hypothesis and discusses approaches to dealing with dataset pathologies. Chapter 3 introduces scCCESS, a flexible autoencoder-based framework for improving the performance of clustering methods when applied to single-cell RNA-seq datasets by diversifying and simplifying inputs to the chosen clustering algorithm. Chapter 4 introduces ConGregatE-PPI, a predictive ensemble artificial neural network model which leverages complementary information from multiple datasets to improve prediction of protein-protein interactions in a specific biological context.
See less
Date
2025Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Science, School of Life and Environmental SciencesAwarding institution
The University of SydneyShare