Automatic Privacy Compliance Checks for Mobile Apps Using Natural Language Processing
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Pinchahewage, Bhanuka Malith SilvaAbstract
The rapid growth of the mobile app ecosystem has intensified concerns about how user data is collected, shared, and communicated through privacy disclosures. Privacy compliance in app marketplaces relies heavily on developer self-reporting and user awareness. As a result, privacy ...
See moreThe rapid growth of the mobile app ecosystem has intensified concerns about how user data is collected, shared, and communicated through privacy disclosures. Privacy compliance in app marketplaces relies heavily on developer self-reporting and user awareness. As a result, privacy information, whether in detailed policy documents or in summarised forms, often fails to accurately reflect intended data practices. This thesis explores how recent advances in NLP can enable automated and scalable privacy compliance checks in the Google Play Store. It identifies key factors that limit the transparency and usability of privacy policies and proposes enhanced parsing and structuring techniques to improve comprehension and support more effective regulatory oversight. Existing encoder-based models provide accurate predictions but lack interpretability, while decoder-based LLMs provide meaningful explanations, yet they lack verifiability. To address this gap, this thesis first introduces an entailment-driven LLM framework that couples generative reasoning and re-evaluation strategies with embedding-based verification, improving both the interpretability and factual consistency of privacy policy classification. It then presents PrivPRISM, a novel language-modelling framework that leverages both encoder and decoder architectures for large-scale compliance analysis, which cross-examines privacy policies, Play Store disclosures, and installation artefacts to detect inconsistencies. Findings reveal that 53% of analysed apps exhibit discrepancies, highlighting the need for evidence-driven auditing. Finally, this thesis details PrivSTRUCT, a structured modelling approach that leverages developer-defined structural cues to disentangle complex privacy disclosures by linking data items to their stated or implied purposes. The findings reveal a persistent transparency gap in which broadly defined purpose disclosures obscure sensitive first- and third-party data practices in mobile apps.
See less
See moreThe rapid growth of the mobile app ecosystem has intensified concerns about how user data is collected, shared, and communicated through privacy disclosures. Privacy compliance in app marketplaces relies heavily on developer self-reporting and user awareness. As a result, privacy information, whether in detailed policy documents or in summarised forms, often fails to accurately reflect intended data practices. This thesis explores how recent advances in NLP can enable automated and scalable privacy compliance checks in the Google Play Store. It identifies key factors that limit the transparency and usability of privacy policies and proposes enhanced parsing and structuring techniques to improve comprehension and support more effective regulatory oversight. Existing encoder-based models provide accurate predictions but lack interpretability, while decoder-based LLMs provide meaningful explanations, yet they lack verifiability. To address this gap, this thesis first introduces an entailment-driven LLM framework that couples generative reasoning and re-evaluation strategies with embedding-based verification, improving both the interpretability and factual consistency of privacy policy classification. It then presents PrivPRISM, a novel language-modelling framework that leverages both encoder and decoder architectures for large-scale compliance analysis, which cross-examines privacy policies, Play Store disclosures, and installation artefacts to detect inconsistencies. Findings reveal that 53% of analysed apps exhibit discrepancies, highlighting the need for evidence-driven auditing. Finally, this thesis details PrivSTRUCT, a structured modelling approach that leverages developer-defined structural cues to disentangle complex privacy disclosures by linking data items to their stated or implied purposes. The findings reveal a persistent transparency gap in which broadly defined purpose disclosures obscure sensitive first- and third-party data practices in mobile apps.
See less
Date
2026Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
The University of SydneyShare