Automatic Data Repair without Format Specifications
| Field | Value | Language |
| dc.contributor.author | Luo, Zijian | |
| dc.date.accessioned | 2026-03-05T23:43:19Z | |
| dc.date.available | 2026-03-05T23:43:19Z | |
| dc.date.issued | 2026 | en |
| dc.identifier.uri | https://hdl.handle.net/2123/34960 | |
| dc.description | Includes publication | |
| dc.description.abstract | In data processing, datasets are expected to adhere to specific formats so that they can be consumed reliably by downstream tools and analyses. In practice, however, human error, data corruption, inconsistent formatting, and partial transmission routinely produce nonconforming records that parsers reject outright. When no trustworthy formal specification is available, practitioners are left with two unsatisfactory options: discarding these records, losing potentially valuable information, or performing manual data repair, which is time-consuming, error-prone, and difficult to scale. This thesis studies how to automate data repair in specification-independent settings while treating existing parsers and validators as black boxes. We first analyse the state-of-the-art format-free technique DDMax, showing that their reliance on deletion-heavy edits, valid empty records, waypoints, and strong parser feedback can lead to severe data loss and limited applicability. We then introduce two complementary format-free repair approaches that relax these assumptions. ϵREPAIR leverages minimal viable-prefix feedback from parsers to guide a small number of targeted edits near parse boundaries, and βMAX uses examples of valid and augmented invalid records to infer regular structure and suggest repairs for regex-style formats. Across several real-world formats and regex-described data categories, our empirical evaluation shows that these approaches substantially improve repair quality and data retention over existing methods while remaining practical in runtime. Together, they demonstrate that high-fidelity, specification-independent record repair is achievable even when formal formats are unavailable or unreliable. | en |
| dc.language.iso | en | en |
| dc.subject | data repair | en |
| dc.subject | format-free repair | en |
| dc.subject | black-box parsers | en |
| dc.subject | regular-language learning | en |
| dc.subject | error-correcting parsing | en |
| dc.title | Automatic Data Repair without Format Specifications | en |
| dc.type | Thesis | |
| dc.type.thesis | Masters by Research | en |
| dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en |
| usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Computer Science | en |
| usyd.degree | Master of Philosophy M.Phil | en |
| usyd.awardinginst | The University of Sydney | en |
| usyd.advisor | Gopinath, Rahul | |
| usyd.include.pub | Yes | en |
Associated file/s
Associated collections