Show simple item record

FieldValueLanguage
dc.contributor.authorLuo, Zijian
dc.date.accessioned2026-03-05T23:43:19Z
dc.date.available2026-03-05T23:43:19Z
dc.date.issued2026en
dc.identifier.urihttps://hdl.handle.net/2123/34960
dc.descriptionIncludes publication
dc.description.abstractIn data processing, datasets are expected to adhere to specific formats so that they can be consumed reliably by downstream tools and analyses. In practice, however, human error, data corruption, inconsistent formatting, and partial transmission routinely produce nonconforming records that parsers reject outright. When no trustworthy formal specification is available, practitioners are left with two unsatisfactory options: discarding these records, losing potentially valuable information, or performing manual data repair, which is time-consuming, error-prone, and difficult to scale. This thesis studies how to automate data repair in specification-independent settings while treating existing parsers and validators as black boxes. We first analyse the state-of-the-art format-free technique DDMax, showing that their reliance on deletion-heavy edits, valid empty records, waypoints, and strong parser feedback can lead to severe data loss and limited applicability. We then introduce two complementary format-free repair approaches that relax these assumptions. ϵREPAIR leverages minimal viable-prefix feedback from parsers to guide a small number of targeted edits near parse boundaries, and βMAX uses examples of valid and augmented invalid records to infer regular structure and suggest repairs for regex-style formats. Across several real-world formats and regex-described data categories, our empirical evaluation shows that these approaches substantially improve repair quality and data retention over existing methods while remaining practical in runtime. Together, they demonstrate that high-fidelity, specification-independent record repair is achievable even when formal formats are unavailable or unreliable.en
dc.language.isoenen
dc.subjectdata repairen
dc.subjectformat-free repairen
dc.subjectblack-box parsersen
dc.subjectregular-language learningen
dc.subjecterror-correcting parsingen
dc.titleAutomatic Data Repair without Format Specificationsen
dc.typeThesis
dc.type.thesisMasters by Researchen
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en
usyd.facultySeS faculties schools::Faculty of Engineering::School of Computer Scienceen
usyd.degreeMaster of Philosophy M.Philen
usyd.awardinginstThe University of Sydneyen
usyd.advisorGopinath, Rahul
usyd.include.pubYesen


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.