Challenges in lemmatising signed language digital video corpora: the measure of lexical frequency in Australian and British signed languages
Field | Value | Language |
dc.contributor.author | Schembri, Adam | |
dc.contributor.author | Johnston, Trevor | |
dc.contributor.author | Fenlon, Jordan | |
dc.contributor.author | Cormier, Kearsy | |
dc.contributor.author | Rentelis, Ramas | |
dc.date.accessioned | 2012-02-07 | |
dc.date.available | 2012-02-07 | |
dc.date.issued | 2011-01-01 | |
dc.identifier.uri | http://hdl.handle.net/2123/8109 | |
dc.description.abstract | Digital video archives of Auslan (Australian sign language) and BSL (British Sign Language) are slowly being transformed into machine-readable linguistic corpora. Each archive (Auslan 2004-2008, BSL 2008-2001) consists of data collected from deaf native and near-native signers. The datasets are being annotated using ELAN software. The majority of the video data will be made accessible online (with some limits to access for sensitive data). In this presentation, we report on the on-going studies of lexical frequency in these two signed languages—63,436 sign tokens produced in 360 clips by 109 participants in the currently annotated Auslan dataset, and 25,000 sign tokens from the corpus conversation data in the BSL dataset (500 signs each from 50 participants). Preliminary results signs indicate that between 65% and 60% of the Auslan and BSL data respectively consist of signs from the core lexicon (i.e. those signs which are highly conventionalised in form and meaning across contexts, (see Johnston, 2011, Johnston &Schembri, 1999, 2010). The next two largest categories are pointing signs (12% and 23% respectively) and signs from outside the core lexicon (i.e., gestures and sequences of enactment or 'constructed action') (6.5% and 9% respectively). The remaining number of tokens consists of fingerspelled signs (5% in both datasets), depicting constructions (i.e., depicting verbs of location, motion and/or handling, 11% and 3% respectively), and sign names (0.2 and 0.3% respectively). We discuss some of the challenges creating a lemmatised corpus of a sign language, including difficulties in differentiating core from non-core signs and sign from gesture, as well as how our work informs both sign language documentation and description specifically and linguistic theory more generally. Johnston, T. (2011). Lexical frequency in sign languages. Journal of Deaf Studies and Deaf Education. Johnston, T., &Schembri, A. (1999). On defining lexeme in a signed language. Sign Language and Linguistics, 2(2), 115-185. Johnston, T., &Schembri, A. (2010). Variation, lexicalization and grammaticalization in signed languages. Langage et Société, 131, 19-35. | en_AU |
dc.description.sponsorship | PARADISEC (Pacific And Regional Archive for Digital Sources in Endangered Cultures), Australian Partnership for Sustainable Repositories, Ethnographic E-Research Project and Sydney Object Repositories for Research and Teaching. | en_AU |
dc.language.iso | en | en_AU |
dc.relation.ispartof | Sustainable data from digital research: Humanities perspectives on digital scholarship. | en_AU |
dc.title | Challenges in lemmatising signed language digital video corpora: the measure of lexical frequency in Australian and British signed languages | en_AU |
dc.type | Conference paper | en_AU |
Associated file/s
Associated collections