Challenges in lemmatising signed language digital video corpora: the measure of lexical frequency in Australian and British signed languages
Access status:
Open Access
Type
Conference paperAbstract
Digital video archives of Auslan (Australian sign language) and BSL (British Sign Language) are slowly being transformed into machine-readable linguistic corpora. Each archive (Auslan 2004-2008, BSL 2008-2001) consists of data collected from deaf native and near-native signers. The ...
See moreDigital video archives of Auslan (Australian sign language) and BSL (British Sign Language) are slowly being transformed into machine-readable linguistic corpora. Each archive (Auslan 2004-2008, BSL 2008-2001) consists of data collected from deaf native and near-native signers. The datasets are being annotated using ELAN software. The majority of the video data will be made accessible online (with some limits to access for sensitive data). In this presentation, we report on the on-going studies of lexical frequency in these two signed languages—63,436 sign tokens produced in 360 clips by 109 participants in the currently annotated Auslan dataset, and 25,000 sign tokens from the corpus conversation data in the BSL dataset (500 signs each from 50 participants). Preliminary results signs indicate that between 65% and 60% of the Auslan and BSL data respectively consist of signs from the core lexicon (i.e. those signs which are highly conventionalised in form and meaning across contexts, (see Johnston, 2011, Johnston &Schembri, 1999, 2010). The next two largest categories are pointing signs (12% and 23% respectively) and signs from outside the core lexicon (i.e., gestures and sequences of enactment or 'constructed action') (6.5% and 9% respectively). The remaining number of tokens consists of fingerspelled signs (5% in both datasets), depicting constructions (i.e., depicting verbs of location, motion and/or handling, 11% and 3% respectively), and sign names (0.2 and 0.3% respectively). We discuss some of the challenges creating a lemmatised corpus of a sign language, including difficulties in differentiating core from non-core signs and sign from gesture, as well as how our work informs both sign language documentation and description specifically and linguistic theory more generally. Johnston, T. (2011). Lexical frequency in sign languages. Journal of Deaf Studies and Deaf Education. Johnston, T., &Schembri, A. (1999). On defining lexeme in a signed language. Sign Language and Linguistics, 2(2), 115-185. Johnston, T., &Schembri, A. (2010). Variation, lexicalization and grammaticalization in signed languages. Langage et Société, 131, 19-35.
See less
See moreDigital video archives of Auslan (Australian sign language) and BSL (British Sign Language) are slowly being transformed into machine-readable linguistic corpora. Each archive (Auslan 2004-2008, BSL 2008-2001) consists of data collected from deaf native and near-native signers. The datasets are being annotated using ELAN software. The majority of the video data will be made accessible online (with some limits to access for sensitive data). In this presentation, we report on the on-going studies of lexical frequency in these two signed languages—63,436 sign tokens produced in 360 clips by 109 participants in the currently annotated Auslan dataset, and 25,000 sign tokens from the corpus conversation data in the BSL dataset (500 signs each from 50 participants). Preliminary results signs indicate that between 65% and 60% of the Auslan and BSL data respectively consist of signs from the core lexicon (i.e. those signs which are highly conventionalised in form and meaning across contexts, (see Johnston, 2011, Johnston &Schembri, 1999, 2010). The next two largest categories are pointing signs (12% and 23% respectively) and signs from outside the core lexicon (i.e., gestures and sequences of enactment or 'constructed action') (6.5% and 9% respectively). The remaining number of tokens consists of fingerspelled signs (5% in both datasets), depicting constructions (i.e., depicting verbs of location, motion and/or handling, 11% and 3% respectively), and sign names (0.2 and 0.3% respectively). We discuss some of the challenges creating a lemmatised corpus of a sign language, including difficulties in differentiating core from non-core signs and sign from gesture, as well as how our work informs both sign language documentation and description specifically and linguistic theory more generally. Johnston, T. (2011). Lexical frequency in sign languages. Journal of Deaf Studies and Deaf Education. Johnston, T., &Schembri, A. (1999). On defining lexeme in a signed language. Sign Language and Linguistics, 2(2), 115-185. Johnston, T., &Schembri, A. (2010). Variation, lexicalization and grammaticalization in signed languages. Langage et Société, 131, 19-35.
See less
Date
2011-01-01Source title
Sustainable data from digital research: Humanities perspectives on digital scholarship.Licence
OtherFaculty/School
Sydney Conservatorium of Music, PARADISEC (Pacific And Regional Archive for Digital Sources in Endangered Cultures)Share