Multi-label Learning for Public Speaking Annotation

Xu, Jiahao

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Xu, Jiahao

Abstract

Public speaking, as a critical communication skill, has always been very challenging for many people. Despite numerous books and courses available on the market, the secret formula of giving a successful public speech is still unknown to the public. To help people with public ...
See morePublic speaking, as a critical communication skill, has always been very challenging for many people. Despite numerous books and courses available on the market, the secret formula of giving a successful public speech is still unknown to the public. To help people with public speaking, we aim to investigate audience responses to public speeches and analyze how TED speakers deliver their speeches. To achieve this objective, we formulate the task as an audio affective annotation problem, which predicts user emotional ratings on audio signals from public speaking scenes. In addition, we also quantitatively analyse the influence of speech delivery techniques on audience impressions. Therefore, we can provide speakers with personalized and constructive feedback on their public speaking skills and improve their delivery effectively and efficiently. The research presented in this thesis explores the audio annotation problem in multi-label learning settings from three perspectives and demonstrates our approaches to such perspectives. The first perspective is to examine clustering features as mid-level representations for input space learning. While most existing audio annotation studies focus on high-level representations such as spectral features, we propose to learn intermediate-level features from the input space, enabling more discriminative representations and improving annotation accuracy. Considering the rapid development and successful applications of deep learning techniques, we propose a novel convolutional clustering neural network (CCNN) to achieve effective input space learning. A clustering layer is proposed for the first time to derive intermediate representations, and we explore the effects of different clustering strategies. State-of-the-art annotation results are reported in the experiments on our TEDtalk dataset, which consists of more than 2,000 video clips from the TED website with user ratings. Our second perspective is to learn from the output space and the input space to further improve the annotation accuracy. In other words, we aim to map the correlation between the labels in a multi-label learning setting as complementary information. Therefore, we propose a novel deep learning framework that incorporates flexible modules that can jointly learn from input and output spaces. With this framework, we can extract label-specific features and learn multi-label classifiers simultaneously. We introduce a label-specific feature pooling method for the input space to refine convolutional features and obtain features specific to each label. We adopt Graph Convolutional Network (GCN) to map the inter-label correlation and enhance the multi-label classifiers for the output space. Although the label of our TEDtalk dataset is limited and the performance improvements are marginal on audio affective annotation, the proposed method achieves superior performance on image multi-label classification task. For the third and last perspective, we propose a deep affective scoring network for audio affective annotation, which can both predict the audience emotion score and provide users with constructive feedback for improving their speech delivery. The proposed network adopts a deep ranking framework to address this multi-label problem, reformulating the binary classification task into a continuous regression task, which is more intuitive for this specific problem. Furthermore, we model the correlation between speakers’ emotions and audience perception as auxiliary features. For affective annotation, the trained scoring network outperforms existing methods on annotation accuracy. Meanwhile, we use the trained network to examine how various general speaking attributes (e.g. pitch, speaking speed and pauses) would influence speech delivery, which can be further used to provide users qualitatively and quantitatively advice on public speaking.
See less

Date

2022

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Computer Science

Awarding institution

The University of Sydney

Subjects

public speaking
emotion score
user ratings
speech delivery
audio annotation
neural network