Instance-Dependent Positive-Unlabelled Learning
Access status:
USyd Access
Type
ThesisThesis type
Masters by ResearchAuthor/s
He, FengxiangAbstract
An emerging topic in machine learning is how to learn classifiers from datasets containing only positive and unlabelled examples (PU learning). This problem has significant importance in both academia and industry. This thesis addresses the PU learning problem following a natural ...
See moreAn emerging topic in machine learning is how to learn classifiers from datasets containing only positive and unlabelled examples (PU learning). This problem has significant importance in both academia and industry. This thesis addresses the PU learning problem following a natural strategy that treats unlabelled data as negative. By this way, a PU dataset is transferred to a fully-labelled dataset but with label noise. This strategy has been employed by many existing works and is usually called the one-side noise model. Under the framework of the one-side noise model, this thesis proposes an instance-dependent model to express how likely a negative label is corrupted. The model relies on the probabilistic gap, which is defined as the difference between the posteriors that an instance is respectively from the classes of positive or negative. Intuitively, the instance with a smaller probabilistic gap is more likely to be wrongly labelled. Motivated by this intuition, this thesis assumes there is a negative correlation between the noisy probability of the instance and the corresponding probabilistic gap. This model is named as probabilistic-gap PU model (PGPU model). Based on the PGPU model, this thesis designs Bayesian relabelling method that can select a group of the unlabelled instances and give them new labels that are identical to the ones assigned by a Bayesian optimal classifier. By this way, we can significantly extend the labelled dataset. Eventually, this thesis employs conventional binary classification methods to learn a classifier from the extended labelled datasets. It is worth noting that there could be a sub-domain of the instances where no data point can be relabelled. This issue could lead to a biased classifier. A kernel mean matching technique is then employed to remedy this problem. This thesis also evaluates the proposed method in both theoretical and empirical manners. Both theoretical and empirical results are in agreements with our method.
See less
See moreAn emerging topic in machine learning is how to learn classifiers from datasets containing only positive and unlabelled examples (PU learning). This problem has significant importance in both academia and industry. This thesis addresses the PU learning problem following a natural strategy that treats unlabelled data as negative. By this way, a PU dataset is transferred to a fully-labelled dataset but with label noise. This strategy has been employed by many existing works and is usually called the one-side noise model. Under the framework of the one-side noise model, this thesis proposes an instance-dependent model to express how likely a negative label is corrupted. The model relies on the probabilistic gap, which is defined as the difference between the posteriors that an instance is respectively from the classes of positive or negative. Intuitively, the instance with a smaller probabilistic gap is more likely to be wrongly labelled. Motivated by this intuition, this thesis assumes there is a negative correlation between the noisy probability of the instance and the corresponding probabilistic gap. This model is named as probabilistic-gap PU model (PGPU model). Based on the PGPU model, this thesis designs Bayesian relabelling method that can select a group of the unlabelled instances and give them new labels that are identical to the ones assigned by a Bayesian optimal classifier. By this way, we can significantly extend the labelled dataset. Eventually, this thesis employs conventional binary classification methods to learn a classifier from the extended labelled datasets. It is worth noting that there could be a sub-domain of the instances where no data point can be relabelled. This issue could lead to a biased classifier. A kernel mean matching technique is then employed to remedy this problem. This thesis also evaluates the proposed method in both theoretical and empirical manners. Both theoretical and empirical results are in agreements with our method.
See less
Date
2018-09-30Licence
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering and Information Technologies, School of Computer ScienceAwarding institution
The University of SydneyShare