Technical Reports (CIS)

Document Type

Technical Report

Date of this Version

1-1-2012

Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-12-17.

Abstract

In many domain adaption formulations, it is assumed to have large amount of unlabeled data from the domain of interest (target domain), some portion of it may be labeled, and large amount of labeled data from other domains, aka source domain(s). Motivated by the fact that labeled data is hard to obtain in any domain, we design algorithms for the settings in which there exists large amount of unlabeled data from all domains, small portion of which may be labeled.

We build on recent advances in graph-based semi-supervised learning and supervised metric learning. Given all instances, labeled and unlabeled, from all domains, we build a large similarity graph between them, where an edge exists between two instances if they are close according to some metric. Instead of using predefined metric, as commonly performed, we feed the labeled instances into metric-learning algorithms and (re)construct a data-dependent metric, which is used to construct the graph. We employ different types of edges depending on the domain-identity of the two vertices touching it, and learn the weights of each edge.

We provide extensive empirical evidence demonstrating that our approach leads to significant reduction in classification error across domains, and evaluate the contribution of each resource: labeled and unlabeled data of the various domains.

Keywords

Sentiment Analysis, Machine Learning, Domain Adaptation

Included in

Engineering Commons

Share

COinS
 

Date Posted: 31 October 2012