首页 | 本学科首页   官方微博 | 高级检索  
     


Data heterogeneity consideration in semi-supervised learning
Affiliation:1. Institute of Humanities, Arts and Sciences, Federal University of Southern Bahia, BR-367, Km 10, CEP: 45810-000, Porto Seguro, Bahia, Brazil;2. Department of Computer Science, Institute of Mathematics and Computer Science, University of São Paulo, Av. Trabalhador São-carlense, 400, Caixa Postal: 668, CEP: 13560-970, São Carlos, São Paulo, Brazil;3. Department of Computation and Mathematics, School of Philosophy, Science and Literature in Ribeirão Preto, University of São Paulo, Av. Bandeirantes, 3900, CEP: 14090-901, Ribeirão Preto, São Paulo, Brazil;1. DEI, University of Padua, viale Gradenigo 6, Padua, Italy;2. DIN, State University of Maringa (UEM), Maringa, PR, Brazil;3. DISI, University of Bologna, Cesena, Italy;4. DTIC, Sejong University, Seoul, Republic of Korea;1. Computer and Network Center, National Cheng Kung University, Tainan 701, Taiwan;2. Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan
Abstract:In class (cluster) formation process of machine learning techniques, data instances are usually assumed to have equal relevance. However, it is frequently not true. Such a situation is more typical in semi-supervised learning since we have to understand the data structure of both labeled and unlabeled data at the same time. In this paper, we investigate the organizational heterogeneity of data in semi-supervised learning using graph representation. This is because graph is a natural choice to characterize relationship between any pair of nodes or any pair of groups of nodes, consequently, strategical location of each node or each group of nodes can be determined by graph measures. Specifically, two issues are addressed: (1) We propose an adaptive graph construction method, we call AdaRadius, considering the heterogeneity of local interacting structure among nodes. As a result, it presents several interesting properties, namely adaptability to data density variations, low dependency on parameters setting, and reasonable computational cost, for both pool based and incremental data. (2) Moreover, we present heuristic criteria for selecting representative data samples to be labeled. Experimental study shows that selective labeling usually gets better classification results than random labeling. To our knowledge, it still lacks investigation on both issues up to now, therefore, our approach presents an important step toward the data heterogeneity characterization not only in semi-supervised learning, but also in general machine learning.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号