首页 | 本学科首页   官方微博 | 高级检索  
     

维吾尔文无监督自动切分及无监督特征选择
引用本文:吐尔地·托合提,艾克白尔·帕塔尔,艾斯卡尔·艾木都拉. 维吾尔文无监督自动切分及无监督特征选择[J]. 模式识别与人工智能, 2013, 26(9): 845-852
作者姓名:吐尔地·托合提  艾克白尔·帕塔尔  艾斯卡尔·艾木都拉
作者单位:新疆大学信息科学与工程学院乌鲁木齐830046
基金项目:国家自然科学基金项目(No.61063022,61262062,61163033,61163032)、教育部新世纪优秀人才支持计划项目(No.NCET-10-0969)、新疆维吾尔自治区高技术研究发展计划项目(No.201212124)、新疆维吾尔自治区高校科研计划重点项目(No.XJEDU2012I11)资助
摘    要:
维吾尔文常用切分方法会产生大量的语义抽象甚至多义的词特征,因此学习算法难以发现高维数据中隐藏的结构.提出一种无监督切分方法dme-TS和一种无监督特征选择方法UMRMR-UFS.dme-TS从大规模生语料中自动获取单词Bi-gram及上下文语境信息,并将相邻单词间的t-测试差、互信息及双词上下文邻接对熵的线性融合作为一个组合统计量(dme)来评价单词间的结合能力,从而将文本切分成语义具体的独立语言单位的特征集合.UMRMR-UFS用一种综合考虑最大相关度和最小冗余的无监督特征选择标准(UMRMR)来评价每一个特征的重要性,并将最重要的特征依次移入到特征子集中.实验结果表明dme-TS能有效控制原始特征集的规模,提高特征项本身的质量,用UMRMR-UFS的输出来表征文本时,学习算法也表现出其最高的性能.

关 键 词:维吾尔文切分  互信息  t-测试差  邻接对熵  无监督特征选择  
收稿时间:2012-08-14

Unsupervised Uyghur Segmentation and Unsupervised Feature Selection
TOHTI Turdi,PATTA Akbarr,HAMDULLA Askar. Unsupervised Uyghur Segmentation and Unsupervised Feature Selection[J]. Pattern Recognition and Artificial Intelligence, 2013, 26(9): 845-852
Authors:TOHTI Turdi  PATTA Akbarr  HAMDULLA Askar
Affiliation:School of Information Science and Engineering,Xinjiang University,Urumqi 830046
Abstract:
Commonly used Uyghur segmentation method produces a large number of semantic abstraction and even polysemous word features,so learning algorithms are difficult to find the hidden structure in the high-dimensional data. A segmentation approach dme-TS and a feature selection approach UMRMR-UFS based on unsupervised strategy are proposed. In dme-TS,the word based Bi-gram and contextual information are derived from large scale raw text corpus automatically,and the liner combinations of difference of t-test,mutual information and entropy of double word adjacency are taken as a measurement (dme) to estimate the agglutinative strength between two adjacent Uyghur words. In UMRMR-UFS,an improved unsupervised feature selection criterion (UMRMR) is proposed and the importance of each feature is estimated according to its minimum redundancy and maximum relevancy. The experimental result shows that dme-TS effectively reduces the dimensions of original feature set and improves the quality of the feature itself,and the learning algorithm represents its highest performance on the feature subset selected by UMRMR-UFS.
Keywords:Uyghur Segmentation  Mutual Information  Difference of t-Test  Entropy of Adjacency  Unsupervised Feature Selection  
点击此处可从《模式识别与人工智能》浏览原始摘要信息
点击此处可从《模式识别与人工智能》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号