首页 | 本学科首页   官方微博 | 高级检索  
     

基于自然标注信息和隐含主题模型的无监督文本特征抽取
引用本文:饶高琦,于东,荀恩东.基于自然标注信息和隐含主题模型的无监督文本特征抽取[J].中文信息学报,2015,29(6):141-149.
作者姓名:饶高琦  于东  荀恩东
作者单位:1. 北京语言大学 大数据与语言教育研究所,北京 100083; 2.中国语言政策与标准研究所,北京 100083
基金项目:国家自然科学基金(61300081,61170162);国家社科重大基金(12&ZD173);国家语委科研基金(YB125-42);北京语言大学研究生创新基金(14YCX074)
摘    要:术语和惯用短语可以体现文本特征。无监督的抽取特征词语对诸多自然语言处理工作起到支持作用。该文提出了“聚类-验证”过程,使用主题模型对文本中的字符进行聚类,并采用自然标注信息对提取出的字符串进行验证和过滤,从而实现了从未分词领域语料中无监督获得词语表的方法。通过优化和过滤,我们可以进一步获得了富含有术语信息和特征短语的高置信度特征词表。在对计算机科学等六类不同领域语料的实验中,该方法抽取的特征词表具有较好的文体区分度和领域区分度。

关 键 词:自然标注信息  自然语块  隐含主题模型  领域特征  文体特征  />  

Unsupervised Text Feature Extraction Based on Natural Annotation and Latent Topic Model
RAO Gaoqi,YU Dong,XUN Endong.Unsupervised Text Feature Extraction Based on Natural Annotation and Latent Topic Model[J].Journal of Chinese Information Processing,2015,29(6):141-149.
Authors:RAO Gaoqi  YU Dong  XUN Endong
Affiliation:1. Beijing Language and Culture University, Institute of BigData and Language Education, Beijing 100083, China;
2.Institute for Chinese Language Policies and Standards,Beijing 100083, China)
Abstract:Text features are often shown by its terms and phrases. Their unsupervised extraction can support various natural language processing. We propose a “Cluster-Verification” method to gain the lexicon from raw corpus, by combining latent topic model and natural annotation. Topic modeling is used to cluster strings, while we filter and optimize its result by natural annotations in raw corpus. High accuracy is found in the lexicon we gained, as well as good performance on describing domains and writing styles of the texts. Experiments on 6 kinds of domain corpora showed its promising effect on classifying their domains or writing styles. Key words natural annotation; natural chunk; latent topic model; domain feature; stylistic features
Keywords:natural annotation  natural chunk  latent topic model  domain feature  stylistic features  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号