首页 | 本学科首页   官方微博 | 高级检索  
     

基于体裁的中文网页分类的特征选取
引用本文:黄臻臻,吴扬扬.基于体裁的中文网页分类的特征选取[J].计算机工程与设计,2007,28(11):2743-2745.
作者姓名:黄臻臻  吴扬扬
作者单位:华侨大学,计算机科学系,福建,泉州,362021
基金项目:福建省科技攻关项目 , 福建省自然科学基金
摘    要:探讨基于体裁的中文网页分类的特征项选取问题.词汇特征方面,结合自动抽取和人工归纳的方式来获得.通过改进PAT树存储结构,进行序列挖掘来获得频繁字符串特征,使得文本分类系统摆脱对切词处理和词典的依赖,并提出了模糊字符串模式的特征表达方式.此外,特征集中融入了文本的形式特征,并根据网页的特点,引入链接信息特征.实现了基于体裁的中文网页分类系统,结果表明分类效果得到了有效的改善.

关 键 词:网页分类  体裁  特征选取  序列挖掘  模糊字符串模式  体裁  中文网页分类  特征选取  genre  based  categorization  web  page  Chinese  selection  改善  分类效果  结果  信息特征  形式特征  特征集  表达方式  模式  模糊  词典  词处理
文章编号:1000-7024(2007)11-2743-03
修稿时间:2006-06-29

Feature selection of Chinese web page categorization based on genre
HUANG Zhen-Zhen,WU Yang-yang.Feature selection of Chinese web page categorization based on genre[J].Computer Engineering and Design,2007,28(11):2743-2745.
Authors:HUANG Zhen-Zhen  WU Yang-yang
Affiliation:Department of Computer Science, Huaqiao University, Quanzhou 362021, China
Abstract:This paper gives a research on feature selection of Chinese web page categorization based on genre.Character features are gained by combining automatic extraction with artificial induction.Frequent character features are extracted by sequence mining that amends the storage structure of FAT-Tree,so that the classifier can shake off the burden of words segmentation procedures and large dic- tionaries.A new approach of feature expression based on fuzzy character pattern is proposed.Furthermore,the feature sets contain the form features of the documents and include the features of link information.Chinese web page categorization system based on genre is implemented.The results of experiment show that the performance of classifier is improved by the method.
Keywords:web page categorization  genre  feature selection  sequence mining  fuzzy character pattern
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号