首页 | 本学科首页   官方微博 | 高级检索  
     

A Semi-Structured Document Model for Text Mining
作者姓名:杨建武  陈晓鸥
作者单位:[1]NationalKeyLaboratoryforTextProcessing,InstituteofComputerScienceandTechnologyPekingUniversity,Beijing100871,P.R.China [2]NationalKeyLaboratoryforTextProcessing,InstituteofComputerScienceandTechnologyPek
基金项目:This research is supported by National Technology Innovation Project and Peking University Graduate Student Development Foundation as one of doctoral dissertation's innovative research
摘    要:A semi-structured document has more structured information compared to an ordinary document,and the relation among semi-structured documents can be fully utilized.In order to take advantage of the structure and link information in a semi-structured document for better mining,a structured link vector model (SLVM) is presented in this paper,where a vector represents a document,and vectors‘ elements are determined by terms,document structure and neighboring documents.Text mining based on SLVM is described in the procedure of K-means for briefness and clarity:calculating document similarity and calculating cluster center.The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments,and its F value increases from 0.65-0.73 to 0.82-0.86.

关 键 词:HTML语言  XML语言  半结构文件模型  版本开采  结构信息

A semi-structured document model for text mining
Jianwu Yang,Xiaoou Chen.A Semi-Structured Document Model for Text Mining[J].Journal of Computer Science and Technology,2002,17(5):0-0.
Authors:Jianwu Yang  Xiaoou Chen
Affiliation:(1) National Key Laboratory for Text Processing, Institute of Computer Science and Technology, Peking University, 100871 Beijing, P.R. China
Abstract:A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.
Keywords:semi-structured document  XML  text mining  vector space model  structured link vector model
本文献已被 CNKI 维普 万方数据 SpringerLink 等数据库收录!
点击此处可从《计算机科学技术学报》浏览原始摘要信息
点击此处可从《计算机科学技术学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号