首页 | 本学科首页   官方微博 | 高级检索  
     

半结构化数据相似搜索的索引技术研究
引用本文:杨建武,陈晓鸥.半结构化数据相似搜索的索引技术研究[J].计算机学报,2002,25(11):1219-1226.
作者姓名:杨建武  陈晓鸥
作者单位:北京大学计算机研究所文字信息处理技术国家重点实验室,北京,100871
摘    要:为了在海量、高维、动态的半结构化数据集上进行有效的相似搜索,该文提出一种采用聚类技术进行索引构建与更新的多路平衡树--CSS-树以及基于CSS-树的相似搜索与动态更新的算法。CSS-树借鉴SS^ -树基于聚类进行节点组织与分裂的基本思想,避免了根据坐标准进行分裂时所要求的维不相关性,同时在节点组织、分裂算法和搜索算法等方面进行了改进,提出了新的搜索剪枝策略,实验表明,该结构及算法对海量半结构化数据相似搜索和效率明显优于传统算法。

关 键 词:半结构化数据  相似搜索  索引  相似索引  聚类  数据挖掘  数据库  多路平衡树
修稿时间:2001年8月3日

An Index Structure of Semi-Structure Data Set for Similarity Search
YANG Jian-Wu,CHEN Xiao-Ou.An Index Structure of Semi-Structure Data Set for Similarity Search[J].Chinese Journal of Computers,2002,25(11):1219-1226.
Authors:YANG Jian-Wu  CHEN Xiao-Ou
Abstract:A new index, called CSS-tree, is proposed to organize and search dynamic high-dimension vast semi-structure data set. The CSS-tree is a multi-way balance tree, which is combining the benefit of R-tree and SS-tree to deal with high-dimension vast data sets, and the benefit of M-tree to deal with "metric space" data sets. This paper details the structure of CSS-tree, whose each inner node is composed of a group of index elements including cover center and cover radius of child tree and every leaf is in same level and all data indices is in leaves. The paper give algorithms for similarity search based CSS-tree both range search and k-NN search, and dynamic update algorithms of the CSS-tree. It describes the simply split policy which reference to CF-tree's split policy of BIRTH, and reorganizing algorithms which using clustering technique to keep the index elements that the similar elements are neighbor in the index tree, and avoid the need of independent between feather values. It also describes how to keep minimum cover space and overlap space. Using simulation data sets and using part of "Chinese Encyclopedia Database" as data set, which is on XML document set, experiments show that the CSS-tree is close to SS -tree and M-tree in building tree, but CSS-tree outperforms both SS -tree and M-tree in similarity search in semi-structured data sets.
Keywords:similarity indexing  similarity search  semi-structured document  cluster  XML
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号