半结构化数据相似搜索的索引技术研究 An Index Structure of Semi-Structure Data Set for Similarity Search期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

半结构化数据相似搜索的索引技术研究

引用本文：	杨建武,陈晓鸥.半结构化数据相似搜索的索引技术研究[J].计算机学报,2002,25(11):1219-1226.

作者姓名：	杨建武陈晓鸥

作者单位：	北京大学计算机研究所文字信息处理技术国家重点实验室,北京,100871

摘要：	为了在海量、高维、动态的半结构化数据集上进行有效的相似搜索，该文提出一种采用聚类技术进行索引构建与更新的多路平衡树－－CSS－树以及基于CSS－树的相似搜索与动态更新的算法。CSS－树借鉴SS^ -树基于聚类进行节点组织与分裂的基本思想，避免了根据坐标准进行分裂时所要求的维不相关性，同时在节点组织、分裂算法和搜索算法等方面进行了改进，提出了新的搜索剪枝策略，实验表明，该结构及算法对海量半结构化数据相似搜索和效率明显优于传统算法。
关键词：	半结构化数据相似搜索索引相似索引聚类数据挖掘数据库多路平衡树
修稿时间：	2001年8月3日
An Index Structure of Semi-Structure Data Set for Similarity Search

YANG Jian-Wu,CHEN Xiao-Ou.An Index Structure of Semi-Structure Data Set for Similarity Search[J].Chinese Journal of Computers,2002,25(11):1219-1226.

Authors:	YANG Jian-Wu CHEN Xiao-Ou

Abstract:	A new index, called CSS-tree, is proposed to organize and search dynamic high-dimension vast semi-structure data set. The CSS-tree is a multi-way balance tree, which is combining the benefit of R-tree and SS-tree to deal with high-dimension vast data sets, and the benefit of M-tree to deal with "metric space" data sets. This paper details the structure of CSS-tree, whose each inner node is composed of a group of index elements including cover center and cover radius of child tree and every leaf is in same level and all data indices is in leaves. The paper give algorithms for similarity search based CSS-tree both range search and k-NN search, and dynamic update algorithms of the CSS-tree. It describes the simply split policy which reference to CF-tree's split policy of BIRTH, and reorganizing algorithms which using clustering technique to keep the index elements that the similar elements are neighbor in the index tree, and avoid the need of independent between feather values. It also describes how to keep minimum cover space and overlap space. Using simulation data sets and using part of "Chinese Encyclopedia Database" as data set, which is on XML document set, experiments show that the CSS-tree is close to SS -tree and M-tree in building tree, but CSS-tree outperforms both SS -tree and M-tree in similarity search in semi-structured data sets.

Keywords:	similarity indexing similarity search semi-structured document cluster XML
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏