首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
针对现有的树聚类算法不能适应数据的动态变化和不确定性等问题,研究不确定数据的聚类问题,提出一种在不确定树数据库中的动态聚类算法,有效地解决了因数据的动态变化而导致的无法聚类的问题.首先,提出转变树集、相似分组和树类集等概念来描述一个不确定树数据库的聚类模型.其次,为了更加准确的度量子树之间的相似性,考虑到子树即具有结点语义特征,又具有结构化特性,提出了一种语义相似度计算方法与结构相似度计算方法,同时对两者赋予一定比例的权值并求和得到最终的相似度.再次,设计了一个动态聚类过程,采用自适应获取聚类阈值,较大程度上减少了人为干扰导致聚类结果不准确的影响,使得具有相似结构的子树聚集在同一个相似分组中,不同分组之间的子树相似度达到最小化,同时对每个相似分组,定义一个提取代表性子树的公式,将其作为树类组成树的类集.最后,通过模拟数据和真实环境两部分实验可以表明,算法有效可行,聚类结果较准确且具有较好的运行效率.  相似文献   

2.
R^*树是目前公认查询效果很好的R树变体,但是其构造代价较原始R树增加数倍,对于插入删除和更新频繁的空间数据效果不好。为此,本文提出一种基于惰性聚类分裂技术的R树动态实现方法(LR树)。惰性聚类分裂技术是在对象插入节点导致溢出时不立即进行分裂,而是尝试将其插入到邻近的未满节点中,直到邻近节点均已满时,再利用聚类技术进行节点分裂,在邻近节点和分裂节点之间重组入口项。LR树在确保查询性能的前提下,大大降低了构造代价,并且大幅提高了索引结构的空间利用率。最后的分析和实验证明了LR树的高效性。  相似文献   

3.
基于P2P网络的浏览器缓存协作系统的研究   总被引:1,自引:0,他引:1  
提出了一种基于P2P技术的浏览器缓存协作系统IntraCache,IntraCache包括3类节点:注册服务器、胖节点和瘦节点.注册服务器负责节点的注册,胖节点负责管理某一个组内所有节点和节点共享的缓存内容的索引信息,瘦节点可以和其它节点互相通信和共享资源.与传统代理缓存系统相比,该系统有易扩展和对节点失败不敏感等优点.IntraCache使用PB grouping方法动态的在线聚类节点,将具有相似兴趣的节点组织在一个兴趣组中,并使用基于兴趣相似度的方法进行搜索.基于日志驱动的模拟测试表明,PB grouping方法的聚类能力要高于以前的方法,能够有效提高命中率;基于相似度的搜索策略能够有效减少搜索空间进而减少延迟.  相似文献   

4.
逐维聚类的相似度索引算法   总被引:5,自引:0,他引:5  
随着多媒体信息技术的迅速发展,多维度索引技术在图像、视频等可视信息的存储、检索方面成为一个重要的研究领域,针对“维数危机”难题,提出逐维聚类相似度索引算法,该算法根据数据集的分布特性,对特征矢量的每一维进行聚类,算法在实现检索时可以逐步滤除与查询矢量不相似的数据集,缩小检索范围,进而提高了检索速度,实验结果表明,逐维聚类算法适用于基于相似度的高维数据矢量检索和查询,是一种简单、灵活的索引结构。  相似文献   

5.
基于聚类的Hilbert R-树空间索引算法   总被引:2,自引:2,他引:0  
R-树适合于动态索引,但空间重叠大,而Hilbert R-树也不能有效降低节点覆盖和交叠,直接影响R-树的查询效率。为适应大量的GIS查询应用需要,提出对Hilbert R-树节点进行聚类的索引算法,较好地解决相邻数据的聚类存放,使叶节点MBR面积减小,内部节点交叠降低,并对该算法进行实验测试和性能分析,结果表明该算法具有较高的查询效率。  相似文献   

6.
Z树:一个高维度的数据索引结构   总被引:3,自引:0,他引:3       下载免费PDF全文
张强  赵政 《计算机工程》2007,33(15):49-51
Z树能够高效地处理对高维度数据集的矩形区域查询和最邻近搜索。它按照节点的形状变化量优化数据的插入位置,使节点形状趋于合理。文章给出了一个新的无重叠分裂算法,减少超级节点的产生。引入了动态剪枝和重新插入策略,压缩超级节点的数量和体积。提出了矩形节点的球形化方法和最优子树搜索算法。实验表明Z树的矩形区域查询和最邻近搜索的效率远远高于X树和SR树。  相似文献   

7.
结构化数据同时具备现海量与复杂的特征,导致其异常辨识难度上升,为此提出基于改进随机森林的海量结构化数据异常辨识算法。凭借互补集合经验模态分解,获得海量结构化数据的本征模态函数,去除噪声点。随机选择特征子集分裂决策树节点,采用AdaBoost算法对随机森林进行加权,完成随机森林改进。将改进随机森林的扩展空间范围定义为异常值范围,结合局部敏感哈希算法度量去除噪声点后的数据异常度,实现海量结构化数据异常辨识。通过实验表明,所提算法的海量结构化数据异常辨识精准度最高达到了95.8%,结构化数据量为400 G时的辨识耗时为2.52 min,说明该算法的海量结构化数据异常辨识精准率高、耗时短,具有较高的应用价值。  相似文献   

8.
多代表点特征树与空间聚类算法   总被引:1,自引:0,他引:1  
空间数据具有海量、复杂、连续、空间自相关、存在缺损与误差等的特点,要求空间聚类算法具有高效率,能处理各种复杂形状的簇,聚类结果与数据空间分布顺序无关,并且对离群点是健壮的等性能,已有的算法难以同时满足要求。本文提出了一个适合处理海量复杂空间数据的数据结构一多代表点特征树。基于多代表点特征树提出了适合挖掘海量复杂空间数据聚类算法CAMFT,该算法利用多代表点特征树对海量的数据进行压缩,结合随机采样的方法进一步增强算法处理海量数据的能力;同时,多代表点特征树能够保存复杂形状的聚类特征,适合处理复杂空间数据。实验表明了算法CAMFT能够快速处理带有离群点的复杂形状聚类的空间数据,结果与对象空间分布顺序无关,并且效率优于已有的同类聚类算法BLRCH与CURE。  相似文献   

9.
基于层次与划分方法的聚类算法研究   总被引:3,自引:1,他引:3  
针对在层次聚类算法中,一个分裂或合并被执行,就不能修正,其聚类质量受到限制的缺陷,提出了利用簇间相异度及基于信息熵或整体相似度的聚类质量评价标准,在簇分裂过程中动态的进行簇的合并与分裂的算法。仿真实验结果证明,该算法具有使结果簇更紧凑和独立的效果,具有更好的聚类质量。  相似文献   

10.
用SOM聚类实现多级高维点数据索引   总被引:4,自引:0,他引:4  
高维点数据的索引是基于内容的信息检索的主要研究问题之一,从SOM聚类算法出发,利用自组织映射的良好性能,解决了R-Tree及其变体算法中的边界索引问题,并能适应维数更高的点数据,同时针对传统聚类算法只能组织一级索引的局限,提出了利用SOM网络组织多级索引,并用半径进行剪枝处理的优化办法,实验结果表明,提出的方法不仅克服了传统聚类方法的搜索过程可能产生的查询错误,而且大大提高了索引的构建和查询效率。  相似文献   

11.
Indexing high-dimensional data for efficient in-memory similarity search   总被引:3,自引:0,他引:3  
In main memory systems, the L2 cache typically employs cache line sizes of 32-128 bytes. These values are relatively small compared to high-dimensional data, e.g., >32D. The consequence is that existing techniques (on low-dimensional data) that minimize cache misses are no longer effective. We present a novel index structure, called /spl Delta/-tree, to speed up the high-dimensional query in main memory environment. The /spl Delta/-tree is a multilevel structure where each level represents the data space at different dimensionalities: the number of dimensions increases toward the leaf level. The remaining dimensions are obtained using principal component analysis. Each level of the tree serves to prune the search space more efficiently as the lower dimensions can reduce the distance computation and better exploit the small cache line size. Additionally, the top-down clustering scheme can capture the feature of the data set and, hence, reduces the search space. We also propose an extension, called /spl Delta//sup +/-tree, that globally clusters the data space and then partitions clusters into small regions. The /spl Delta//sup +/-tree can further reduce the computational cost and cache misses. We conducted extensive experiments to evaluate the proposed structures against existing techniques on different kinds of data sets. Our results show that the /spl Delta//sup +/-tree is superior in most cases.  相似文献   

12.
Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size of the data set require an index structure to facilitate the search. State-of-the-art index structures are built by partitioning the data set based on distances to certain reference point(s). Using the index, search is confined to a small number of partitions. However, these methods either ignore the property of the data distribution (e.g., VP-tree and its variants) or produce non-disjoint partitions (e.g., M-tree and its variants, DBM-tree); these greatly affect the search efficiency. In this paper, we study the effectiveness of a new index structure, called Nested-Approximate-eQuivalence-class tree (NAQ-tree), which overcomes the above disadvantages. NAQ-tree is constructed by recursively dividing the data set into nested approximate equivalence classes. The conducted analysis and the reported comparative test results demonstrate the effectiveness of NAQ-tree in significantly improving the search efficiency.  相似文献   

13.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

14.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

15.
The TV-tree: An index structure for high-dimensional data   总被引:20,自引:0,他引:20  
We propose a file structure to index high-dimensionality data, which are typically points in some feature space. The idea is to use only a few of the features, using additional features only when the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such varying length feature vectors. Finally, we report simulation results, comparing the proposed structure with theR *-tree, which is one of the most successful methods for low-dimensionality spaces.The results illustrate the superiority of our method, which saves up to 80% in disk accesses.  相似文献   

16.
A generalization of binary search trees and binary split trees is developed that takes advantage of two-way key comparisons: the two-way comparison tree. The two-way comparison tree has little use for dynamic situations but is an improvement over the optimal binary search tree and the optimal binary split tree for static data sets. AnO(n) time and space algorithm is presented for constructing an optimal two-way comparison tree when access probabilities are equal, and an exact formula for the optimal cost is developed. The construction of the optimal two-way comparison tree for unequal access frequencies, both successful and unsuccessful, is computable inO(n 5) time andO(n 3) space using algorithms similar to those for the optimal binary split tree. The optimal two-way comparison tree can improve search cost by up to 50% over the optimal binary search tree.  相似文献   

17.
Range and k-nearest neighbor searching are core problems in pattern recognition. Given a database S of objects in a metric space M and a query object q in M, in a range searching problem the goal is to find the objects of S within some threshold distance to g, whereas in a k-nearest neighbor searching problem, the k elements of S closest to q must be produced. These problems can obviously be solved with a linear number of distance calculations, by comparing the query object against every object in the database. However, the goal is to solve such problems much faster. We combine and extend ideas from the M-tree, the multivantage point structure, and the FQ-tree to create a new structure in the "bisector tree" class, called the Antipole tree. Bisection is based on the proximity to an "Antipole" pair of elements generated by a suitable linear randomized tournament. The final winners a, b of such a tournament is far enough apart to approximate the diameter of the splitting set. If dist(a, b) is larger than the chosen cluster diameter threshold, then the cluster is split. The proposed data structure is an indexing scheme suitable for (exact and approximate) best match searching on generic metric spaces. The Antipole tree outperforms by a factor of approximately two existing structures such as list of clusters, M-trees, and others and, in many cases, it achieves better clustering properties.  相似文献   

18.
Many recent database applications need to deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The slim-tree uses the triangle inequality to prune the distance calculations that are needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the slim-tree uses a minimal spanning tree to help with the splitting. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The slim-tree is a metric access method that tackles the problem of overlaps between nodes in metric spaces and that allows one to minimize the overlap. The proposed "fat-factor" is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed "slim-down" algorithm. This paper also presents a new tool in the slim-tree's arsenal of resources, aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new slim-tree algorithms lead to performance improvements. These results show that the slim-tree outperforms the M-tree by up to 200% for range queries. For insertion and splitting, the minimal-spanning-tree-based algorithm achieves up to 40 times faster insertions. We observed improvements of up to 40% in range queries after applying the slim-down algorithm  相似文献   

19.
Knowledge discovery refers to identifying hidden and valid patterns in data and it can be used to build knowledge inference systems. Decision tree is one such successful technique for supervised learning and extracting knowledge or rules. This paper aims at developing a decision tree model to predict the occurrence of diabetes disease. Traditional decision tree algorithms have a problem with crisp boundaries. Much better decision rules can be identified from these clinical data sets with the use of the fuzzy decision boundaries. The key step in the construction of a decision tree is the identification of split points and in this work best split points are identified using the Gini index. Authors propose a method to minimize the calculation of Gini indices by identifying false split points and used the Gaussian fuzzy function because the clinical data sets are not crisp. As the efficiency of the decision tree depends on many factors such as number of nodes and the length of the tree, pruning of decision tree plays a key role. The modified Gini index-Gaussian fuzzy decision tree algorithm is proposed and is tested with Pima Indian Diabetes (PID) clinical data set for accuracy. This algorithm outperforms other decision tree algorithms.  相似文献   

20.
With the explosive growth of data, to support efficient data management including queries and updates, the database system is expected to provide tree-like indexes, such as R-tree, M-tree, B+-tree, according to different types of data. In the distributed environment, the indexes have to be scattered across the compute nodes to improve reliability and scalability. Indexes can speed up queries, but they incur maintenance cost when updates occur. In the distributed environment, each compute node maintains a subset of an index tree, so keeping the communication cost small is more crucial, or else it occupies lots of network bandwidth and the scalability and availability of the database system are affected. Further, to achieve the reliability and scalability for queries, several replicas of the index are needed, but keeping the replicas consistent is not straightforward. In this paper, we propose a framework supporting tree-like indexes, based on Chord overlay, which is a popular P2P structure. The framework dynamically tunes the number of replicas of index to balance the query cost and the update cost. Several techniques are designed to improve the efficiency of updates without the cost of performance of the queries. We implement M-tree and R-tree in our framework, and extensive experiments on real- life and synthetic datasets verify the efficiency and scalability of our framework.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号