首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Metric indexing is the state of the art in general distance-based retrieval. Relying on the triangular inequality, metric indexes achieve significant online speed-up beyond a linear scan. Recently, the idea of Ptolemaic indexing was introduced, which substitutes Ptolemy's inequality for the triangular one, potentially yielding higher efficiency for the distances where it applies. In this paper we have adapted several metric indexes to support Ptolemaic indexing, thus establishing a class of Ptolemaic access methods (PtoAM). In particular, we include Ptolemaic Pivot tables, Ptolemaic PM-Trees and the Ptolemaic M-Index. We also show that the most important and promising family of distances suitable for Ptolemaic indexing is the signature quadratic form distance, an adaptive similarity measure which can cope with flexible content representations of multimedia data, among other things. While this distance has shown remarkable qualities regarding the search effectiveness, its high computational complexity underscores the need for efficient search methods. We show that these distances are Ptolemaic metrics and present a study where we apply Ptolemaic indexing methods on real-world image databases, resolving exact queries nearly four times as fast as the state-of-the-art metric solution, and up to three orders of magnitude times as fast as sequential scan.  相似文献   

2.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

3.
Some approximate indexing schemes have been recently proposed in metric spaces which sort the objects in the database according to pseudo-scores. It is known that (1) some of them provide a very good trade-off between response time and accuracy, and (2) probability-based pseudo-scores can provide an optimal trade-off in range queries if the probabilities are correctly estimated. Based on these facts, we propose a probabilistic enhancement scheme which can be applied to any pseudo-score based scheme. Our scheme computes probability-based pseudo-scores using pseudo-scores obtained from a pseudo-score based scheme. In order to estimate the probability-based pseudo-scores, we use the object-specific parameters in logistic regression and learn the parameters using MAP (Maximum a Posteriori) estimation and the empirical Bayes method. We also propose a technique which speeds up learning the parameters using pseudo-scores. We applied our scheme to the two state-of-the-art schemes: the standard pivot-based scheme and the permutation-based scheme, and evaluated them using various kinds of datasets from the Metric Space Library. The results showed that our scheme outperformed the conventional schemes, with regard to both the number of distance computations and the CPU time, in all the datasets.  相似文献   

4.
Similarity searching in metric spaces has a vast number of applications in several fields like multimedia databases, text retrieval, computational biology, and pattern recognition. In this context, one of the most important similarity queries is the k nearest neighbor (k-NN) search. The standard best-first k-NN algorithm uses a lower bound on the distance to prune objects during the search. Although optimal in several aspects, the disadvantage of this method is that its space requirements for the priority queue that stores unprocessed clusters can be linear in the database size. Most of the optimizations used in spatial access methods (for example, pruning using MinMaxDist) cannot be applied in metric spaces, due to the lack of geometric properties. We propose a new k-NN algorithm that uses distance estimators, aiming to reduce the storage requirements of the search algorithm. The method stays optimal, yet it can significantly prune the priority queue without altering the output of the query. Experimental results with synthetic and real datasets confirm the reduction in storage space of our proposed algorithm, showing savings of up to 80% of the original space requirement.
Gonzalo NavarroEmail:

Benjamin Bustos   is an assistant professor in the Department of Computer Science at the University of Chile. He is also a researcher at the Millennium Nucleus Center for Web Research. His research interests are similarity searching and multimedia information retrieval. He has a doctoral degree in natural sciences from the University of Konstanz, Germany. Contact him at bebustos@dcc.uchile.cl. Gonzalo Navarro   earned his PhD in Computer Science at the University of Chile in 1998, where he is now Full Professor. His research interests include similarity searching, text databases, compression, and algorithms and data structures in general. He has coauthored a book on string matching and around 200 international papers. He has (co)chaired international conferences SPIRE 2001, SCCC 2004, SPIRE 2005, SIGIR Posters 2005, IFIP TCS 2006, and ENC 2007 Scalable Pattern Recognition track; and belongs to the Editorial Board of Information Retrieval Journal. He is currently Head of the Department of Computer Science at University of Chile, and Head of the Millenium Nucleus Center for Web Research, the largest Chilean project in Computer Science research.   相似文献   

5.
Similarity search operations require executing expensive algorithms, and although broadly useful in many new applications, they rely on specific structures not yet supported by commercial DBMS. In this paper we discuss the new Omni-technique, which allows to build a variety of dynamic Metric Access Methods based on a number of selected objects from the dataset, used as global reference objects. We call them as the Omni-family of metric access methods. This technique enables building similarity search operations on top of existing structures, significantly improving their performance, regarding the number of disk access and distance calculations. Additionally, our methods scale up well, exhibiting sub-linear behavior with growing database size.  相似文献   

6.
Spatial indexing of high-dimensional data based on relative approximation   总被引:2,自引:0,他引:2  
We propose a novel index structure, the A-tree (approximation tree), for similarity searches in high-dimensional data. The basic idea of the A-tree is the introduction of virtual bounding rectangles (VBRs) which contain and approximate MBRs or data objects. VBRs can be represented quite compactly and thus affect the tree configuration both quantitatively and qualitatively. First, since tree nodes can contain a large number of VBR entries, fanout becomes large, which increases search speed. More importantly, we have a free hand in arranging MBRs and VBRs in the tree nodes. Each A-tree node contains an MBR and its children VBRs. Therefore, by fetching an A-tree node, we can obtain information on the exact position of a parent MBR and the approximate position of its children. We have performed experiments using both synthetic and real data sets. For the real data sets, the A-tree outperforms the SR-tree and the VA-file in all dimensionalities up to 64 dimensions, which is the highest dimension in our experiments. Additionally, we propose a cost model for the A-tree. We verify the validity of the cost model for synthetic and real data sets. Edited by T. Sellis. Received: December 8, 2000 / Accepted: March 20, 2002 Published online: September 25, 2002  相似文献   

7.
We propose a scalable distributed data structure (SDDS) called SD-Rtree. We intend our structure for point, window and kNN queries over large spatial datasets distributed on clusters of interconnected servers. The structure balances the storage and processing load over the available resources, and aims at minimizing the size of the cluster. SD-Rtree generalizes the well-known Rtree structure. It uses a distributed balanced binary tree that scales with insertions to potentially any number of storage servers through splits of the overloaded ones. A user/application manipulates the structure from a client node. The client addresses the tree through its image that can be possibly outdated due to later split. This may generate addressing errors, solved by the forwarding among the servers. Specific messages towards the clients incrementally correct the outdated images. We present the building of an SD-Rtree through insertions, focusing on the split and rotation algorithms. We follow with the query algorithms. We describe then a flexible allocation protocol which allows to cope with a temporary shortage of storage resources through data storage balancing. Experiments show additional aspects of SD-Rtree and compare its behavior with a distributed quadtree. The results justify our various design choices and the overall utility of the structure.  相似文献   

8.
Proximity searches become very difficult on “high dimensional” metric spaces, that is, those whose histogram of distances has a large mean and/or a small variance. This so-called “curse of dimensionality”, well known in vector spaces, is also observed in metric spaces. The search complexity grows sharply with the dimension and with the search radius. We present a general probabilistic framework applicable to any search algorithm and whose net effect is to reduce the search radius. The higher the dimension, the more effective the technique. We illustrate empirically its practical performance on a particular class of algorithms, where large improvements in the search time are obtained at the cost of a very small error probability.  相似文献   

9.
Learned Indexes use a model to restrict the search of a sorted table to a smaller interval. Typically, a final binary search is done using the lower_bound routine of the Standard C++ library. Recent studies have shown that on current processors other search approaches (such as k-ary search) can be more efficient in some applications. Using the SOSD learned indexing benchmarking software, we extend these results to show that k-ary search is indeed a better choice when using learned indexes. We highlight how such a choice may be dependent on the computer architecture used, for example, Intel I7 or Apple M1, and provide guidelines for the selection of the Search routine within the learned indexing framework.  相似文献   

10.
In the digital age, the dependence of the elderly on smartphones is growing; and the frequency of use and time spent on various apps continue to grow. The display elements belonging to the quick access area on the homepage of an app act as shortcuts for navigation, directly affecting the search efficiency and user experience of the elderly when they use the app. In this study, we aim to investigate the impact of the density of the elements (icons and their text labels) located on the quick access area of smartphone apps on the visual search efficiency and user experience of elderly people. First, three typical density designs were extracted by collating and analyzing the density rules for the elements belonging to the quick access areas on the homepages of existing elderly-oriented versions of mainstream apps. Then, the densities of three elements in the quick access area on the homepage of a takeout app were used for a case study involving 96 elderly subjects, who were invited to participate in the task search test and user interviews. The results showed that the density of the elements in the app’s quick access area had no significant effect on the visual search efficiency of the subjects but had a certain impact on the user experience. Additionally, the older elderly subjects preferred designs with lower density, while the younger elderly ones, who had more online shopping experience, preferred designs with higher density. The subjects were more concerned about the ease of use and the overall user experience of the quick access area than its visual aesthetics. The research results not only provide a theoretical reference and design basis for designing icons and text labels belonging to the quick access area of apps, considering the density of these elements in the context of elderly-oriented apps, but also offers inspiration for improving the user experience of elderly people that use apps.  相似文献   

11.
Computer-based assessments of complex problem solving (CPS) that have been used in international large-scale surveys require students to engage in an in-depth interaction with the problem environment. In this, they evoke manifest sequences of overt behavior that are stored in computer-generated log files. In the present study, we explored the relation between several overt behaviors, which N = 1476 Finnish ninth-grade students (mean age = 15.23, SD = .47 years) exhibited when exploring a CPS environment, and their CPS performance. We used the MicroDYN approach to measure CPS and inspected students' behaviors through log-file analyses. Results indicated that students who occasionally observed the problem environment in a noninterfering way in addition to actively exploring it (noninterfering observation) showed better CPS performance, whereas students who showed a high frequency of (potentially unplanned) interventions (intervention frequency) exhibited worse CPS performance. Additionally, both too much and too little time spent on a CPS task (time on task) was associated with poor CPS performance. The observed effects held after controlling for students' use of an exploration strategy that required a sequence of multiple interventions (VOTAT strategy) indicating that these behaviors exhibited incremental effects on CPS performance beyond the use of VOTAT.  相似文献   

12.
A new algorithm for decomposition of mixed pixels based on orthogonal bases of data space is proposed in this paper. It is a simplex-based method which extracts endmembers sequentially using computations of largest simplex volumes. At each searching step of this extraction algorithm, searching for the simplex with the largest volume is equivalent to searching for a new orthogonal basis which has the largest norm. The new endmember corresponds to the new basis with the largest norm. This algorithm runs very fast and can also avoid the dilemma in traditional simplex-based endmember extraction algorithms, such as N-FINDR, that it generally produces different sets of final endmembers if different initial conditions are used. Moreover, with this set of orthogonal bases, the proposed algorithm can also determine the proper number of endmembers and finish the unmixing of the original images which the traditional simplex-based algorithms cannot do by themselves. Experimental results of both artificial simulated images and practical remote sensing images demonstrate the algorithm proposed in this paper is a fast and accurate algorithm for the decomposition of mixed pixels.  相似文献   

13.
Classification on medical data raises several problems such as class imbalance, double meaning of missing data, volumetry or need of highly interpretable results. In this paper a new algorithm is proposed: MOCA-I (Multi-Objective Classification Algorithm for Imbalanced data), a multi-objective local search algorithm that is conceived to deal with these issues all together. It is based on a new modelization as a Pittsburgh multi-objective partial classification rule mining problem, which is described in the first part of this paper. An existing dominance-based multi-objective local search (DMLS) is modified to deal with this modelization. After experimentally tuning the parameters of MOCA-I and determining which version of DMLS algorithm is the most effective, the obtained MOCA-I version is compared to several state-of-the-art classification algorithms. This comparison is realized on 10 small and middle-sized data sets of literature and 2 real data sets; MOCA-I obtains the best results on the 10 data sets and is statistically better than other approaches on the real data sets.  相似文献   

14.
空间数据融合算法在温度场计算中的应用   总被引:1,自引:1,他引:1  
提出了一种空间数据融合算法,该算法综合了插值逼近和平方逼近两种逼近的特点,一方面保持了平方逼近的有效概念的基础上降低了运算复杂度,另一方面克服了插值逼近高阶波动等缺点,从而可以在多信秘采集点的情况下快速得到一个多项式函数来描述采集量在整个测量区间内的变化规律,经过在火烧油层温度场温度分布计算的实践证明其是一种实时,有效的逼近算法,最后对算法做了进一步的性能分析。  相似文献   

15.
蒋胤傑    况琨    吴飞   《智能系统学报》2020,15(1):175-182
数据驱动的机器学习(特别是深度学习)在自然语言处理、计算机视觉分析和语音识别等领域取得了巨大进展,是人工智能研究的热点。但是传统机器学习是通过各种优化算法拟合训练数据集上的最优模型,即在模型上的平均损失最小,而在现实生活的很多问题(如商业竞拍、资源分配等)中,人工智能算法学习的目标应该是是均衡解,即在动态情况下也有较好效果。这就需要将博弈的思想应用于大数据智能。通过蒙特卡洛树搜索和强化学习等方法,可以将博弈与人工智能相结合,寻求博弈对抗模型的均衡解。从数据拟合的最优解到博弈对抗的均衡解能让大数据智能有更广阔的应用空间。  相似文献   

16.
The rate of penetration (ROP) model is of great importance in achieving a high efficiency in the complex geological drilling process. In this paper, a novel two-level intelligent modeling method is proposed for the ROP considering the drilling characteristics of data incompleteness, couplings, and strong nonlinearities. Firstly, a piecewise cubic Hermite interpolation method is introduced to complete the lost drilling data. Then, a formation drillability (FD) fusion submodel is established by using Nadaboost extreme learning machine (Nadaboost-ELM) algorithm, and the mutual information method is used to obtain the parameters, strongly correlated with the ROP. Finally, a ROP submodel is established by a neural network with radial basis function optimized by the improved particle swarm optimization (RBFNN-IPSO). This two-level ROP model is applied to a real drilling process and the proposed method shows the best performance in ROP prediction as compared with conventional methods. The proposed ROP model provides the basis for intelligent optimization and control in the complex geological drilling process.  相似文献   

17.
近来,"MOOC""翻转课堂"等名词已成为教育界热议的话题,从认识阶段逐步走向了实践阶段。在中国,已经有一些高校进行了相关的实践。那么,什么是"翻转课堂"?如何认识这一新生事物对本科计算机教育的影响?又如何付诸计算机专业基础核心课程数据结构的具体教学实践中来?结合以上问题展开了研究。  相似文献   

18.
To capitalize on multicore power, modern high-speed data transfer applications usually adopt multi-threaded design and aggregate multiple network interfaces. However, NUMA introduces another dimension of complexity to these applications. In this paper, we undertook comprehensive experiment on real systems to illustrate the importance of NUMA-awareness to applications with intensive memory accesses and network I/Os. Instead of simply attributing the NUMA effect to the physical layout, we provide an in-depth analysis of underlying interactions inside hardware devices. We profile the system performance by monitoring relevant hardware counters, and reveal how the NUMA penalty occurs during prefetch and cache synchronization processes. Consequently, we implement a thread mapping module in a bulk data transfer software, BBCP, as a practical example of enabling NUMA-awareness. The enhanced application is then evaluated on our high-performance testbed with storage area networks (SAN). Our experimental results show that the proposed NUMA optimizations can significantly improve BBCP’s performance in memory-based tests with various contention levels and realistic data transfers involving SAN-based storage.  相似文献   

19.
丁二烯生产中,由于不可知因素干扰和随机误差,来自DCS的实时数据普遍有随机误差或显著误差。如何纠正和抹平误差,对于提高数据精度和可靠性具有重要意义。本文描述的数据校正模块基于线性条件,根据实际情况将模块结构划为矩阵处理和校正两部分,并针对工业数据的特性,设计相应的数据结构和处理流程。从实际应用来看,在线性条件下,模块能够完成数据校正的功能。  相似文献   

20.
No school is an island; it is a part of a continuum or a pipeline of institutions which together form an educational pipeline through which groups of students pass. To turn a body of data into useful information for knowledge-based decision-making at any level, data must be collected, organised, analysed and reflected upon. The purpose of this paper is to discuss how schools and other educational institutions can not only collect better data but learn how to transform that data so that the information held within can be effectively shared among all stakeholders. This process will help to ensure that the school and the entire education system provide a more seamless and effective educational pipeline for students, and ultimately improve the quality of education delivered in the country as a whole.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号