期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Ptolemaic access methods: Challenging the reign of the metric space model

Magnus Lie Hetland Tomáš Skopal Jakub Lokoč Christian Beecks 《Information Systems》2013

Metric indexing is the state of the art in general distance-based retrieval. Relying on the triangular inequality, metric indexes achieve significant online speed-up beyond a linear scan. Recently, the idea of Ptolemaic indexing was introduced, which substitutes Ptolemy's inequality for the triangular one, potentially yielding higher efficiency for the distances where it applies. In this paper we have adapted several metric indexes to support Ptolemaic indexing, thus establishing a class of Ptolemaic access methods (PtoAM). In particular, we include Ptolemaic Pivot tables, Ptolemaic PM-Trees and the Ptolemaic M-Index. We also show that the most important and promising family of distances suitable for Ptolemaic indexing is the signature quadratic form distance, an adaptive similarity measure which can cope with flexible content representations of multimedia data, among other things. While this distance has shown remarkable qualities regarding the search effectiveness, its high computational complexity underscores the need for efficient search methods. We show that these distances are Ptolemaic metrics and present a study where we apply Ptolemaic indexing methods on real-world image databases, resolving exact queries nearly four times as fast as the state-of-the-art metric solution, and up to three orders of magnitude times as fast as sequential scan. 相似文献

2.

CM-tree: A dynamic clustered index for similarity search in metric databases

Lior Israel 《Data & Knowledge Engineering》2007,63(3):919-946

Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs. 相似文献

3.

Probabilistic enhancement of approximate indexing in metric spaces

Takao Murakami Kenta Takahashi Susumu Serita Yasuhiro Fujii 《Information Systems》2013

Some approximate indexing schemes have been recently proposed in metric spaces which sort the objects in the database according to pseudo-scores. It is known that (1) some of them provide a very good trade-off between response time and accuracy, and (2) probability-based pseudo-scores can provide an optimal trade-off in range queries if the probabilities are correctly estimated. Based on these facts, we propose a probabilistic enhancement scheme which can be applied to any pseudo-score based scheme. Our scheme computes probability-based pseudo-scores using pseudo-scores obtained from a pseudo-score based scheme. In order to estimate the probability-based pseudo-scores, we use the object-specific parameters in logistic regression and learn the parameters using MAP (Maximum a Posteriori) estimation and the empirical Bayes method. We also propose a technique which speeds up learning the parameters using pseudo-scores. We applied our scheme to the two state-of-the-art schemes: the standard pivot-based scheme and the permutation-based scheme, and evaluated them using various kinds of datasets from the Metric Space Library. The results showed that our scheme outperformed the conventional schemes, with regard to both the number of distance computations and the CPU time, in all the datasets. 相似文献

4.

Improving the space cost of <Emphasis Type="Italic">k</Emphasis>-NN search in metric spaces by using distance estimators

Benjamin Bustos Gonzalo Navarro 《Multimedia Tools and Applications》2009,41(2):215-233

Similarity searching in metric spaces has a vast number of applications in several fields like multimedia databases, text retrieval, computational biology, and pattern recognition. In this context, one of the most important similarity queries is the k nearest neighbor (k-NN) search. The standard best-first k-NN algorithm uses a lower bound on the distance to prune objects during the search. Although optimal in several aspects, the disadvantage of this method is that its space requirements for the priority queue that stores unprocessed clusters can be linear in the database size. Most of the optimizations used in spatial access methods (for example, pruning using MinMaxDist) cannot be applied in metric spaces, due to the lack of geometric properties. We propose a new k-NN algorithm that uses distance estimators, aiming to reduce the storage requirements of the search algorithm. The method stays optimal, yet it can significantly prune the priority queue without altering the output of the query. Experimental results with synthetic and real datasets confirm the reduction in storage space of our proposed algorithm, showing savings of up to 80% of the original space requirement.

Gonzalo NavarroEmail:

Benjamin Bustos is an assistant professor in the Department of Computer Science at the University of Chile. He is also a researcher at the Millennium Nucleus Center for Web Research. His research interests are similarity searching and multimedia information retrieval. He has a doctoral degree in natural sciences from the University of Konstanz, Germany. Contact him at bebustos@dcc.uchile.cl. Gonzalo Navarro earned his PhD in Computer Science at the University of Chile in 1998, where he is now Full Professor. His research interests include similarity searching, text databases, compression, and algorithms and data structures in general. He has coauthored a book on string matching and around 200 international papers. He has (co)chaired international conferences SPIRE 2001, SCCC 2004, SPIRE 2005, SIGIR Posters 2005, IFIP TCS 2006, and ENC 2007 Scalable Pattern Recognition track; and belongs to the Editorial Board of Information Retrieval Journal. He is currently Head of the Department of Computer Science at University of Chile, and Head of the Millenium Nucleus Center for Web Research, the largest Chilean project in Computer Science research. 相似文献

5.

Spatial indexing of high-dimensional data based on relative approximation 总被引：2，自引：0，他引：2

Yasushi Sakurai Masatoshi Yoshikawa Shunsuke Uemura Haruhiko Kojima 《The VLDB Journal The International Journal on Very Large Data Bases》2002,11(2):93-108

We propose a novel index structure, the A-tree (approximation tree), for similarity searches in high-dimensional data. The basic idea of the A-tree is the introduction of virtual bounding rectangles (VBRs) which contain and approximate MBRs or data objects. VBRs can be represented quite compactly and thus affect the tree configuration both quantitatively and qualitatively. First, since tree nodes can contain a large number of VBR entries, fanout becomes large, which increases search speed. More importantly, we have a free hand in arranging MBRs and VBRs in the tree nodes. Each A-tree node contains an MBR and its children VBRs. Therefore, by fetching an A-tree node, we can obtain information on the exact position of a parent MBR and the approximate position of its children. We have performed experiments using both synthetic and real data sets. For the real data sets, the A-tree outperforms the SR-tree and the VA-file in all dimensionalities up to 64 dimensions, which is the highest dimension in our experiments. Additionally, we propose a cost model for the A-tree. We verify the validity of the cost model for synthetic and real data sets. Edited by T. Sellis. Received: December 8, 2000 / Accepted: March 20, 2002 Published online: September 25, 2002 相似文献

6.

The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient

Caetano TrainaJr. Roberto F. Santos Filho Agma J. M. Traina Marcos R. Vieira Christos Faloutsos 《The VLDB Journal The International Journal on Very Large Data Bases》2007,16(4):483-505

Similarity search operations require executing expensive algorithms, and although broadly useful in many new applications, they rely on specific structures not yet supported by commercial DBMS. In this paper we discuss the new Omni-technique, which allows to build a variety of dynamic Metric Access Methods based on a number of selected objects from the dataset, used as global reference objects. We call them as the Omni-family of metric access methods. This technique enables building similarity search operations on top of existing structures, significantly improving their performance, regarding the number of disk access and distance calculations. Additionally, our methods scale up well, exhibiting sub-linear behavior with growing database size. 相似文献

7.

Large-scale indexing of spatial data in distributed repositories: the SD-Rtree 总被引：2，自引：0，他引：2

Cédric du Mouza Witold Litwin Philippe Rigaux 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(4):933-958

We propose a scalable distributed data structure (SDDS) called SD-Rtree. We intend our structure for point, window and kNN queries over large spatial datasets distributed on clusters of interconnected servers. The structure balances the storage and processing load over the available resources, and aims at minimizing the size of the cluster. SD-Rtree generalizes the well-known Rtree structure. It uses a distributed balanced binary tree that scales with insertions to potentially any number of storage servers through splits of the overloaded ones. A user/application manipulates the structure from a client node. The client addresses the tree through its image that can be possibly outdated due to later split. This may generate addressing errors, solved by the forwarding among the servers. Specific messages towards the clients incrementally correct the outdated images. We present the building of an SD-Rtree through insertions, focusing on the split and rotation algorithms. We follow with the query algorithms. We describe then a flexible allocation protocol which allows to cope with a temporary shortage of storage resources through data storage balancing. Experiments show additional aspects of SD-Rtree and compare its behavior with a distributed quadtree. The results justify our various design choices and the overall utility of the structure. 相似文献

8.

Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces 总被引：1，自引：0，他引：1

Edgar Chávez Gonzalo Navarro 《Information Processing Letters》2003,85(1):39-46

Proximity searches become very difficult on “high dimensional” metric spaces, that is, those whose histogram of distances has a large mean and/or a small variance. This so-called “curse of dimensionality”, well known in vector spaces, is also observed in metric spaces. The search complexity grows sharply with the dimension and with the search radius. We present a general probabilistic framework applicable to any search algorithm and whose net effect is to reduce the search radius. The higher the dimension, the more effective the technique. We illustrate empirically its practical performance on a particular class of algorithms, where large improvements in the search time are obtained at the cost of a very small error probability. 相似文献

9.

Understanding students' performance in a computer-based assessment of complex problem solving: An analysis of behavioral data from computer-generated log files

《Computers in human behavior》2016

Computer-based assessments of complex problem solving (CPS) that have been used in international large-scale surveys require students to engage in an in-depth interaction with the problem environment. In this, they evoke manifest sequences of overt behavior that are stored in computer-generated log files. In the present study, we explored the relation between several overt behaviors, which N = 1476 Finnish ninth-grade students (mean age = 15.23, SD = .47 years) exhibited when exploring a CPS environment, and their CPS performance. We used the MicroDYN approach to measure CPS and inspected students' behaviors through log-file analyses. Results indicated that students who occasionally observed the problem environment in a noninterfering way in addition to actively exploring it (noninterfering observation) showed better CPS performance, whereas students who showed a high frequency of (potentially unplanned) interventions (intervention frequency) exhibited worse CPS performance. Additionally, both too much and too little time spent on a CPS task (time on task) was associated with poor CPS performance. The observed effects held after controlling for students' use of an exploration strategy that required a sequence of multiple interventions (VOTAT strategy) indicating that these behaviors exhibited incremental effects on CPS performance beyond the use of VOTAT. 相似文献

10.

A new approach based on orthogonal bases of data space to decomposition of mixed pixels for hyperspectral imagery

XueTao Tao Bin Wang LiMing Zhang 《中国科学F辑(英文版)》2009,52(5):843-857

A new algorithm for decomposition of mixed pixels based on orthogonal bases of data space is proposed in this paper. It is a simplex-based method which extracts endmembers sequentially using computations of largest simplex volumes. At each searching step of this extraction algorithm, searching for the simplex with the largest volume is equivalent to searching for a new orthogonal basis which has the largest norm. The new endmember corresponds to the new basis with the largest norm. This algorithm runs very fast and can also avoid the dilemma in traditional simplex-based endmember extraction algorithms, such as N-FINDR, that it generally produces different sets of final endmembers if different initial conditions are used. Moreover, with this set of orthogonal bases, the proposed algorithm can also determine the proper number of endmembers and finish the unmixing of the original images which the traditional simplex-based algorithms cannot do by themselves. Experimental results of both artificial simulated images and practical remote sensing images demonstrate the algorithm proposed in this paper is a fast and accurate algorithm for the decomposition of mixed pixels. 相似文献

11.

空间数据融合算法在温度场计算中的应用 总被引：1，自引：1，他引：1

杜宇健萧德云《传感器与微系统》2002,21(9):59-61

提出了一种空间数据融合算法，该算法综合了插值逼近和平方逼近两种逼近的特点，一方面保持了平方逼近的有效概念的基础上降低了运算复杂度，另一方面克服了插值逼近高阶波动等缺点，从而可以在多信秘采集点的情况下快速得到一个多项式函数来描述采集量在整个测量区间内的变化规律，经过在火烧油层温度场温度分布计算的实践证明其是一种实时，有效的逼近算法，最后对算法做了进一步的性能分析。相似文献

12.

大数据智能:从数据拟合最优解到博弈对抗均衡解

下载免费PDF全文

蒋胤傑况琨吴飞《智能系统学报》2020,15(1):175-182

数据驱动的机器学习（特别是深度学习）在自然语言处理、计算机视觉分析和语音识别等领域取得了巨大进展,是人工智能研究的热点。但是传统机器学习是通过各种优化算法拟合训练数据集上的最优模型,即在模型上的平均损失最小,而在现实生活的很多问题（如商业竞拍、资源分配等）中,人工智能算法学习的目标应该是是均衡解,即在动态情况下也有较好效果。这就需要将博弈的思想应用于大数据智能。通过蒙特卡洛树搜索和强化学习等方法,可以将博弈与人工智能相结合,寻求博弈对抗模型的均衡解。从数据拟合的最优解到博弈对抗的均衡解能让大数据智能有更广阔的应用空间。相似文献

13.

线性的数据校正在丁二烯生产中的应用

陈卿陆新建李荣羽《计算机与应用化学》2010,27(2)

丁二烯生产中,由于不可知因素干扰和随机误差,来自DCS的实时数据普遍有随机误差或显著误差。如何纠正和抹平误差,对于提高数据精度和可靠性具有重要意义。本文描述的数据校正模块基于线性条件,根据实际情况将模块结构划为矩阵处理和校正两部分,并针对工业数据的特性,设计相应的数据结构和处理流程。从实际应用来看,在线性条件下,模块能够完成数据校正的功能。相似文献

14.

From data to knowledge: The interaction between data management systems in educational institutions and the delivery of quality education

Hilary Tolley Boaz Shulruf 《Computers & Education》2009,53(4):1199-1206

No school is an island; it is a part of a continuum or a pipeline of institutions which together form an educational pipeline through which groups of students pass. To turn a body of data into useful information for knowledge-based decision-making at any level, data must be collected, organised, analysed and reflected upon. The purpose of this paper is to discuss how schools and other educational institutions can not only collect better data but learn how to transform that data so that the information held within can be effectively shared among all stakeholders. This process will help to ensure that the school and the entire education system provide a more seamless and effective educational pipeline for students, and ultimately improve the quality of education delivered in the country as a whole. 相似文献

15.

A UML profile for the conceptual modelling of structurally complex data: Easing human effort in the KDD process

《Information and Software Technology》2014,56(3):335-351

ContextDomains where data have a complex structure requiring new approaches for knowledge discovery from data are on the increase. In such domains, the information related to each object under analysis may be composed of a very broad set of interrelated data instead of being represented by a simple attribute table. This further complicates their analysis.ObjectiveIt is becoming more and more necessary to model data before analysis in order to assure that they are properly understood, stored and later processed. On this ground, we have proposed a UML extension that is able to represent any set of structurally complex hierarchically ordered data. Conceptually modelled data are human comprehensible and constitute the starting point for automating other data analysis tasks, such as comparing items or generating reference models.MethodThe proposed notation has been applied to structurally complex data from the stabilometry field. Stabilometry is a medical discipline concerned with human balance. We have organized the model data through an implementation based on XML syntax.ResultsWe have applied data mining techniques to the resulting structured data for knowledge discovery. The sound results of modelling a domain with such complex and wide-ranging data confirm the utility of the approach.ConclusionThe conceptual modelling and the analysis of non-conventional data are important challenges. We have proposed a UML profile that has been tested on data from a medical domain, obtaining very satisfactory results. The notation is useful for understanding domain data and automating knowledge discovery tasks. 相似文献

16.

Application of edge detection techniques to detection of the bright band in radar data

E.L Hines P.A Watson 《Image and vision computing》1983,1(4):221-226

An approach to the problem of automatically locating the melting layer is outlined. The approach uses image analysis techniques on dual and single polarization radar data. 相似文献

17.

Comparative analysis of methods for determining the metabolic rate in order to provide a balance between man and the environment

Evandro Eduardo Broday Antonio Augusto de Paula XavierReginaldo de Oliveira 《International Journal of Industrial Ergonomics》2014

The incorrect determination of metabolic rate can be linked to discrepancies between the model of the PMV (Predicted Mean Vote) and real thermal sensation collected in field studies. Aiming to improve the correlation of the PMV model and the real thermal sensation, this work established new values for the metabolic rate: one way being called “calculated” using Newton's Method and the other called "measured" using a metabolic analyzer. Welder's activities were evaluated, through the measurements of environmental and personal variables. New values of metabolic rate were determined for this activity. The values found for the calculated form and the measured one were, respectively, 178.63 and 145.46 W/m², different from the range provided by the table of ISO 8996 (2004) for this activity (75–125 W/m²). In order to verify which of the values of the metabolic rate was closer to the real thermal sensation of PMV, a linear regression was made between the PMV and the real thermal sensation in three ways: S × PMV_tabulated (R² = 0.1749), S × PMV_calculated (R ² = 0.7481) and S × PMV_measured (R² = 0.7854). It was found that the values measured by the instrument gave a higher coefficient of determination which was chosen for the correction of the table. The correction of the table provides a value of M_predicted, that is a value of metabolic rate that corrects the values provided by the tables of ISO 8996 (2004), by means of a correction coefficient. For the welder's activities in a metal-mechanics industry, tabulated values can be multiplied by the correction coefficient 1.4648 in order to minimize inaccuracies. The PMV_predicted, obtained through the M_predicted, when related to the actual thermal sensation, provides a coefficient of determination of 0.7511, thereby improving the model of the PMV. 相似文献

18.

Extension of the SAEM algorithm to left-censored data in nonlinear mixed-effects model: Application to HIV dynamics model

Adeline Samson Marc Lavielle France Mentré 《Computational statistics & data analysis》2006,51(3):1562-1574

The reduction of viral load is frequently used as a primary endpoint in HIV clinical trials. Nonlinear mixed-effects models are thus proposed to model this decrease of the viral load after initiation of treatment and to evaluate the intra- and inter-patient variability. However, left censoring due to quantification limits in the viral load measurement is an additional challenge in the analysis of longitudinal HIV data. An extension of the stochastic approximation expectation-maximization (SAEM) algorithm is proposed to estimate parameters of these models. This algorithm includes the simulation of the left-censored data in a right-truncated Gaussian distribution. Simulation results show that the proposed estimates are less biased than the usual naive methods of handling such data: omission of all censored data points, or imputation of half the quantification limit to the first point below the limit and omission of the following points. The viral load measurements obtained in the TRIANON-ANRS81 clinical trial are analyzed with this method and a significant difference is found between the two treatment groups of this trial. 相似文献

19.

Using a MODFLOW grid, generated with GMS, to solve a transport problem with TOUGH2 in complex geological environments: The intertidal deposits of the Venetian Lagoon

A. Borgia L. Cattaneo D. MarconiC. Delcroix E.L. RossiG. Clemente C.G. AmorosoF. Lo Re E. Tozzato 《Computers & Geosciences》2011,37(6):783-790

The tides of the Venetian Lagoon generally vary between −0.5 and +0.7 m asl. Occasionally, they may reach a maximum of 1.5 m (acqua alta) and a minimum of −0.8 m asl (acqua bassa). Intertidal areas, called “barene,” exist all along the coast of the Lagoon. These areas are characterized by canals that concentrate the flow of water (and the deposition of sands) during the rising and waning of the tides, and that inundate and drain the vegetated areas found between canals (where organic-rich clays are deposited). Therefore, since the area is subject to subsidence, in time, sand dykes (the original canals) become juxtaposed to clayey dykes (the original vegetated areas). In addition, the sands form a continuous hydrogeologic network within the clays, very similar to that of a vascular system that effectively drains the whole “barena” deposits. In order to be effective, measures for monitoring, confining, or remediating the transport of pollutants through this kind of environment must explicitly take into account the geologic complexity. The same complexity must be included in the numerical models that support remediation efforts. At the moment, there appears to be no off-the-shelf graphical interface that is able to manage such complexity for TOUGH2. To attempt to solve this problem we have used a calibrated USGS-MODFLOW model, of the barena of “Passo a Campalto” in the Venetian Lagoon, developed with the GMS graphical interface. The model is made of 42 layers, which, apart from the first layer, are 0.5 m thick; the first layer has the thickness distribution of a dump found on top of the barena deposit at Passo a Campalto. Each layer consists of 100×60 square cells, for a total of 252,000 cells, only about half of which are active. Using a FORTRAN routine, we translate this grid, with all the hydrogeologic boundary conditions, into a TOUGH2 input file, and we provide the additional necessary information for running a TOUGH2 simulation. The results are promising, in that we were able to produce TOUGH2 grids with very complex geology and to run the models with success. For visualization, the results can be imported back into GMS as 3D scatter point sets, or they can be plotted with any adequate plotting software such as MatLab. Developing conceptual and numerical models with an elaborate graphical interface such as GMS effectively allows setting up complex problems while concentrating on their physics. 相似文献

20.

Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs

Wael Khreich Author Vitae Eric Granger^{Author Vitae} 《Pattern recognition》2010,43(8):2732-2752

Hidden Markov models (HMMs) have been shown to provide a high level performance for detecting anomalies in sequences of system calls to the operating system kernel. Using Boolean conjunction and disjunction functions to combine the responses of multiple HMMs in the ROC space may significantly improve performance over a “single best” HMM. However, these techniques assume that the classifiers are conditional independent, and their of ROC curves are convex. These assumptions are violated in most real-world applications, especially when classifiers are designed using limited and imbalanced training data. In this paper, the iterative Boolean combination (IBC) technique is proposed for efficient fusion of the responses from multiple classifiers in the ROC space. It applies all Boolean functions to combine the ROC curves corresponding to multiple classifiers, requires no prior assumptions, and its time complexity is linear with the number of classifiers. The results of computer simulations conducted on both synthetic and real-world host-based intrusion detection data indicate that the IBC of responses from multiple HMMs can achieve a significantly higher level of performance than the Boolean conjunction and disjunction combinations, especially when training data are limited and imbalanced. The proposed IBC is general in that it can be employed to combine diverse responses of any crisp or soft one- or two-class classifiers, and for wide range of application domains. 相似文献