首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The K-Nearest Neighbor (K-NN) search problem is the way to find the K closest and most similar objects to a given query. The K-NN is essential for many applications such as information retrieval and visualization, machine learning and data mining. The exponential growth of data imposes to find approximate approaches to this problem. Permutation-based indexing is one of the most recent techniques for approximate similarity search. Objects are represented by permutation lists ordering their distances to a set of selected reference objects, following the idea that two neighboring objects have the same surrounding. In this paper, we propose a novel quantized representation of permutation lists with its related data structure for effective retrieval on single and multicore architectures. Our novel permutation-based indexing strategy is built to be fast, memory efficient and scalable. This is experimentally demonstrated in comparison to existing proposals using several large-scale datasets of millions of documents and of different dimensions.  相似文献   

2.
Constant technological advances in electronic devices have led to the growth of elaborated data such as large texts, time series, georeferenced imagery, genetic sequences, photos, videos and several other types of complex data. Differently from scalar, traditional data types such as numbers and strings, complex data do not present the order relation property, which allows identifying whether an element precedes another according to some criterion. Therefore, these data are usually compared by the similarity degree among them. The Metric Access Methods (MAMs) are recognized as well-suited to perform similarity queries over such kind of data more efficiently than other access methods. MAMs can be considered dynamic or static depending on the pivot type used to construct them. Pivots are often employed to narrow the search for data. Global pivots can be employed to look into elements in the whole dataset, thus they have a high impact in the process of pruning irrelevant elements, since a single global pivot can be used to discard a large amount of irrelevant elements. Nevertheless, MAMs based on global pivots may have their dynamicity compromised by the fact that eventual pivot-related updates must be propagated through the entire structure. Local pivots, on the other hand, allow the maintenance to occur locally at the price of a lower pruning ability. In this paper, we propose novel techniques for improving the performance of dynamic MAMs without harming their dynamicity, once that several applications handle online complex data and, consequently, demand efficient dynamic indexes to be successful. Specifically, our main contributions are three techniques: (i) CLAP, which consists of employing local additional pivots to reduce distance calculations; (ii) ACIR, which is combined with CLAP and anticipates information from child nodes to reduce unnecessary disk accesses; and (iii) SCOOP, which is combined with CLAP as an extended version of ACIR, anticipating a larger amount of information from child nodes. The techniques have been applied to a dynamic MAM and evaluated over real datasets ranging from moderate to high dimensionality and cardinality. The experimental results show that our techniques were able to reduce query execution time in up to 63% for point queries and up to 53% for queries retrieving multiple elements.  相似文献   

3.
In the context of software startups, project failure is embraced actively and considered crucial to obtain validated learning that can lead to pivots. A pivot is the strategic change of a business concept, product or the different elements of a business model. A better understanding is needed on different types of pivots and different factors that lead to failures and trigger pivots, for software entrepreneurial teams to make better decisions under chaotic and unpredictable environment. Due to the nascent nature of the topic, the existing research and knowledge on the pivots of software startups are very limited. In this study, we aimed at identifying the major types of pivots that software startups make during their startup processes, and highlighting the factors that fail software projects and trigger pivots. To achieve this, we conducted a case survey study based on the secondary data of the major pivots happened in 49 software startups. 10 pivot types and 14 triggering factors were identified. The findings show that customer need pivot is the most common among all pivot types. Together with customer segment pivot, they are common market related pivots. The major product related pivots are zoom-in and technology pivots. Several new pivot types were identified, including market zoom-in, complete and side project pivots. Our study also demonstrates that negative customer reaction and flawed business model are the most common factors that trigger pivots in software startups. Our study extends the research knowledge on software startup pivot types and pivot triggering factors. Meanwhile it provides practical knowledge to software startups, which they can utilize to guide their effective decisions on pivoting.  相似文献   

4.
We present methods to store and access templates of data arrays in parallel processors with shuffle-exchange-type interconnection networks. For this purpose, we define the class of composite linear permutations. In our method, each element of the data array is stored in the memory module determined by applying a suitable composite linear permutation on its indices. Simple necessary and sufficient criteria to avoid memory conflicts in the access of important templates such as row, column, main diagonal, and square block are given based on the composite linear permutation involved. The criteria so derived also specify the set of permutations to be realized by an interconnection network to avoid network conflicts. In particular, we give the criteria to be satisfied by a scheme of the proposed class to avoid network conflicts during the access of templates, when shuffle-exchange-type networks are used. Almost all the previously known scrambled storage methods are special cases in the class of storage methods presented in this paper.  相似文献   

5.
Mechanized systems for equational inference often produce many terms that are permutations of one another. We propose to gain efficiency by dealing with such sets of terms in a uniform manner, by the use of efficient general algorithms on permutation groups. We show how permutation groups arise naturally in equational inference problems, and study some of their properties. We also study some general algorithms for processing permutations and permutation groups, and consider their application to equational reasoning and term-rewriting systems. Finally, we show how these techniques can be incorproated into resolution theorem-proving strategies.  相似文献   

6.
Access Structures for Angular Similarity Queries   总被引:2,自引:0,他引:2  
Angular similarity measures have been utilized by several database applications to define semantic similarity between various data types such as text documents, time-series, images, and scientific data. Although similarity searches based on Euclidean distance have been extensively studied in the database community, processing of angular similarity searches has been relatively untouched. Problems due to a mismatch in the underlying geometry as well as the high dimensionality of the data make current techniques either inapplicable or their use results in poor performance. This brings up the need for effective indexing methods for angular similarity queries. We first discuss how to efficiently process such queries and propose effective access structures suited to angular similarity measures. In particular, we propose two classes of access structures, namely, angular-sweep and cone-shell, which perform different types of quantization based on the angular orientation of the data objects. We also develop query processing algorithms that utilize these structures as dense indices. The proposed techniques are shown to be scalable with respect to both dimensionality and the size of the data. Our experimental results on real data sets from various applications show two to three orders of magnitude of speedup over the current techniques  相似文献   

7.
This paper evaluates the use of standard database indexes and query processing as a way to do metric indexing in the LAESA approach. By utilizing B-trees and R-trees as pivot-based indexes, we may use well-known optimization techniques from the database field within metric indexing and search. The novelty of this paper is that we use a cost-based approach to dynamically evaluate which and how many pivots to use in the evaluation of each query. By a series of measurements using our database prototype we are able to evaluate the performance of this approach. Compared to using all available pivots for filtering, the optimized approach gives half the response times for main memory data, but much more varied results for disk resident data. However, by use of the cost model we are able to dynamically determine when to bypass the indexes and simply perform a sequential scan of the base data. The conclusion of this evaluation is that it is beneficial to create many pivots, but to use only the most selective ones during evaluation of each query. R-trees give better performance than B-trees when utilizing all pivots, but when being able to dynamically select the best pivots, B-trees often provide better performance.  相似文献   

8.
Visualization is one of the most effective methods for analyzing how high-dimensional data are distributed. Dimensionality reduction techniques, such as PCA, can be used to map high dimensional data to a two- or three-dimensional space. In this paper, we propose an algorithm called HyperMap that can be effectively applied to visualization. Our algorithm can be seen as a generalization of FastMap. It preserves its linear computation complexity, and overcomes several main shortcomings, especially in visualization. Since there are more than two pivot objects in each axis of a target space, more distance information needs to be preserved in each dimension. Then in visualization, the number of pivot objects can go beyond the limitation of six (2-pivot objects × 3-dimensions). Our HyperMap algorithm also gives more flexibility to the target space, such that the data distribution can be observed from various viewpoints. Its effectiveness is confirmed by empirical evaluations on both real and synthetic datasets.  相似文献   

9.
We say an LP (linear programming) is fully nondegenerate if both the primal and the dual problems are nondegenerate. In this paper, we prove the existence of a sequence of | B ∖ B | admissible pivot from any basis B (not necessarily feasible) to the unique optimal basis B , if the given LP has an optimal solution and is fully nondegenerate. Here admissible pivots are those pivots (satisfying certain sign conditions) that exist if the current LP dictionary is not terminal, i.e., neither optimal, inconsistent nor dual inconsistent. A natural extension of the result to LCPs (linear complementarity problems) with sufficient matrices is given. The existence itself does not yield a strongly polynomial pivot algorithm for LPs but provides us with a good motivation to study the class of admissible pivot methods for LPs, as opposed to the narrower class of simplex methods for which the shortest sequence of pivots is not known to be polynomially bounded.  相似文献   

10.
The simple intramolecular model for gene assembly in ciliates is particularly interesting because it can predict the correct assembly of all available experimental data, although it is not universal. The simple model also has a confluence property that is not shared by the general model. A previous formalization of the simple model through sorting of signed permutations is unsatisfactory because it effectively ignores one operation of the model and thus, it cannot be used to answer questions about parallelism in the model, or about measures of complexity. We propose in this paper a string-based model in which a gene is represented through its sequence of pointers and markers and its assembly is represented as a string rewriting process. We prove that this string-based model is equivalent to the permutation-based model as far as gene assembly is concerned, while it tracks all operations of the simple model. We also consider overlap graphs for these strings and prove the results with respect to the overlap of markers.  相似文献   

11.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

12.
Clustering is the process of grouping objects that are similar, where similarity between objects is usually measured by a distance metric. The groups formed by a clustering method are referred as clusters. Clustering is a widely used activity with multiple applications ranging from biology to economics. Each clustering technique has some advantages and disadvantages. Some clustering algorithms may even require input parameters which strongly affect the result. In most cases, it is not possible to choose the best distance metric, the best clustering method, and the best input argument values for an input data set. Therefore, multiple clusterings can be obtained by several distance metrics, several clustering methods, and several input argument values. And, multiple clusterings can be combined into a new and better quality final clustering. We propose a family of combining multiple clustering algorithms that are memory efficient, scalable, robust, and intuitive. Our new algorithms offer tremendous speed gain and low memory requirements by working at cluster level, while producing very good quality final clusters. Extensive experimental evaluations on some very challenging artificially generated and real data sets from a diverse set of domains establish the usefulness of our methods.  相似文献   

13.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

14.
In this paper we propose the use of Boolean permutations to design public key cryptosystems. The security of the cryptosystems is based on the difficulty of inverting Boolean permutations. Using two Boolean permutations for which the inverses are easy to find, one can construct a composite Boolean permutation which is hard to invert. The paper proposes three such Boolean permutation based public key systems. The paper also consider applications of a Boolean permutation based public key system to digital signatures and shared signatures.  相似文献   

15.
Many applications — such as content-based image retrieval, subspace clustering, and feature selection — may benefit from efficient subspace similarity search. Given a query object, the goal of subspace similarity search is to retrieve the most similar objects from the database, where the similarity distance is defined over an arbitrary subset of dimensions (or features) — that is, an arbitrary axis-aligned projective subspace — specified along with the query. Though much effort has been spent on similarity search in fixed subspaces, relatively little attention has been given to the problem of similarity search when the dimensions are specified at query time. In this paper, we propose new methods for the subspace similarity search problem for real-valued data. Extensive experiments are provided showing very competitive performance relative to state-of-the-art solutions.  相似文献   

16.
General-purpose computing on graphics processing unit (GPGPU) has been adopted to accelerate the running of applications which require long execution time in various problem domains. Tabu Search belonging to meta-heuristics optimization has been used to find a suboptimal solution for NP-hard problems within a more reasonable time interval. In this paper, we have investigated in how to improve the performance of Tabu Search algorithm on GPGPU and took the permutation flow shop scheduling problem (PFSP) as the example for our study. In previous approach proposed recently for solving PFSP by Tabu Search on GPU, all the job permutations are stored in global memory to successfully eliminate the occurrences of branch divergence. Nevertheless, the previous algorithm requires a large amount of global memory space, because of a lot of global memory access resulting in system performance degradation. We propose a new approach to address the problem. The main contribution of this paper is an efficient multiple-loop struct to generate most part of the permutation on the fly, which can decrease the size of permutation table and significantly reduce the amount of global memory access. Computational experiments on problems according with benchmark suite for PFSP reveal that the best performance improvement of our approach is about 100%, comparing with the previous work.  相似文献   

17.
A paired data set is common in microarray experiments, where the data are often incompletely observed for some pairs due to various technical reasons. In microarray paired data sets, it is of main interest to detect differentially expressed genes, which are usually identified by testing the equality of means of expressions within a pair. While much attention has been paid to testing mean equality with incomplete paired data in previous literature, the existing methods commonly assume the normality of data or rely on the large sample theory. In this paper, we propose a new test based on permutations, which is free from the normality assumption and large sample theory. We consider permutation statistics with linear mixtures of paired and unpaired samples as test statistics, and propose a procedure to find the optimal mixture that minimizes the conditional variances of the test statistics, given the observations. Simulations are conducted for numerical power comparisons between the proposed permutation tests and other existing methods. We apply the proposed method to find differentially expressed genes for a colorectal cancer study.  相似文献   

18.
XML is a new standard for exchanging and representing information on the Internet. Documents can be hierarchically represented by XML-elements. In this paper, we propose that an XML document collection be represented and indexed using a bitmap indexing technique. We define the similarity and popularity operations suitable for bitmap indexes. We also define statistical measurements in the BitCube: center, and radius. Based on these measurements, we describe a new bitmap indexing based technique to cluster XML documents. The techniques for clustering are motivated by the fact that the bitmap indexes are expected to be very sparse.Furthermore, a 2-dimensional bitmap index is extended to a 3-dimensional bitmap index, called the BitCube. Sophisticated querying of XML document collections can be performed using primitive operations such as slice, project, and dice. Experiments show that the BitCube can be created efficiently and the primitive operations can be performed more efficiently with the BitCube than with other alternatives.  相似文献   

19.
Appropriately defining and efficiently calculating similarities from large data sets are often essential in data mining, both for gaining understanding of data and generating processes and for building tractable representations. Given a set of objects and their correlations, we here rely on the premise that each object is characterized by its context, i.e., its correlations to the other objects. The similarity between two objects can then be expressed in terms of the similarity between their contexts. In this way, similarity pertains to the general notion that objects are similar if they are exchangeable in the data. We propose a scalable approach for calculating all relevant similarities among objects by relating them in a correlation graph that is transformed to a similarity graph. These graphs can express rich structural properties among objects. Specifically, we show that concepts—abstractions of objects—are constituted by groups of similar objects that can be discovered by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of fields and will be demonstrated here in three domains: computational linguistics, music, and molecular biology, where the numbers of objects and correlations range from small to very large.  相似文献   

20.
Graph colouring approaches for a satellite range scheduling problem   总被引:2,自引:0,他引:2  
A goal of this paper is to efficiently adapt the best ingredients of the graph colouring techniques to an NP-hard satellite range scheduling problem, called MuRRSP. We propose two new heuristics for the MuRRSP, where as many jobs as possible have to be scheduled on several resources, while respecting time and capacity constraints. In the permutation solution space, which is widely used by other researchers, a solution is represented by a permutation of the jobs, and a schedule builder is needed to generate and evaluate a feasible schedule from the permutation. On the contrary, our heuristics are based on the solution space which contains all the feasible schedules. Based on the similarities between the graph colouring problem and the MuRRSP, we show that the latter solution space has significant advantages. A tabu search and an adaptive memory algorithms are designed to tackle the MuRRSP. These heuristics are derived from efficient graph colouring methods. Numerical experiments, performed on large, realistic, and challenging instances, showed that our heuristics are very competitive, robust, and outperform algorithms based on the permutation solution space.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号