首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Semantic-based searching in peer-to-peer (P2P) networks has drawn significant attention recently. A number of semantic searching schemes, such as GES proposed by Zhu Y et al., employ search models in Information Retrieval (IR). All these IR-based schemes use one vector to summarize semantic contents of all documents on a single node. For example, GES derives a node vector based on the IR model: VSM (Vector Space Model). A topology adaptation algorithm and a search protocol are then designed according to the similarity between node vectors of different nodes. Although the single semantic vector is suitable when the distribution of documents in each node is uniform, it may not be efficient when the distribution is diverse. When there are many categories of documents at each node, the node vector representation may be inaccurate. We extend the idea of GES and present a new class-based semantic searching scheme (CSS) specifically designed for unstructured P2P networks with heterogeneous single-node document collection. It makes use of a state-of-the-art data clustering algorithm, online spherical k-means clustering (OSKM), to cluster all documents on a node into several classes. Each class can be viewed as a virtual node. Virtual nodes are connected through virtual links. As a result, the class vector replaces the node vector and plays an important role in the class-based topology adaptation and search process. This makes CSS very efficient. Our simulation using the IR benchmark TREC collection demonstrates that CSS outperforms GES in terms of higher recall, higher precision, and lower search cost.  相似文献   

2.
Finding similar items in a large and unstructured dataset is a challenging task in many applications of data science, such as searching, indexing, and retrieval. With the increasing data volume and demand for real time responses, similarity search has gained much consideration. In this paper, a parallel computational approach for similarity search using Bloom filters (PCASSB) has been proposed, which uses Bloom filter for the representation of features of document and comparison with user's query. Query features are stored in integer query array (IQA), an array of integer. The PCASSB, an approximate similarity search technique, has been implemented on graphics processing unit with compute unified device architecture as the programming platform. To compute the similarity score between query and reference dataset, Dice coefficient has been used as a baseline method. The accuracy of the results generated by PCASSB is compared with the baseline method and other state‐of‐the‐art methods. The experimental results show that the proposed technique is quite effective in processing large number of text documents as it takes less computational time.  相似文献   

3.
We present two new techniques for regular expression searching and use them to derive faster practical algorithms. Based on the specific properties of Glushkov’s nondeterministic finite automaton construction algorithm, we show how to encode a deterministic finite automaton (DFA) using O(m2m) bits, where m is the number of characters, excluding operator symbols, in the regular expression. This compares favorably against the worst case of O(m2m|Σ|) bits needed by a classical DFA representation (where Σ is the alphabet) and O(m22m) bits needed by the Wu and Manber approach implemented in Agrep. We also present a new way to search for regular expressions, which is able to skip text characters. The idea is to determine the minimum length ℓ of a string matching the regular expression, manipulate the original automaton so that it recognizes all the reverse prefixes of length up to ℓ of the strings originally accepted, and use it to skip text characters as done for exact string matching in previous work. We combine these techniques into two algorithms, one able and one unable to skip text characters. The algorithms are simple to implement, and our experiments show that they permit fast searching for regular expressions, normally faster than any existing algorithm.  相似文献   

4.
We study the problem of minimizing the expected cost of binary searching for data where the access cost is not fixed and depends on the last accessed element, such as data stored in magnetic or optical disk. We present an optimal algorithm for this problem that finds the optimal search strategy in O(n 3 ) time, which is the same time complexity of the simpler classical problem of fixed costs. Next, we present two practical linear expected time algorithms, under the assumption that the access cost of an element is independent of its physical position. Both practical algorithms are online, that is, they find the next element to access as the search proceeds. The first one is an approximate algorithm which minimizes the access cost disregarding the goodness of the problem partitioning. The second one is a heuristic algorithm, whose quality depends on its ability to estimate the final search cost, and therefore it can be tuned by recording statistics of previous runs. We present an application for our algorithms related to text retrieval. When a text collection is large it demands specialized indexing techniques for efficient access. One important type of index is the suffix array, where data access is provided through an indirect binary search on the text stored in magnetic disk or optical disk. Under this cost model we prove that the optimal algorithm cannot perform better than Ω(1/ log n) times the standard binary search. We also prove that the approximate strategy cannot, on average, perform worse than 39% over the optimal one. We confirm the analytical results with simulations, showing improvements between 34% (optimal) and 60% (online) over standard binary search for both magnetic and optical disks. Received February 13, 1997; revised May 27, 1998.  相似文献   

5.
Peer-to-peer systems have been widely used for sharing and exchanging data and resources among numerous computer nodes. Various data objects identifiable with high dimensional feature vectors, such as text, images, genome sequences, are starting to leverage P2P technology. Most of the existing works have been focusing on queries on data objects with one or few attributes and thus are not applicable on high dimensional data objects. In this study, we investigate K nearest neighbors query (KNN) on high dimensional data objects in P2P systems. Efficient query algorithm and solutions that address various technical challenges raised by high dimensionality, such as search space resolution and incremental search space refinement, are proposed. An extensive simulation using both synthetic and real data sets demonstrates that our proposal efficiently supports KNN query on high dimensional data in P2P systems.  相似文献   

6.
Faster Approximate String Matching   总被引:11,自引:0,他引:11  
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Ω (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)) , where m is the pattern length and k<m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m=O(log n) . Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost. We then explore other novel techniques to cope with longer patterns. We show how to partition the pattern into short subpatterns which can be searched with less errors using the simple automaton, to obtain an average cost close to . Moreover, we allow the superimposition of many subpatterns in a single automaton, obtaining near average complexity (σ is the alphabet size). We perform a complete analysis of all the techniques and show how to combine them in an optimal form, also obtaining new tighter bounds for the probability of an approximate occurrence in random text. Finally, we show experimental results comparing our algorithms against previous work. These experiments show that our algorithms are among the fastest for typical text searching, being the fastest in some cases. Although we aim mainly at text searching, we believe that our ideas can be successfully applied to other areas such as computational biology. Received November 22, 1996; revised October 13 and December 5, 1997.  相似文献   

7.
Gonzalo Navarro  Jorma Tarhio 《Software》2005,35(12):1107-1130
We present a Boyer–Moore (BM) approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BM approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters to inspect. We present and compare several solutions for single and multipattern searches. We show that our algorithms obtain speedups of up to 50% compared to the simple decompress‐then‐search approach. Finally, we present a public tool, LZgrep, which uses our algorithms to offer grep‐like capabilities directly searching files compressed using Unix's Compress, a LZW compressor. LZgrep can also search files compressed with Unix gzip, using new decompress‐then‐search techniques we develop, which are faster than the current tools. This way, users can always keep their files in compressed form and still search them, uncompressing only when they want to see them. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

8.
κ Nearest Neighbor (κNN) search is one of the most important operations in spatial and spatio-temporal databases. Although it has received considerable attention in the database literature, there is little prior work on κNN retrieval for moving object trajectories. Motivated by this observation, this paper studies the problem of efficiently processing κNN (κ≥ 1) search on R-tree-like structures storing historical information about moving object trajectories. Two algorithms are developed based on best-first traversal paradigm, called BFPκNN and BFTκNN, which handle the κNN retrieval with respect to the static query point and the moving query trajectory, respectively. Both algorithms minimize the number of node access, that is, they perform a single access only to those qualifying nodes that may contain the final result. Aiming at saving main-memory consumption and reducing CPU cost further, several effective pruning heuristics are also presented. Extensive experiments with synthetic and real datasets confirm that the proposed algorithms in this paper outperform their competitors significantly in both efficiency and scalability.  相似文献   

9.
《Information Systems》2005,30(4):277-298
An important problem in unstructured peer-to-peer (P2P) networks is the efficient content-based retrieval of documents shared by other peers. However, existing searching mechanisms are not scaling well because they are either based on the idea of flooding the network with queries or because they require some form of global knowledge.We propose the Intelligent Search Mechanism (ISM) which is an efficient, scalable yet simple mechanism for improving the information retrieval problem in P2P systems. Our mechanism is efficient since it is bounded by the number of neighbors and scalable because no global knowledge is required to be maintained.ISM consists of four components: A Profiling Structure which logs queryhit messages coming from neighbors, a Query Similarity function which calculates the similarity queries to a new query, RelevanceRank which is an online neighbor ranking function and a Search Mechanism which forwards queries to selected neighbors.We deploy and compare ISM with a number of other distributed search techniques over static and dynamic environments. Our experiments are performed with real data over Peerware, our middleware simulation infrastructure which is deployed on 75 workstations. Our results indicate that ISM outperforms its competitors and that in some cases it manages to achieve 100% recall rate while using only half of the network resources required by its competitors. Further, its performance is also superior with respect to the total query response time and our algorithm exhibits a learning behavior as nodes acquire more knowledge. Finally ISM works well in dynamic network topologies and in environments with replicated data sources.  相似文献   

10.
Summary The order in which the variables are tested in a backtrack program can have a major effect on its running time. The best search order usually varies among the branches of the backtrack tree, so the number of possible search orders can be astronomical. We present an algorithm that chooses a search order dynamically by investigating all possibilities for k levels below the current level, extending beyond k levels wherever possible by setting the variables that have unique forced values. The algorithm takes time O(n k+1) to process a node. For k=2 and binary variables the analysis for selecting the next variable to introduce into the backtrack tree makes complete use of the information contained in the two-level investigations. For larger k or variables of higher degree there is no polynomial-time algorithm that makes complete use of the k-level investigations to limit searching (unless P=NP). The search rearrangement algorithm is closely related to constraint propagation. Experimental studies on conjunctive normal form predicates confirm that 1-level search rearrangement saves a great deal of time compared to 0-level (ordinary backtracking), and show that 2-level saves time over 1-level on large problems. For such problems with 256 variables 2-level is better than 1-level by a factor of two.  相似文献   

11.
Boolean networks provide a simple and intuitive model for gene regulatory networks, but a critical defect is the time required to learn the networks. In recent years, efficient network search algorithms have been developed for a noise-free case and for a limited function class. In general, the conventional algorithm has the high time complexity of O(22k mn k+1) where m is the number of measurements, n is the number of nodes (genes), and k is the number of input parents. Here, we suggest a simple and new approach to Boolean networks, and provide a randomized network search algorithm with average time complexity O (mn k+1/ (log m)(k−1)). We show the efficiency of our algorithm via computational experiments, and present optimal parameters. Additionally, we provide tests for yeast expression data. Editor: David Page  相似文献   

12.
Efficient file searching is an essential feature in P2P systems. While many current approaches use brute force techniques to search files by meta information (file names, extensions or user-provided tags), the interest is in implementing techniques that allow content-based search in P2P systems. Recently, clustering techniques have been used for searching text documents to increase the efficiency of document discovery and retrieval. Integrating such techniques into P2P systems is important to enhance searching in P2P file sharing systems. While some effort has been taken for content-based searching for text documents in P2P systems, there has been few research work for applying these techniques to multimedia content in P2P systems. In this paper, we introduce two P2P content-based clustering techniques for multimedia documents. These techniques are an adaptation of the existing Class-based Semantic Search algorithm for text documents. The proposed algorithms have been integrated into a JXTA-based Overlay P2P platform, and evaluation results are provided. The JXTA-Overlay together with the considered clustering techniques is thus very useful for developing P2P multimedia applications requiring efficient searching of multimedia contents in peer nodes.  相似文献   

13.
An optimal parallel algorithm for volume ray casting   总被引:3,自引:0,他引:3  
Volume rendering by ray casting is computationally expensive. For interactive volume visualization, rendering must be done in real time (30 frames/s). Since the typical size of a 3D dataset is 2563, parallel processing is imperative. In this paper, we present anO(logn) EREW algorithm for volume rendering. We useO(n 3) processors that can be optimized toO(log3 n) time withO(n 3/log3 n) processors. We have implemented our algorithm on a MasPar MP-1. The implementation results show that a frame of size 2563 is generated in 11 s by 4096 processors. This time can be further reduced by the use of large number of processors.  相似文献   

14.
问题如下:给定图G=(V, E)和正整数k,要求将图G中所有节点合并成为k个超节点,满足由这些超节点组成的摘要图能够在一定误差范围内表示原图G.这是一个基于图划分的组合优化问题,一个主要求解思路是逐次地随机抽取节点对集并用启发式方法从中选取节点对进行合并.本文提出一个有效的两阶段求解算法TS_LGS.算法根据图G的平均点度特征设置阶段阈值:当前超节点数大于阶段阈值为第1阶段,期间算法在采样节点对中基于当前最佳合并分数批量选择节点对合并,旨在有效减少迭代次数;否则为第2阶段,期间算法在加权采样的基础上优先挑选相邻的节点对,旨在找到重构误差增量较小的节点对合并,直至超节点的个数为k.在典型的真实网络实例图上与现有最好算法SAA进行了实验对比,结果表明,算法TS_LGS以较低时间复杂度提取到的图摘要具有更低的重构误差和查询误差.  相似文献   

15.
We propose four fast synchronous distributed message-based algorithms, to identify maximum cardinality sets of edge- and node-disjoint paths, between a source node and a target node in a digraph. Previously, Dinneen et?al. presented two algorithms, based on the classical distributed depth-first search (DFS), which run in O(mf) steps, where m is the number of edges and f is the number of disjoint paths. Combining Cidon??s distributed DFS and our new result, Theorem 3, we propose two improved DFS-based algorithms, which run in O(nf) steps, where n is the number of nodes. We also present improved versions of our two breadth-first search (BFS) based algorithms, with the same complexity upperbound, O(nf). Empirically, for a large set of randomly generated digraphs, our DFS-based edge-disjoint algorithm is 39?% faster than Dinneen et?al.??s edge-disjoint algorithm and our BFS-based edge-disjoint algorithm is 80?% faster. All these improved algorithms have been inspired and guided by a P system modelling exercise, but are suitable for any distributed implementation.  相似文献   

16.
针对深层次分类中分类准确率低、处理速度慢等问题,提出一种待分类文本的候选类别搜索算法。首先,引入搜索、分类两阶段的处理思想,结合类别层次树的结构特点和类别间的相关联系等隐含的领域知识,进行了类别层次权重分析和特征项的动态更新,为类树层次结构的各个节点构建更具分类判断力的特征项集合;进而,采用深度优先搜索算法并结合设定阈值的剪枝策略缩小搜索范围,搜索得到待分类文本的最优候选类别;最后,在候选类别的基础上应用经典的K最近邻(KNN)分类算法和支持向量机(SVM)分类算法进行分类测试和对比分析。实验结果显示,所提算法的总体分类性能优于传统的分类算法,而且使平均F1值较基于贪心策略的启发式搜索算法提高了6%左右。该算法显著提高了深层次文本分类的分类准确度。  相似文献   

17.
Constraint satisfaction problems are ubiquitous in artificial intelligence and many algorithms have been developed for their solution. This paper provides a unified survey of some of these, in terms of three classes: (i) tree search, (ii) arc consistency (AC), and (iii) hybrid tree search/arc consistency algorithms. It is shown that several important algorithms, when slightly rearranged, are of the latter hybrid form, but with arc consistency components that do not necessarily achieve full arc consistency at the tree nodes. Accordingly, we define several new partial AC procedures, AC1/5, AC1/4, AC1/3, and AC½, analogous to the well-known full AC algorithms which Mackworth has called AC1, AC2, and AC3. The fractional suffixes on our AC algorithms are roughly proportional to the degree of partial arc consistency they achieve. Unlike traditional versions, our AC algorithms (full and partial) are presented in a parameterized form to allow them to be embedded efficiently at the nodes of a tree search process. Algorithm complexities are compared empirically, using the n-queens problem and a new version called confused n-queens. Gaschnig's Backmarking (a tree search algorithm) and Haralick's Forward Checking (a hybrid algorithm) are found to be the most efficient. For the hybrid algorithms, we find that it pays to do little arc consistency processing at the nodes, incurring more nodes, but sufficiently reducing the work per node so as to obtain less work over the whole tree. The unified view taken here suggests several new algorithms. Preliminary results show one of these to be the best algorithm so far.  相似文献   

18.
Similarity searching in metric spaces has a vast number of applications in several fields like multimedia databases, text retrieval, computational biology, and pattern recognition. In this context, one of the most important similarity queries is the k nearest neighbor (k-NN) search. The standard best-first k-NN algorithm uses a lower bound on the distance to prune objects during the search. Although optimal in several aspects, the disadvantage of this method is that its space requirements for the priority queue that stores unprocessed clusters can be linear in the database size. Most of the optimizations used in spatial access methods (for example, pruning using MinMaxDist) cannot be applied in metric spaces, due to the lack of geometric properties. We propose a new k-NN algorithm that uses distance estimators, aiming to reduce the storage requirements of the search algorithm. The method stays optimal, yet it can significantly prune the priority queue without altering the output of the query. Experimental results with synthetic and real datasets confirm the reduction in storage space of our proposed algorithm, showing savings of up to 80% of the original space requirement.
Gonzalo NavarroEmail:

Benjamin Bustos   is an assistant professor in the Department of Computer Science at the University of Chile. He is also a researcher at the Millennium Nucleus Center for Web Research. His research interests are similarity searching and multimedia information retrieval. He has a doctoral degree in natural sciences from the University of Konstanz, Germany. Contact him at bebustos@dcc.uchile.cl. Gonzalo Navarro   earned his PhD in Computer Science at the University of Chile in 1998, where he is now Full Professor. His research interests include similarity searching, text databases, compression, and algorithms and data structures in general. He has coauthored a book on string matching and around 200 international papers. He has (co)chaired international conferences SPIRE 2001, SCCC 2004, SPIRE 2005, SIGIR Posters 2005, IFIP TCS 2006, and ENC 2007 Scalable Pattern Recognition track; and belongs to the Editorial Board of Information Retrieval Journal. He is currently Head of the Department of Computer Science at University of Chile, and Head of the Millenium Nucleus Center for Web Research, the largest Chilean project in Computer Science research.   相似文献   

19.
针对时间依赖路网中的K近邻(KNN)查询TD-FTT算法查询点发起时间与到达时间在同一时段的限制和预处理阶段计算时间代价大的问题,提出基于动态选择启发值改进的TD-FTT (ITD-FTT)算法。首先,在预处理阶段,根据各时段各边时间函数的最小值构建最小路网Gmin;然后,在路网Gmin中利用网络泰森图(NVD)并行计算节点最近邻来减少预处理阶段的计算时间;最后,在查找阶段通过计算节点到达时间所在时段,动态选择启发值来解除时间段的限制。实验结果显示,在预处理阶段ITD-FTT算法比TD-FTT算法计算时间减少了70.12%;在查询阶段ITD-FTT比TD-INE算法和TD-A算法在遍历节点个数上分别减少了46.52%和16.63%,响应时间比TD-INE算法和TD-A算法分别降低47.46%和18.24%。实验结果表明,ITD-FTT算法减少了查询扩展的节点数,降低了查找K近邻的时间,提高了查找效率。  相似文献   

20.
Distance-based outliers: algorithms and applications   总被引:20,自引:0,他引:20  
This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (distance-based) outliers. Specifically, we show that (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., ); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of , k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear with respect to N, but exponential with respect to k. We provide experimental results indicating that this algorithm significantly outperforms the two simple algorithms for . Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most three passes over a dataset. Again, experimental results show that this algorithm is by far the best for . Finally, we discuss our work on three real-life applications, including one on spatio-temporal data (e.g., a video surveillance application), in order to confirm the relevance and broad applicability of DB outliers. Received February 15, 1999 / Accepted August 1, 1999  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号