首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 198 毫秒
1.
Harry M. Chang 《Computing》2011,91(3):241-264
The Zipf–Mandelbrot law is widely used to model a power-law distribution on ranked data. One of the best known applications of the Zipf–Mandelbrot law is in the area of linguistic analysis of the distribution of words ranked by their frequency in a text corpus. By exploring known limitations of the Zipf–Mandelbrot law in modeling the actual linguistic data from different domains in both printed media and online content, a new algorithm is developed to effectively construct n-gram rules for building natural language (NL) models required for a human-to-computer interface. The construction of statistically-oriented n-gram rules is based on a new computing algorithm that identifies the area of divergence between Zipf–Mandelbrot curve and the actual frequency distribution of the ranked n-gram text tokens extracted from a large text corpus derived from the online electronic programming guide (EPG) for television shows and movies. Two empirical experiments were carried out to evaluate the EPG-specific language models created using the new algorithm in the context of NL-based information retrieval systems. The experimental results show the effectiveness of the algorithm for developing low-complexity concept models with high coverage for the user’s language models associated with both typed and spoken queries when interacting with a NL-based EPG search interface.  相似文献   

2.
In this paper we describe an elegant and efficient approach to coupling reordering and decoding in statistical machine translation, where the n-gram translation model is also employed as distortion model. The reordering search problem is tackled through a set of linguistically motivated rewrite rules, which are used to extend a monotonic search graph with reordering hypotheses. The extended graph is traversed in the global search when a fully informed decision can be taken. Further experiments show that the n-gram translation model can be successfully used as reordering model when estimated with reordered source words. Experiments are reported on the Europarl task (Spanish–English and English–Spanish). Results are presented regarding translation accuracy and computational efficiency, showing significant improvements in translation quality with respect to monotonic search for both translation directions at a very low computational cost.  相似文献   

3.
We present a new sublinear-size index structure for finding all occurrences of a given q -gram in a text. Such a q -gram index is needed in many approximate pattern matching algorithms. All earlier q -gram indexes require at least O(n) space, where n is the length of the text. The new Lempel—Ziv index needs only O(n/log n) space while being as fast as previous methods. The new method takes advantage of repetitions in the text found by Lempel—Ziv parsing. Received November 1996; revised March 1997.  相似文献   

4.
Due to the famous dimensionality curse problem, search in a high-dimensional space is considered as a "hard" problem. In this paper, a novel composite distance transformation method, which is called CDT, is proposed to support a fast κ-nearest-neighbor (κ-NN) search in high-dimensional spaces. In CDT, all (n) data points are first grouped into some clusters by a κ-Means clustering algorithm. Then a composite distance key of each data point is computed. Finally, these index keys of such n data points are inserted by a partition-based B^+-tree. Thus, given a query point, its κ-NN search in high-dimensional spaces is transformed into the search in the single dimensional space with the aid of CDT index. Extensive performance studies are conducted to evaluate the effectiveness and efficiency of the proposed scheme. Our results show that this method outperforms the state-of-the-art high-dimensional search techniques, such as the X-Tree, VA-file, iDistance and NB-Tree.  相似文献   

5.
Let X be a finite alphabet containing more than one letter. A dense language over X is a language containing a disjunctive language. A language L is an n-dense language if for any distinct n words there exist two words such that In this paper we classify dense languages into strict n-dense languages and study some of their algebraic properties. We show that for each n ≥ 0, the n-dense language exists. For an n-dense language L, n ≠ 1, the language LQ is a dense language, where Q is the set of all primitive words over X. Moreover, for a given n ≥ 1, the language L is such that , then for some m, nm ≤ 2n + 1. Characterizations on 0-dense languages and n-dense languages are obtained. It is true that for any dense language L, there exist such that for some. We show that everyn-dense language, n ≥ 0, can be split into disjoint union of infinitely many n-dense languages.  相似文献   

6.
Graph conductance queries, also known as personalized PageRank and related to random walks with restarts, were originally proposed to assign a hyperlink-based prestige score to Web pages. More general forms of such queries are also very useful for ranking in entity-relation (ER) graphs used to represent relational, XML and hypertext data. Evaluation of PageRank usually involves a global eigen computation. If the graph is even moderately large, interactive response times may not be possible. Recently, the need for interactive PageRank evaluation has increased. The graph may be fully known only when the query is submitted. Browsing actions of the user may change some inputs to the PageRank computation dynamically. In this paper, we describe a system that analyzes query workloads and the ER graph, invests in limited offline indexing, and exploits those indices to achieve essentially constant-time query processing, even as the graph size scales. Our techniques—data and query statistics collection, index selection and materialization, and query-time index exploitation—have parallels in the extensive relational query optimization literature, but is applied to supporting novel graph data repositories. We report on experiments with five temporal snapshots of the CiteSeer ER graph having 74–702 thousand entity nodes, 0.17–1.16 million word nodes, 0.29–3.26 million edges between entities, and 3.29–32.8 million edges between words and entities. We also used two million actual queries from CiteSeer’s logs. Queries run 3–4 orders of magnitude faster than whole-graph PageRank, the gap growing with graph size. Index size is smaller than a text index. Ranking accuracy is 94–98% with reference to whole-graph PageRank.  相似文献   

7.
In this paper we study the external memory planar point enclosure problem: Given N axis-parallel rectangles in the plane, construct a data structure on disk (an index) such that all K rectangles containing a query point can be reported I/O-efficiently. This problem has important applications in e.g. spatial and temporal databases, and is dual to the important and well-studied orthogonal range searching problem. Surprisingly, despite the fact that the problem can be solved optimally in internal memory with linear space and O(log N+K) query time, we show that one cannot construct a linear sized external memory point enclosure data structure that can be used to answer a query in O(log  B N+K/B) I/Os, where B is the disk block size. To obtain this bound, Ω(N/B 1−ε ) disk blocks are needed for some constant ε>0. With linear space, the best obtainable query bound is O(log 2 N+K/B) if a linear output term O(K/B) is desired. To show this we prove a general lower bound on the tradeoff between the size of the data structure and its query cost. We also develop a family of structures with matching space and query bounds. An extended abstract of this paper appeared in Proceedings of the 12th European Symposium on Algorithms (ESA’04), Bergen, Norway, September 2004, pp. 40–52. L. Arge’s research was supported in part by the National Science Foundation through RI grant EIA–9972879, CAREER grant CCR–9984099, ITR grant EIA–0112849, and U.S.-Germany Cooperative Research Program grant INT–0129182, as well as by the US Army Research Office through grant W911NF-04-01-0278, by an Ole Roemer Scholarship from the Danish National Science Research Council, a NABIIT grant from the Danish Strategic Research Council and by the Danish National Research Foundation. V. Samoladas’ research was supported in part by a grant co-funded by the European Social Fund and National Resources-EPEAEK II-PYTHAGORAS. K. Yi’s research was supported in part by the National Science Foundation through ITR grant EIA–0112849, U.S.-Germany Cooperative Research Program grant INT–0129182, and Hong Kong Direct Allocation Grant (DAG07/08).  相似文献   

8.
9.
Binhai Zhu 《GeoInformatica》2000,4(3):317-334
This paper studies the idea of answering range searching queries using simple data structures. The only data structure we need is the Delaunay Triangulation of the input points. The idea is to first locate a vertex of the (arbitrary) query polygon and walk along the boundary of the polygon in the Delaunay Triangulation and report all the points enclosed by the query polygon. For a set of uniformly distributed random points in 2-D and a query polygon the expected query time of this algorithm is O(n 1/3 + Q + E K + L r n 1/2), where Q is the size of the query polygon , {\bf E}K = O(n\bcdot area is the expected number of output points, L r is a parameter related to the shape of the query polygon and n, and L r is always bounded by the sum of the edge lengths of . Theoretically, when L r = O(1/n1/6) the expected query time is O(n1/3 + Q + E K), which improves the best known average query time for general range searching. Besides the theoretical meaning, the good property of this algorithm is that once the Delaunay Triangulation is given, no additional preprocessing is needed. In order to obtain empirical results, we design a new algorithm for generating random simple polygons within a given domain. Our empirical results show that the constant coefficient of the algorithm is small, at least for the special (practical) cases when the query polygon is either a triangle (simplex range searching) or an axis-parallel box (orthogonal range searching) and for the general case when the query polygons are generated by our new polygon-generating algorithms and their sizes are relatively small.  相似文献   

10.
刘鹏远  赵铁军 《软件学报》2009,20(5):1292-1300
为了解决困扰词义及译文消歧的数据稀疏及知识获取问题,提出一种基于Web利用n-gram统计语言模型进行消歧的方法.在提出词汇语义与其n-gram语言模型存在对应关系假设的基础上,首先利用Hownet建立中文歧义词的英文译文与知网DEF的对应关系并得到该DEF下的词汇集合,然后通过搜索引擎在Web上搜索,并以此计算不同DEF中词汇n-gram出现的概率,然后进行消歧决策.在国际语义评测SemEval-2007中的Multilingual Chinese English Lexical Sample Task测试集上的测试表明,该方法的Pmar值为55.9%,比其上该任务参评最好的无指导系统性能高出12.8%.  相似文献   

11.
Summary writing is an important part of many English Language Examinations. As grading students’ summary writings is a very time-consuming task, computer-assisted assessment will help teachers carry out the grading more effectively. Several techniques such as latent semantic analysis (LSA), n-gram co-occurrence and BLEU have been proposed to support automatic evaluation of summaries. However, their performance is not satisfactory for assessing summary writings. To improve the performance, this paper proposes an ensemble approach that integrates LSA and n-gram co-occurrence. As a result, the proposed ensemble approach is able to achieve high accuracy and improve the performance quite substantially compared with current techniques. A summary assessment system based on the proposed approach has also been developed.  相似文献   

12.
Ran Raz 《Algorithmica》2009,55(3):462-489
Our main result is that the membership xSAT (for x of length n) can be proved by a logarithmic-size quantum state |Ψ〉, together with a polynomial-size classical proof consisting of blocks of length polylog(n) bits each, such that after measuring the state |Ψ〉 the verifier only needs to read one block of the classical proof. This shows that if a short quantum witness is available then a (classical) PCP with only one query is possible. Our second result is that the class QIP/qpoly contains all languages. That is, for any language L (even non-recursive), the membership xL (for x of length n) can be proved by a polynomial-size quantum interactive proof, where the verifier is a polynomial-size quantum circuit with working space initiated with some quantum state |Ψ L,n 〉 (depending only on L and n). Moreover, the interactive proof that we give is of only one round, and the messages communicated are classical. The advice |Ψ L,n 〉 given to the verifier can also be replaced by a classical probabilistic advice, as long as this advice is kept as a secret from the prover. Our result can hence be interpreted as: the class IP/rpoly contains all languages. For the proof of the second result, we introduce the quantum low-degree-extension of a string of bits. The main result requires an additional machinery of quantum low-degree-test. R. Raz’s research was supported by Israel Science Foundation (ISF) grant.  相似文献   

13.
Given m ordered segments that form a partition of some universe (e.g., a two-dimensional strip), the multisearch problem consists of determining, for a set of n query points in the universe, the segments they belong to. We present the first nontrivial parallel deterministic scheme for performing multisearch on a distributed-memory machine when m=ω(n) . The scheme is designed on the BSP* model of parallel computation, a variant of Valiant's BSP which rewards blockwise communication, and relies on a suitable redundant representation of the segments. The time needed to answer the queries is analyzed as a function of the redundancy and of the BSP* parameters. We show that optimal performance can be obtained using logarithmic redundancy. We also prove a lower bound on the communication requirements of any deterministic multisearch scheme realized on a distributed-memory machine. The lower bound exhibits a tradeoff between the redundancy used to represent the segments and the performance of the scheme. Received June 1, 1997; revised March 10, 1998.  相似文献   

14.
Spatiotemporal objects – that is, objects that evolve over time – appear in many applications. Due to the nature of such applications, storing the evolution of objects through time in order to answer historical queries (queries that refer to past states of the evolution) requires a very large specialized database, what is termed in this article a spatiotemporal archive. Efficient processing of historical queries on spatiotemporal archives requires equally sophisticated indexing schemes. Typical spatiotemporal indexing techniques represent the objects using minimum bounding regions (MBR) extended with a temporal dimension, which are then indexed using traditional multidimensional index structures. However, rough MBR approximations introduce excessive overlap between index nodes, which deteriorates query performance. This article introduces a robust indexing scheme for answering spatiotemporal queries more efficiently. A number of algorithms and heuristics are elaborated that can be used to preprocess a spatiotemporal archive in order to produce finer object approximations, which, in combination with a multiversion index structure, will greatly improve query performance in comparison to the straightforward approaches. The proposed techniques introduce a query efficiency vs. space tradeoff that can help tune a structure according to available resources. Empirical observations for estimating the necessary amount of additional storage space required for improving query performance by a given factor are also provided. Moreover, heuristics for applying the proposed ideas in an online setting are discussed. Finally, a thorough experimental evaluation is conducted to show the merits of the proposed techniques. Edited by B. Seeger A short version of this article appeared as “Efficient indexing of spatiotemporal objects” in the Proceedings of Extending Database Technology 2002 [19]. This work was partially supported by NSF grants IIS-9907477, EIA-9983445, NSF IIS 9984729, NSF ITR 0220148, NSF IIS-0133825, NRDRP, and the U.S. Department of Defense.  相似文献   

15.
Many estimation methods for independent component analysis (ICA) requires prewhitening of observed signals. This paper proposes a new method of prewhitening named β-prewhitening by minimizing the empirical β-divergence over the space of all the Gaussian distributions. The value of the tuning parameter β plays the key role in the performance of our current proposal. An attempt is made to propose an adaptive selection procedure for the tuning parameter β for this algorithm. At last, a measure of performance index is proposed for assessing prewhitening procedures. Simulation results show that β-prewhitening efficiently improves the performance over the standard prewhitening when outliers exist; it keeps equal performance otherwise. Performance of the proposed method is compared with the standard prewhitening by both FastICA and our proposed performance index.  相似文献   

16.
Let V be a set of points in a d-dimensional l p -metric space. Let s,tV and let L be a real number. An L-bounded leg path from s to t is an ordered set of points which connects s to t such that the leg between any two consecutive points in the set has length of at most L. The minimal path among all these paths is the L-bounded leg shortest path from s to t. In the st Bounded Leg Shortest Path (stBLSP) problem we are given two points s and t and a real number L, and are required to compute an L-bounded leg shortest path from s to t. In the All-Pairs Bounded Leg Shortest Path (apBLSP) problem we are required to build a data structure that, given any two query points from V and a real number L, outputs the length of the L-bounded leg shortest path (a distance query) or the path itself (a path query). In this paper we obtain the following results:  相似文献   

17.
We consider the problem of ray shooting in a three-dimensional scene consisting of k (possibly intersecting) convex polyhedra with a total of n facets. That is, we want to preprocess them into a data structure, so that the first intersection point of a query ray and the given polyhedra can be determined quickly. We describe data structures that require preprocessing time and storage (where the notation hides polylogarithmic factors), and have polylogarithmic query time, for several special instances of the problem. These include the case when the ray origins are restricted to lie on a fixed line 0, but the directions of the rays are arbitrary, the more general case when the supporting lines of the rays pass through 0, and the case of rays orthogonal to some fixed line with arbitrary origins and orientations. We also present a simpler solution for the case of vertical ray-shooting with arbitrary origins. In all cases, this is a significant improvement over previously known techniques (which require Ω(n 2) storage, even when k n). Work by Haim Kaplan and Natan Rubin has been supported by Grant 975/06 from the Israel Science Fund. Work by Micha Sharir and Natan Rubin was partially supported by NSF Grant CCF-05-14079, by a grant from the U.S.–Israeli Binational Science Foundation, by grant 155/05 from the Israel Science Fund, Israeli Academy of Sciences, by a grant from the AFIRST French–Israeli program, and by the Hermann Minkowski–MINERVA Center for Geometry at Tel Aviv University. A preliminary version of this paper appeared in Proc. 15th Annu. Europ. Sympos. Alg. (2007), 287–298.  相似文献   

18.
k代表轮廓查询是从传统轮廓查询中衍生出来的一类查询.给定多维数据集合D,轮廓查询从D中找到所有不被其他对象支配的对象,将其返回给用户,便于用户结合自身偏好选择高质量对象.然而,轮廓对象规模通常较大,用户需要从大量数据中进行选择,导致选择速度和质量无法得到保证.与传统轮廓查询相比,k代表轮廓查询从所有轮廓对象中选择“代表性”最强的k个对象返回给用户,有效地解决了传统轮廓查询存在的这一问题.给定滑动窗口W和连续查询q,q监听窗口中的数据.当窗口滑动时,查询q返回窗口中,组合支配面积最大的k个对象.现有算法的核心思想是:实时监测当前窗口中的轮廓对象集合,当轮廓对象集合更新时,算法更新k代表轮廓.然而,实时监测窗口中,轮廓集合的计算代价通常较大.此外,当轮廓集合规模较大时,从中选择k代表轮廓的计算代价是同样巨大的,导致已有算法无法在高速流环境下使用.针对上述问题,提出了ρ-近似k代表轮廓查询.为了支持该查询,提出了查询处理框架PAKRS(predict-basedapproximatekrepresentativeskyline).首先,PAKRS利用高速流的特性对当前窗口进行划分,根据划分结...  相似文献   

19.
We consider theorthgonal clipping problem in a set of segments: Given a set ofn segments ind-dimensional space, we preprocess them into a data structure such that given an orthogonal query window, the segments intersecting it can be counted/reported efficiently. We show that the efficiency of the data structure significantly depends on a geometric discrete parameterK named theProjected-image complexity, which becomes Θ(n 2) in the worst case but practically much smaller. If we useO(m) space, whereK log4d−7 nmn log4d−7 n, the query time isO((K/m)1/2 logmax{4, 4d−5} n). This is near to an Ω((K/m)1/2) lower bound.  相似文献   

20.
We present new efficient deterministic and randomized distributed algorithms for decomposing a graph with n nodes into a disjoint set of connected clusters with radius at most k−1 and having O(n 1+1/k ) intercluster edges. We show how to implement our algorithms in the distributed CONGEST\mathcal{CONGEST} model of computation, i.e., limited message size, which improves the time complexity of previous algorithms (Moran and Snir in Theor. Comput. Sci. 243(1–2):217–241, 2000; Awerbuch in J. ACM 32:804–823, 1985; Peleg in Distributed Computing: A Locality-Sensitive Approach, 2000) from O(n) to O(n 1−1/k ). We apply our algorithms for constructing low stretch graph spanners and network synchronizers in sublinear deterministic time in the CONGEST\mathcal{CONGEST} model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号