首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Due to idiosyncrasies in their syntax, semantics or frequency, Multiword Expressions (MWEs) have received special attention from the NLP community, as the methods and techniques developed for the treatment of simplex words are not necessarily suitable for them. This is certainly the case for the automatic acquisition of MWEs from corpora. A lot of effort has been directed to the task of automatically identifying them, with considerable success. In this paper, we propose an approach for the identification of MWEs in a multilingual context, as a by-product of a word alignment process, that not only deals with the identification of possible MWE candidates, but also associates some multiword expressions with semantics. The results obtained indicate the feasibility and low costs in terms of tools and resources demanded by this approach, which could, for example, facilitate and speed up lexicographic work.  相似文献   

2.
The present paper investigates multiword expressions (MWEs) in spoken language and possible ways of identifying MWEs automatically in speech corpora. Two MWEs that emerged from previous studies and that occur frequently in Dutch are analyzed to study their pronunciation characteristics and compare them to those of other utterances in a large speech corpus. The analyses reveal that these MWEs display extreme pronunciation variation and reduction, i.e., many phonemes and even syllables are deleted. Several measures of pronunciation reduction are calculated for these two MWEs and for all other utterances in the corpus. Five of these measures are more than twice as high for the MWEs, thus indicating considerable reduction. One overall measure of pronunciation deviation is then calculated and used to automatically identify MWEs in a large speech corpus. The results show that neither this overall measure, nor frequency of co-occurrence alone are suitable for identifying MWEs. The best results are obtained by using a metric that combines overall pronunciation reduction with weighted frequency. In this way, recurring “islands of pronunciation reduction” that contain (potential) MWEs can be identified in a large speech corpus.  相似文献   

3.
Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7–12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.  相似文献   

4.
We describe annotation of multiword expressions (MWEs) in the Prague dependency treebank, using several automatic pre-annotation steps. We use subtrees of the tectogrammatical tree structures of the Prague dependency treebank to store representations of the MWEs in the dictionary and pre-annotate following occurrences automatically. We also show a way to measure reliability of this type of annotation.  相似文献   

5.
6.
7.
8.
英中可比语料库中多词表达自动提取与对齐   总被引:3,自引:1,他引:2       下载免费PDF全文
多词表达(MWE)不仅用来提高当前机器翻译系统质量,而且也用于跨语言检索和数据挖掘等其他自然语言处理领域。为此,提出了基于语义模板与基于统计工具相结合的方法从三元组可比语料库中自动提取本族英语MWE。采用基于词表和分布方法计算词语间的相似度,扩大MWE覆盖范围。利用GIZA++对齐算法提取对译的中文MWE,依据统计方法计算互译概率信息,根据概率大小,选择最佳英汉MWE互译对。实验结果表明上述方法可以有效提高MWE提取和对齐的准确率。  相似文献   

9.
The storage and retrieval of multimedia data is a crucial problem in multimedia information systems due to the huge storage requirements. It is necessary to provide an efficient methodology for the indexing of multimedia data for rapid retrieval. The aim of this paper is to introduce a methodology to represent, simplify, store, retrieve and reconstruct an image from a repository. An algebraic representation of the spatio-temporal relations present in a document is constructed from an equivalent graph representation and used to index the document. We use this representation to simplify and later reconstruct the complete index. This methodology has been tested by implementation of a prototype system called Simplified Modeling to Access and ReTrieve multimedia information (SMART). Experimental results show that the complexity of an index of a 2D document is O (n*(n−1)/k) with k≥2 as opposed to the O (n*(n−1)/2) known so far. Since k depends on the number of objects in an image more complex documents have lower overall complexity.  相似文献   

10.
We present here an equivalence checking algorithm which operates directly on a pair of strict deterministic vs. LL(k) grammars. It is also straightforwardly applicable to a pair of LL(k) grammars, though an LL(k) grammar is not necessarily strict deterministic. The basic idea is from Korenjak and Hopcroft's branching algorithm for simple deterministic grammars, but ours is so distinguished that it is throughout free from mixing the nonterminals of the respective grammars in question and then very simple.  相似文献   

11.
Anomaly detection is considered an important data mining task, aiming at the discovery of elements (known as outliers) that show significant diversion from the expected case. More specifically, given a set of objects the problem is to return the suspicious objects that deviate significantly from the typical behavior. As in the case of clustering, the application of different criteria leads to different definitions for an outlier. In this work, we focus on distance-based outliers: an object x is an outlier if there are less than k objects lying at distance at most R from x. The problem offers significant challenges when a stream-based environment is considered, where data arrive continuously and outliers must be detected on-the-fly. There are a few research works studying the problem of continuous outlier detection. However, none of these proposals meets the requirements of modern stream-based applications for the following reasons: (i) they demand a significant storage overhead, (ii) their efficiency is limited and (iii) they lack flexibility in the sense that they assume a single configuration of the k and R parameters. In this work, we propose new algorithms for continuous outlier monitoring in data streams, based on sliding windows. Our techniques are able to reduce the required storage overhead, are more efficient than previously proposed techniques and offer significant flexibility with regard to the input parameters. Experiments performed on real-life and synthetic data sets verify our theoretical study.  相似文献   

12.
Recent advances in 3D modeling provide us with real 3D datasets to answer queries, such as “What is the best position for a new billboard?” and “Which hotel room has the best view?” in the presence of obstacles. These applications require measuring and differentiating the visibility of an object (target) from different viewpoints in a dataspace, e.g., a billboard may be seen from many points but is readable only from a few points closer to it. In this paper, we formulate the above problem of quantifying the visibility of (from) a target object from (of) the surrounding area with a visibility color map (VCM). A VCM is essentially defined as a surface color map of the space, where each viewpoint of the space is assigned a color value that denotes the visibility measure of the target from that viewpoint. Measuring the visibility of a target even from a single viewpoint is an expensive operation, as we need to consider factors such as distance, angle, and obstacles between the viewpoint and the target. Hence, a straightforward approach to construct the VCM that requires visibility computation for every viewpoint of the surrounding space of the target is prohibitively expensive in terms of both I/Os and computation, especially for a real dataset comprising thousands of obstacles. We propose an efficient approach to compute the VCM based on a key property of the human vision that eliminates the necessity for computing the visibility for a large number of viewpoints of the space. To further reduce the computational overhead, we propose two approximations; namely, minimum bounding rectangle and tangential approaches with guaranteed error bounds. Our extensive experiments demonstrate the effectiveness and efficiency of our solutions to construct the VCM for real 2D and 3D datasets.  相似文献   

13.
We consider the system of intuitionistic fuzzy sets (IF-sets) in a universe X and study the cuts of an IF-set. Suppose a left continuous triangular norm is given. The t-norm based cut (level set) of an IF-set is defined in a way that binds the membership and nonmembership functions via the triangular norm. This is an extension of usual cuts of IF-sets. We show that the system of these cuts fulfils analogical properties as usual systems of cuts. However, it is not possible to reconstruct an IF-set from the system of t-norm based cuts.  相似文献   

14.
In emergency evacuations, not all pedestrians know the destination or the routes to the destination, especially when the route is complex. Many pedestrians follow a leader or leaders during an evacuation. A Trace Model was proposed to simulate such tracing processes, including (1) a Dynamic Douglas–Peucker algorithm to extract global key nodes from dynamically partial routes, (2) a key node complementation rule to address the issue in which the Dynamic Douglas–Peucker algorithm does not work for an extended time when the route is straight and long, and (3) a modification to a follower’s impatience factor, which is associated with the distance from the leader. The tracing process of pupils following their teachers in a primary school during an evacuation was simulated. The virtual process was shown to be reasonable both in the indoor classroom and on the outdoor campus along complex routes. The statistical data obtained in the simulation were also studied. The results show that the Trace Model can extract relatively global key nodes from dynamically partial routes that are very similar to the results obtained by the classical Douglas–Peucker algorithm based on whole routes, and the data redundancy is effectively reduced. The results also show that the Trace Model is adaptive to the motions between followers and leaders, which demonstrates that the Trace Model is applicable for the tracing process in complex routes and is an improvement on the classical Douglas–Peucker algorithm and the social force model.  相似文献   

15.
With the rapid development of Web 2.0 sites such as Blogs and Wikis users are encouraged to express opinions about certain products, services or social topics over the web. There is a method for aggregating these opinions, called Opinion Aggregation, which is made up of four steps: Collect, Identify, Classify and Aggregate. In this paper, we present a new conceptual multidimensional data model based on the Fuzzy Model based on the Semantic Translation to solve the Aggregate step of an Opinion Aggregation architecture, which allows exploiting the measure values resulting from integrating heterogeneous information (including unstructured data such as free texts) by means of traditional Business Intelligence tools. We also present an entire Opinion Aggregation architecture that includes the Aggregate step and solves the rest of steps (Collect, Identify and Classify) by means an Extraction, Transformation and Loading process. This architecture has been implemented in an Oracle Relational Database Management System. We have applied it to integrate heterogeneous data extracted from certain high end hotels websites, and we show a case study using the collected data during several years in the websites of high end hotels located in Granada (Spain). With this integrated information, the Data Warehouse user can make several analyses with the benefit of an easy linguistic interpretability and a high precision by means of interactive tools such as the dashboards.  相似文献   

16.
A colouring of a graph is ecological if every pair of vertices that have the same set of colours in their neighbourhood are coloured alike. We consider the following problem: if a graph G and an ecological colouring c of G are given, can further vertices added to G, one at a time, be coloured so that at each stage the current graph is ecologically coloured? If the answer is yes, then we say that the pair (G,c) is ecologically online extendible. By generalizing the well-known First-Fit algorithm, we are able to characterize when (G,c) is ecologically online extendible, and to show that deciding whether (G,c) is ecologically extendible can be done in polynomial time. We also describe when the extension is possible using only colours from a given finite set C. For the case where c is a colouring of G in which each vertex is coloured distinctly, we give a simple characterization of when (G,c) is ecologically online extendible using only the colours of c, and we also show that (G,c) is always online extendible using the colours of c plus one extra colour. We also study (off-line) ecological H-colourings (an H-colouring of G is a homomorphism from G to H). We study the problem of deciding whether G has an ecological H-colouring for some fixed H and give a characterization of its computational complexity in terms of the structure of H.  相似文献   

17.
Let X = Cn. In this paper we present an algorithm that computes the de Rham cohomology groups HdRi(U,C ) where U is the complement of an arbitrary Zariski-closed set Y in X. Our algorithm is a merger of the algorithm given inOaku and Takayama (1999), who considered the case where Y is a hypersurface, and our methods from Walther (1999) for the computation of local cohomology. We further extend the algorithm to compute de Rham cohomology groups with supports HdR, Zi(U,C ) where again U is an arbitrary Zariski-open subset of X and Z is an arbitrary Zariski-closed subset of U. Our main tool is a generalization of the restriction process from Oaku and Takayama (in press) to complexes of modules over the Weyl algebra. The restriction rests on an existence theorem onVd -strict resolutions of complexes that we prove by means of an explicit construction via Cartan–Eilenberg resolutions. All presented algorithms are based on Gröbner basis computations in the Weyl algebra and the examples are carried out using the computer system Kan by Takayama (1999).  相似文献   

18.
The problem of k nearest neighbors (kNN) is to find the nearest k neighbors for a query point from a given data set. In this paper, a novel fast kNN search method using an orthogonal search tree is proposed. The proposed method creates an orthogonal search tree for a data set using an orthonormal basis evaluated from the data set. To find the kNN for a query point from the data set, projection values of the query point onto orthogonal vectors in the orthonormal basis and a node elimination inequality are applied for pruning unlikely nodes. For a node, which cannot be deleted, a point elimination inequality is further used to reject impossible data points. Experimental results show that the proposed method has good performance on finding kNN for query points and always requires less computation time than available kNN search algorithms, especially for a data set with a big number of data points or a large standard deviation.  相似文献   

19.
An application of the bucket sort in Kruskal's minimal spanning tree algorithm is proposed. The modified algorithm is very fast if the edge costs are from a distribution which is close to uniform. This is due to the fact that the sorting phase then takes for an m edge graph an O(m) average time. The O(m log m) worst case occurs when there is a strong peak in the distribution of the edge costs.  相似文献   

20.
Assume that each vertex of a graph G is assigned a nonnegative integer weight and that l and u are given integers such that 0≤lu. One wishes to partition G into connected components by deleting edges from G so that the total weight of each component is at least l and at most u. Such a partition is called an (l,u)-partition. We deal with three problems to find an (l,u)-partition of a given graph: the minimum partition problem is to find an (l,u)-partition with the minimum number of components; the maximum partition problem is defined analogously; and the p-partition problem is to find an (l,u)-partition with a given number p of components. All these problems are NP-hard even for series-parallel graphs, but are solvable in linear time for paths. In this paper, we present the first polynomial-time algorithm to solve the three problems for arbitrary trees.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号