首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
在分别研究了基于信息熵和基于词频分布变化的术语抽取方法的情况下,该文提出了一种信息熵和词频分布变化相结合的术语抽取方法。信息熵体现了术语的完整性,词频分布变化体现了术语的领域相关性。通过应用信息熵,即将信息熵结合到词频分布变化公式中进行术语抽取,且应用简单语言学规则过滤普通字符串。实验表明,在汽车领域的语料上,应用该方法抽取出1300个术语,其正确率达到73.7%。结果表明该方法对低频术语有更好的抽取效果,同时抽取出的术语结构更完整。  相似文献   

2.
生物多样性和均匀度显著性的统计检验及网络计算软件   总被引:18,自引:1,他引:18  
Shaanon-Wiener多样性指数和均匀度以其简单易用而被广泛应用于生物多样性和群落生态学等研究中。然而,由于缺乏适合的统计检验等原因,其分析的可信性较低,本研究以Shaanon-Wiener指数和Ewens-Caswell检验为基础,研制了生物多样性和均匀度显著性统计检验的Internet计算软件EwensCaswellTest。该软件由4个Java类和一个HTML文件组成,可运行于多种网络浏览器上。用EwensCaswellTest对水稻田节肢动物群落多样性(15个地点,17个功能群),以及中国民族HLA-DQB1等位基因的多态性(12个民族和人群,17个等位基因)进行了分析。结果显示,该检验可有效地反映多样性和均匀度的显著性。  相似文献   

3.
Three models for word frequency distributions, the lognormal law, the generalized inverse Gauss-Poisson law and the extended generalized Zipf's law are compared and evaluated with respect to goodness of fit and rationale. Application of these models to frequency distributions of a text, a corpus and morphological data reveals that no model can lay claim to exclusive validity, while inspection of the extrapolated theoretical vocabulary sizes raises doubts as to whether the urn scheme with independent trials is the correct underlying model for word frequency data. The role of morphology in shaping word frequency distributions is discussed, as well as parallelisms between vocabulary richness in literary studies and morphological productivity in linguistics.R. Harald Baayen received his PhD at the Free University, Amsterdam, where he was involved in research on morphological productivity. He is now at the Max-Planck Institute for Psycholinguistics, Nijmegen, participating in a project on computational modelling of lexical representation and process.  相似文献   

4.
Multi-objective evolutionary optimization algorithms are among the best optimizers for solving problems in control systems, engineering and industrial planning. The performance of these algorithms degrades severely due to the loss of selection pressure exerted by the Pareto dominance relation which will cause the algorithm to act randomly. Various recent methods tried to provide more selection pressure but this would cause the population to converge to a specific region which is not desirable. Diversity reduction in high dimensional problems which decreases the capabilities of these approaches is a decisive factor in the overall performance of these algorithms. The novelty of this paper is to propose a new diversity measure and a diversity control mechanism which can be used in combination to remedy the mentioned problem. This measure is based on shortest Hamiltonian path for capturing an order of the population in any dimension. In order to control the diversity of population, we designed an adaptive framework which adjusts the selection operator according to diversity variation in the population using different diversity measures as well as our proposed one. This study incorporates the proposed framework in MOEA/D, an efficient widely used evolutionary algorithm. The obtained results validate the motivation on the basis of diversity and performance measures in comparison with the state-of-the-art algorithms and demonstrate the applicability of our algorithm/method in handling many-objective problems. Moreover, an extensive comparison with several diversity measure algorithms reveals the competitiveness of our proposed measure.  相似文献   

5.
缅甸语属于低资源语言,网络中获取大规模的汉-缅双语词汇一定程度上可以缓解汉-缅机器翻译中面临句子级对齐语料匮乏的问题.为此,本文提出了一种融合主题及上下文特征的汉缅双语词汇抽取方法.首先利用LDA主题模型获取汉缅文档主题分布,并通过双语词向量表征将跨语言主题向量映射到共享的语义空间后抽取同一主题下相似度较高的词作为汉-缅双语候选词汇,然后基于BERT获取候选双语词汇相关上下文的词汇语义表征构建上下文向量,最后通过计算候选词的上下文向量的相似度对候选双语词汇进行加权得到质量更高的汉缅互译词汇.实验结果表明,相对于基于双语词典的方法和基于双语LDA+CBW的方法,本文提出的方法准确率上分别提升了11.07%和3.82%.  相似文献   

6.
This paper shows practical examples of the application of a new image fusion paradigm for achieving a 2-D all in-focus image starting from a set of multi-focus images of a 3-D real object. The goal consists in providing an enhanced 2-D image showing the object entirely in focus. The fusion procedure shown here is based on the use of a focusing pixel-level measure. Such measure is defined in the space–frequency domain through a 1-D pseudo-Wigner distribution. The method is illustrated with different sets of images. Evaluation measures applied to artificially blurred cut and pasted regions have shown that the present scheme can provide equally or even better performance than other alternative image fusion algorithms.  相似文献   

7.
藏汉词表的生成不仅是藏汉双向机器翻译任务开始的第一步,而且影响着藏汉双向翻译效果。本文通过改进生成藏汉词表来提升下游藏汉双向翻译性能。一方面从词表拼接入手,采用高频使用正常词表,低频使用字节对编码词表的思想,通过反复训练找到最佳词频阈值;另一方面通过最优传输的词汇学习方法学习生成藏汉词表,并针对藏语本身语言特点进行改进后应用到藏汉双向翻译上。实验结果表明,本文针对藏文语言特点提出的字节对编码加最优传输的词汇学习方法效果最佳,在藏汉翻译任务上BLEU值达到37.35,汉藏翻译任务上BLEU值达到27.60。  相似文献   

8.
In this paper we introduce a set of related confidence measures for large vocabulary continuous speech recognition (LVCSR) based on local phone posterior probability estimates output by an acceptor HMM acoustic model. In addition to their computational efficiency, these confidence measures are attractive as they may be applied at the state-, phone-, word- or utterance-levels, potentially enabling discrimination between different causes of low confidence recognizer output, such as unclear acoustics or mismatched pronunciation models. We have evaluated these confidence measures for utterance verification using a number of different metrics. Experiments reveal several trends in “profitability of rejection", as measured by the unconditional error rate of a hypothesis test. These trends suggest that crude pronunciation models can mask the relatively subtle reductions in confidence caused by out-of-vocabulary (OOV) words and disfluencies, but not the gross model mismatches elicited by non-speech sounds. The observation that a purely acoustic confidence measure can provide improved performance over a measure based upon both acoustic and language model information for data drawn from the Broadcast News corpus, but not for data drawn from the North American Business News corpus suggests that the quality of model fit offered by a trigram language model is reduced for Broadcast News data. We also argue that acoustic confidence measures may be used to inform the search for improved pronunciation models.  相似文献   

9.
How Variable May a Constant be? Measures of Lexical Richness in Perspective   总被引:1,自引:0,他引:1  
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.  相似文献   

10.
哈萨克语作为新疆少数民族语言之一,其词频统计作为自然语言处理的基础性课题,成为需要迫切解决的问题。基于此,介绍Zapf 定律及哈萨克语词频统计之间的联系。对连续输入哈萨克语字符串进行切分,再输入切分后的哈萨克语词串,由此得到哈萨克语词典。在词典中存储词形不同的哈语词组,以及这些词组出现的频率,并进行哈萨克语的统计实验,结果说明哈萨克语词频之间存在内在联系,同时验证哈萨克词频符合Zapf 的幂率定律。  相似文献   

11.
One of the most serious challenges for speech synthesis is the systematic treatment of events in language and speech that are known to have low frequencies of occurrence. The problems that extremely unbalanced frequency distributions pose for rule-based or data-driven models are often underestimated or even unrecognized. This paper discusses the problems pertinent to rare events in four components of speech synthesis systems: in linguistic text analysis, where productive word formation processes generate a potentially unbounded lexicon and cause heavily skewed word frequency distributions; in syllabification, where some syllables occur very frequently but most phonotactically possible syllables are very infrequent; in speech timing, where most constellations of factors affecting segmental duration are sparsely or not at all represented in training databases; and in unit selection synthesis, where the uneven distribution of speech unit frequencies poses challenges to speech corpus design. Currently available techniques for coping with the problem of rare or unseen events in each of these components are reviewed. Finally, a distinction is made between a strictly closed domain with a fixed vocabulary and a merely restricted domain with loopholes for unseen words and names, and the consequences of the respective type of domain for appropriate synthesis strategies are discussed.  相似文献   

12.
As well as the newly developed scaling diversity index, there are also eleven traditional diversity indices to be found in the literature. Analyses show that these eleven traditional indices are unable to formulate the richness component of diversity. In particular, the most widely used index, the Shannon‐Weiner index, cannot express the evenness component. On the contrary, the scaling diversity index is able to formulate both the richness aspect and the evenness aspect of diversity. The scaling diversity index has been applied to developing scenarios of ecological diversity at different spatial resolutions and spatial scales. A case study in Fukang in the Xinjiang Uygur Autonomous Region in China shows that the scaling diversity index is sensitive to spatial resolution and is easy to understand. It is scientifically sound and could be operated at affordable cost.  相似文献   

13.
The use of a space optimal storage scheme for compiler diagnostic messages is described. The space reduction achievable through word coding may be predicted in terms of text length, vocabulary size and average word length. A method of word or phrase selection is discussed. Illustration is made by application to the FORTRAN FTN and COBOL compilers under the operating system KRONOS for the CDC 6400 computer.  相似文献   

14.
Efficient modeling of actions is critical for recognizing human actions. Recently, bag of video words (BoVW) representation, in which features computed around spatiotemporal interest points are quantized into video words based on their appearance similarity, has been widely and successfully explored. The performance of this representation however, is highly sensitive to two main factors: the granularity, and therefore, the size of vocabulary, and the space in which features and words are clustered, i.e., the distance measure between data points at different levels of the hierarchy. The goal of this paper is to propose a representation and learning framework that addresses both these limitations.We present a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding. As opposed to flat vocabularies used in traditional methods, we propose to exploit the hierarchical nature of feature vocabularies representative of human actions. Spatiotemporal features computed around interest points in videos form the lowest level of representation. Video words are then obtained by clustering those spatiotemporal features. Each video word is then represented by a vector of Pointwise Mutual Information (PMI) between that video word and training video clips, and is treated as a mid-level feature. At the highest level of the hierarchy, our goal is to further cluster the mid-level features, while exploiting semantically meaningful distance measures between them. We conjecture that the mid-level features produced by similar video sources (action classes) must lie on a certain manifold. To capture the relationship between these features, and retain it during clustering, we propose to use diffusion distance as a measure of similarity between them. The underlying idea is to embed the mid-level features into a lower-dimensional space, so as to construct a compact yet discriminative, high level vocabulary. Unlike some of the supervised vocabulary construction approaches and the unsupervised methods such as pLSA and LDA, Diffusion Maps can capture local relationship between the mid-level features on the manifold. We have tested our approach on diverse datasets and have obtained very promising results.  相似文献   

15.
16.

The development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman’s and Pearson’s correlation coefficients.

  相似文献   

17.
多配送中心物流车辆调度问题是一类实用性很高的NP难解问题。针对标准差分进化算法进化过程中缺乏动态调整,进化后期由于种群多样性的降低,算法容易陷入早熟收敛的问题,提出了一种改进的差分进化算法。该算法在变异过程中动态自适应地调整缩放因子,在交叉过程中通过高斯扰动增加种群的多样性,在变异操作之后,加入新的选择机制。将该算法应用于多配送中心物流车辆调度问题,建立了数学模型,介绍了该算法的详细实现过程。仿真通过和遗传算法和标准差分进化算法比较,表明该算法具有更好的寻优效果,从而证明了该算法应用于该问题的可行性和有效性。  相似文献   

18.
Operational risk is commonly analyzed in terms of the distribution of aggregate yearly losses. Risk measures can then be computed as statistics of this distribution that focus on the region of extreme losses. Assuming independence among the operational risk events and between the likelihood that they occur and their magnitude, separate models are made for the frequency and for the severity of the losses. These are then combined to estimate the distribution of aggregate losses. While the detailed form of the frequency distribution does not significantly affect the risk analysis, the choice of model for the severity often has a significant impact on operational risk measures. For heavy-tailed distributions these measures are dominated by extreme losses, whose probability cannot be reliably extrapolated from the available data. With limited empirical evidence, it is difficult to distinguish among alternative models that produce very different values of the risk measures. Furthermore, the estimates obtained can be unstable and overly sensitive to the presence or absence of single extreme events. Setting a bound on the maximum amount that can be lost in a single event reduces the dependence on the distributional assumptions and improves the robustness and stability of the risk measures, while preserving their sensitivity to changes in the risk profile. This bound should be determined by expert assessment on the basis of economic arguments and validated by the regulator, so that it can be used as a control parameter in the risk analysis.  相似文献   

19.
方言研究领域中的语音研究、词汇研究及语法研究是方言研究的三个重要组成部分,如何识别方言词汇,是方言词汇研究首要的环节。目前,汉语方言词汇研究的语料收集与整理主要通过专家人工整理的形式进行,耗时耗力。 随着信息技术的发展,人们的交流广泛通过网络进行,而输入法数据包含海量的语料资源以及地域信息,可以帮助进行方言词汇语料的自动发现。然而,目前尚没有文献研究如何利用拼音输入法数据对方言词汇进行系统化分析,因此在本文中,我们探讨借助中文输入法的用户行为来自动发现各地域方言词汇的方法。特别的,我们归纳得到输入法数据中表征方言词汇的两类特征,并基于对特征的不同组合识别方言词汇。最后我们通过实验评价了两类特征的不同组合方法对方言词汇识别效果的影响。  相似文献   

20.
Determining the most appropriate inputs to a model has a significant impact on the performance of the model and associated algorithms for classification, prediction, and data analysis. Previously, we proposed an algorithm ICAIVS which utilizes independent component analysis (ICA) as a preprocessing stage to overcome issues of dependencies between inputs, before the data being passed through to an input variable selection (IVS) stage. While we demonstrated previously with artificial data that ICA can prevent an overestimation of necessary input variables, we show here that mixing between input variables is common in real-world data sets so that ICA preprocessing is useful in practice. This experimental test is based on new measures introduced in this paper. Furthermore, we extend the implementation of our variable selection scheme to a statistical dependency test based on mutual information and test several algorithms on Gaussian and sub-Gaussian signals. Specifically, we propose a novel method of quantifying linear dependencies using ICA estimates of mixing matrices with a new linear mixing measure (LMM).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号