首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Neural Computing and Applications - Detecting and correcting misspelled words in a written text are of great importance in many natural language processing applications. Errors can be broadly...  相似文献   

2.
Stemming is the basic operation in Natural language processing (NLP) to remove derivational and inflectional affixes without performing a morphological analysis. This practice is essential to extract the root or stem. In NLP domains, the stemmer is used to improve the process of information retrieval (IR), text classifications (TC), text mining (TM) and related applications. In particular, Urdu stemmers utilize only uni-gram words from the input text by ignoring bigrams, trigrams, and n-gram words. To improve the process and efficiency of stemming, bigrams and trigram words must be included. Despite this fact, there are a few developed methods for Urdu stemmers in the past studies. Therefore, in this paper, we proposed an improved Urdu stemmer, using hybrid approach divided into multi-step operation, to deal with unigram, bigram, and trigram features as well. To evaluate the proposed Urdu stemming method, we have used two corpora; word corpus and text corpus. Moreover, two different evaluation metrics have been applied to measure the performance of the proposed algorithm. The proposed algorithm achieved an accuracy of 92.97% and compression rate of 55%. These experimental results indicate that the proposed system can be used to increase the effectiveness and efficiency of the Urdu stemmer for better information retrieval and text mining applications.  相似文献   

3.
中文拼写纠错是一项检测和纠正文本中拼写错误的任务。大多数中文拼写错误是在语义、读音或字形上相似的字符被误用,因此常见的做法是对不同模态提取特征进行建模。但将不同特征直接融合或是利用固定权重进行求和,使得不同模态信息之间的重要性关系被忽略以及模型在识别错误时会出现偏差,阻止了模型以有效的方式学习。为此,提出了一种新的模型以改善这个问题,称为基于文本序列错误概率和中文拼写错误概率融合的汉语纠错算法。该方法使用文本序列错误概率作为动态权重、中文常见拼写错误概率作为固定权重,对语义、读音和字形信息进行了高效融合。模型能够合理控制不同模态信息流入混合模态表示,更加针对错误发生处进行学习。在SIGHAN基准上进行的实验表明,所提模型的各项评估分数在不同数据集上均有提升,这验证了该算法的可行性。  相似文献   

4.
The researches on spelling correction aiming at detecting errors in texts tend to focus on context-sensitive spelling error correction,which is more difficult than traditional isolated-word error correction,A novel and efficient algorithm for the system of Chinese spelling error correction,CInsunSpell,is presented.In this system,the work of correction includes two parts:checking phase and correcting phase,At the first phase ,a Trigram algorithm within one fixed-size window is designed to locate potential errors in local area.The second phase employs a new method of automatically and dynamically distributing weights among the characters in the confusion set as well as in the Bayesian language model.The tactics used above exhibits good performances.  相似文献   

5.
In this paper we address the problem of providing an order of relevance, or ranking, among entities’ properties used in RDF datasets, Linked Data and SPARQL endpoints. We first motivate the importance of ranking RDF properties by providing two killer applications for the problem, namely property tagging and entity visualization. Moved by the desiderata of these applications, we propose to apply Machine Learning to Rank (MLR) techniques to the problem of ranking RDF properties. Our devised solution is based on a deep empirical study of all the dimensions involved: feature selection, MLR algorithm and Model training. The major advantages of our approach are the following: (a) flexibility/personalization, as the properties’ relevance can be user-specified by personalizing the training set in a supervised approach, or set by a novel automatic classification approach based on SWiPE; (b) speed, since it can be applied without computing frequencies over the whole dataset, leveraging existing fast MLR algorithms; (c) effectiveness, as it can be applied even when no ontology data is available by using novel dataset-independent features; (d) precision, which is high both in terms of f-measure and Spearman’s rho. Experimental results show that the proposed MLR framework outperform the two existing approaches found in literature which are related to RDF property ranking.  相似文献   

6.
To enhance security in dynamic networks, it is important to evaluate the vulnerabilities and offer economic and practical patching strategy since vulnerability is the major driving force for attacks. In this paper, a hybrid ranking approach is presented to estimate vulnerabilities under the dynamic scenarios, which is a combination of low-level rating for vulnerability instances and high-level evaluation for the security level of the network system. Moreover, a novel quantitative model, an adapted attack graph, is also proposed to escaping isolated scoring, which takes the dynamic and logic relations among exploits into account, and significantly benefits to vulnerability analysis. To validate applicability and performance of our approach, a hybrid ranking case is implemented as experimental platform. The ranking results show that our approach differentiates the influential levels among vulnerabilities under dynamic attacking scenarios and economically enhances the security of network system.  相似文献   

7.
In this paper, the problem of spatial error concealment for real-time applications is addressed. The proposed method can be categorized in exemplar-based error concealment approaches. In this category, a patch of corrupted pixels are replaced by another patch of the image that contains correct pixels. For splitting the erroneous block to different patches, a novel context-dependent exemplar-based algorithm based on a previously proposed segmentation method is proposed. The capability of the proposed method for concealment in diverse image regions is depicted. Our detailed conducted experiments show that the proposed method outperforms the state-of-the-art spatial error concealment methods in terms of output quality.  相似文献   

8.
Extracting significant features from high-dimension and small sample size biological data is a challenging problem. Recently, Micha? Draminski proposed the Monte Carlo feature selection (MC) algorithm, which was able to search over large feature spaces and achieved better classification accuracies. However in MC the information of feature rank variations is not utilized and the ranks of features are not dynamically updated. Here, we propose a novel feature selection algorithm which integrates the ideas of the professional tennis players ranking, such as seed players and dynamic ranking, into Monte Carlo simulation. Seed players make the feature selection game more competitive and selective. The strategy of dynamic ranking ensures that it is always the current best players to take part in each competition. The proposed algorithm is tested on 8 biological datasets. Results demonstrate that the proposed method is computationally efficient, stable and has favorable performance in classification.  相似文献   

9.
When a document is prepared using a computer system, it can be checked for spelling errors automatically and efficiently. This paper reviews and compares several methods for searching an English spelling dictionary. It also presents a new technique, hash-bucket search, for searching a static table in general, and a dictionary in particular. Analysis shows that with only a small amount of space beyond that required to store the keys, the hash-bucket search method has many advantages over existing methods. Experimental results with a sample dictionary using double hashing and the hash-bucket techniques are presented.  相似文献   

10.
针对标签排序问题的特点,提出一种面向标签排序数据集的特征选择算法(Label Ranking Based Feature Selection, LRFS)。该算法首先基于邻域粗糙集定义了新的邻域信息测度,能直接度量连续型、离散型以及排序型特征间的相关性、冗余性和关联性。然后,在此基础上提出基于邻域关联权重因子的标签排序特征选择算法。实验结果表明,LRFS算法能够在不降低排序准确率的前提下,有效剔除标签排序数据集中的无关特征或冗余特征。  相似文献   

11.
12.
Automatic document summarization aims to create a compressed summary that preserves the main content of the original documents. It is a well-recognized fact that a document set often covers a number of topic themes with each theme represented by a cluster of highly related sentences. More important, topic themes are not equally important. The sentences in an important theme cluster are generally deemed more salient than the sentences in a trivial theme cluster. Existing clustering-based summarization approaches integrate clustering and ranking in sequence, which unavoidably ignore the interaction between them. In this paper, we propose a novel approach developed based on the spectral analysis to simultaneously clustering and ranking of sentences. Experimental results on the DUC generic summarization datasets demonstrate the improvement of the proposed approach over the other existing clustering-based approaches.  相似文献   

13.
In mountainous areas, slope and altitude variations modulate the airborne sensed hyperspectral radiance image. A new algorithm, SIERRA, has been developed for atmospheric, relief and BRDF corrections in order to extract the surface reflectance in the form of bi-hemispherical albedo that does not depend on solar incidence and observation angles. The forward modeling efforts focus on the estimation of diffuse irradiance and upwelling diffuse radiance, and on the formulation of BRDF effects. The inversion scheme consists of four steps, that go deeper and deeper into the phenomena's complexity.To validate the model, reflectance images are assessed from radiance images simulated with different radiative transfer codes or forward models: MODTRAN4 in the case of homogeneous and flat ground, AMARTIS and SIERRA forward models for heterogeneous and mountainous cases. The surface reflectance is retrieved with a 5% relative error under standard acquisition conditions.SIERRA is applied to HyMap data acquired over the hilly landscape near Calanas, Spain. The hypercube reflectances are compared with those obtained using ATCOR4 and COCHISE. The benefit of relief correction is clearly demonstrated.  相似文献   

14.
为了进一步提高直线时栅位移传感器的测量精度,在建立该传感器周期误差、阿贝误差、热膨胀误差的全误差模型基础上,提出了一种组合校准的方法,利用傅里叶谐波分析和材料线膨胀原理对直线时栅的各种误差进行修正,修正后的精度可达±0.5μm/m。实验证明:该方法解决了直线测量中误差难以分离的问题,同时解决了计算机连续自动采样问题,提高了标定效率,使该方法广泛地应用于生产实践成为可能。  相似文献   

15.
一种新的图像分场描述编码错误掩盖算法   总被引:1,自引:0,他引:1  
针对信道中数据丢失导致图像恢复质量严重受损问题,提出将方向插值引入到基于图像分场描述编码的错误掩盖算法。每帧图像分为两场独立编码传输,丢失一场数据时,通过改进Sobel算子计算正确接收的另一场图像内部边缘,沿边缘方向插值恢复出丢失的场。实验结果表明,应用这种基于像素行交织的错误掩盖算法后,图像主观恢复质量和峰值信噪比均较现有方法有显著提高。算法计算复杂度低,与H.264编码标准兼容,适于实际工程应用。  相似文献   

16.
Fuzzy linear programming (FLP) problems with a wide varietyof applications in sciencesand engineering allow working with imprecise data and constraints, leading to more realistic models. The main contribution of this study is to deal with the formulation of a kind of FLP problems, known as bounded interval-valued fuzzy numbers linear programming (BIVFNLP) problems, with coefficients of decision variables in the objective function, resource vector, and coefficients of the technological matrix represented as interval-valued fuzzy numbers (IVFNs), and crisp decision variables limited to lower and upper bounds. Here, based on signed distance ranking to order IVFNs, the bounded simplex method is extended to obtain an interval-valued fuzzy optimal value for the BIVFNLP problem under consideration. Finally, one illustrative example is given to show the superiority of the proposed algorithm over the existing ones.  相似文献   

17.
针对普通的空间关键字查询通常会导致多查询结果的问题。本文提出了一种基于空间对象位置-文本相关度的top- k 查询与排序方法,用于获取与给定空间关键字查询在文本上相关且位置上相近的典型空间对象。该方法分为离线处理和在线查询处理2个阶段。在离线阶段,根据空间对象之间的位置相近性和文本相似性,度量任意一对空间对象之间的位置-文本关系紧密度。在此基础上,提出了基于概率密度的代表性空间对象选取算法,根据空间对象之间的位置-文本关系为每个代表性空间对象构建相应的空间对象序列。在线查询处理阶段,对于一个给定的空间关键字查询,利用Cosine相似度评估方法计算查询条件与代表性空间对象之间的相关度,然后使用阈值算法(threshold algorithm,TA)在预先创建的空间对象序列上快速选出top- k 个满足查询需求的典型空间对象。实验结果表明:提出的空间对象top- k 查询与排序方法能够有效地满足用户查询需求,并且具有较高的准确性、典型性和执行效率。  相似文献   

18.
An efficient and systematic LL(1) error recovery method is presented that has been implemented for an LL(1) parser generator. Error messages which provide good diagnostic information are generated automatically. Error correction is done by discarding some input symbols and popping up some symbols from the parsing-stack in order to restore the parser to a valid configuration. Thus, symbol deletions and insertions are simulated. The choice between different possible corrections is made by comparing the cost of the inserted (popped) symbols with the reliability value of the recovery symbol (the first input symbol that is not discarded). Our concept of reliability is based on the observation that input symbols differ from each other in their ability to serve as recovery points. A high reliability value of a symbol asserts that it was probably not placed in the input by accident. So it is reasonable not to discard that symbol but to resume parsing. This is done even if a string with high insert-cost has to be inserted before that symbol in order to fit it to the part of the program that has already been analysed. The error recovery routine is invoked only when an error is detected. Thus, there is no additional time required for parsing correct programs. Error-correcting parsers for different languages, including Pascal, have been generated. Some experimental results are summarized.  相似文献   

19.
In intuitionistic fuzzy set and their generalizations such as Pythagorean fuzzy sets and q‐rung orthopair fuzzy sets, ranking is not easy to define. There are several techniques available in literature for ranking values in above mentioned orthopair fuzzy sets. It is interesting to see that almost all the proposed ranking methods produce distinct ranking. Notion of knowledge base is very important to study ranking proposed by different techniques. Aim of this paper is to critically analyze the available ranking techniques for q‐rung orthopair fuzzy values and propose a new graphical ranking method based on hesitancy index and entropy. Several numerical examples are tested with the proposed technique, which shows that the technique is intuitive and convenient for real life applications.  相似文献   

20.
Due to the lack of parallel data in current grammatical error correction (GEC) task, models based on sequence to sequence framework cannot be adequately trained to obtain higher performance. We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data. The first approach is to corrupt each word in the monolingual corpus with a fixed probability, including replacement, insertion and deletion. Another approach is to train error generation models and further filtering the decoding results of the models. The experiments on different synthetic data show that the error rate is 40% and that the ratio of error types is the same can improve the model performance better. Finally, we synthesize about 100 million data and achieve comparable performance as the state of the art, which uses twice as much data as we use.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号