共查询到20条相似文献,搜索用时 78 毫秒
1.
2.
Grammatical and context‐sensitive error correction using a statistical machine translation framework
Producing electronic rather than paper documents has considerable benefits such as easier organizing and data management. Therefore, existence of automatic writing assistance tools such as spell and grammar checker/correctors can increase the quality of electronic texts by removing noise and correcting the erroneous sentences. Different kinds of errors in a text can be categorized into spelling, grammatical and real‐word errors. In this article, we present a language‐independent approach based on a statistical machine translation framework to develop a proofreading tool, which detects grammatical errors as well as context‐sensitive spelling mistakes (real‐word errors). A hybrid model for grammar checking is suggested by combining the mentioned approach with an existing rule‐based grammar checker. Experimental results on both English and Persian languages indicate that the proposed statistical method and the rule‐based grammar checker are complementary in detecting and correcting syntactic errors. The results of the hybrid grammar checker, applied to some English texts, show an improvement of about 24% with respect to the recall metric with almost similar value for precision. Experiments on real‐world data set show that state‐of‐the‐art results are achieved for grammar checking and context‐sensitive spell checking for Persian language. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献
3.
4.
5.
6.
针对目前藏文文本自动查错方法的不足,该文提出了一种基于规则和统计相结合的自动查错方法.首先以藏文拼写文法为基础,结合形式语言与自动机理论,构造37种确定型有限自动机识别现代藏文字;然后利用查找字典的方法识别梵音藏文字;最后利用互信息和t-测试差等统计方法查找藏语词语搭配错误和语法错误等真字词错误,实现藏文文本的自动查错... 相似文献
7.
Natural language processing (NLP) has been used to process text pertaining to patient records and narratives. However, most of the methods used were developed for specific systems, so new research is necessary to assess whether such methods can be easily retargeted for new applications and goals, with the same performance. In this paper, open‐source tools are reused as building blocks on which a new system is built. The aim of our work is to evaluate the applicability of the current NLP technology to a new domain: automatic knowledge acquisition of diagnostic and therapeutic procedures from clinical practice guideline free‐text documents. In order to do this, two publicly available syntactic parsers, several terminology resources and a tool oriented to identify semantic predications were tailored to increase the performance of each tool individually. We apply this new approach to 171 sentences selected by the experts from a clinical guideline, and compare the results with those of the tools applied with no tailoring. The results of this paper show that with some adaptation, open‐source NLP tools can be retargeted for new tasks, providing an accuracy that is equivalent to the methods designed for specific tasks. 相似文献
8.
9.
基于规则与统计相结合的中文文本自动查错模型与算法 总被引:7,自引:0,他引:7
中文文本自动校对是自然语言处理领域具有挑战性的研究课题。本文提出了一种规则与统计相结合的中文文本自动查错模型与算法。根据正确文本分词后单字词的出现规律以及“非多字词错误”的概念,提出一组错误发现规则,并与针对分词后单字散串建立的字二元、三元统计模型和词性二元、三元统计模型相结合,建立了文本自动查错模型与实现算法。通过对30篇含有578个错误测试点的文本进行实验,所提算法的查错召回率为86.85%、准确率为69.43% ,误报率为30.57%。 相似文献
10.
11.
Can Eyupoglu 《计算机系统科学与工程》2019,34(3):113-121
12.
单词嵌入表示学习是自然语言处理(NLP)中最基本但又很重要的研究内容, 是所有后续高级语言处理任务
的基础. 早期的单词独热表示忽略了单词的语义信息, 在应用中常常会遇到数据稀疏的问题, 后来随着神经语言模
型(NLM)的提出, 单词被表示为低维实向量, 有效地解决了数据稀疏的问题. 单词级的嵌入表示是最初的基于神经
网络语言模型的输入表示形式, 后来人们又从不同角度出发, 提出了诸多变种. 本文从模型涉及到的语种数的角度
出发, 将单词嵌入表示模型分为单语言单词嵌入表示模型和跨语言单词嵌入表示模型两大类. 在单语言中, 根据模
型输入的颗粒度又将模型分为字符级、单词级、短语级及以上的单词嵌入表示模型, 不同颗粒度级别的模型的应用
场景不同, 各有千秋. 再将这些模型按照是否考虑上下文信息再次分类, 单词嵌入表示还经常与其它场景的模型结
合, 引入其他模态或关联信息帮助学习单词嵌入表示, 提高模型的表现性能, 故本文也列举了一些单词嵌入表示模
型和其它领域模型的联合应用. 通过对上述模型进行研究, 将每个模型的特点进行总结和比较, 在文章最后给出了
未来单词嵌入表示的研究方向和展望. 相似文献
13.
14.
Fiona Lyddy Francesca Farina James Hanney Lynn Farrell Niamh Kelly O'Neill 《Journal of Computer-Mediated Communication》2014,19(3):546-561
Concerns over effects of ‘textisms’ on literacy have been reinforced by research identifying processing costs associated with reading textisms. But to what extent do such studies reflect actual textism use? This study examined the textual characteristics of 936 text messages in English (13391 words). Message length, nonstandard spelling, sender and message characteristics and word frequency were analyzed. The data showed that 25% of word content used nonstandard spelling, the most frequently occurring category involving omission of capital letters. Types of nonstandard spelling varied only slightly depending on the purpose of the text message, while the overall proportion of nonstandard spelling did not differ significantly. Less than 0.2% of content was ‘semantically unrecoverable.’ Implications for experimental studies of textisms are discussed. 相似文献
15.
Design,Implementation and Evaluation of an Inflectional Morphology Finite State Transducer for Irish
Minority languages must endeavour to keep up with and avail of language technology advances if they are to prosper in the modern world. Finite state technology is mature, stable and robust. It is scalable and has been applied successfully in many areas of linguistic processing, notably in phonology, morphology and syntax. In this paper, the design, implementation and evaluation of a morphological analyser and generator for Irish using finite state transducers is described. In order to produce a high-quality linguistic resource for NLP applications, a complete set of inflectional morphological rules for Irish is handcrafted, as is the initial test lexicon. The lexicon is then further populated semi-automatically using both electronic and printed lexical resources. Currently we achieve coverage of 89% on unrestricted text. Finally we discuss a number of methodological issues in the design of NLP resources for minority languages. 相似文献
16.
语义相关度计算在信息检索、词义消歧、自动文摘、拼写校正等自然语言处理中均扮演着重要的角色。该文采用基于维基百科的显性语义分析方法计算汉语词语之间的语义相关度。基于中文维基百科,将词表示为带权重的概念向量,进而将词之间相关度的计算转化为相应的概念向量的比较。进一步,引入页面的先验概率,利用维基百科页面之间的链接信息对概念向量各分量的值进行修正。实验结果表明,使用该方法计算汉语语义相关度,与人工标注标准的斯皮尔曼等级相关系数可以达到0.52,显著改善了相关度计算的结果。 相似文献
17.
复述技术研究综述 总被引:5,自引:0,他引:5
复述是自然语言中比较普遍的一个现象,它集中反映了语言的多样性。复述研究的对象主要是短语或者句子的同义现象。自然语言处理各种底层技术的不断发展和成熟,为复述研究提高了可能,使之受到越来越多的关注。在英文和日文方面,复述技术已经被成功的应用到信息检索、自动问答、信息抽取、自动文摘以及机器翻译等多个领域,有效地提高了系统的性能。本文主要对复述实例库的构建、复述规则的抽取以及复述的生成等几方面的最新研究进展进行详细的综述,并简要介绍了我们在中文复述方面进行的初步研究工作。在文章的最后一部分,我们对复述技术的难点及未来的发展方向进行了展望,并对全文进行了总结。 相似文献
18.
Diana Santos 《Machine Translation》1999,14(2):83-112
Is adaptation of English NLP applications the right way to gomultilingual? Should one prefer ``language-independent'' systems with aview to applying them to a large number of different languages? Experience from the processing of Portuguese in several differentareas (part-of-speech tagging, corpus tools, lexical decomposition,machine translation, etc.) suggests that neither of these offers a satisfactory solution. This paper argues for a thorough study of the way individual languageswork in order to develop applications suited for the language inquestion, i.e., ``language-dependent'' systems. 相似文献
19.
目前,一些主流的判别学习算法只能优化光滑可导的损失函数,但在自然语言处理(natural language processing,简称NLP)中,很多应用的直接评价标准(如字符转换错误数(character error rate,简称CER))都是不可导的阶梯形函数.为解决此问题,研究了一种新提出的判别学习算法--最小化样本风险(minimum sample risk,简称MSR)算法.与其他判别训练算法不同,MSR算法直接使用阶梯形函数作为其损失函数.首先,对MSR算法的时空复杂性作了分析和提高;同时,提出了改进的算法MSR-II,使得特征之间相关性的计算更加稳定.此外,还通过大量领域适应性建模实验来考察MSR-II的鲁棒性.日文汉字输入实验的评测结果表明:(1) MSR/MSR-II显著优于传统三元模型,使错误率下降了20.9%;(2) MSR/MSR-II与另两类主流判别学习算法Boosting和Perceptron表现相当;(3) MSR-II不仅在时空复杂度上优于MSR,特征选择的稳定性也更高;(4) 领域适应性建模的结果证明了MSR-II的良好鲁棒性.总之,MSR/MSR-II是一种非常有效的算法.由于其使用的是阶梯形的损失函数,因此可以广泛应用于自然语言处理的各个领域,如拼写校正和机器翻译. 相似文献
20.
基于最小编辑距离的维语词语检错与纠错研究 总被引:2,自引:1,他引:2
拼写错误的发现和候选词选取是文本分析中的一个重要的技术问题。本文结合维吾尔语的语音和词语结构特点,列出了文本中常见的拼写错误类型,详细分析了解决方法,利用最小编辑距离(minimum edit distance)算法实现了维吾尔语文本拼写错误分析中的查错和纠错功能,并以此为基础,结合维吾尔语构词规则,进一步提高了建议候选词的准确率和速度。该算法已被成功地应用到了维吾尔语文字自动校对和多文种文本检索等领域中。在以新疆高校学报为语料的测试中,词语查纠率达到 85%以上。 相似文献