首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Long organization names are often abbreviated in spoken Chinese, and abbreviated utterances cannot be recognized correctly if the abbreviations are not included in the recognition vocabulary. Therefore, it is very important to automatically generate and add abbreviations for organization names to the vocabulary. Generation of Chinese abbreviations is much more complex than English abbreviations which are mostly acronyms and truncations. In this paper, we propose a new hybrid method for automatically generating Chinese abbreviations and we perform vocabulary expansion using output of the abbreviation model for voice search. In our abbreviation modeling, we treat the abbreviation generation problem as a tagging problem and use conditional random fields (CRF) as the tagging tool, the output of which is then re-ranked by a length model and web information. In the vocabulary expansion, considering the multiple abbreviation phenomenon and limited coverage of the top-1 abbreviation candidate, we add top-10 candidates into the vocabulary. In our experiments, for the abbreviation modeling, we achieved a top-10 coverage of 88.3% with the proposed method. For the voice search using abbreviated utterances, we improved the full-name search accuracy from 16.9% to 79.2% by incorporating the top-10 abbreviation candidates to the vocabulary.  相似文献   

2.
Abbreviation Completion is a novel technique to improve the efficiency of code-writing by supporting code completion of multiple keywords based on non-predefined abbreviated input??a different approach from conventional code completion that finds one keyword at a time based on an exact character match. Abbreviated input consisting of abbreviated keywords and non-alphanumeric characters between each abbreviated keyword (e.g. pb st nm) is expanded into a full expression (e.g. public String name) by a Hidden Markov Model learned from a corpus of existing code and abbreviation examples. The technique does not require the user to memorize abbreviations and provides incremental feedback of the most likely completions. In addition to code completion by disabbreviation of multiple keywords, abbreviation completion supports prediction of the next keywords and non-alphanumeric characters of a code completion candidate, a technique called code completion by extrapolation. The system finds the most likely next keywords and non-alphanumeric characters using an n-gram model of programming language. This enables a code completion scenario in which a user first types a short abbreviated expression to complete the beginning part of a desired full expression and then uses the extrapolation feature to complete the remaining part without further typing. This paper presents the algorithm for abbreviation completion, integrated with a new user interface for multiple-keyword code completion. We tested the system by sampling 4919 code lines from open source projects and found that more than 99% of the code lines could be resolved from acronym-like abbreviations. The system could also extrapolate code completion candidates to complete the next one or two keywords with the accuracy of 96% and 82%, respectively. A user study of code completion by disabbreviation found 30% reduction in time usage and 41% reduction of keystrokes over conventional code completion.  相似文献   

3.
汉语缩略语在现代汉语中被广泛使用,其研究对于中文信息处理有着重要地意义。该文提出了一种从英汉平行语料库中自动提取汉语缩略语的方法。首先对双语语料进行词对齐,再抽取出与词对齐信息一致的双语短语对,然后用SVM分类器提取出质量高的双语短语对,最后再从质量高的短语对集合中利用相同英文及少量汉语缩略—全称对应规则提取出汉语缩略语及全称语对。实验结果表明,利用平行语料的双语对译信息,自动提取出的缩略语具有较高地准确率,可以作为一种自动获取缩略语词典的有效方法。  相似文献   

4.
简称是自然语言词汇的重要组成部分,其获取是自然语言处理中的一个基本而又关键的问题。提出了一种根据汉语全称从Web中获取对应汉语简称的方法。该方法包括获取和验证两个步骤。获取步骤通过选择查询模式从Web上获得候选简称集合。为了验证候选简称,定义了全简称关系约束,分别定性和定量地表示全称和对应简称之间的约束,构建了全简称关系图来表示所有全称和简称之间的联系,在验证过程中,先分别用约束公理和关系图对候选简称进行过滤,再用约束函数对候选简称分类,并以分类类别、语料标记和约束函数值作为属性构建决策树,利用决策树对候选简称进行验证。实验结果表明,获取方法的最终准确率为94.63%,召回率为84.09%,验证方法的准确率为94.81%。  相似文献   

5.
针对目前企业营销的不断深入,企业简称被各大新闻广泛使用,而作为新词又难以被有效识别的问题,提出一种基于构成模式和条件随机场(CRF)的企业简称预测方法。首先,从语言学的角度对企业全称和简称的构成规律进行了总结,并采用词库以及规则相结合的方式对Bi-gram算法进行改进,提出CBi-gram算法,实现了对企业全称的结构化切分,并提高了企业全称中核心词识别的准确性。然后,依据上述切分结果对企业类型进行再次细分,并通过人工总结和规则自学习的方法形成不同企业类型下的简称规则集。最后再基于规则生成企业的候选简称集,降低了不适用的规则对于不同类型的企业在生成简称过程中产生的噪声。另外,为了弥补单纯基于规则在解决全称缩写和简写缩写混合的局限性,引入CRF,从统计的角度对简称进行预测,并选取词、音调以及词在全称组成成分中的位置作为模型特征,进行模型训练,以实现两种方法的相互补充。实验结果显示,该方法具有较高的准确率,输出的企业简称集基本覆盖了企业的常用简称范围。  相似文献   

6.
中文组织机构名称与简称的识别   总被引:2,自引:0,他引:2  
本文提出了一种基于规则识别中文组织机构全称和简称的方法。全称的识别首先借助机构后缀词库获得其右边界,然后通过规则匹配并借助贝叶斯概率模型加以决策获得其左边界。简称的识别是在全称的基础上应用其对应的简称规则实现的。在开放性测试中,该方法的总体查全率为85.19%,查准率为83.03%,F Measure为84.10%;简称的查全率为67.18%,查准率为74.14%。目前该方法已应用于中文关系的抽取系统。  相似文献   

7.
为净化网络环境,需要对网络信息进行审查。针对网络信息中所包含的敏感词,尤其是中文敏感词变形体的识别成为了一个迫切需要解决的问题。通过分析汉字的结构和读音等特征提出了一种中文敏感词变形体的识别方法。该方法针对词的拼音、词的简称和词的拆分三种敏感词变形体分别设计了基于易混拼音分组的敏感词的识别算法(SPGR)、字符串的简称识别算法(SNR)和基于KMP的汉字拆分识别算法(WS-KMP),有效提高了敏感词审查的准确率和效率。实验结果表明,该方法在识别中文敏感词变形体的时候有较高的查全率和查准率。  相似文献   

8.
微博情感倾向性分析旨在发现用户对热点事件的观点态度。由于微博噪声大、新词多、缩写频繁、有自己的固定搭配、上下文信息有限等原因,微博情感倾向性分析是一项有挑战性的工作。该文主要探讨利用卷积神经网络进行微博情感倾向性分析的可行性,分别将字级别词向量和词级别词向量作为原始特征,采用卷积神经网络来发现任务中的特征,在COAE2014任务4的语料上进行了实验。实验结果表明,利用字级别词向量及词级别词向量的卷积神经网络分别取得了95.42%的准确率和94.65%的准确率。由此可见对于中文微博语料而言,利用卷积神经网络进行微博情感倾向性分析是有效的,且使用字级别的词向量作为原始特征会好于使用词级别的词向量作为原始特征。  相似文献   

9.
错别字自动识别是自然语言处理中一项重要的研究任务,在搜索引擎、自动问答等应用中具有重要价值.尽管传统方法在识别文本中多字词错误方面的准确率较高,但由于中文单字词错误具有特殊性,传统方法对中文单字词检错准确率较低.该文提出了一种基于Transformer网络的中文单字词检错方法.首先,该文通过充分利用汉字混淆集和Web网...  相似文献   

10.
基于邻接矩阵全文索引模型的文本压缩技术   总被引:1,自引:0,他引:1  
基于不定长单词的压缩模型的压缩效率高于基于字符的压缩模型,但是它的最优符号集的寻找算法是NP完全问题,本文提出了一种基于贪心算法的计算最小汉字平均熵的方法,发现一个局部最优的单词表。这种方法的关键是将文本的邻接矩阵索引作为统计基础,邻接矩阵全文索引是论文提出的一种新的全文索引模型,它忠实地反映了原始文本,很利于进行原始文本的初步统计,因此算法效率得以提高,其时间复杂度与文本的汉字种数成线性关系,能够适应在线需要。并且,算法生成的压缩模型的压缩比是0.47,比基于字的压缩模型的压缩效率提高25%。  相似文献   

11.
基于决策树的汉语未登录词识别   总被引:13,自引:0,他引:13  
未登录词识别是汉语分词处理中的一个难点。在大规模中文文本的自动分词处理中,未登录词是造成分词错识误的一个重要原因。本文首先把未登录词识别问题看成一种分类问题。即分词程序处理后产生的分词碎片分为‘合’(合成未登录词)和‘分’(分为两单字词)两类。然后用决策树的方法来解决这个分类的问题。从语料库及现代汉语语素数据库中共统计出六类知识:前字前位成词概率、后字后位成词概率、前字自由度、后字自由度、互信息、单字词共现概率。用这些知识作为属性构建了训练集。最后用C4.5算法生成了决策树。在分词程序已经识别出一定数量的未登录词而仍有分词碎片情况下使用该方法,开放测试的召回率:69.42%,正确率:40.41%。实验结果表明,基于决策树的未登录词识别是一种值得继续探讨的方法。  相似文献   

12.
Effective identifier names for comprehension and memory   总被引:1,自引:0,他引:1  
Readers of programs have two main sources of domain information: identifier names and comments. When functions are uncommented, as many are, comprehension is almost exclusively dependent on the identifier names. Assuming that writers of programs want to create quality identifiers (e.g., identifiers that include relevant domain knowledge), one must ask how should they go about it. For example, do the initials of a concept name provide enough information to represent the concept? If not, and a longer identifier is needed, is an abbreviation satisfactory or does the concept need to be captured in an identifier that includes full words? What is the effect of longer identifiers on limited short term memory capacity? Results from a study designed to investigate these questions are reported. The study involved over 100 programmers who were asked to describe 12 different functions and then recall identifiers that appeared in each function. The functions used three different levels of identifiers: single letters, abbreviations, and full words. Responses allow the extent of comprehension associated with the different levels to be studied along with their impact on memory. The functions used in the study include standard computer science textbook algorithms and functions extracted from production code. The results show that full-word identifiers lead to the best comprehension; however, in many cases, there is no statistical difference between using full words and abbreviations. When considered in the light of limited human short-term memory, well-chosen abbreviations may be preferable in some situations since identifiers with fewer syllables are easier to remember.  相似文献   

13.
A comprehensive online unconstrained Chinese handwriting dataset, SCUT-COUCH2009, is introduced in this paper. As a revision of SCUT-COUCH2008 [1], the SCUT-COUCH2009 database consists of more datasets with larger vocabularies and more writers. The database is built to facilitate the research of unconstrained online Chinese handwriting recognition. It is comprehensive in the sense that it consists of 11 datasets of different vocabularies, named GB1, GB2, TradGB1, Big5, Pinyin, Letters, Digit, Symbol, Word8888, Word17366 and Word44208. In particular, the SCUT-COUCH2009 database contains handwritten samples of 6,763 single Chinese characters in the GB2312-80 standard, 5,401 traditional Chinese characters of the Big5 standard, 1,384 traditional Chinese characters corresponding to the level 1 characters of the GB2312-80 standard, 8,888 frequently used Chinese words, 17,366 daily-used Chinese words, 44,208 complete words from the Fourth Edition of “The Contemporary Chinese Dictionary”, 2,010 Pinyin and 184 daily-used symbols. The samples were collected using PDAs (Personal Digit Assistant) and smart phones with touch screens and were contributed by more than 190 persons. The total number of character samples is over 3.6 million. The SCUT-COUCH2009 database is the first publicly available large vocabulary online Chinese handwriting database containing multi-type character/word samples. We report some evaluation results on the database using state-of-the-art recognizers for benchmarking.  相似文献   

14.
基于统计和规则的未登录词识别方法研究   总被引:8,自引:0,他引:8       下载免费PDF全文
周蕾  朱巧明 《计算机工程》2007,33(8):196-198
介绍了一种基于统计和规则的未登录词识别方法.该方法分为2个步骤:(1)对文本进行分词,对分词结果中的碎片进行全切分生成临时词典,并利用规则和频度信息给临时词典中的每个字串赋权值,利用贪心算法获得每个碎片的最长路径,从而提取未登录词;(2)在上一步骤的基础上,建立二元模型,并结合互信息来提取由若干个词组合而成的未登录词(组).实验证明该方法开放测试的准确率达到81.25%,召回率达到82.38%.  相似文献   

15.
信息时代推进盲文数字化, 关乎我国广大盲人文化素质的提高和生活水平的改善. 本文实现了一种基于国家通用盲文标调规则的汉盲转换系统, 能够快速生成海量符合国家通用盲文方案的数字化资源, 满足视障人士无障碍获取信息的需求. 此系统按通用盲文规则处理汉语文本, 将其转换为符合标调规则、简写规则的盲文结果. 测试结果表明, 此系统可以准确处理标调规则、简写规则, 可得到准确的符合国家通用盲文方案的盲文数字化结果. 声调省写覆盖率、韵母简写覆盖率和篇幅增加量均与国家通用盲文方案的理论值相当, 能够快速处理长篇语料文件, 程序执行效率高, 具有实用价值, 可以用于推广国家通用盲文, 促进我国盲文数字化无障碍建设.  相似文献   

16.
缩略语在自然语言中被广泛使用。因其是新词的重要来源之一,成为了自然语言处理领域的一大问题。该文以汉语为对象,研究了从完整形式预测缩略语形式的方法。首先,使用条件随机场模型对完整形式进行序列标注,生成缩略语候选集合。再利用搜索引擎获取网络数据,并通过不同策略利用网络数据对各候选依次评估,结合各项评估分数进行重排序,选择最终的缩略语结果。实验结果表明,增加Web信息之后,缩略语预测的准确率可以提高约五个百分点。  相似文献   

17.
关键词抽取技术是自然语言处理领域的一个研究热点。在目前的关键词抽取算法中,深度学习方法较少考虑到中文的特点,汉字粒度的信息利用不充分,中文短文本关键词的提取效果仍有较大的提升空间。为了改进短文本的关键词提取效果,针对论文摘要关键词自动抽取任务,提出了一种将双向长短时记忆神经网络(Bidirectional Long Shot-Term Memory,BiLSTM)与注意力机制(Attention)相结合的基于序列标注(Sequence Tagging)的关键词提取模型(Bidirectional Long Short-term Memory and Attention Mechanism Based on Sequence Tagging,BAST)。首先使用基于词语粒度的词向量和基于字粒度的字向量分别表示输入文本信息;然后,训练BAST模型,利用BiLSTM和注意力机制提取文本特征,并对每个单词的标签进行分类预测;最后使用字向量模型校正词向量模型的关键词抽取结果。实验结果表明,在8159条论文摘要数据上,BAST模型的F1值达到66.93%,比BiLSTM-CRF(Bidirectional Long Shoft-Term Memory and Conditional Random Field)算法提升了2.08%,较其他传统关键词抽取算法也有进一步的提高。该模型的创新之处在于结合了字向量和词向量模型的抽取结果,充分利用了中文文本信息的特征,可以有效提取短文本的关键词,提取效果得到了进一步的改进。  相似文献   

18.
与拼音文字不同,用户在进行中文输入时需要借助输入法软件完成从拼音串到汉字串的转换过程,输入法因此成为中文用户进行人机交互的基础性工具,而输入法的相关技术研发也一直是学术界与产业界的关注热点。在中文输入法技术的研究中,用户的行为特点对输入法软件的词库建立、算法设计、交互方式设计与性能评价等多方面都有着至关重要的作用,但由于数据获取与分析的困难,这方面的相关研究尚不多见。该文利用某中文输入法在用户许可下收集的超过4.1亿条用户输入行为记录,进行了中文输入法用户行为的分析研究,针对不同类别应用程序的输入词频差异,不同用户在同类应用程序中的不同候选词条的选择等行为特点进行了挖掘分析,研究结果会对深入了解中文输入法用户行为,进而改进输入法软件性能具有一定的指导意义。  相似文献   

19.
藏语自动分词是藏语信息处理的基础性关键问题,而紧缩词识别是藏语分词中的重点和难点。目前公开的紧缩词识别方法都是基于规则的方法,需要词库支持。该文提出了一种基于条件随机场的紧缩词识别方法,并在此基础上实现了基于条件随机场的藏语自动分词系统。实验结果表明,基于条件随机场的紧缩词识别方法快速、有效,而且可以方便地与分词模块相结合,显著提高了藏语分词的效果。  相似文献   

20.
A word w is called synchronizing (recurrent, reset, directable) word of deterministic finite automata (DFA) if w brings all states of the automaton to a unique state. According to the famous conjecture of Cerny from 1964, every n-state synchronizing automaton possesses a synchronizing word of length at most (n - 1)2. The problem is still open. It will be proved that the Cerny conjecture holds good for synchronizing DFA with transition monoid having no involutions and for every n-state (n 〉 2) synchronizing DFA with transition monoid having only trivial subgroups the minimal length of synchronizing word is not greater than (n - 1)2/2. The last important class of DFA involved and studied by Schutzenberger is called aperiodic; its automata accept precisely star-free languages. Some properties of an arbitrary synchronizing DFA were established.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号