首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This research looks at the effects of segment order and segmentation on translation retrieval performance for an experimental Japanese–English translation memory system. We implement a number of both bag-of-words and segment-order-sensitive string comparison methods, and test each over character-based and word-based indexing using n-grams of various orders. To evaluate accuracy, we propose an automatic method which identifies the target-language string(s) which would lead to the optimal translation for a given input, based on analysis of the held-out translation and the current contents of the translation memory. Our results indicate that character-based indexing is superior to word-based indexing, and also that bag-of-words methods are equivalent to segment-order-sensitive methods in terms of accuracy but vastly superior in terms of retrieval speed, suggesting that word segmentation and segment-order sensitivity are unnecessary luxuries for translation retrieval.  相似文献   

2.
基于字表的中文搜索引擎分词系统的设计与实现   总被引:9,自引:0,他引:9  
丁承  邵志清 《计算机工程》2001,27(2):191-192,F003
分析了常用的基于词典的汉语分词方法用于中文搜索引擎开发中的不足,提出基于字表的中文搜索引擎分词系统,并在索引,查询,排除歧义等方面进行了设计和实现。  相似文献   

3.
一种基于字词联合解码的中文分词方法   总被引:9,自引:1,他引:8  
宋彦  蔡东风  张桂平  赵海 《软件学报》2009,20(9):2366-2375
近年来基于字的方法极大地提高了中文分词的性能,借助于优秀的学习算法,由字构词逐渐成为中文分词的主要技术路线.然而,基于字的方法虽然在发现未登录词方面有其优势,却往往在针对表内词的切分效果方面不及基于词的方法,而且还损失了一些词与词之间的信息以及词本身的信息.在此基础上,提出了一种结合基于字的条件随机场模型与基于词的Bi-gram语言模型的切分策略,实现了字词联合解码的中文分词方法,较好地发挥了两个模型的长处,能够有效地改善单一模型的性能,并在SIGHAN Bakeoff3的评测集上得到了验证,充分说明了合理的字词结合方法将有效地提高分词系统的性能,可以更好地应用于中文信息处理的各个方面.  相似文献   

4.
Probability-Based Chinese Text Processing and Retrieval   总被引:1,自引:0,他引:1  
We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results that compare a word-based text processing method with a character-based method. The experimental results also compare a number of term-weighting functions including both single-unit weighting and compound-unit weighting functions.  相似文献   

5.
中文信息的标引是国内信息导航系统实现的基础,汉语分词和语义提取是目前尚未解决的难题。本文比较了信息检索系统中目前主要使用的标引方法,根据国内信息导航系统处理对象的“中文”特征,提出了关键词标引与全文标引相结合的混合标引方法,并给出了具体的实现方法,较好地解决了查全、查准和标引空间的增长问题。文中最后也给出了中文信息标引处理后入库的数据的检索方法。  相似文献   

6.
Text retrieval systems require an index to allow efficient retrieval of documents at the cost of some storage overhead. This paper proposes a novel full-text indexing model for Chinese text retrieval based on the concept of adjacency matrix of directed graph. Using this indexing model, on one hand, retrieval systems need to keep only the indexing data, instead of the indexing data and the original text data as the traditional retrieval systems always do. On the other hand, occurrences of index term are identified by labels of the so-called s-strings where the index term appears, rather than by its positions as in traditional indexing models. Consequently, system space cost as a whole can be reduced drastically while retrieval efficiency is maintained satisfactory. Experiments over several real-world Chinese text collections are carried out to demonstrate the effectiveness and efficiency of this model. In addition to Chinese, The proposed indexing model is also effective and efficient for text retrieval of other Oriental languages, such as Japanese and Korean. It is especially useful for digital library application areas where storage resource is very limited (e.g., e-books and CD-based text retrieval systems).  相似文献   

7.
本文将部分语义信息加入到二元文法中,提出改进的二元文法索引策略。本文应用2-泊松模型的BM25公式在TREC公开数据集上进行了测试。实验表明,改进的二元文法索引策略与基于字的索引策略、基于词的索引策略和基于二元文法的索引策略对比,在主要性能评测参数平均精确率、R-精确率参数上相对较优。  相似文献   

8.
中文分词十年回顾   总被引:56,自引:6,他引:56  
过去的十年间,尤其是2003年国际中文分词评测活动Bakeoff开展以来,中文自动分词技术有了可喜的进步。其主要表现为: (1)通过“分词规范+词表+分词语料库”的方法,使中文词语在真实文本中得到了可计算的定义,这是实现计算机自动分词和可比评测的基础;(2)实践证明,基于手工规则的分词系统在评测中不敌基于统计学习的分词系统;(3)在Bakeoff数据上的评估结果表明,未登录词造成的分词精度失落至少比分词歧义大5倍以上;(4)实验证明,能够大幅度提高未登录词识别性能的字标注统计学习方法优于以往的基于词(或词典)的方法,并使自动分词系统的精度达到了新高。  相似文献   

9.
10.
概率潜在语义检索模型使用统计的方法建立“文档—潜在语义一词”之间概率分布关系并利用这种关系进行检索。本文比较了在概率潜在语义检索模型中不同中文索引技术对检索效果的影响,考察了基于分词、二元和关键词抽取三种不同的索引技术,并和向量空间模型作了对比分析。实验结果表明:在概率潜在语义检索模型中,词的正确切分能提高检索的平均精度。  相似文献   

11.
Extraction and recognition of artificial text in multimedia documents   总被引:1,自引:0,他引:1  
Abstract The systems currently available for contentbased image and video retrieval work without semantic knowledge, i. e. they use image processing methods to extract low level features of the data. The similarity obtained by these approaches does not always correspond to the similarity a human user would expect. A way to include more semantic knowledge into the indexing process is to use the text included in the images and video sequences. It is rich in information but easy to use, e. g. by key word based queries. In this paper we present an algorithm to localise artificial text in images and videos using a measure of accumulated gradients and morphological processing. The quality of the localised text is improved by robust multiple frame integration. A new technique for the binarisation of the text boxes based on a criterion maximizing local contrast is proposed. Finally, detection and OCR results for a commercial OCR are presented, justifying the choice of the binarisation technique.An erratum to this article can be found at  相似文献   

12.
目前比较流行的中文分词方法为基于统计模型的机器学习方法。基于统计的方法一般采用人工标注的句子级的标注语料进行训练,但是这种方法往往忽略了已有的经过多年积累的人工标注的词典信息。这些信息尤其是在面向跨领域时,由于目标领域句子级别的标注资源稀少,从而显得更加珍贵。因此如何充分而且有效的在基于统计的模型中利用词典信息,是一个非常值得关注的工作。最近已有部分工作对它进行了研究,按照词典信息融入方式大致可以分为两类:一类是在基于字的序列标注模型中融入词典特征,而另一类是在基于词的柱搜索模型中融入特征。对这两类方法进行比较,并进一步进行结合。实验表明,这两类方法结合之后,词典信息可以得到更充分的利用,最终无论是在同领域测试和还是在跨领域测试上都取得了更优的性能。  相似文献   

13.
Tsimhoni O  Smith D  Green P 《Human factors》2004,46(4):600-610
A driving simulator experiment was conducted to determine the effects of entering addresses into a navigation system during driving. Participants drove on roads of varying visual demand while entering addresses. Three address entry methods were explored: word-based speech recognition, character-based speech recognition, and typing on a touch-screen keyboard. For each method, vehicle control and task measures, glance timing, and subjective ratings were examined. During driving, word-based speech recognition yielded the shortest total task time (15.3 s), followed by character-based speech recognition (41.0 s) and touch-screen keyboard (86.0 s). The standard deviation of lateral position when performing keyboard entry (0.21 m) was 60% higher than that for all other address entry methods (0.13 m). Degradation of vehicle control associated with address entry using a touch screen suggests that the use of speech recognition is favorable. Speech recognition systems with visual feedback, however, even with excellent accuracy, are not without performance consequences. Applications of this research include the design of in-vehicle navigation systems as well as other systems requiring significant driver input, such as E-mail, the Internet, and text messaging.  相似文献   

14.
中文关系抽取采用基于字符或基于词的神经网络,现有的方法大多存在分词错误和歧义现象,会不可避免的引入大量冗余和噪音,从而影响关系抽取的结果.为了解决这一问题,本文提出了一种基于多粒度并结合语义信息的中文关系抽取模型.在该模型中,我们将词级别的信息合并进入字符级别的信息中,从而避免句子分割时产生错误;借助外部的语义信息对多义词进行建模,来减轻多义词所产生的歧义现象;并且采用字符级别和句子级别的双重注意力机制.实验表明,本文提出的模型能够有效提高中文关系抽取的准确率和召回率,与其他基线模型相比,具有更好的优越性和可解释性.  相似文献   

15.
基于中文题名的计算机辅助标引   总被引:1,自引:0,他引:1  
本文阐述了基于中文文献题名的计算机辅助标引系统的组成结构,并讨论了其中的一些关键技术问题,文章从系统结构设计方面,对该系统的建表模块,目录模块,分词标模块,校对模块,选号打印模块和系统管理模块进行了讨论,并着重讨论了分词标引技术。  相似文献   

16.
This paper introduces a new feature vector for shape-based image indexing and retrieval. This feature classifies image edges based on two factors: their orientations and correlation between neighboring edges. Hence it includes information of continuous edges and lines of images and describes major shape properties of images. This scheme is effective and robustly tolerates translation, scaling, color, illumination, and viewing position variations. Experimental results show superiority of proposed scheme over several other indexing methods. Averages of precision and recall rates of this new indexing scheme for retrieval as compared with traditional color histogram are 1.99 and 1.59 times, respectively. These ratios are 1.26 and 1.04 compared to edge direction histogram.  相似文献   

17.
Sentiment analysis is a text mining task that determines the polarity of a given text, i.e., its positiveness or negativeness. Recently, it has received a lot of attention given the interest in opinion mining in micro-blogging platforms. These new forms of textual expressions present new challenges to analyze text because of the use of slang, orthographic and grammatical errors, among others. Along with these challenges, a practical sentiment classifier should be able to handle efficiently large workloads.The aim of this research is to identify in a large set of combinations which text transformations (lemmatization, stemming, entity removal, among others), tokenizers (e.g., word n-grams), and token-weighting schemes make the most impact on the accuracy of a classifier (Support Vector Machine) trained on two Spanish datasets. The methodology used is to exhaustively analyze all combinations of text transformations and their respective parameters to find out what common characteristics the best performing classifiers have. Furthermore, we introduce a novel approach based on the combination of word-based n-grams and character-based q-grams. The results show that this novel combination of words and characters produces a classifier that outperforms the traditional word-based combination by 11.17% and 5.62% on the INEGI and TASS’15 dataset, respectively.  相似文献   

18.
本文给出一种以词语为索引项的索引文件存储结构,以及基于这种结构的索引查询算法.首先分析中文索引库的分布规律,接着在此基础上设计了一种逆序存储的三层索引结构,这种结构在创建索引时能根据词语频率自动调整存储顺序,最后给出一种基于自动机和逆向最大匹配的索引查询算法.实验系统TIFS将三层索引结构与B树、哈希方法在时间和空间复杂度方面进行对比,结果表明,对于大规模的中文文本检索,三层索引结构的综合效果最好.  相似文献   

19.
全文检索字索引技术的研究与实现   总被引:12,自引:1,他引:12  
针对中文全文检索字表法检索索引的创建,提出了快速的建立方法,并根据中文文体的特点,提出了有效的索引压缩方法。实验表明,使用虚拟内存技术可以大大节省索引的建立时间;采用字节对齐的索相压缩技术,不但可以有效地减少索引占用的磁盘空间,而且可以加快检索时间,索引的空间和时间效率都得以提高。  相似文献   

20.
隐含语义索引及其在中文文本处理中的应用研究   总被引:33,自引:0,他引:33  
信息检索本质上是语义检索,而传统信息检索系统都是基于独立词索引,因此检索效果并不理想,隐含语义索引是一种新型的信息检索模型,它通过奇异值分析,将词向量和文档向量投影到一个低维空间,消减了词和文档之间的语义模糊度,使得文档之间的语义关系更为明晰。实验和理论结果证实了隐含语义索引能够取得更好的检索效果。本文论述了隐含语义索引的理论基础,研究了隐含语义索引在中文文本处理中的应用,包括中文文本检索、中文文本分类和中文文本聚类等。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号