首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
面向信息检索的自适应中文分词系统   总被引:16,自引:0,他引:16  
新词的识别和歧义的消解是影响信息检索系统准确度的重要因素.提出了一种基于统计模型的、面向信息检索的自适应中文分词算法.基于此算法,设计和实现了一个全新的分词系统BUAASEISEG.它能够识别任意领域的各类新词,也能进行歧义消解和切分任意合理长度的词.它采用迭代式二元切分方法,对目标文档进行在线词频统计,使用离线词频词典或搜索引擎的倒排索引,筛选候选词并进行歧义消解.在统计模型的基础上,采用姓氏列表、量词表以及停词列表进行后处理,进一步提高了准确度.通过与著名的ICTCLAS分词系统针对新闻和论文进行对比评测,表明BUAASEISEG在新词识别和歧义消解方面有明显的优势.  相似文献   

2.
刘春丽  李晓戈  刘睿  范贤  杜丽萍 《计算机应用》2016,36(10):2794-2798
为提高中文分词的准确率和未登录词(OOV)识别率,提出了一种基于字表示学习方法的中文分词系统。首先使用Skip-gram模型将文本中的词映射为高维向量空间中的向量;其次用K-means聚类算法将词向量聚类,并将聚类结果作为条件随机场(CRF)模型的特征进行训练;最后基于该语言模型进行分词和未登录词识别。对词向量的维数、聚类数及不同聚类算法对分词的影响进行了分析。基于第四届自然语言处理与中文计算会议(NLPCC2015)提供的微博评测语料进行测试,实验结果表明,在未利用外部知识的条件下,分词的F值和OOV识别率分别达到95.67%和94.78%,证明了将字的聚类特征加入到条件随机场模型中能有效提高中文短文本的分词性能。  相似文献   

3.
Many online returns are caused by mismatches between what consumers see online and what they actually receive. This paper discusses e-sellers’ use of purchase-risk notices (PRNs) for possible mismatches as a preemptive action to avoid returns. Two one-factor (purchase-risk notice: presence vs. absence) scenario experiments were conducted via two studies (Study 1 and Study 2). The pre-purchase effects and post-purchase effects of PRNs were examined in Study 1 and Study 2, respectively. One-way ANOVAs were used to test the hypotheses. It was found that returns can be avoided by using PRNs without negatively affecting consumers’ purchase intentions. Additionally, using PRNs can make consumers more tolerant of minor mismatches, attract more repurchases, and reduce consumers’ dissatisfaction and regret about purchase decisions.  相似文献   

4.
This paper presents a new technique of high accuracy to recognize both typewritten and handwritten English and Arabic texts without thinning. After segmenting the text into lines (horizontal segmentation) and the lines into words, it separates the word into its letters. Separating a text line (row) into words and a word into letters is performed by using the region growing technique (implicit segmentation) on the basis of three essential lines in a text row. This saves time as there is no need to skeletonize or to physically isolate letters from the tested word whilst the input data involves only the basic information—the scanned text. The baseline is detected, the word contour is defined and the word is implicitly segmented into its letters according to a novel algorithm described in the paper. The extracted letter with its dots is used as one unit in the system of recognition. It is resized into a 9 × 9 matrix following bilinear interpolation after applying a lowpass filter to reduce aliasing. Then the elements are scaled to the interval [0,1]. The resulting array is considered as the input to the designed neural network. For typewritten texts, three types of Arabic letter fonts are used—Arial, Arabic Transparent and Simplified Arabic. The results showed an average recognition success rate of 93% for Arabic typewriting. This segmentation approach has also found its application in handwritten text where words are classified with a relatively high recognition rate for both Arabic and English languages. The experiments were performed in MATLAB and have shown promising results that can be a good base for further analysis and considerations of Arabic and other cursive language text recognition as well as English handwritten texts. For English handwritten classification, a success rate of about 80% in average was achieved while for Arabic handwritten text, the algorithm performance was successful in about 90%. The recent results have shown increasing success for both Arabic and English texts.  相似文献   

5.
为构建在线生物文献核磁共振图像库,通过分析在线医学文献图像的特点,用塔式梯度方向直方图进行图像特征提取,结合图像对应的文本标注,采用基于高斯过程的分类方法设计实现了一个在线生物文献MRI图像识别系统。实验结果表明,该系统比基于单一特征的系统识别率更高,同时比基于标准的SVM和KNN的识别方法性能更好。表明该系统的设计是可行、可靠和有效的。  相似文献   

6.
An information retrieval system has to retrieve all and only those documents that are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and query terms have been a serious obstacle to the enhancement of retrieval performance. In this article, we discuss automatic term normalization between words and phrases in text corpora and their application to a Korean information retrieval system. We perform three new types of term normalizations: transliterated word normalization, noun phrase normalization, and context-based term normalization. Transliterated words are normalized into equivalence classes by using contextual similarity to alleviate lexical term mismatches. Then, noun phrases are normalized into phrasal terms by segmenting compound nouns as well as normalizing noun phrases. Moreover, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using the K-means algorithm and cooccurrence clusters are identified to alleviate semantic term mismatches. These term normalizations are used in both the indexing and the retrieval system. The experimental results show that our proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements. As a result, our system can improve the retrieval effectiveness of the information retrieval system.  相似文献   

7.
Many people with disabilities do not have the dexterity necessary to control a joystick on an electric wheelchair. Moreover, they have difficulty to avoid obstacles. The aim of this work is to implement a multi-modal system to control the movement of an Electric wheelchair using small vocabulary word recognition system and a set of sensors to detect and avoid obstacles. The methodology adopted is based on grouping a microcontroller with a speech recognition development kit for isolated word from a dependent speaker and a set of sensors. In order to gain in time design, tests have shown that it would be better to choose a speech recognition kit and to adapt it to the application. The text was submitted by the authors in English.  相似文献   

8.
殷昊  徐健  李寿山  周国栋 《计算机科学》2018,45(Z11):105-112
文本情绪识别是自然语言处理问题中的一项基本任务。该任务旨在通过分析文本判断该文本是否含有情绪。针对该任务,提出了一种基于字词融合特征的微博情绪识别方法。相对于传统方法,所提方法能够充分考虑微博语言的特点,充分利用字词融合特征提升识别性能。具体而言,首先将微博文本分别用字特征和词特征表示;然后利用LSTM模型(或双向LSTM模型)分别从字特征和词特征表示的微博文本中提取隐层特征;最后融合两组隐层特征,得到字词融合特征,从而进行情绪识别。实验结果表明,该方法能够获得更好的情绪识别性能。  相似文献   

9.
Offline handwritten Amharic word recognition   总被引:1,自引:0,他引:1  
This paper describes two approaches for Amharic word recognition in unconstrained handwritten text using HMMs. The first approach builds word models from concatenated features of constituent characters and in the second method HMMs of constituent characters are concatenated to form word model. In both cases, the features used for training and recognition are a set of primitive strokes and their spatial relationships. The recognition system does not require segmentation of characters but requires text line detection and extraction of structural features, which is done by making use of direction field tensor. The performance of the recognition system is tested by a dataset of unconstrained handwritten documents collected from various sources, and promising results are obtained.  相似文献   

10.
11.
A new method is described and tested for using an unreliable character recognition device to produce a reliable index for a collection of documents. All highly likely substitution errors of the recognition device are handled by transforming characters which confuse readily into the same pseudocharacter. An analysis of the method is done showing the expected precision (fraction of words correctly found to words present) and recall (fraction of words retrieved properly to those which were retrieved). Published substitution error matrices were employed, along with a large file of words and word frequencies to evaluate the method. Performance was surprisingly good. Suggestions for further enhancements are given.  相似文献   

12.
The focus on communications technology in recent years has led to the question of how to best display electronic text onto small-screened devices. Past studies have shown that the compact method of rapid serial visual presentation (RSVP) is efficient but not well liked. Two experiments were conducted to explore ways of improving the preference for and feasibility of RSVP. In experiment 1, the effects of a completion meter, punctuation pauses, and variable word duration were studied. Although the sentence-by-sentence and normal page formats were still superior, post-experiment ratings indicated that punctuation pauses improved user preference for RSVP, and its preference increased in general with practice. In experiment 2, a modified RSVP condition included a completion meter, punctuation pauses, interruption pauses and pauses at clause boundaries. This condition was significantly preferred to a normal RSVP condition. The present enhancements may increase the feasibility of using RSVP with small displays.  相似文献   

13.
Text recognition in natural scene images is a challenging task that has recently been garnering increased research attention. In this paper, we propose a method for recognizing text by utilizing the layout consistency of a text string. We estimate the layout (four lines of a text string) using initial character extraction and recognition result. On the basis of the layout consistency across a word, we perform character extraction and recognition again using four lines, which is more accurate than the first process. Our layout estimation method is different from previous methods in terms of exploiting character recognition results and its use of a class-conditional layout model. More accurate and robust estimation is achieved, and it can be used to refine character extraction and recognition. We call this two-way process—from extraction and recognition to layout, and from layout to extraction and recognition—“bidirectional” to discriminate it from previous feedback refinement approaches. Experimental results demonstrate that our bidirectional processes provide a boost to the performance of word recognition.  相似文献   

14.
向量空间模型(VSM)是一种使用特征向量对文本进行建模的方法,广泛应用于文本分类、模式识别等领域。但文本内容较多时,传统的VSM建模可能产生维数爆炸现象,效率低下且难以保证分类效果。针对VSM高维现象,提出一种利用词义和词频降低文本建模维度的方法,以提高效率和准确度。提出一种多义词判别优化的同义词聚类方法,结合上下文判别多义词的词义后,根据特征项词义相似度进行加权,合并词义相近的特征项。新方法使特征向量维度大大降低,多义词判别提高了文章特征提取的准确性。与其他文本特征提取和文本分类方法进行比较,结果表明,该算法在效率和准确度上有明显提高。  相似文献   

15.
An architecture for handwritten text recognition systems   总被引:1,自引:1,他引:0  
This paper presents an end-to-end system for reading handwritten page images. Five functional modules included in the system are introduced in this paper: (i) pre-processing, which concerns introducing an image representation for easy manipulation of large page images and image handling procedures using the image representation; (ii) line separation, concerning text line detection and extracting images of lines of text from a page image; (iii) word segmentation, which concerns locating word gaps and isolating words from a line of text image obtained efficiently and in an intelligent manner; (iv) word recognition, concerning handwritten word recognition algorithms; and (v) linguistic post-pro- cessing, which concerns the use of linguistic constraints to intelligently parse and recognize text. Key ideas employed in each functional module, which have been developed for dealing with the diversity of handwriting in its various aspects with a goal of system reliability and robustness, are described in this paper. Preliminary experiments show promising results in terms of speed and accuracy. Received October 30, 1998 / Revised January 15, 1999  相似文献   

16.
Text Retrieval from Document Images Based on Word Shape Analysis   总被引:2,自引:1,他引:2  
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.  相似文献   

17.
Humans are able to recognise a word before its acoustic realisation is complete. This in contrast to conventional automatic speech recognition (ASR) systems, which compute the likelihood of a number of hypothesised word sequences, and identify the words that were recognised on the basis of a trace back of the hypothesis with the highest eventual score, in order to maximise efficiency and performance. In the present paper, we present an ASR system, SpeM, based on principles known from the field of human word recognition that is able to model the human capability of ‘early recognition’ by computing word activation scores (based on negative log likelihood scores) during the speech recognition process.Experiments on 1463 polysyllabic words in 885 utterances showed that 64.0% (936) of these polysyllabic words were recognised correctly at the end of the utterance. For 81.1% of the 936 correctly recognised polysyllabic words the local word activation allowed us to identify the word before its last phone was available, and 64.1% of those words were already identified one phone after their lexical uniqueness point.We investigated two types of predictors for deciding whether a word is considered as recognised before the end of its acoustic realisation. The first type is related to the absolute and relative values of the word activation, which trade false acceptances for false rejections. The second type of predictor is related to the number of phones of the word that have already been processed and the number of phones that remain until the end of the word. The results showed that SpeM’s performance increases if the amount of acoustic evidence in support of a word increases and the risk of future mismatches decreases.  相似文献   

18.
Three models for word frequency distributions, the lognormal law, the generalized inverse Gauss-Poisson law and the extended generalized Zipf's law are compared and evaluated with respect to goodness of fit and rationale. Application of these models to frequency distributions of a text, a corpus and morphological data reveals that no model can lay claim to exclusive validity, while inspection of the extrapolated theoretical vocabulary sizes raises doubts as to whether the urn scheme with independent trials is the correct underlying model for word frequency data. The role of morphology in shaping word frequency distributions is discussed, as well as parallelisms between vocabulary richness in literary studies and morphological productivity in linguistics.R. Harald Baayen received his PhD at the Free University, Amsterdam, where he was involved in research on morphological productivity. He is now at the Max-Planck Institute for Psycholinguistics, Nijmegen, participating in a project on computational modelling of lexical representation and process.  相似文献   

19.
资源稀缺蒙语语音识别研究   总被引:1,自引:1,他引:0  
张爱英  倪崇嘉 《计算机科学》2017,44(10):318-322
随着语音识别技术的发展,资源稀缺语言的语音识别系统的研究吸引了更广泛的关注。以蒙语为目标识别语言,研究了在资源稀缺的情况下(如仅有10小时的带标注的语音)如何利用其他多语言信息提高识别系统的性能。借助基于多语言深度神经网络的跨语言迁移学习和基于多语言深度Bottleneck神经网络的抽取特征可以获得更具有区分度的声学模型。通过搜索引擎以及网络爬虫的定向抓取获得大量的网页数据,有助于获得文本数据,以增强语言模型的性能。融合多个不同识别结果以进一步提高识别精度。与基线系统相比,多种系统融合的识别绝对错误率减少12%。  相似文献   

20.

The focus on communications technology in recent years has led to the question of how to best display electronic text onto small-screened devices. Past studies have shown that the compact method of rapid serial visual presentation (RSVP) is efficient but not well liked. Two experiments were conducted to explore ways of improving the preference for and feasibility of RSVP. In experiment 1, the effects of a completion meter, punctuation pauses, and variable word duration were studied. Although the sentence-by-sentence and normal page formats were still superior, post-experiment ratings indicated that punctuation pauses improved user preference for RSVP, and its preference increased in general with practice. In experiment 2, a modified RSVP condition included a completion meter, punctuation pauses, interruption pauses and pauses at clause boundaries. This condition was significantly preferred to a normal RSVP condition. The present enhancements may increase the feasibility of using RSVP with small displays.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号