排序方式: 共有11条查询结果,搜索用时 0 毫秒
1.
Eyas El-Qawasmeh 《Information Processing Letters》2004,92(5):257-265
Word prediction methodologies depend heavily on the statistical approach that uses the unigram, bigram, and the trigram of words. However, the construction of the N-gram model requires a very large size of memory, which is beyond the capability of many existing computers. Beside this, the approximation reduces the accuracy of word prediction. In this paper, we suggest to use a cluster of computers to build an Optimal Binary Search Tree (OBST) that will be used for the statistical approach in word prediction. The OBST will contain extra links so that the bigram and the trigram of the language will be presented. In addition, we suggest the incorporation of other enhancements to achieve optimal performance of word prediction. Our experimental results showed that the suggested approach improves the keystroke saving. 相似文献
2.
一种基于单字统计二元文法的自组词音字转换算法 总被引:3,自引:0,他引:3
音字转换在语音识别和汉字语句键盘输入方面都占有很重要的地位,现在比较流行的方法是基于大语料统计的Markov模型的音字转换方法,其中基于单字N元文法的音字转换算法具有数据量少,算法简单的优点,但转换准确率却较低,而基于词N元文法的音字转换算法则正好相反,本文在基于单字统计Bigran算法的基础上提出了一种自组词的音字转换方法,不仅具有单字Bigram方法的占空间少的优点,而且又可充分利用基于词Bi 相似文献
3.
Chinese Phonetic-Character Conversion(CPCC) is an important issue in Chinese speech recognition and Chinese sentence keyboard input system. The approaches based on large corpus statistic Markov language model (such as bigram, trigram) become more and more popular today. This paper presents an improved Chinese word bigram, space-compressed Chinese word bigram, which stores the bi-word co-articulation frequency in the form of the bi-character co-articulation frequency. The bi-word co-articulation frequency is estimated from the bi-character co-articulation frequency library. The CPCC experiment with the improved Chinese word bigram shows: it can reach a higher correct conversion ratio with less space occupation. 相似文献
4.
维吾尔文Bigram文本特征提取 总被引:1,自引:0,他引:1
文本特征表示是在文本自动分类中最重要的一个环节。在基于向量空间模型(VSM)的文本表示中特征单元粒度的选择直接影响到文本分类的效果。在维吾尔文文本分类中,对于单词特征不能更好地表征文本内容特征的问题,在分析了维吾尔文Bigram对文本分类作用的基础上,构造了一个新的统计量CHIMI,并在此基础上提出了一种维吾尔语Bigram特征提取算法。将抽取到的Bigram作为文本特征,采用支持向量机(SVM)算法对维吾尔文文本进行了分类实验。实验结果表明,与以词为特征的文本分类相比,Bigram作为文本特征能够提高维吾尔文文本分类的准确率和召回率并且通过实验验证了该算法的有效性。 相似文献
5.
基于Bigram的特征词抽取及自动分类方法研究 总被引:1,自引:1,他引:1
王笑旻 《计算机工程与应用》2005,41(22):177-179,210
用计算机信息处理技术实现文本自动分类是计算机自然语言理解学科共同关注的课题。该文提出了一种基于Bigram的无词典的中文文本特征词的抽取方法,并利用互信息概念对得到的特征词进行处理,提高了特征词抽取的准确性。此外,通过采用基于统计学习原理和结构风险最小原则的支持向量机算法对一些文本进行了分类,验证了由所提出的算法得到的特征词的有效性和可行性。 相似文献
6.
7.
基于隐马尔可夫模型(HMM)的词性标注的应用研究 总被引:3,自引:0,他引:3
利用隐马尔可夫模型(HMM)对英语文本进行词性标注,首先介绍了对Viterbi算法的改进和基于HMM模型方法训练机器的步骤,然后通过一系列对比实验,得出两个结论:二元文法模型的“性能价格比”较三元文法模型更令人满意;词性标注集的个数对词性标注的准确率有影响。最后利用上述结论进行了封闭式测试和开放式测试。 相似文献
8.
基于汉语二元同现的统计词义消歧方法研究 总被引:4,自引:0,他引:4
采用《汉语义词词林》和英汉双语语料库,通过“双语对齐”扩充了英汉词典的单记号译文;对大规模汉语语料库以B+树算法为骨架统计了双语词组二元同现频次。 相似文献
9.
《Measurement》2014
With the rapid expansion of e-commerce over the decades, the growth of the user generated content in the form of reviews is enormous on the Web. A need to organize the e-commerce reviews arises to help users and organizations in making an informed decision about the products. Opinion mining systems based on machine learning approaches are used online to categorize the customer opinion into positive or negative reviews. Different from previous approaches that employed single rule based or statistical techniques, we propose a hybrid machine learning approach built under the framework of combination (ensemble) of classifiers with principal component analysis (PCA) as a feature reduction technique. This paper introduces two hybrid models, i.e. PCA with bagging and PCA with Bayesian boosting models for feature based opinion classification of product reviews. The results are compared with two individual classifier models based on statistical learning i.e. logistic regression (LR) and support vector machine (SVM). We found that hybrid methods do better in terms of four quality measures like misclassification rate, correctness, completeness and effectiveness in classifying the opinion into positive and negative. 相似文献
10.
为通过构建高速的中文索引结构来提高Off-line模式的串匹配速度,提出了一种基于Bigram二级哈希的中文索引结构。该索引采用中文GB2312编码处理中文汉字,以中文Bigram项作为词汇项,并实现了基于二级哈希的词汇表存储结构。实验数据显示,本文索引结构虽然占用存储空间为词索引的2倍多,但其匹配速度是词索引的4倍多。结果表明本文索引在中文匹配中具有速度优势。 相似文献