首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
信息抽取是数据挖掘的一个重要领域,文本信息抽取是指从一段自由文本中抽取出指定的信息并将其结构化数 据存入知识库供用户查询或下一步处理所用。人物属性信息抽取是智能人物类搜索引擎构建的重要基础,同时结构化信 息也是计算机所能理解的一种数据格式。作者提出了一种自动获取百科人物属性的方法,该方法利用各属性值的词性信 息来定位到百科自由文本中,通过统计的方法发现规则,再根据规则匹配从百科文本中获取人物属性信息。实验表明该 方法从百科文本中抽取人物属性信息是有效的。抽取的结果可以用来构建人物属性知识库。  相似文献   

2.
目前基于机器学习的入侵检测研究都是从提高检测精度的分类器算法设计出发,大多未考虑对样本特征的分析。文章提出了一种基于特征抽取的异常检测方法,应用主元神经网络(PCNN)抽取入侵特征,再应用SVM检测入侵。采用广义Hebb学习规则训练线性主元神经网络,SVM采用基于网格粒度搜索获得最优参数。利用KDD99数据集,将线性PCNN-SVM与SVM进行比较,结果显示在不降低分类器性能的情况下,PCNN特征抽取方法能对输入数据有效降维。  相似文献   

3.
观点承载着文本的重要信息,而比较句是观点评论中一种常见的句式现象。针对中文比较句识别问题,该文提出了一种基于规则与统计相结合的方法并进行实验。该方法先对语料及其分词结果进行规范化处理,再通过基于比较特征词词典与句法结构模板、依存关系相结合的方法进行泛提取。然后设计一种CSR规则提取算法,并利用CRF挖掘实体对象信息及语义角色信息。最后利用SVM分类器,选取不同特征维数,找到使性能达到最优的特征形式完成精提取。  相似文献   

4.
面对大规模异构网页,基于视觉特征的网页信息抽取方法普遍存在通用性较差、抽取效率较低的问题。针对通用性较差的问题,该文提出了基于视觉特征的使用有监督机器学习的网页信息抽取框架WEMLVF。该框架具有良好的通用性,通过对论坛网站和新闻评论网站的信息抽取实验,验证了该框架的有效性。然后,针对视觉特征提取时间代价过高导致信息抽取效率较低的问题,该文使用WEMLVF,分别提出基于XPath和基于经典包装器归纳算法SoftMealy的自动生成信息抽取模板的方法。这两种方法使用视觉特征自动生成信息抽取模板,但模板的表达并不包含视觉特征,使得在使用模板进行信息抽取的过程中无需提取网页的视觉特征,从而既充分利用了视觉特征在信息抽取中的作用,又显著提升了信息抽取的效率,实验结果验证了这一结论。  相似文献   

5.
赵江江  秦兵 《电脑学习》2012,2(1):16-17,20
采用基于BootStrapping的方法实现中文事件元素抽取系统。其中,将事件元素抽取定义为一个模式匹配问题。针对这一问题,首先构建了初始种子集,然后创新性地引入了BootStrapping方法构建模板集,并使用模式匹配的方法进行事件元素抽取。在模板构造过程中,提出了基于BestMatch的模板泛化算法[1]。对任意两个事件实例模板[2]进行匹配,计算其匹配代价并泛化,提高了模板的覆盖能力。所实现的系统在ACE 2005语料测试中取得了不错结果。  相似文献   

6.
为提高互联网中在线评论文本的情感倾向分类准确率,方便消费者和商家准确高效地获取信息,该文提出一种将语义规则方法与深度学习方法相结合的在线评论文本情感分类模型,对基于情感词典的语义规则信息进行扩展,嵌入到常用特征模板中组合成更有效的混合特征模板;采用Fisher判别准则方法对混合特征模板进行降维以消除特征间的信息冗余;深度学习模型采用基于LSTM改进的RNN模型,将网络爬取的数据输入到模型进行训练和测试。结果表明,语义规则抽取出的特征包含更多、更准确的情感信息,使得混合特征模板可以更加全面地考虑文本的情感特征粒度;Fisher准则可有效识别出高判别性的低维文本特征,进一步提高改进RNN模型对评论文本的分类性能。  相似文献   

7.
句子边界识别是藏文信息处理领域中一项重要的基础性工作,该文提出了一种基于最大熵和规则相结合的方法识别藏语句子边界。首先,利用藏语边界词表识别歧义的句子边界,最后采用最大熵模型识别规则无法识别的歧义句子边界。该方法有效利用藏语句子边界规则减少了最大熵模型因训练语料稀疏或低劣而导致对句子边界的误判。实验表明,该文提出的方法具有较好的性能,F1值可达97.78%。  相似文献   

8.
构建藏语依存树库是实现藏语句法分析的重要基础,对藏语本体研究和信息处理具有重要价值。基于此,该文提出了一种基于树库转换的藏语依存树库构建方法。该方法首先扩充了前期构建的藏语短语结构树库,然后根据藏语短语结构树和依存树的特征设计树库转换规则,实现藏语短语结构树到依存结构树的初步转换,最后对自动转换结果进行人工校验,得到了2.2万句藏语依存树。为了对转换结果做出量化评价,该文抽取了依存树库中5%的依存树,对其依存关系进行校验和统计,最终依存关系的准确率达到89.36%,中心词的准确率达到92.09%。此外,该文使用基于神经网络的句法分析模型验证了依存树库的有效性。在该模型上,UAS值和LAS值分别达到83.62%和81.90%。研究证明,使用半自动的树库转换方法能够有效地完成藏语依存树库构建工作。  相似文献   

9.
蛋白质关系抽取研究对于生命科学各领域的研究具有广泛的应用价值。但是,基于机器学习的蛋白质关系抽取方法普遍停留在二元关系抽取,失去了丰富的关系类型信息,而基于规则的开放式信息抽取方法可以抽取完整的蛋白质关系(“蛋白质1,关系词,蛋白质2”),但是召回率较低。针对以上问题,该文提出了一种混合机器学习和规则方法的蛋白质关系抽取框架。该框架先利用机器学习方法完成命名实体识别和二元关系抽取,然后利用基于句法模板和词典匹配的方法抽取表示当前两个蛋白质间关系类型的关系词。该方法在AImed语料上取得了40.18%的F值,远高于基于规则的Stanford Open IE方法。  相似文献   

10.
藏语自动分词是藏语信息处理的基础性关键问题,而紧缩词识别是藏语分词中的重点和难点。目前公开的紧缩词识别方法都是基于规则的方法,需要词库支持。该文提出了一种基于条件随机场的紧缩词识别方法,并在此基础上实现了基于条件随机场的藏语自动分词系统。实验结果表明,基于条件随机场的紧缩词识别方法快速、有效,而且可以方便地与分词模块相结合,显著提高了藏语分词的效果。  相似文献   

11.
The matrix, as an extended pattern representation to the vector, has proven to be effective in feature extraction. However, the subsequent classifier following the matrix-pattern- oriented feature extraction is generally still based on the vector pattern representation (namely, MatFE + VecCD), where it has been demonstrated that the effectiveness in classification just attributes to the matrix representation in feature extraction. This paper looks at the possibility of applying the matrix pattern representation to both feature extraction and classifier design. To this end, we propose a so-called fully matrixized approach, i.e., the matrix-pattern-oriented feature extraction followed by the matrix-pattern-oriented classifier design (MatFE + MatCD). To more comprehensively validate MatFE + MatCD, we further consider all the possible combinations of feature extraction (FE) and classifier design (CD) on the basis of patterns represented by matrix and vector respectively, i.e., MatFE + MatCD, MatFE + VecCD, just the matrix-pattern-oriented classifier design (MatCD), the vector-pattern-oriented feature extraction followed by the matrix-pattern-oriented classifier design (VecFE + MatCD), the vector-pattern-oriented feature extraction followed by the vector-pattern-oriented classifier design (VecFE + VecCD) and just the vector-pattern-oriented classifier design (VecCD). The experiments on the combinations have shown the following: 1) the designed fully matrixized approach (MatFE + MatCD) has an effective and efficient performance on those patterns with the prior structural knowledge such as images; and 2) the matrix gives us an alternative feasible pattern representation in feature extraction and classifier designs, and meanwhile provides a necessary validation for "ugly duckling" and "no free lunch" theorems.  相似文献   

12.
针对故障诊断中数据存在噪声和高维的缺点,使用一种快速特征提取方法对故障数据进行降维,该方法以特征信号的均值和方差作为其权重衡量的依据。利用支持向量机的模式分类功能,构造了基于特征提取的多故障分类器。实例表明,在保证诊断效果的情况下,该方法实现了数据降维,降低了运算复杂度。  相似文献   

13.
基于训练样本自动选取的SVM彩色图像分割方法   总被引:1,自引:0,他引:1  
张荣  王文剑  白雪飞 《计算机科学》2012,39(11):267-271
图像分割是模式识别、图像理解、计算机视觉等领域的重要研究内容。基于支持向量机((Support Vcctor Ma- chine, SVM)的方法现已广泛应用于图像分割,但其在训练样本的选取上大多是人工选择,这降低了图像分割的自适 应性,且影响了SVM的分类性能。提出一种基于训练样本自动选取的SVM彩色图像分割方法,算法首先使用模糊 C均值(Fuzzy C-Mcans, FCM)聚类算法自动获取训练样本,然后分别提取图像颜色特征和纹理特征,将其作为SVM 模型训练样本的特征属性进行训练,最后用训练好的分类器对图像进行分割。实验结果表明,提出的方法可取得很好 的分割结果。  相似文献   

14.
语言风格是高考阅读理解中的重要考察内容,然而不同考察方式所需的分类层次不尽相同,该文将语言风格鉴赏转化为层次分类问题。在类别标签指导下,利用图分割算法,获取与特定类别相对应的原始簇。基于原始簇,利用层次聚类获取语言风格类别层次结构,之后结合层次结构训练SVM层次分类器。在解答语言风格鉴赏题过程中,依据阅读理解题干确定所需分类层次,利用SVM层次分类器完成对阅读材料语言风格判别,最后结合知识库生成语言风格鉴赏题答案。实验结果表明,基于层次结构的语言风格判别方法,可以为高考鉴赏类考题的解答提供技术支撑。  相似文献   

15.
Brain–machine interfaces are systems that allow the control of a device such as a robot arm through a person’s brain activity; such devices can be used by disabled persons to enhance their life and improve their independence. This paper is an extended version of a work that aims at discriminating between left and right imagined hand movements using a support vector machine (SVM) classifier to control a robot arm in order to help a person to find an object in the environment. The main focus here is to search for the best features that describe efficiently the electroencephalogram data during such imagined gestures by comparing two feature extraction methods, namely the continuous wavelet transform (CWT) and the empirical modal decomposition (EMD), combined with the principal component analysis (PCA) that were fed through a linear and radial basis function (RBF) kernel SVM classifier. The experimental results showed high performance achieving an average accuracy across all the subjects of 92.75% with an RBF kernel SVM classifier using CWT and PCA compared to 80.25% accuracy obtained with EMD and PCA. The proposed system has been implemented and tested using data collected from five male subjects and it enabled the control of the robot arm in the right and the left direction.  相似文献   

16.
词向量在自然语言处理研究的各个领域发挥着重要作用。该文从语言学角度出发,讨论了词向量技术与语言学理论的关系;根据词向量的特征,提出利用藏文词向量构建语义相似词知识库。该文以哈尔滨工业大学的《词林》为基础,通过汉藏双语词典对译,在获取对译词的词向量的基础上,计算对译词的词向量与原子词群平均词向量的差值,利用不同的差值,自动筛选出与原子词群语义相似度较小的词。该文分别以藏文的词和音节为单位计算词向量,自动筛出不属于原子词群的词,通过对自动筛选结果与人工筛选结果对比,发现两者具有较高的一致性,这说明词向量计算结果与人的语言直觉具有较高的一致性。总体来说,该文所采用的方法有助于提高藏文语义相似词知识库构建效率。  相似文献   

17.
SVM+BiHMM:基于统计方法的元数据抽取混合模型   总被引:3,自引:0,他引:3  
张铭  银平  邓志鸿  杨冬青 《软件学报》2008,19(2):358-368
提出了一种SVM BiHMM的混合元数据自动抽取方法.该方法基于SVM(support vector machine)和二元HMM(bigram HMM(hidden Markov model),简称BiHMM)理论.二元HMM模型BiHMM在保持模型结构不变的前提下,通过区分首发概率和状态内部发射概率,修改了HMM发射概率计算模型.在SVM BiHMM复合模型中,首先根据规则把论文粗分为论文头、正文以及引文部分,然后建立SVM模型把文本块划分为元数据子类,接着采用Sigmoid双弯曲函数把SVM分类结果用于拟合调整BiHMM模型的单词发射概率,最后用复合模型进行元数据抽取.SVM方法有效考虑了块间联系,BiHMM模型充分考虑了单词在状态内部的位置信息,二者的元数据抽取结果得到了很好的互补和修正,实验评测结果表明,SVM BiHMM算法的抽取效果优于其他方法.  相似文献   

18.
In this paper, Texas Instruments TMS320C6713 DSP based real-time speech recognition system using Modified One Against All Support Vector Machine (SVM) classifier is proposed. The major contributions of this paper are: the study and evaluation of the performance of the classifier using three feature extraction techniques and proposal for minimizing the computation time for the classifier. From this study, it is found that the recognition accuracies of 93.33%, 98.67% and 96.67% are achieved for the classifier using Mel Frequency Cepstral Coefficients (MFCC) features, zerocrossing (ZC) and zerocrossing with peak amplitude (ZCPA) features respectively. To reduce the computation time required for the systems, two techniques – one using optimum threshold technique for the SVM classifier and another using linear assembly are proposed. The ZC based system requires the least computation time and the above techniques reduce the execution time by a factor of 6.56 and 5.95 respectively. For the purpose of comparison, the speech recognition system is also implemented using Altera Cyclone II FPGA with Nios II soft processor and custom instructions. Of the two approaches, the DSP approach requires 87.40% less number of clock cycles. Custom design of the recognition system on the FPGA without using the soft-core processor would have resulted in less computational complexity. The proposed classifier is also found to reduce the number of support vectors by a factor of 1.12–3.73 when applied to speaker identification and isolated letter recognition problems. The techniques proposed here can be adapted for various other SVM based pattern recognition systems.  相似文献   

19.
In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.  相似文献   

20.
In this paper, we introduce the backoff hierarchical class n-gram language models to better estimate the likelihood of unseen n-gram events. This multi-level class hierarchy language modeling approach generalizes the well-known backoff n-gram language modeling technique. It uses a class hierarchy to define word contexts. Each node in the hierarchy is a class that contains all the words of its descendant nodes. The closer a node to the root, the more general the class (and context) is. We investigate the effectiveness of the approach to model unseen events in speech recognition. Our results illustrate that the proposed technique outperforms backoff n-gram language models. We also study the effect of the vocabulary size and the depth of the class hierarchy on the performance of the approach. Results are presented on Wall Street Journal (WSJ) corpus using two vocabulary set: 5000 words and 20,000 words. Experiments with 5000 word vocabulary, which contain a small numbers of unseen events in the test set, show up to 10% improvement of the unseen event perplexity when using the hierarchical class n-gram language models. With a vocabulary of 20,000 words, characterized by a larger number of unseen events, the perplexity of unseen events decreases by 26%, while the word error rate (WER) decreases by 12% when using the hierarchical approach. Our results suggest that the largest gains in performance are obtained when the test set contains a large number of unseen events.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号