首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
提出一种英文文本检索算法,从文本中提取奇异值向量作为复特征向量,利用向量间的余弦相似度作为文本检索的相似度度量.实验结果表明,该算法在检索准确率和运算效率上都优于传统的LSA算法.  相似文献   

2.
An essay-based discourse analysis system can help students improve their writing by identifying relevant essay-based discourse elements in their essays. Our discourse analysis software, which is embedded in Criterion, an online essay evaluation application, uses machine learning to identify discourse elements in student essays. The system makes decisions that exemplify how teachers perform this task. For instance, when grading student essays, teachers comment on the discourse structure. Teachers might explicitly state that the essay lacks a thesis statement or that an essay's single main idea has insufficient support. Training the systems to model this behavior requires human judges to annotate a data sample of student essays. The annotation schema reflects the highly structured discourse of genres such as persuasive writing. Our discourse analysis system uses a voting algorithm that takes into account the discourse labeling decisions of three independent systems.  相似文献   

3.
Web search users complain of the inaccurate results produced by current search engines. Most of these inaccurate results are due to a failure to understand the user??s search goal. This paper proposes a method to extract users?? intentions and to build an intention map representing these extracted intentions. The proposed method makes intention vectors from clicked pages from previous search logs obtained on a given query. The components of the intention vector are weights of the keywords in a document. It extracts user??s intentions by using clustering the intention vectors and extracting intention keywords from each cluster. The extracted the intentions on a query are represented in an intention map. For the efficiency analysis of intention map, we extracted user??s intentions using 2,600 search log data a current domestic commercial search engine. The experimental results with a search engine using the intention maps show statistically significant improvements in user satisfaction scores.  相似文献   

4.
作文跑题检测是作文自动评分系统的重要模块。传统的作文跑题检测一般计算文章内容相关性作为得分,并将其与某一固定阈值进行对比,从而判断文章是否跑题。但是实际上文章得分高低与题目有直接关系,发散性题目和非发散性题目的文章得分有明显差异,所以很难用一个固定阈值来判断所有文章。该文提出一种作文跑题检测方法,基于文档发散度的作文跑题检测方法。该方法的创新之处在于研究文章集合发散度的概念,建立发散度与跑题阈值的关系模型,对于不同的题目动态选取不同的跑题阈值。该文构建了一套跑题检测系统,并在一个真实的数据集中进行测试。实验结果表明基于文档发散度的作文跑题检测系统能有效识别跑题作文。  相似文献   

5.
为实现基于关键词的维吾尔文文档图像检索,提出一种基于由粗到细层级匹配的关键词文档图像检索方法。使用改进的投影切分法将经过预处理的文档图像切分成单词图像库,使用模板匹配对关键词进行粗匹配;在粗匹配的基础上,提取单词图像的方向梯度直方图(HOG)特征向量;通过支持向量机(SVM)分类器学习特征向量,实现关键词图像检索。在包含108张文档图像的数据库中进行实验,实验结果表明,检索准确率平均值为91.14%,召回率平均值为79.31%,该方法能有效实现基于关键词的维吾尔文文档图像检索。  相似文献   

6.
针对现有的无监督作文跑题检测方法中,使用作文内容向量表示作文存在非主题词噪声所导致的相似度不准确问题,该文提出一种基于作文主题词抽取和局部密度阈值选择的无监督作文跑题检测方法。首先使用LDA主题生成模型挖掘待测作文的主题词,并使用分布式表示向量寻找与题目词项语义相似的词,作为对作文题目的主题词扩展,在此基础上使用提出的切题度计算方法计算待测作文的切题度,并使用所提出的基于作文集切题度局部密度的阈值抽取方法动态选取切题阈值,进而实现一种无需训练集和主题无关的无监督作文跑题检测方法。在以英语为母语的学习者和以汉语为母语的学习者所写的8个作文集共9 381篇作文上的实验结果表明,该文提出的作文跑题检测方法能有效识别跑题作文,加入拼写检查预处理后,平均F1值为79.64%,单个作文题目下F1值最好为96.1%。  相似文献   

7.
为了提高英汉翻译系统的翻译精度,提出一种基于人机交互和特征提取的英汉翻译系统模型。首先,为了实现翻译特征语境特征的提取,通过特征提取算法提取语义翻译语境矩阵和非语义翻译语境矩阵;其次,为度量同一翻译环境下的两个语义向量之间的相似度,选择余弦相似度函数计算翻译相似度。将翻译相似度引入英汉翻译系统模型,通过比较两个语义向量之间的翻译相似度实现英汉之间的翻译。与SOA、SCA和SLA对比可知,基于人机交互和特征提取的英汉翻译具有更高的准确率、精确率和召回率,为英语翻译提供新的方法和途径。  相似文献   

8.
王伟  赵尔平  崔志远  孙浩 《计算机应用》2021,41(8):2193-2198
针对目前词向量表示低频词质量差,表示的语义信息容易混淆,以及现有的消歧模型对多义词不能准确区分等问题,提出一种基于词向量融合表示的多特征融合消歧方法。该方法将使用知网(HowNet)义原表示的词向量与Word2vec生成的词向量进行融合来补全词的多义信息以及提高低频词的表示质量。首先计算待消歧实体与候选实体的余弦相似度来获得二者的相似度;其次使用聚类算法和知网知识库来获取实体类别特征相似度;然后利用改进的潜在狄利克雷分布(LDA)主题模型来抽取主题关键词以计算实体主题特征相似度,最后通过加权融合以上三类特征相似度实现多义词词义消歧。在西藏畜牧业领域测试集上进行的实验结果表明,所提方法的准确率(90.1%)比典型的图模型消歧方法提高了7.6个百分点。  相似文献   

9.
随着移动互联网迅猛地发展,移动课堂已经作为一种新兴的教学模式步入社会。传统作文课堂仅限于教师对学生作文的批改,学生之间不能相互批改作文,学生只能看到老师对自己作文的评价。因此,学生就不知道作文评分的重点在哪里,怎样才能提升自己的作文水平。为此提出一种基于移动课堂作文互评模式的教学方式,设计并实现了基于iOS的作文互评系统,它包含了后台服务和前台客户端。实验结果表明,系统能有效地提升教学的效率和提高学生的学习兴趣。  相似文献   

10.
Electronic Health Records (EHR) form a valuable resource in the healthcare enterprise because clinical evidence can be provided to identify potential complications and support decisions on early intervention. Simple string matching, the common search algorithm, is not able to map a query to the similar health records in the database with respect to the medical concepts. A novel ontological vector model supported by the Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) is proposed in this paper to project the disease terms of a health record to a feature space so that each health record can be characterized using a feature vector, giving a fingerprint of the record. The similarity between the query and database health records was measured by similarity measures of their feature vectors and string matching score respectively. Three types of similarity measures were considered in this study, namely, Euclidean distance (ED), direction cosine (DC) and modified direction cosine (mDC). Medical history and carotid ultrasonic imaging findings were collected from 47 subjects in Hong Kong. The dataset formed 1081 pairs of health records and ROC analysis was used to evaluate and compare the accuracy of the ontological vector model and simple string matching against the agreement of the presence or absence of carotid plaques identified by carotid ultrasound between two subjects. It was found that the score generated by simple string matching was a random rater but the ontological vector model was not. In other words, the degree of health record similarity based on the ontological vector model is associated with the agreement of atherosclerosis between two patients. The vector model using feature terms at the SNOMED-CT level 4 gave the best performance. The performance of mDC was very close to that of ED and DC but the properties of mDC make it more suitable for the retrieval of similar health records. It was also shown that the ontological vector model was enhanced by the support vector classifier approach.  相似文献   

11.
个人微博是现在流行的社交工具,因其数量繁杂而对用户浏览产生困扰。本文将语义相似度大的微博聚类以 方便用户浏览。主要研究工作如下:1. 使用python 中的jieba 分词对个人微博进行分词预处理并去除停用词;2. 将分词数据集 利用CBOW模型训练词语向量;3. 用词语向量表示个人微博句子向量;4. 个人微博句子向量表示成空间中的分布点,使用改进 的曼哈顿句子算法计算距离即个人微博间的相似度。5. 使用改进的clarans 算法聚类。实验表明本文的方法与传统聚类算法 如划分法、层次法、密度法等有明显的提高。  相似文献   

12.
针对个性化站点较少考虑用户检索意图的问题,提出结合交叉信息熵和词语特征信息的关键词提取方法以及结合余弦相似度和加权海明距离的文本排序方法,旨在不需要用户任何反馈的条件下,为用户推荐更满意的检索结果。通过过滤用户请求个性化站点时的访问地址,获取用户浏览的网页文本内容,从中提取能够表示用户检索意图的关键词集进行重新检索后对检索结果排序,最后将排序后的结果作为推荐模块返回给用户。实验表明,利用该方法获得的查询推荐结果能够更加符合用户检索意图,提供更好的用户体验。  相似文献   

13.
王靖 《计算机应用研究》2020,37(10):2951-2955,2960
针对同类文本中提取的关键词形式多样,且在相似性与相关性上具有模糊关系,提出一种对词语进行分层聚类的文本特征提取方法。该方法在考虑文本间相同词贡献文本相似度的前提下,结合词语相似性与相关性作为语义距离,并根据该语义距离的不同,引入分层聚类并赋予不同聚类权值的方法,最终得到以词和簇共同作为特征单元的带有聚类权值的向量空间模型。引入了word2vec训练词向量得到文本相似度,并根据Skip-Gram+Huffman Softmax模型的算法特点,运用点互信息公式准确获取词语间的相关度。通过文本的分类实验表明,所提出的方法较目前常用的仅使用相似度单层聚类后再统计的方法,能更有效地提高文本特征提取的准确性。  相似文献   

14.
当前的知识蒸馏算法均只在对应层间进行蒸馏,为了解决这一问题,提高知识蒸馏的性能,首先分析了教师模型的低层特征对学生模型高层特征的指导作用,并在此基础上提出了基于知识回顾解耦的目标检测蒸馏方法。该方法首先将学生模型的高层特征与低层特征对齐、融合并区分空间和通道提取注意力,使得学生的高层特征能够渐进式地学到教师的低层和高层知识;随后将前背景解耦,分别蒸馏;最后通过金字塔池化在不同尺度上计算其与教师模型特征的相似度。在不同的目标检测模型上进行了实验,实验表明,提出的方法简单且有效,能够适用于各种不同的目标检测模型。骨干网络为ResNet-50的RetinaNet和FCOS分别在COCO2017数据集上获得了39.8%和42.8%的mAP,比基准提高了2.4%和2.3%。  相似文献   

15.
针对特大突发事件应急决策中大群体专家存在偏好信息不完全的问题,提出了一种新的不完全偏好信息大群体应急决策方法.首先,利用TF-IDF(term frequency-inverse document frequency)算法对特大突发事件相关的微博大数据文本流进行关键词提取,获取事件属性及其权重;其次,根据专家给出的偏好信息计算专家的犹豫度,进而获得专家的权重;再次,根据不完全偏好信息矩阵进行属性关联测度和方案接近度测度,提出了基于属性关联和方案接近度的新的补值模型,获得完全偏好信息矩阵;然后,结合专家权重和属性权重进行信息集结和方案择优;最后,通过江西洪涝灾害事件验证所提方法的可行性和有效性.  相似文献   

16.
With the increasing amount of information available in recent years, searching for the desired content is becoming a challenging task. In this work, a tool for searching abstracts submitted to scientific conferences is introduced. It not only searches abstracts by the given keyword(s) but also displays abstracts related to a single or multiple selection. It also displays highly relevant abstracts together with possible keywords to help users refine their search. Analysis of the conditional similarity algorithm proposed here has shown that it does provide better output compared to ordinary cosine similarity, as well as the list of possible keywords reflects results of latent topic analysis. An interface for storing and sorting selected abstracts for future review and/or printing is also provided.  相似文献   

17.
针对VSM不能揭示文档中特征词间的潜在语义关系,相似度计算准确性较低的问题,结合本体模型的结构特点,从语义重合度、语义距离以及本体结构等因素综合考虑概念间的相似度计算,提出了一种基于领域本体的文档向量空间模型。该模型通过构建概念间的语义相似度矩阵对特征词权值进行调整,建立包含语义关系的标准(学生)答案的向量空间模型,并用"VSM模型+余弦值"算法评估学生答案和标准答案的相似度。实验表明,与传统方法相比,该方法提高了评测效果及准确率。  相似文献   

18.
判断问题相似是社区问答(community question answer, CQA)中很重要的一个研究方向.社区问答中的问题通常由主题和描述构成.由于社区问答的开放性,用户的提问长短不一,而问题中会包含大量干扰模型判断问题是否相似的背景信息.为了减少上述问题对计算问题相似度的影响,模型将关键词及问题主题视为问题的关键信息,并使用这些信息计算问题相似度.首先,在基于文本间相似及相异信息的CNN模型的基础上引入了关键词抽取技术.同时,为了更好地利用问题主题的信息,模型融合了问题主题相似度的特征.模型在SemEval2017评测的问题相似任务中进行了实验,其平均精度均值(mean average precision, MAP)达到了49.65%,超过了评测中的最佳结果.  相似文献   

19.
Semantic-oriented service matching is one of the challenges in automatic Web service discovery. Service users may search for Web services using keywords and receive the matching services in terms of their functional profiles. A number of approaches to computing the semantic similarity between words have been developed to enhance the precision of matchmaking, which can be classified into ontology-based and corpus-based approaches. The ontology-based approaches commonly use the differentiated concept information provided by a large ontology for measuring lexical similarity with word sense disambiguation. Nevertheless, most of the ontologies are domain-special and limited to lexical coverage, which have a limited applicability. On the other hand, corpus-based approaches rely on the distributional statistics of context to represent per word as a vector and measure the distance of word vectors. However, the polysemous problem may lead to a low computational accuracy. In this paper, in order to augment the semantic information content in word vectors, we propose a multiple semantic fusion (MSF) model to generate sense-specific vector per word. In this model, various semantic properties of the general-purpose ontology WordNet are integrated to fine-tune the distributed word representations learned from corpus, in terms of vector combination strategies. The retrofitted word vectors are modeled as semantic vectors for estimating semantic similarity. The MSF model-based similarity measure is validated against other similarity measures on multiple benchmark datasets. Experimental results of word similarity evaluation indicate that our computational method can obtain higher correlation coefficient with human judgment in most cases. Moreover, the proposed similarity measure is demonstrated to improve the performance of Web service matchmaking based on a single semantic resource. Accordingly, our findings provide a new method and perspective to understand and represent lexical semantics.  相似文献   

20.
与传统的机器学习方法相比,终身机器学习能够有效利用知识库中积累的知识来提高当前学习任务的学习效果。然而经典的终身主题模型(LTM)在领域选择时缺乏偏向性,且在计算目标词的相似性时不能充分利用目标词的上下文信息。从词语和主题选择的角度提出改进模型HW-LTM,利用Word2vec词向量的余弦相似度和主题之间的Hellinger距离寻找相似度较大的词语和领域,实现在迭代学习中对词语和领域的更优选择和更有效的知识获取,同时通过预加载词向量相似度矩阵的方式解决词向量余弦距离的重复计算问题,利用Hellinger距离计算主题相似度,加快模型收敛速度。在京东商品评论数据集上的实验结果表明,HW-LTM模型表现优于基线主题挖掘模型,相比LTM模型,其topic coherence指标提升48,耗时缩短43.75%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号