首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
新的关键字提取算法研究   总被引:2,自引:0,他引:2  
传统的关键字提取算法往往是基于高频词提取的,但文档中的关键字往往并不都是高频词,因此还需要从非高频词集中找出关键字.把一篇文档抽象为一个图:结点表示词语,边表示词语的同现关系;并基于文档的这种拓扑结构,提出了一种新的关键字提取算法,并和传统的关键字提取算法作了比较,在精确率,覆盖率方面均有不错的效果.  相似文献   

2.
The amount of big data collected during human–computer interactions requires natural language processing (NLP) applications to be executed efficiently, especially in parallel computing environments. Scalability and performance are critical in many NLP applications such as search engines or web indexers. However, there is a lack of mathematical models helping users to design and apply scheduling theory for NLP approaches. Moreover, many researchers and software architects reported various difficulties related to common NLP benchmarks. Therefore, this paper aims to introduce and demonstrate how to apply a scheduling model for a class of keyword extraction approaches. Additionally, we propose methods for the overall performance evaluation of different algorithms, which are based on processing time and correctness (quality) of answers. Finally, we present a set of experiments performed in different computing environments together with obtained results that can be used as reference benchmarks for further research in the field.  相似文献   

3.
以语义为基础实现文档关键词提取是提高自动提取准确度的有效途径。以中文文档为处理对象,通过《同义词词林》计算词语间语义距离,对词语进行密度聚类,得到主题相关类,并从主题相关类中选取中心词作为  相似文献   

4.
为了克服传统主题词抽取算法中的主题漂移与主题误判等问题,提出了利用词的共现信息来提高主题词抽取的准确率。根据词汇与文本中的上下文环境词汇的共现关系来调节词的权重评分,与文本主题具有较高共现率的词将被优先抽取为文本的主题词,从而提高文本的主题词抽取精度。经实验证明,提出的主题词抽取方法较一般主题词抽取方法准确率有所提升,特别是抽取文本篇幅较短时,该方法明显优于一般方法。  相似文献   

5.
Measuring the semantic similarity between sentences is an essential issue for many applications, such as text summarization, Web page retrieval, question-answer model, image extraction, and so forth. A few studies have explored on this issue by several techniques, e.g., knowledge-based strategies, corpus-based strategies, hybrid strategies, etc. Most of these studies focus on how to improve the effectiveness of the problem. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The previous methods cannot handle the big data efficiently, i.e., applying such strategies directly is time consuming because every candidate sentence needs to be tested. In this paper, we propose efficient strategies to tackle such problem based on a general framework. The basic idea is that for each similarity, we build a corresponding index in the preprocessing. Traversing these indices in the querying process can avoid to test many candidates, so as to improve the efficiency. Moreover, an optimal aggregation algorithm is introduced to assemble these similarities. Our framework is general enough that many similarity metrics can be incorporated, as will be discussed in the paper. We conduct extensive experimental evaluation on three real datasets to evaluate the efficiency of our proposal. In addition, we illustrate the trade-off between the effectiveness and efficiency. The experimental results demonstrate that the performance of our proposal outperforms the state-of-the-art techniques on efficiency while keeping the same high precision as them.  相似文献   

6.
Wu  Xing  Du  Zhikang  Guo  Yike 《Multimedia Tools and Applications》2018,77(19):25355-25367
Multimedia Tools and Applications - Document classification plays an important role in natural language processing. Among that, keyword extraction algorithm shows its great potential in summarizing...  相似文献   

7.
在网页浏览推荐任务中,如何利用网页内容选取合适的推荐关键词是具有挑战性的研究热点.为了实现有效的关键词推荐方法,利用大规模的真实网络用户浏览行为数据,以及相关提取算法和新词发现算法实现并比较了基于领域关键词提取技术和基于查询词候选集合的关键词推荐方法.实验结果证明,2种方法都能够有效地表征用户信息需求,而第1种推荐方法的准确率更高,具有更好的推荐性能.  相似文献   

8.
This article focuses on the problems of application of artificial intelligence to represent legal knowledge. The volume of legal knowledge used in practice is unusually large, and therefore the ontological knowledge representation is proposed to be used for semantic analysis, presentation and use of common vocabulary, and knowledge integration of problem domain. At the same time some features of legal knowledge representation in Ukraine have been taken into account. The software package has been developed to work with the ontology. The main features of the program complex, which has a Web-based interface and supports multi-user filling of the knowledge base, have been described. The crowdsourcing method is due to be used for filling the knowledge base of legal information. The success of this method is explained by the self-organization principle of information. However, as a result of such collective work a number of errors are identified, which are distributed throughout the structure of the ontology. The results of application of this program complex are discussed in the end of the article and the ways of improvement of the considered technique are planned.  相似文献   

9.
Mining the interests of Chinese microbloggers via keyword extraction   总被引:1,自引:0,他引:1  
Microblogging provides a new platform for communicating and sharing information among Web users. Users can express opinions and record daily life using microblogs. Microblogs that are posted by users indicate their interests to some extent. We aim to mine user interests via keyword extraction from microblogs. Traditional keyword extraction methods are usually designed for formal documents such as news articles or scientific papers. Messages posted by microblogging users, however, are usually noisy and full of new words, which is a challenge for keyword extraction. In this paper, we combine a translation-based method with a frequency-based method for keyword extraction. In our experiments, we extract keywords for microblog users from the largest microblogging website in China, Sina Weibo. The results show that our method can identify users’ interests accurately and efficiently.  相似文献   

10.
针对英文短文本的内容精悍、格式多变的特点,提出了基于多线程多重因子加权的文本关键词提取算法.该算法利用词频-逆向文档频率(TF-IDF)算法计算文本集中单词的词频因子,及代表单词出现位置、长度和同现关系的位置因子、词长因子和同现因子,采用基于Future模式多线程并发计算4个因子的权值.再计算单词的4个因子累积权值并排序提取关键词.实验结果表明,基于多线程多重因子加权的关键词提取算法能够有效提高短文本关键词提取的准确率和召回率.  相似文献   

11.
基于词频统计的文本关键词提取方法   总被引:1,自引:0,他引:1  
针对传统TF-IDF算法关键词提取效率低下及准确率欠佳的问题,提出一种基于词频统计的文本关键词提取方法。首先,通过齐普夫定律推导出文本中同频词数的计算公式;其次,根据同频词数计算公式确定文本中各频次词语所占比重,发现文本中绝大多数是低频词;最后,将词频统计规律应用于关键词提取,提出基于词频统计的TF-IDF算法。采用中、英文文本实验数据集进行仿真实验,其中推导出的同频词数计算公式平均相对误差未超过0.05;确立的各频次词语所占比重的最大误差绝对值为0.04;提出的基于词频统计的TF-IDF算法与传统TF-IDF算法相比,平均查准率、平均查全率和平均F1度量均有提高,而平均运行时间则均有降低。实验结果表明,在文本关键词提取中,基于词频统计的TF-IDF算法在查准率、查全率及F1指标上均优于传统TF-IDF算法,并能够有效减少关键词提取运行时间。  相似文献   

12.
提出一种基于K-Means和主题模型的软件缺陷分析方法,对软件缺陷的类别和关键词进行研究.获取缺陷报告进行预处理,获取有效特征,利用向量空间模型进行文本表示,计算权重,根据最终特征向量进行聚类;提取每一类缺陷的主题和关键词,帮助修复人员快速找到对应的修复方式.关键词提取结果以单词形式呈现给修复人员.实验结果表明,所提方法在bugzilla、firefox和SeaM onkey这3款软件的缺陷报告共1500条缺陷信息上最终聚类平均准确率能达到81%.  相似文献   

13.
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.  相似文献   

14.
杨朝举  葛维益  王羽  徐建 《计算机应用研究》2021,38(4):1022-1026,1032
关键词提取在众多文本挖掘任务中扮演着重要的角色,其提取效果直接影响了文本挖掘任务的质量。以文本为研究对象,提出了一种基于k-truss图分解的关键词提取方法,名为KEK(keyword extraction based on k-truss)。该方法首先借助空间向量模型理论,以文本中的词为节点,通过词语之间的共现关系来构建文本图,接着利用k-truss图分解技术来获取文本语义特征,并结合词频、单词位置特征、复杂网络特征等构造无参评分函数,最终根据评分结果来提取关键词。通过在基准数据集上进行实验验证,结果表明KEK算法在提取短文本关键词上的F1值性能指标优于其他基于文本图的关键词提取方法。  相似文献   

15.
针对语义信息对TextRank的影响,同时考虑新闻标题信息高度浓缩以及关键词的覆盖性与差异性的特点,提出一种新的融合LSTM和LDA差异的关键词抽取方法。首先对新闻文本进行预处理,得到候选关键词;其次通过LDA主题模型得到候选关键词的主题差异影响度;然后结合LSTM模型和word2vec模型计算候选关键词与标题的语义相关性影响度;最后将候选关键词节点按照主题差异影响度和语义相关性影响度进行非均匀转移,得到最终的候选关键词排序,抽取关键词。该方法融合了关键词的语义重要性、覆盖性以及差异性的不同属性。在搜狗全网新闻语料上的实验结果表明,该方法的抽取结果相比于传统方法在准确率和召回率上都有明显提升。  相似文献   

16.
提出一种基于K-Means和主题模型的软件缺陷分析方法,对软件缺陷的类别和关键词进行研究.获取缺陷报告进行预处理,获取有效特征,利用向量空间模型进行文本表示,计算权重,根据最终特征向量进行聚类;提取每一类缺陷的主题和关键词,帮助修复人员快速找到对应的修复方式.关键词提取结果以单词形式呈现给修复人员.实验结果表明,所提方法在bugzilla、firefox和SeaM onkey这3款软件的缺陷报告共1500条缺陷信息上最终聚类平均准确率能达到81%.  相似文献   

17.
18.
针对基于像素分析方法不适用于高分辨率影像信息提取的问题,提出一种基于对象的图像分析方法来进行城市建筑信息提取。采用多分辨率图像分割方法得到图像对象,提出非监督的最优尺度判定方法解决单尺度分割造成的欠分割和过分割问题。在对象分类提取过程中,结合LiDAR数据的地形表面高程信息和光谱信息对建筑物进行提取,并利用尺寸、空间位置等信息进行误分类修正。实验区域共提取出18个建筑目标,结果表明所提出的方法有效可行。  相似文献   

19.
20.
Attribute-based encryption with keyword search (ABKS) enables data owners to grant their search capabilities to other users by enforcing an access control policy over the outsourced encrypted data. However, existing ABKS schemes cannot guarantee the privacy of the access structures, which may contain some sensitive private information. Furthermore, resulting from the exposure of the access structures, ABKS schemes are susceptible to an off-line keyword guessing attack if the keyword space has a polynomial size. To solve these problems, we propose a novel primitive named hidden policy ciphertext-policy attribute-based encryption with keyword search (HP-CPABKS). With our primitive, the data user is unable to search on encrypted data and learn any information about the access structure if his/her attribute credentials cannot satisfy the access control policy specified by the data owner. We present a rigorous selective security analysis of the proposed HP-CPABKS scheme, which simultaneously keeps the indistinguishability of the keywords and the access structures. Finally, the performance evaluation verifies that our proposed scheme is efficient and practical.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号