首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present PKIP, an adaptable learning assistant tool for managing question items in item banks. PKIP is not only able to automatically assist educational users to categorize the question items into predefined categories by their contents but also to correctly retrieve the items by specifying the category and/or the difficulty level. PKIP adapts the “categorization learning model” to improve the system’s categorization performance using the incoming question items. PKIP tool has an advantage over the traditional document categorization methods in that it can correctly categorize the question item which lacks keywords since it adopts the feature selection technique and support vector machine approach to item bank text categorization. In our initial experimentation, PKIP was designed and implemented to manage the Thai high primary mathematics question items. PKIP was tested and evaluated in terms of both system accuracy and user satisfaction. The evaluation result shows that the system accuracy is acceptable and PKIP satisfies the need of the users.  相似文献   

2.
This paper presents a complete system able to categorize handwritten documents, i.e. to classify documents according to their topic. The categorization approach is based on the detection of some discriminative keywords prior to the use of the well-known tf-idf representation for document categorization. Two keyword extraction strategies are explored. The first one proceeds to the recognition of the whole document. However, the performance of this strategy strongly decreases when the lexicon size increases. The second strategy only extracts the discriminative keywords in the handwritten documents. This information extraction strategy relies on the integration of a rejection model (or anti-lexicon model) in the recognition system. Experiments have been carried out on an unconstrained handwritten document database coming from an industrial application concerning the processing of incoming mails. Results show that the discriminative keyword extraction system leads to better recall/precision tradeoffs than the full recognition strategy. The keyword extraction strategy also outperforms the full recognition strategy for the categorization task.  相似文献   

3.
Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.  相似文献   

4.
程岚岚  何丕廉  孙越恒 《计算机应用》2005,25(12):2780-2782
提出了一种基于朴素贝叶斯模型的中文关键词提取算法。该算法首先通过训练过程获得朴素贝叶斯模型中的各个参数,然后以之为基础,在测试过程完成关键词提取。实验表明,相对于传统的if*idf方法,该算法可从小规模的文档集中提取出更为准确的关键词,而且可灵活地增加表征词语重要性的特征项,因而具有更好的可扩展性。  相似文献   

5.
针对传统的协同过滤算法忽略了用户兴趣源于关键词以及数据稀疏的问题,提出了结合用户兴趣度聚类的协同过滤推荐算法。利用用户对项目的评分,并从项目属性中提取关键词,提出了一种新的RF-IIF (rating frequency-inverse item frequency)算法,根据目标用户对某关键词的评分频率和该关键词被所有用户的评分频率,得到用户对关键词的偏好,形成用户—关键词偏好矩阵,并在该矩阵基础上进行聚类。然后利用logistic函数得到用户对项目的兴趣度,明确用户爱好,在类簇中寻找目标用户的相似用户,提取邻居爱好的前◢N◣个物品对用户进行推荐。实验结果表明,算法准确率始终优于传统算法,对用户爱好判断较为准确,缓解了数据稀疏问题,有效提高了推荐的准确率和效率。  相似文献   

6.
新的关键字提取算法研究   总被引:2,自引:0,他引:2  
传统的关键字提取算法往往是基于高频词提取的,但文档中的关键字往往并不都是高频词,因此还需要从非高频词集中找出关键字.把一篇文档抽象为一个图:结点表示词语,边表示词语的同现关系;并基于文档的这种拓扑结构,提出了一种新的关键字提取算法,并和传统的关键字提取算法作了比较,在精确率,覆盖率方面均有不错的效果.  相似文献   

7.
基于分离模型的中文关键词提取算法研究   总被引:4,自引:0,他引:4  
关键词提取在自动文摘、信息检索、文本分类、文本聚类等方面具有十分重要的作用。通常所说的关键词实际上有相当一部分是关键的短语和未登录词,而这部分关键词的抽取是十分困难的问题。该文提出将关键词提取分为两个问题进行处理关键单词提取和关键词串提取,设计了一种基于分离模型的中文关键词提取算法。该算法并针对关键单词提取和关键词串提取这两个问题设计了不同的特征以提高抽取的准确性。实验表明,相对于传统的关键词提取算法,基于分离模型的中文关键词提取算法效果更好。  相似文献   

8.
This paper presents an integrated approach to spot the spoken keywords in digitized Tamil documents by combining word image matching and spoken word recognition techniques. The work involves the segmentation of document images into words, creation of an index of keywords, and construction of word image hidden Markov model (HMM) and speech HMM for each keyword. The word image HMMs are constructed using seven dimensional profile and statistical moment features and used to recognize a segmented word image for possible inclusion of the keyword in the index. The spoken query word is recognized using the most likelihood of the speech HMMs using the 39 dimensional mel frequency cepstral coefficients derived from the speech samples of the keywords. The positional details of the search keyword obtained from the automatically updated index retrieve the relevant portion of text from the document during word spotting. The performance measures such as recall, precision, and F-measure are calculated for 40 test words from the four groups of literary documents to illustrate the ability of the proposed scheme and highlight its worthiness in the emerging multilingual information retrieval scenario.  相似文献   

9.
陈伟鹤  刘云 《计算机科学》2016,43(12):50-57
中文文本的关键词提取是自然语言处理研究中的难点。国内外大部分关键词提取的研究都是基于英文文本的, 但其并不适用于中文文本的关键词提取。已有的针对中文文本的关键词提取算法大多适用于长文本,如何从一段短中文文本中准确地提取出具有实际意义且与此段中文文本的主题密切相关的词或词组是研究的重点。 提出了面向中文文本的基于词或词组长度和频数的关键词提取算法,此算法首先提取文本中出现频数较高的词或词组,再根据这些词或词组的长度以及在文本中出现的频数计算权重,从而筛选出关键词或词组。该算法可以准确地从中文文本中提取出相对重要的词或词组,从而快速、准确地提取此段中文文本的主题。实验结果表明,基于词或词组长度和频数的中文文本关键词提取算法与已有的其他算法相比,可用于处理中文文本,且具有更高的准确性。  相似文献   

10.
关键词提取在自然语言处理领域有着广泛的应用,如何准确、快速地从文本中获取关键词信息已经成为文本处理的关键性问题。现有的关键词提取方法很多,但是这些关键词提取方法的准确率和通用性有待提高。因此,提出了一种改进的TextRank关键词提取方法,该方法使用TF-IDF方法与平均信息熵方法计算文本中词语的重要性,然后根据计算结果得到词语的综合权重。利用词语的综合权重改进TextRank算法的节点初始值以及节点概率转移矩阵,通过迭代的方式计算各个节点的权重,直至收敛,从而得到词语的权重信息,选择top N个词语作为关键词输出,实现关键词的提取功能。实验结果表明,相较于传统的TF-IDF方法和TextRank方法,提出的改进后的TextRank关键词提取方法有更好的通用性,提取的关键词的准确率更高。  相似文献   

11.
在电子商务市场上选择正确的单词是很重要否则很容易迷失在巨大的平台。在平台上需要广告自己的产品的话正确的关键字起着重要的作用。每个单词的句子提供某些索引擎页面中的位置。因此,对于可见光和网页中的位置这个必要知道竞争对手卖的公共关键字。本文提出一种专门针对电子商务领域关键词竞价的关键词选取算法。此类研究具有很强的实际应用价值,电子商务的广告主在进行广告推广时,由于关键字竞价机制的复杂性、不完全信息特征以及广告主自身对关键词缺乏深刻的理解,导致关键字竞价广告的投标策略失误。本文重新定义关键词选取算法,适应电子商务领域文本特性。通过商品标题文本,获取推荐关键词。整个过程首先建立分层次行业相关词表,然后根据词表完成关键词推荐,推荐分为两个步骤:种子关键词抽取;推荐关键词扩展。  相似文献   

12.
This study reported an investigation of eighth graders’ (14-year-olds) web searching strategies and outcomes, and then analyzed their correlations with students’ web experiences, epistemological beliefs, and the nature of searching tasks. Eighty-seven eighth graders were asked to fill out a questionnaire for probing epistemological beliefs (from positivist to constructivist-oriented views) and finished three different types of searching tasks. Their searching process was recorded by screen capture software and answers were reviewed by two expert teachers based on their accuracy, richness and soundness. Five quantitative indicators were used to assess students’ searching strategies: number of keywords, visited pages, maximum depth of exploration, refinement of keyword, and number of words used in the first keyword. The main findings derived from this study suggested that, students with richer web experiences could find more correct answers in “close-ended” search tasks. In addition, students with better metacognitive skills such as keyword refinement tended to achieve more successful searching outcomes in such tasks. However, in “open-ended” tasks, where questions were less certain and answers were more elaborated, students who had more advanced epistemological beliefs, concurring with a constructivist view, had better searching outcomes in terms of their soundness and richness. This study has concluded that epistemological beliefs play an influential role in open-ended Internet learning environments.  相似文献   

13.
Social Tagging is the process by which many users add metadata in the form of keywords, to annotate and categorize items (songs, pictures, Web links, products, etc.). Social tagging systems (STSs) can provide three different types of recommendations: They can recommend 1) tags to users, based on what tags other users have used for the same items, 2) items to users, based on tags they have in common with other similar users, and 3) users with common social interest, based on common tags on similar items. However, users may have different interests for an item, and items may have multiple facets. In contrast to the current recommendation algorithms, our approach develops a unified framework to model the three types of entities that exist in a social tagging system: users, items, and tags. These data are modeled by a 3-order tensor, on which multiway latent semantic analysis and dimensionality reduction is performed using both the Higher Order Singular Value Decomposition (HOSVD) method and the Kernel-SVD smoothing technique. We perform experimental comparison of the proposed method against state-of-the-art recommendation algorithms with two real data sets (Last.fm and BibSonomy). Our results show significant improvements in terms of effectiveness measured through recall/precision.  相似文献   

14.
Mining the interests of Chinese microbloggers via keyword extraction   总被引:1,自引:0,他引:1  
Microblogging provides a new platform for communicating and sharing information among Web users. Users can express opinions and record daily life using microblogs. Microblogs that are posted by users indicate their interests to some extent. We aim to mine user interests via keyword extraction from microblogs. Traditional keyword extraction methods are usually designed for formal documents such as news articles or scientific papers. Messages posted by microblogging users, however, are usually noisy and full of new words, which is a challenge for keyword extraction. In this paper, we combine a translation-based method with a frequency-based method for keyword extraction. In our experiments, we extract keywords for microblog users from the largest microblogging website in China, Sina Weibo. The results show that our method can identify users’ interests accurately and efficiently.  相似文献   

15.
ABSTRACT

Online privacy policies are known to have inconsistent formats and incomplete content. They are also hard to understand and do not effectively help individuals to make decisions about the data practices of the online service providers. Several studies have focused on the deficiencies of privacy policies such as length and readability. However, a very limited number of studies have explored the content of privacy policies. This paper aims to shed some lights on the content of these legal documents. To this end, we performed a comprehensive analysis of keywords and content of over 2000 online policies. Policies were collected from variety of websites, application domains, and regulatory regimes. Topic modeling algorithms, such as Latent Dirichlet Allocation, were used for topic coverage analysis. This study also measured the coverage of ambiguous words in privacy policies. Lastly, a method was used to evaluate keyword similarity between privacy policies which belonged to different regulatory framework or applications. The findings suggested that regulations have an impact on the selection of terminologies used in the privacy policies. The results also suggested that European policies use fewer ambiguous words but use more words such as cookie and compliance with the regional regulations. We also observed that the seed keywords extracted for each section of privacy policies were consistently used in all policies regardless of the application domain and regulations.  相似文献   

16.
Text summarization and keyword extraction are two important research topics in Natural Language Processing (NLP), and they both generate concise information to describe the gist of text. Although these two tasks have similar objective, they are usually studied independently and their association is less considered. Based on the graph-based ranking methods, some collaborative extraction methods have been proposed, capturing the associations between sentences, between words and between the sentence and the word. Though they generate both text summary and keywords in an iterative reinforced framework, most existing models are limited to express various kinds of binary relations between sentences and words, ignoring a number of potential important high-order relationships among different text units. In this paper, we propose a new collaborative extraction method based on hypergraph. In this method, sentences are modeled as hyperedges and words are modeled as vertices to build a hypergraph, and then the summary and keywords are generated by taking advantage of higher order information from sentences and words under the unified hypergraph. Experiments on the Weibo-oriented Chinese news summarization task in NLPCC 2015 demonstrate that the proposed method is feasible and effective.
Key words hypergraph;document Summarization;keyword extraction;collaborative extraction


  相似文献   

17.
针对维吾尔语文本的分类问题,提出一种基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类方法。首先,对输入文本进行预处理,滤除非维吾尔语的字符和停用词;然后,利用词语语义相似度、词语位置和词频重要性加权的TextRank算法提取文本关键词集合;最后,根据互信息相似度度量,计算输入文本关键词集和各类关键词集的相似度,最终实现文本的分类。实验结果表明,该方案能够 提取出具有较高识别度的关键词,当关键词集大小为1250时,平均分类率达到了91.2%。  相似文献   

18.
Recommendation systems aim to recommend items or packages of items that are likely to be of interest to users. Previous work on recommendation systems has mostly focused on recommending points of interest (POI), to identify and suggest top-k items or packages that meet selection criteria and satisfy compatibility constraints on items in a package, where the (packages of) items are ranked by their usefulness to the users. As opposed to prior work, this paper investigates two issues beyond POI recommendation that are also important to recommendation systems. When there exist no sufficiently many POI that can be recommended, we propose (1) query relaxation recommendation to help users revise their selection criteria, or (2) adjustment recommendation to guide recommendation systems to modify their item collections, such that the users׳ requirements can be satisfied.We study two related problems, to decide (1) whether the query expressing the selection criteria can be relaxed to a limited extent, and (2) whether we can update a bounded number of items, such that the users can get desired recommendations. We establish the upper and lower bounds of these problems, all matching, for both combined and data complexity, when selection criteria and compatibility constraints are expressed in a variety of query languages, for both item recommendation and package recommendation. To understand where the complexity comes from, we also study the impact of variable sizes of packages, compatibility constraints and selection criteria on the analyses of these problems. Our results indicate that in most cases the complexity bounds of query relaxation and adjustment recommendation are comparable to their counterparts of the basic recommendation problem for testing whether a given set of (resp. packages of) items makes top-k items (resp. packages). In other words, extending recommendation systems with the query relaxation and adjustment recommendation functionalities typically does not incur extra overhead.  相似文献   

19.
目前的研究大多把向量空间模型中特征项的选取与权重的计算分开,掩盖中文分词时产生的语义缺失,导致特征项区分度下降。为此,提出一种基于统计与规则的关键词抽取方法。利用句法规则提取出基本短语,以取代词袋模型中的词,考虑特征项位置、分布及语法角色等信息,综合加权计算特征项权重。实验结果表明,与现有方法相比,该方法能够更有效地进行文本信息过滤。  相似文献   

20.
传统的推荐算法多以优化推荐列表的精确度为目标,而忽略了推荐算法的另一个重要指标——多样性。提出了一种新的提高推荐列表多样性的方法。该方法将列表生成步骤转换为N次概率选择过程,每次概率选择通过两个步骤完成:类型选择与项目选择。在类型选择中,引入项目的类型信息,根据用户对不同项目类型的喜好计算概率矩阵,并依照该概率矩阵选择一个类型;在项目选择中,根据项目的预测评分、项目的历史流行度、项目的推荐流行度3个因素重新计算项目的最终得分,选择得分最高的项目推荐给用户。通过阈值TR来调节多样性与精确度之间的折中。最后,通过对比实验证明了该方法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号