首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 62 毫秒
1.
本文通过对著名站点Yahoo层次结构的分析,介绍一种基于此层次结构的文档分类器的机器学习技巧。我们利用此层次学习算法来自动构造分类器每个域的背景知识。  相似文献   

2.
基于概念类别属性,在Protege平台下构建了茶领域本体,并实现基于茶领域本体的DocOnto文本分类器.在该分类器上对茶文档、酒文档和比萨文档进行分类实验,并与朴素贝叶斯分类器的实验结果对比,表明DocOnto分类器在综合查准率相当的情况下,有效地提高召回率,获得更高的F1指标.  相似文献   

3.
基于关联规则的Web文档分类   总被引:5,自引:2,他引:5  
在现有的Web文档分类器中,有的分类器产生比较精确的分类结果,有的分类器产生更易解释的分类模型,但还没有分类器可以将两个方面的优点结合起来.有鉴于此,论文提出一种基于关联规则的Web文档分类方法.该方法采用事务概念,主要考虑两方面的问题:①在文档训练集中发现最优的词条关联规则;②用这些规则构建一个Web文档分类器.试验表明该分类器性能良好,训练速度快,产生的规则易于被人理解,而且容易更新和调整.  相似文献   

4.
李道生  赵强 《计算机工程》2006,32(12):208-209,228
介绍了一个基于语景图的Web主题爬取器的初步设计。描述了NB分类器的文本学习的向量空间模型——Bemoulli模型及NaiveBayes分类器设计提出了简化的前端队列优先排序的设计方案,即下载文档的归一化文档向量与查询向量的余弦相似度,作为层内下戟文档的排序准则,以便与各层队列中文档的类似然率得分排序进行对比。介绍了自动实现爬取结果与主题分类目录的集成设想。  相似文献   

5.
针对大量电子文档需要准确地进行多层次自动分类管理的现实需求,提出基于多重特征选择和多分类器融合技术的层次分类方法。通过引入可信度函数对单分类器效果进行评价,适时采用辅助分类器对较难分类的文档进行分类投票判决。实验结果表明,相对于单分类器,该方法无论在平面分类和层次分类语料上都获得了更好的分类精度,且具有较好的时间复杂性,有很好的实际应用前景。  相似文献   

6.
自动文本分类是指在给定的分类体系下,让计算机根据文本的内容确定与它相关联的类别。现有的文本分类算法大都基于向量空间模型,因而不能充分表达文档的语义特征信息,从而影响了分类器性能。针对此问题,本文通过训练文档构造相似矩阵,从中获得每个类别的主题信息,由此构造分类器,最后与经典的分类器进行组合以确定文本类别。实验系统证明本文提出的分类方法较大改进了分类器性能。  相似文献   

7.
何丽  刘军 《计算机工程》2006,32(20):4-6
提出了一种基于概念特征向量的NB文档分类方法。该方法在未标注文档集上通过SOM(Self-Organizing Maps)聚类产生若干初始文档类,并为每个文档类分配一个类标签,使用最大信息熵的方法建立每个文档类的概念特征向量。在概念特征向量空间上建立最终的文档分类器:CFB-NB。  相似文献   

8.
Web文本分类及其阻塞减少策略   总被引:1,自引:0,他引:1  
Web挖掘中,根据内容对Web文档进行分类是至关重要的一步.在Web文档分类中一种通常的方法是层次型分类方法,这种方法采用自顶向下的方式把文档分类到一个分类树的相应类别.然而,层次型分类方法在对文档进行分类时经常产生待分类的文档在分类树的上层分类器被错误地拒绝的现象(阻塞).针对这种现象,采用了以分类器为中心的阻塞因子去衡量阻塞的程度,并介绍了两种新的层次型分类方法,即基于降低阈值的方法和基于限制投票的方法,去改善Web文档分类中文档被错误阻塞的情况.  相似文献   

9.
基于笔划特征的单字符汉字字体识别   总被引:1,自引:0,他引:1  
在文档电子化的文本自动分析、理解和识别过程中,除了有关文档内容的字符识别外,还必须解决字体识别问题.字体识别不仅是版面分析、理解和恢复的重要依据,还有助于实现高性能字符识别系统.有别于目前基于多个字符组成的文本块的字体识别方法,本文提出了一种基于单个汉字字符的字体识别方法.在单个汉字字符上提取两类特征:笔划属性特征和笔划分布特征,分别构成两个分类器对单个汉字字符进行字体识别,并集成两个分类器的结果得到最终的识别结果.我们使用的笔划属性特征分类器是文本无关的,而笔划分布特征分类器是文本相关的,集成的分类器属于文本相关的字体识别分类器.我们在包含7种字体的样本集上进行了测试,测试结果显示基于单字的字体识别率达到94.48%.  相似文献   

10.
基于隐马尔可夫模型的文本分类算法   总被引:2,自引:0,他引:2  
杨健  汪海航 《计算机应用》2010,30(9):2348-2350
自动文本分类领域近年来已经产生了若干成熟的分类算法,但这些算法主要基于概率统计模型,没有与文本自身的语法和语义建立起联系。提出了将隐马尔可夫序列分析模型(HMM)用于自动文本分类的算法,首先构造表示文档类别的特征词集合,并以文档类别的特征词序列作为不同HMM分类器的观察序列,而HMM的状态转换序列则隐含地表示了不同类别文档内容的形成演化过程。分类时,具有最大生成概率的HMM分类器类标即为测试文档的分类结果。该算法构造的分类器模型一定程度上体现了不同类别文档的语法和语义特征,并可以实现多类别的自动文本分类,分类效率较高。  相似文献   

11.
Fang X  Rau PL 《Ergonomics》2003,46(1-3):242-254
Two experiments were carried out to examine the effects of cultural differences between the Chinese and the US people on the perceived usability and search performance of World Wide Web (WWW) portal sites. Chinese users in Taiwan and US users in Chicago were recruited to perform searching tasks on two versions of Yahoo! portal site: the standard Yahoo! and Yahoo! Chinese. The layout of Yahoo! Chinese is the same as the layout of Yahoo!, and categories on Yahoo! Chinese have been translated from its US counterpart. A special browser was programmed to record all the keystroke data and participants were asked to fill out a satisfaction questionnaire after finishing the tasks. Significant differences of satisfaction and steps to perform some tasks were found between the two groups. The experiment results also provided more detailed insights into the cultural differences between the Chinese and the US users.  相似文献   

12.
《Ergonomics》2012,55(1-3):242-254
Two experiments were carried out to examine the effects of cultural differences between the Chinese and the US people on the perceived usability and search performance of World Wide Web (WWW) portal sites. Chinese users in Taiwan and US users in Chicago were recruited to perform searching tasks on two versions of Yahoo! portal site: the standard Yahoo! and Yahoo! Chinese. The layout of Yahoo! Chinese is the same as the layout of Yahoo!, and categories on Yahoo! Chinese have been translated from its US counterpart. A special browser was programmed to record all the keystroke data and participants were asked to fill out a satisfaction questionnaire after finishing the tasks. Significant differences of satisfaction and steps to perform some tasks were found between the two groups. The experiment results also provided more detailed insights into the cultural differences between the Chinese and the US users.  相似文献   

13.
In order to keep their vibrancy, community question-answering platforms need to link, especially open, questions with members, who might be interested in answering or in viewing their content. For establishing this connection, demographic aspects of the asker (e.g., gender, age, interests and location) play a pivotal role. However, many times this information is incomplete or unavailable, since it is optional for community members to provide it.This paper studies discriminant learning for guessing one of these demographic facets: the gender of an asker. In so doing, it capitalizes on a large-scale corpus automatically constructed from the integration of Yahoo! Search and Yahoo! Answer profiles. Then, this corpus is utilized for examining the impact of numerous features extracted from assorted sources: texts, demographics, meta-data, social interactions and web search. In brief, good non-linguistic gender indicators were age, industry and second-level question categories. If these are inaccessible, our outcomes indicate that models can still infer them, to some extent, from textual sources by means of semantic analysis and dependency relations. Overall, our best configuration reached an accuracy of 74.50%.  相似文献   

14.
多标记数据有很多的冗余特征和数据,为了解决多标记数据中冗余和无关特征,提高多标记学习算法的泛化能力。提出一个基于模拟退火的卷积式特征选择方法——SAML(simulated annealing based feature selection for multi-label data),已有的算法只是使用了遗传算法来进行优化,新算法采用模拟退火来寻找最优子集,其效果在已有的工作中表现出比前者遗传算法更好的效果。在用于公开评测的Yahoo网页分类数据集上的实验结果表明,SAML算法的性能优于新近提出的一些流行的多标记特征选择方法。  相似文献   

15.
提出了能够综合衡量搜索引擎工作性能的六个评测指标,选择了当今主流的三个搜索引擎:谷歌、雅虎和百度进行评测,成功地在两个大型的数据集上实现了整个自动评测系统。实验表明,谷歌工作性能最稳定;雅虎返回的第一个结果最能满足用户的要求,但会受时间因素的影响;百度明显地受关键字类别因素的影响。最后对中搜、狗、爱问进行了评测。  相似文献   

16.
Wang  Min  Feng  Tingting  Shan  Zhaohui  Min  Fan 《Applied Intelligence》2022,52(10):11131-11146

In multi-label learning, each instance is simultaneously associated with multiple class labels. A large number of labels in an application exacerbates the problem of label scarcity. An interesting issue concerns how to query as few labels as possible while obtaining satisfactory classification accuracy. For this purpose, we propose the attribute and label distribution driven multi-label active learning (MCAL) algorithm. MCAL considers the characteristics of both attributes and labels to enable the selection of critical instances based on different measures. Representativeness is measured by the probability density function obtained by non-parametric estimation, while informativeness is measured by the bilateral softmax predicted entropy. Diversity is measured by the distance metric among instances, and richness is measured by the number of softmax predicted labels. We describe experiments performed on eight benchmark datasets and eleven real Yahoo webpage datasets. The results verify the effectiveness of MCAL and its superiority over state-of-the-art multi-label algorithms and multi-label active learning algorithms.

  相似文献   

17.
Do acquisitions lead to instrumental innovations related to the acquired knowledge? Past arguments on vertical integration espouse how a quest for knowledge drives acquisitions culminating in innovation performance. Using Google and Yahoo as cases-in-point, we examine how facets of acquired innovation knowledge impact post-innovation performance. In particular, the apparently opposing fortunes of Google and Yahoo allow us to investigate the pace of their innovation performance as a hazards model. Results from our investigation highlight Google’s ambidexterity over Yahoo with a swifter, systematic pace of innovation performance – from hastening time to patenting new ideas to the time to releasing new applications from acquisitions.  相似文献   

18.
This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for popular vs. rare queries. Therefore, we send 500 queries of both types to the major search engines Google, Yahoo, Live.com and Ask. We count how often the different elements are used by the individual engines. In total, our study is based on 42,758 elements. Findings include that search engines use quite different approaches to results pages composition and therefore, the user gets to see quite different results sets depending on the search engine and search query used. Organic results still play the major role in the results pages, but different shortcuts are of some importance, too. Regarding the frequency of certain host within the results sets, we find that all search engines show Wikipedia results quite often, while other hosts shown depend on the search engine used. Both Google and Yahoo prefer results from their own offerings (such as YouTube or Yahoo Answers). Since we used the .com interfaces of the search engines, results may not be valid for other country-specific interfaces.  相似文献   

19.
Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号