首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
When searching or browsing documents, the genre of a document is an important consideration that complements topical characterization. We examine design considerations for automatic tagging of office document pages with genre membership. These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and image-based features, and proposing a simple ensemble method to improve the performance of genre identification. Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. Our experiments show that when combined with image-based features, text-based features do not significantly influence performance. These results provide support for a topic-independent approach to identification of coarse office document genres. Experiments also show that our simple ensemble method significantly improves performance relative to using a support vector machine (SVM) classifier alone. We demonstrate the utility of our approach by integrating our automatic genre tags in a faceted search and browsing application for office document collections.  相似文献   

2.
As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.  相似文献   

3.
Recently, genre collection and automatic genre identification for the web has attracted much attention. However, currently there is no genre-annotated corpus of web pages where inter-annotator reliability has been established, i.e. the corpora are either not tested for inter-annotator reliability or exhibit low inter-coder agreement. Annotation has also mostly been carried out by a small number of experts, leading to concerns with regard to scalability of these annotation efforts and transferability of the schemes to annotators outside these small expert groups. In this paper, we tackle these problems by using crowd-sourcing for genre annotation, leading to the Leeds Web Genre Corpus—the first web corpus which is, demonstrably reliably annotated for genre and which can be easily and cost-effectively expanded using naive annotators. We also show that the corpus is source and topic diverse.  相似文献   

4.
5.
Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. Orthographic errors represent a second, finer notion of noise. Since specific genres often include documents with many errors, an interesting question is whether this “micro-noise” can help to classify genre. In this paper we consider both problems. After introducing a comprehensive hierarchy of genres, we present an intuitive method to build specialized and distinctive classifiers that also work for very small training corpora. Special emphasis is given to the selection of intelligent high-level features. We then investigate the correlation between genre and micro noise. Using special error dictionaries, we estimate the typical error rates for each genre. Finally, we test if the error rate of a document represents a useful feature for genre classification.  相似文献   

6.
Web sites are a ubiquitous Internet genre employed by student organizations. This article investigates the role of a web site in an Interfraternity Council at a large midwestern university in the United States. The web site is examined through the work of Anthony Giddens, specifically his structuration theory, and the recent research on ITexts. In turn, the composing process required for such IText creation and maintenance is considered in light of the complicated network of forces and restraints surrounding the Interfraternity Council and the web site. By positioning the web site as an IText, the article revisits the field's understanding of genre as well as the knowledge creation surrounding such genres. Ultimately, the article contends that it may be in everyday (I)texts, such as organization web sites, where the intertwined shifts to post-industrialism and an emphasis on multiliteracies are most recognizable and accessible for teachers of writing.  相似文献   

7.
In this paper we discuss an experiment that was carried out with a prototype, designed in conformity with the concept of parallelism and the Parallel Instruction theory (the PI theory). We designed this prototype with five different interfaces, and ran an empirical study in which 18 participants completed an abstract task. The five basic designs were based on hypotheses of the PI theory that for solving tasks on screens all task relevant information must be in view on a computer monitor, as clearly as possible. The condition with two parallel frames and the condition with one long web page appeared to be the best design for this type of task, better than window versions that we normally use for our computer simulations on the web. We do not only describe the results of the abstract task in the five conditions, but we also discuss the results from the perspective of concrete, realistic tasks with computer simulations. The interface with two parallel frames is the best solution here, but also the interface with long web pages (‘virtual parallelism’) is a great favourite in practice when doing realistic tasks.  相似文献   

8.
There exist numerous state of the art classification algorithms that are designed to handle the data with nominal or binary class labels. Unfortunately, less attention is given to the genre of classification problems where the classes are organized as a structured hierarchy; such as protein function prediction (target area in this work), test scores, gene ontology, web page categorization, text categorization etc. The structured hierarchy is usually represented as a tree or a directed acyclic graph (DAG) where there exist IS-A relationship among the class labels. Class labels at upper level of the hierarchy are more abstract and easy to predict whereas class labels at deeper level are most specific and challenging for correct prediction. It is helpful to consider this class hierarchy for designing a hypothesis that can handle the tradeoff between prediction accuracy and prediction specificity. In this paper, a novel ant colony optimization (ACO) based single path hierarchical classification algorithm is proposed that incorporates the given class hierarchy during its learning phase. The algorithm produces IF–THEN ordered rule list and thus offer comprehensible classification model. Detailed discussion on the architecture and design of the proposed technique is provided which is followed by the empirical evaluation on six ion-channels data sets (related to protein function prediction) and two publicly available data sets. The performance of the algorithm is encouraging as compared to the existing methods based on the statistically significant Student's t-test (keeping in view, prediction accuracy and specificity) and thus confirm the promising ability of the proposed technique for hierarchical classification task.  相似文献   

9.
Recent research suggests that older Internet users seem to find it more difficult to locate navigation links than to find information content in web pages. One possibility is that older Internet users’ visual exploration of web pages is more linear in nature, even when this type of processing is not appropriate for the task. In the current study, the eye movements of young and older Internet users were recorded using an ecological version of the web pages or a discursive version designed to induce a linear exploration. The older adults found more targets when performing content-oriented compared to navigation-oriented searches, thus replicating previous results. Moreover, they performed less well than young people only when required to locate navigation links and tended to outperform the younger participants in content-oriented searches. Although the type of search task and type of web page resulted in different visual strategies, little or no support was found for the hypothesis that older participants explore web pages in a more linear way in cases where this strategy was not appropriate. The main conclusion is that differences in visual exploration do not seem to mediate the specific difficulty older adults experience in navigation-oriented searches in web pages.  相似文献   

10.
一种提高中文搜索引擎检索质量的HTML解析方法   总被引:15,自引:1,他引:15  
中文搜索引擎经常会返回大量的无关项或者不含具体信息的间接项,产生这类问题的一个原因是网页中存在着大量与主题无关的文字。对使用关键字检索方法的搜索引擎来说,想在检索或者后处理阶段解决这类问题不仅要付出一定代价,而且在大多数情况下是不可能的。在这篇论文中,我们提出了网页噪声的概念,并针对中文网页的特点,实现了一种对网页自动分块并去噪的HTML解析方法,从而达到在预处理阶段消除潜在无关项和间接项的目的。实验结果表明,该方法能够在不占用查询时间的前提下100%地消除中文搜索引擎隐藏的间接项,以及大约11%的无法过滤或隐藏的无关项或间接项,从而大幅度提高检索结果的查准率。  相似文献   

11.
会话识别是Web日志挖掘中的数据预处理中的一个重要步骤。文中提出了一种改进的会话识别方法。首先,在用户识别后,进行框架页面的过滤,从而大大地减少了实验产生的有效页面,然后为页面设置访问时间阙值,并根据页面内容及站点结构确定的页面重要程度对该阈值进行调整。通过实验证明,相对于传统的对所有页面使用单一的先验阈值进行会话识别的方法,该方法所得到的会话集更具有真实性。  相似文献   

12.
基于概率模型的主题爬虫的研究和实现   总被引:1,自引:1,他引:0  
在现有多种主题爬虫的基础上,提出了一种基于概率模型的主题爬虫。它综合抓取过程中获得的多方面的特征信息来进行分析,并运用概率模型计算每个URL的优先值,从而对URL进行过滤和排序。基于概率模型的主题爬虫解决了大多数爬虫抓取策略单一这个缺陷,它与以往主题爬虫的不同之处是除了使用主题相关度评价指标外,还使用了历史评价指标和网页质量评价指标,较好地解决了"主题漂移"和"隧道穿越"问题,同时保证了资源的质量。最后通过多组实验验证了其在主题网页召回率和平均主题相关度上的优越性。  相似文献   

13.
随着CSS+DIV布局方式逐渐成为网页结构布局的主流,对此类网页进行高效的主题信息抽取已成为专业搜索引擎的迫切任务之一。提出一种基于DIV标签树的网页主题信息抽取方法,首先根据DIV标签把HTML文档解析成DIV森林,然后过滤掉DIV标签树中的噪声结点并且建立STU-DIV模型树,最后通过主题相关度分析和剪枝算法,剪掉与主题信息无关的DIV标签树。通过对多个新闻网站的网页进行分析处理,实验证明此方法能够有效地抽取新闻网页的主题信息。  相似文献   

14.
随着CSS+DIV布局方式逐渐成为网页结构布局的主流,对此类网页进行高效的主题信息抽取已成为专业搜索引擎的迫切任务之一。提出一种基于DIV标签树的网页主题信息抽取方法,首先根据DIV标签把HTML文档解析成DIV森林,然后过滤掉DIV标签树中的噪声结点并且建立STU-DIV模型树,最后通过主题相关度分析和剪枝算法,剪掉与主题信息无关的DIV标签树。通过对多个新闻网站的网页进行分析处理,实验证明此方法能够有效地抽取新闻网页的主题信息。  相似文献   

15.
In this work we propose a model to represent the web as a directed hypergraph (instead of a graph), where links connect pairs of disjointed sets of pages. The web hypergraph is derived from the web graph by dividing the set of pages into non-overlapping blocks and using the links between pages of distinct blocks to create hyperarcs. A hyperarc connects a block of pages to a single page, in order to provide more reliable information for link analysis. We use the hypergraph model to create the hypergraph versions of the Pagerank and Indegree algorithms, referred to as HyperPagerank and HyperIndegree, respectively. The hypergraph is derived from the web graph by grouping pages by two different partition criteria: grouping together the pages that belong to the same web host or to the same web domain. We compared the original page-based algorithms with the host-based and domain-based versions of the algorithms, considering a combination of the page reputation, the textual content of the pages and the anchor text. Experimental results using three distinct web collections show that the HyperPagerank and HyperIndegree algorithms may yield better results than the original graph versions of the Pagerank and Indegree algorithms. We also show that the hypergraph versions of the algorithms were slightly less affected by noise links and spamming.  相似文献   

16.
Web accessibility means that disabled people can effectively perceive, understand, navigate, and interact with the web. Web accessibility evaluation methods are needed to validate the accessibility of web pages. However, the role of subjectivity and of expertise in such methods is unknown and has not previously been studied. This article investigates the effect of expertise in web accessibility evaluation methods by conducting a Barrier Walkthrough (BW) study with 19 expert and 57 nonexpert judges. The BW method is an evaluation method that can be used to manually assess the accessibility of web pages for different user groups such as motor impaired, low vision, blind, and mobile users.

Our results show that expertise matters, and even though the effect of expertise varies depending on the metric used to measure quality, the level of expertise is an important factor in the quality of accessibility evaluation of web pages. In brief, when pages are evaluated with nonexperts, we observe a drop in validity and reliability. We also observe a negative monotonic relationship between number of judges and reproducibility: more evaluators mean more diverse outputs. After five experts, reproducibility stabilizes, but this is not the case with nonexperts. The ability to detect all the problems increases with the number of judges: With 3 experts all problems can be found, but for such a level 14 nonexperts are needed. Even though our data show that experts rated pages differently, the difference is quite small. Finally, compared to nonexperts, experts spent much less time and the variability among them is smaller, they were significantly more confident, and they rated themselves as being more productive. The article discusses practical implications regarding how BW results should be interpreted, how to recruit evaluators, and what happens when more than one evaluator is hired.

Supplemental materials are available for this article. Go to the publisher's online edition of Human–Computer Interaction for statistical details and additional measures for this article.  相似文献   

17.
Correlation-Based Web Document Clustering for Adaptive Web Interface Design   总被引:2,自引:2,他引:2  
A great challenge for web site designers is how to ensure users' easy access to important web pages efficiently. In this paper we present a clustering-based approach to address this problem. Our approach to this challenge is to perform efficient and effective correlation analysis based on web logs and construct clusters of web pages to reflect the co-visit behavior of web site users. We present a novel approach for adapting previous clustering algorithms that are designed for databases in the problem domain of web page clustering, and show that our new methods can generate high-quality clusters for very large web logs when previous methods fail. Based on the high-quality clustering results, we then apply the data-mined clustering knowledge to the problem of adapting web interfaces to improve users' performance. We develop an automatic method for web interface adaptation: by introducing index pages that minimize overall user browsing costs. The index pages are aimed at providing short cuts for users to ensure that users get to their objective web pages fast, and we solve a previously open problem of how to determine an optimal number of index pages. We empirically show that our approach performs better than many of the previous algorithms based on experiments on several realistic web log files. Received 25 November 2000 / Revised 15 March 2001 / Accepted in revised form 14 May 2001  相似文献   

18.
The concept of genre represents a meaningful pattern of communication, which has been applied in the information systems field. Genres are socially constructed: they may consequently be socially more or less acceptable or contested. This paper focuses on the concept of communicative genre and addresses the issue of how meta-communication processes guided by discursive-ethical principles can promote a rational and legitimate definition, design and structuring of genres. Such a meta-communication process has not yet been thoroughly discussed in relation to the concept of genre as a means for structuring (organizational) communication. This paper claims to make the following contributions: firstly, it provides a wider spectrum of discursive concepts for critically reflecting on and discursive evaluation of the content and structures of genres and genre instances. Secondly, it demonstrates how different kinds of meta-communications (ex ante, in-action, and ex post) can be used to legitimate genres in a manner compatible with the discourse ethics. It illustrates the discourse-ethical viewpoint concerning the legitimacy of genre structuring processes and thus, also, the legitimacy of resultant norms and contents of communication, especially in global contexts.  相似文献   

19.
In this paper, music genre taxonomies are used to design hierarchical classifiers that perform better than flat classifiers. More precisely, a novel method based on sequential pattern mining techniques is proposed for the extraction of relevant characteristics that enable to propose a vector representation of music genres. From this representation, the agglomerative hierarchical clustering algorithm is used to produce music genre taxonomies. Experiments are realized on the GTZAN dataset for performances evaluation. A second evaluation on GTZAN augmented by Afro genres has been made. The results show that the hierarchical classifiers obtained with the proposed taxonomies reach accuracies of 91.6 % (more than 7 % higher than the performances of the existing hierarchical classifiers).  相似文献   

20.
目的 现有目标检测任务常在封闭集设定中进行。然而在现实问题中,待检测图片中往往包含未知类别目标。因此,在保证模型对已知类检测性能的基础上,为了提升模型在现实检测任务中对新增类别的目标检测能力,本文对开放集目标检测任务进行研究。方法 区别于现有的开放集目标检测框架在检测任务中将背景类与未知类视为一个类别进行优化,本文框架在进行开放集类别识别的过程中,优先识别候选框属于背景类或是含待识别目标类别,而后再对含待识别目标类别进行已知类与未知类的判别。本文提出基于环状原型空间优化的检测器,该检测器可以通过优化待检测框的特征在高维空间中的稀疏程度对已知类、未知类与背景类进行环状序列判别,从而提升模型对开放集类别的检测性能。在(region proposal networks,RPN)层后设计了随机覆盖候选框的方式筛选相关的背景类训练框,避免了以往开放集检测工作中繁杂的背景类采样步骤。结果 本文方法在保证模型对封闭集设定下检测性能的情况下,通过逐步增加未知类别的数量,在Visual Object Classes-Common Objects in Context-20(VOC-COCO-20),Vi...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号