首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.  相似文献   

2.
Although many web pages consist of blocks of text surrounded by graphics, there is a lack of valid empirical research to aid the design of this type of page [D. Diaper, P. Waelend, Interact. Comput. 13 (2000) 163]. In particular little is known about the influence of animations on interaction with web pages. Proportion, in particular the Golden Section, is known to be a key determinant of aesthetic quality of objects and aesthetics have recently been identified as a powerful factor in the quality of human–computer interaction [N. Tractinsky, A.S. Katz, D. Ikar, Interact. Comput. 13 (2000) 127]. The current study aimed to establish the relative strength of the effects of graphical display and screen ratio of content and navigation areas in web pages, using an information retrieval task and a split-plot experimental research design. Results demonstrated the effect of screen ratio, but a lack of an effect of graphical display on task performance and two subjective outcome measures. However, there was an effect of graphical display on perceived distraction, with animated display leading to more distraction than static display, t(64) = 2.33. Results are discussed in terms of processes of perception and attention and recommendations for web page design are given.  相似文献   

3.
This paper addresses new and significant research issues in web page design in relation to the use of graphics. The original findings include that (a) graphics play an important role in enhancing the appearance and thus users' feelings (aesthetics) about web pages and that (b) the effective use of graphics is crucial in designing web pages. In addition, we have developed a web page design support database based on a user-centered experimental procedure and a neural network model. This design support database can be used to examine how a specific combination of design elements, particularly the ratio of graphics to text, will affect the users' feelings about a web page. As a general rule, the ratio of graphics to text between 3:1 and 1:1 will give the users the best feelings of ease-to-use and clear-to-follow. A web page with a ratio of 1:1 will have the most realistic look, while a ratio of over 3:1 will have the fanciest appearance. The result provides useful insights in using graphics on web pages that help web designers best meet users' specific expectations and aesthetic consistency.  相似文献   

4.
目前,绝大多数网页都是用超文本标记语言HTML描述,并通过浏览器进行浏览,这是一种基于论文阅读的网页浏览方式。该文提出了一种基于幻灯片播放方式的网页标记语言HPML及其浏览器,它能够实现将网页中的文字图像内容以幻灯片的形式自动播放,并通过文本-语音转换器(TextToSpeechEngine)将幻灯片中某些文本形式的内容转换成语音形式,同时能够实现幻灯片图像和语音的同步播放。这种方式极大地放松了对显示器屏幕的尺寸和网络带宽的要求,因此特别适合于移动环境下的网页浏览。  相似文献   

5.
针对Web中数据密集型的动态页面,文本数据少,网页结构化程度高的特点,介绍了一种基于HTML结构的web信息提取方法。该方法先将去噪处理后的Web页面进行解析,然后根据树编辑距离计算页面之间的相似度,对页面进行聚类,再对每一类簇生成相应的提取规则,对Web页面进行数据提取。  相似文献   

6.
In this work we propose a model to represent the web as a directed hypergraph (instead of a graph), where links connect pairs of disjointed sets of pages. The web hypergraph is derived from the web graph by dividing the set of pages into non-overlapping blocks and using the links between pages of distinct blocks to create hyperarcs. A hyperarc connects a block of pages to a single page, in order to provide more reliable information for link analysis. We use the hypergraph model to create the hypergraph versions of the Pagerank and Indegree algorithms, referred to as HyperPagerank and HyperIndegree, respectively. The hypergraph is derived from the web graph by grouping pages by two different partition criteria: grouping together the pages that belong to the same web host or to the same web domain. We compared the original page-based algorithms with the host-based and domain-based versions of the algorithms, considering a combination of the page reputation, the textual content of the pages and the anchor text. Experimental results using three distinct web collections show that the HyperPagerank and HyperIndegree algorithms may yield better results than the original graph versions of the Pagerank and Indegree algorithms. We also show that the hypergraph versions of the algorithms were slightly less affected by noise links and spamming.  相似文献   

7.
使用特征文本密度的网页正文提取   总被引:1,自引:0,他引:1  
针对当前互联网网页越来越多样化、复杂化、非规范化的特点,提出了基于特征文本密度的网页正文提取方法。该方法将网页包含的文本根据用途和特征进行分类,并构建数学模型进行比例密度分析,从而精确地识别出主题文本。该方法的时间和空间复杂度均较低。实验显示,它能有效地抽取复杂网页以及多主题段网页的正文信息,具有很好的通用性。  相似文献   

8.
Users of web sites often do not know exactly which information they are looking for nor what the site has to offer. The purpose of their interaction is not only to fulfill but also to articulate their information needs. In these cases users need to pass through a series of pages before they can use the information that will eventually answer their questions. Current systems that support navigation predict which pages are interesting for the users on the basis of commonalities in the contents or the usage of the pages. They do not take into account the order in which the pages must be visited. In this paper we propose a method to automatically divide the pages of a web site on the basis of user logs into sets of pages that correspond to navigation stages. The method searches for an optimal number of stages and assigns each page to a stage. The stages can be used in combination with the pages’ topics to give better recommendations or to structure or adapt the site. The resulting navigation structures guide the users step by step through the site providing pages that do not only match the topic of the user’s search, but also the current stage of the navigation process.  相似文献   

9.
基于集成学习的钓鱼网页深度检测系统   总被引:1,自引:0,他引:1  
网络钓鱼是一种在线欺诈行为,它利用钓鱼网页仿冒正常合法的网页,窃取用户敏感信息从而达到非法目的.提出了基于集成学习的钓鱼网页深度检测方法,采用网页渲染来应对常见的页面伪装手段,提取渲染后网页的URL信息特征、链接信息特征以及页面文本特征,利用集成学习的方法,针对不同的特征信息构造并训练不同的基础分类器模型,最后利用分类集成策略综合多个基础分类器生成最终的结果.针对PhishTank钓鱼网页的检测实验表明,本文提出的检测方法具有较好的准确率与召回率.  相似文献   

10.
In converting task listings into multiple pages of documentation for job aids or training, the two major problems are deciding on how much material should go on each page and how text and graphics should be laid out on the page. A questionnaire study was used to collect input from fourteen human factors personnel in order to design algorithms for page splitting and page layout. From the rules or heuristics used for page-splitting, an algorithm was devised which closely matched human page-splitting results. Layout of individual pages was automated with an algorithm based on the (significant) consensus among the subjects on questions of graphics positioning and label ordering. The two algorithms have been combined in a Computer Aided Design Procedure which automatically pages task lists and lays out individual pages.  相似文献   

11.
In this paper we present a graphical software system that provides an automatic support to the extraction of information from web pages. The underlying extraction technique exploits the visual appearance of the information in the document, and is driven by the spatial relations occurring among the elements in the page. However, the usual information extraction modalities based on the web page structure can be used in our framework, too. The technique has been integrated within the Spatial Relation Query (SRQ) tool. The tool is provided with a graphical front-end which allows one to define and manage a library of spatial relations, and to use a SQL-like language for composing queries driven by these relations and by further semantic and graphical attributes.  相似文献   

12.
Recent research suggests that older Internet users seem to find it more difficult to locate navigation links than to find information content in web pages. One possibility is that older Internet users’ visual exploration of web pages is more linear in nature, even when this type of processing is not appropriate for the task. In the current study, the eye movements of young and older Internet users were recorded using an ecological version of the web pages or a discursive version designed to induce a linear exploration. The older adults found more targets when performing content-oriented compared to navigation-oriented searches, thus replicating previous results. Moreover, they performed less well than young people only when required to locate navigation links and tended to outperform the younger participants in content-oriented searches. Although the type of search task and type of web page resulted in different visual strategies, little or no support was found for the hypothesis that older participants explore web pages in a more linear way in cases where this strategy was not appropriate. The main conclusion is that differences in visual exploration do not seem to mediate the specific difficulty older adults experience in navigation-oriented searches in web pages.  相似文献   

13.
Device-aware desktop web page transformation for rendering on handhelds   总被引:1,自引:0,他引:1  
This paper illustrates a new approach to automatic re-authoring of web pages for rendering on small-screen devices. The approach is based on automatic detection of the device type and screen size from the HTTP request header to render a desktop web page or a transformed one for display on small screen devices, for example, PDAs. Known algorithms (transforms) are employed to reduce the size of page elements, to hide parts of the text, and to transform tables into text while preserving the structural format of the web page. The system comprises a preprocessor that works offline and a just-in-time handler that responds to HTTP requests. The preprocessor employs Cascading Style Sheets (CSS) to set default attributes for the page and prepares it for the handler. The latter is responsible for downsizing graphical elements in the page, converting tables to text, and inserting visibility attributes and JavaScript code to allow the user of the client device to interact with the page and cause parts of the text to disappear or reappear. A system was developed that implements the approach and was used it to collect performance results and conduct usability testing. The importance of the approach lies in its ability to display hidden parts of the web page without having to revisit the server, thus reducing user wait times considerably, saving battery power, and cutting down on wireless network traffic.  相似文献   

14.
基于链接描述文本及其上下文的Web信息检索   总被引:20,自引:0,他引:20  
文档之间的超链接结构是Web信息检索和传统信息检索的最大区别之一,由此产生了基于超链接结构的检索技术。描述了链接描述文档的概念,并在此基础上研究链接文本(anchor text)及其上下文信息在检索中的作用。通过使用超过169万篇网页的大规模真实数据集以及TREC 2001提供的相关文档及评价方法进行测试,得到如下结论:首先,链接描述文档对网页主题的概括有高度的精确性,但是对网页内容的描述有极大的不完全性;其次,与传统检索方法相比,使用链接文本在已知网页定位的任务上能够使系统性能提高96%,但是链接文本及其上下文信息无法在未知信息查询任务上改善检索性能;最后,把基于链接描述文本的方法与传统方法相结合,能够在检索性能上提高近16%。  相似文献   

15.
互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。  相似文献   

16.
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.  相似文献   

17.
基于特征串的大规模中文网页快速去重算法研究   总被引:16,自引:1,他引:16  
网页检索结果中,用户经常会得到内容相同的冗余页面,其中大量是由于网站之间的转载造成。它们不但浪费了存储资源,并给用户的检索带来诸多不便。本文依据冗余网页的特点引入模糊匹配的思想,利用网页文本的内容、结构信息,提出了基于特征串的中文网页的快速去重算法,同时对算法进行了优化处理。实验结果表明该算法是有效的,大规模开放测试的重复网页召回率达97.3% ,去重正确率达99.5%。  相似文献   

18.
根据网页文本信息的存储特点,提出一种网页文本信息抽取策略,有效地实现了对文本丰富型网页中主要文本信息的抽取工作.该抽取方法具有较强的空阃适应性和时间适应性.  相似文献   

19.
针对钓鱼攻击者常用的伪造HTTPS网站以及其他混淆技术,借鉴了目前主流基于机器学习以及规则匹配的检测钓鱼网站的方法RMLR和PhishDef,增加对网页文本关键字和网页子链接等信息进行特征提取的过程,提出了Nmap-RF分类方法。Nmap-RF是基于规则匹配和随机森林方法的集成钓鱼网站检测方法。根据网页协议对网站进行预过滤,若判定其为钓鱼网站则省略后续特征提取步骤。否则以文本关键字置信度,网页子链接置信度,钓鱼类词汇相似度以及网页PageRank作为关键特征,以常见URL、Whois、DNS信息和网页标签信息作为辅助特征,经过随机森林分类模型判断后给出最终的分类结果。实验证明,Nmap-RF集成方法可以在平均9~10 μs的时间内对钓鱼网页进行检测,且可以过滤掉98.4%的不合法页面,平均总精度可达99.6%。  相似文献   

20.
针对大多数网页除了正文信息外,还包括导航、广告和免责声明等噪声信息的问题。为了提高网页正文抽取的准确性,提出了一种基于文本块密度和标签路径覆盖率的抽取方法(CETD-TPC),结合网页文本块密度特征和标签路径特征的优点,设计了融合两种特征的新特征,利用新特征抽取网页中的最佳文本块,最后,抽取该文本块中的正文内容。该方法有效地解决了网页正文中噪声块信息过滤和短文本难以抽取的问题,且无需训练和人工处理。在CleanEval数据集和从知名网站上随机选取的新闻网页数据集上的实验结果表明,CETD-TPC方法在不同数据源上均具有很好的适用性,抽取性能优于CETR、CETD和CEPR等算法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号