首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
We propose a new Web information extraction system. The outline of the system and the algorithm to extract information are explained in this paper. A typical Web page consists of multiple elements with different functionalities, such as main content, navigation panels, copyright and privacy notices and advertisements. Visitors to Web pages need only a little of the pages. A system to extract a piece of Web pages is needed, our system enables users to extract Web blocks only by setting clipping areas with their mouse. Web blocks are clickable image maps. Imaging and detecting hyperlink areas on client-side are used to generate image maps. The specialty of our system is that Web blocks perfect layouts and hyperlinks on the original Web pages. Users can access and manage their Web blocks via Evernote, which is a cloud storage system. And HTML snippets for Web blocks enable users to easily reuse Web contents on their own Web site.  相似文献   

2.
A system and method of saving a Web page from a Website on an Internet to a computer-readable medium is disclosed.A Web page is downloaded from the Intemet to the computer-readable medium.The Internet address for the Web page isstored on the computer—readable medium When the Web pageis opened from the computer-readable medium the Internetaddress is used to identify a security context for the Web page.By using the Internet address to identify the security context forthe Web page,the system and method of the present inventionallows users to securely view and execute Web pages down-loaded from the Internet.  相似文献   

3.
A rapidly increasing number of Web databases are now become accessible via their HTML form-based query interfaces. Query result pages are dynamically generated in response to user queries, which encode structured data and are displayed for human use. Query result pages usually contain other types of information in addition to query results, e.g., advertisements, navigation bar etc. The problem of extracting structured data from query result pages is critical for web data integration applications, such as comparison shopping, meta-search engines etc, and has been intensively studied. A number of approaches have been proposed. As the structures of Web pages become more and more complex, the existing approaches start to fail, and most of them do not remove irrelevant contents which may affect the accuracy of data record extraction. We propose an automated approach for Web data extraction. First, it makes use of visual features and query terms to identify data sections and extracts data records in these sections. We also represent several content and visual features of visual blocks in a data section, and use them to filter out noisy blocks. Second, it measures similarity between data items in different data records based on their visual and content features, and aligns them into different groups so that the data in the same group have the same semantics. The results of our experiments with a large set of Web query result pages in di?erent domains show that our proposed approaches are highly effective.  相似文献   

4.
A logical foundation for the semantic Web   总被引:8,自引:0,他引:8  
World Wide Web (WWW) has been one of important channels from which people acquire information and services, but most web pages are only used by humans at pre-sent, and these pages cannot be processed and understood automatically by computers. The semantic Web is an essential reformation of Web. The main objective of the seman-tic Web is to enrich Web with semantics and make Web be understood by computers, in order to communicate and cooperate between people and computer. The key of the se-…  相似文献   

5.
Innovating Web Page Classification Through Reducing Noise   总被引:5,自引:0,他引:5       下载免费PDF全文
This paper presents a new method that eliminates noise in Web page classification.It first describes the presentation of a Web page based on HTML tags.Then through a novel distance formula,it eliminates the noise in similarity measure.After carefully analyzing Web pages,we design an algorithm that can distinguish related hyperlinks from noisy ones,Web can utilize non-noisy hyperlinks to improve th performance of Web page classification (The AWN algorithm).For any page.we can classify it through the text and category of neighbor pages relted to the page.The experimental results show that our approach improved classification accuracy.  相似文献   

6.
There are hidden and rich information for data mining in the topology of topic-specific websites. A new topic-specific association rules mining algorithm is proposed to further the research on this area. The key idea is to analyze the frequent hyperlinked relati ons between pages of different topics. In the topic-specific area, if pages of onetopic are frequently hyperlinked by pages of another topic, we consider the two topics are relevant. Also, if pages oftwo different topics are frequently hyperlinked together by pages of the other topic, we consider the two topics are relevant.The initial experiments show that this algorithm performs quite well while guiding the topic-specific crawling agent and it can be applied to the further discovery and mining on the topic-specific website.  相似文献   

7.
Summarizing Vocabularies in the Global Semantic Web   总被引:1,自引:0,他引:1       下载免费PDF全文
In the Semantic Web, vocabularies are defined and shared among knowledge workers to describe linked data for scientific, industrial or daily life usage. With the rapid growth of online vocabularies, there is an emergent need for approaches helping users understand vocabularies quickly. In this paper, we study the summarization of vocabularies to help users understand vocabularies. Vocabulary summarization is based on the structural analysis and pragmatics statistics in the global Semantic Web. Local Bipa...  相似文献   

8.
It is not practical to test a website exhaustively because of too many pages and user groups. So we begin our work at the log files of the Website server, gathering the users' visits and the server's responses in a period of time. Then we analyze and discuss these information, and based on this, we identify the key pages, predominate pages and users' visiting modes. Thus we can allocate the testing resource appropriately, emphasing on these pages and modes. Finally we may improve the display structure of the site and fulfill the functionality of the site, enhancing users' visiting efficiency as possible as we can.  相似文献   

9.
A site-based proxy cache   总被引:4,自引:0,他引:4       下载免费PDF全文
In traditional proxy caches,any visited page from any Web server is cached independently,ignoring connections between pages,And users still have to frequently visity in dexing pages just for reaching useful informative ones,which causes significant waste of caching space and unnecessary Web traffic.In order to solve the above problem,this paper introduced a site graph model to describe WWW and a site-based replacement strategy has been built based on it .The concept of “access frequency“ is developed for evaluating whether a Web page is worth being kept in caching space.On the basis of user‘‘‘‘‘‘‘‘s access history,auxiliary navigation information is provided to help him reach target pages more quickly.Performance test results haves shown that the proposed proxy cache system can get higher hit ratio than traditional ones and can reduce user‘‘‘‘‘‘‘‘s access latency effectively.  相似文献   

10.
There are two important problems in online retail:1)The conflict between the different interest of all customers to the different commodities and the commodity classification structure of Web site;2)Many customers will simultaneously buy both the beer and the diaper that are classified in different classes and levels in the Web site,which is the typical problem in data mining.The two problems will make majority customers access overabundant Web pages.To sove these problems,we mine the Web page data,server data,and marketing data to build an adaptive model.In this model,the frequently purchased commodities and their association commodity sets that are discovered by the association rule discovery will be put into the suitable Web page according to the placing method and the backing off method.At last the navigation Web pages become the navigation content Web pages.The Web site can be adaptive according to the users‘‘‘‘‘‘‘‘accesa and purchase information.In online retail,the designers require to understand the latent users‘‘‘‘‘‘‘‘interest in order to convert the latent users to purchase users.In this paper,we give the approach to discover the Internet marketing intelligence through OLAP in order to help the designers to improve their service.  相似文献   

11.
一种基于支持向量机的专业中文网页分类器   总被引:4,自引:1,他引:4  
文中提出了一种基于支持向量机的专业中文网页分类算法,利用支持向量机对网页进行二类分类,找出所需专业的中文网页;然后利用向量空间模型,对分类好的专业网页进行多类分类。在构造支持向量机的过程中,为了提高分类的召回率,采用了一种偏移因子。该算法只需要计算二类SVM分类器,实验表明,它不仅具有较高的训练效率,同时能得到很高的分类精确率和召回率。  相似文献   

12.
针对海量网页数据挖掘问题,提出基于向量空间的网页内容相似计算算法和软件系统框架。利用搜索引擎从海量网页中提取中文编码的网页URL,在此基础上提取网页的中文字符并分析提取出中文实词,建立向量空间模型计算网页内容间的相似度。该系统缩小了需要进行相似度计算的网页文档范围,节约大量时间和空间资源,为网络信息的分类、查询、智能化等奠定了良好的基础。  相似文献   

13.
个性化学习系统的聚类技术   总被引:1,自引:1,他引:0  
基于日志的Web使用挖掘,利用用户访问页面的相关性提出用户兴趣度,并应用于远程教育中数据准备和页面的推荐过程.讨论教学过程中按需学习和因才施教的可行性,介绍聚类算法在预测推荐页面中的设计与应用.实验运行结果表明,该算法是可行和有效的.  相似文献   

14.
一种Web使用模式挖掘模型的设计   总被引:1,自引:1,他引:0  
Web使用模式挖掘是对用户浏览Web后在服务器日志上所留信息的数据挖掘.介绍了挖掘中常用技术及流程,并提出一种Web使用模式挖掘体系结构,介绍了系统的工作原理,对系统设计中的数据清洗和会话识别等关键技术作了详细讨论.  相似文献   

15.
在分析了网站拓扑结构与Web使用挖掘以及个性化推荐之间关系的基础上,提出了一种超链接结构的分类方法,通过对网站结构信息的分析和处理,得到网站的拓扑结构并进行存储,从而解决了单个网站中Web使用挖掘及推荐中的若干实际问题.  相似文献   

16.
Little research links knowledge management to government institutions. Knowledge management is viewed primarily as value-added for managing business organizations. The information and communication technology (ICT) use in government institutions is often limited to composing a Web page and posting information. The content of these Web pages is regulated; however, it is not judged in terms of communication effectiveness. In this article, the author provides an overview of the regulatory environment of ICT usage affecting Web pages of public sector organizations in Estonia. It also discusses some principles of ICT regulation in a European context. The author evaluates the outcome of the regulatory mechanism through analysis of Web sites of Estonian county governments. Specifically, the author examines whether the Web pages conform to regulatory acts, whether they are user-friendly, and whether they are concise. The content, structure, visual form, and other evaluation categories are analyzed by discourse and content analysis. After analysis, it is concluded that decision makers both on a regulatory level and on county government level have to consider the importance of the generation of contextually appropriate content through Web pages. © 2006 Wiley Periodicals, Inc.  相似文献   

17.
在对Web应用挖掘的基本步骤作系统性研究的基础上,设计了一个基于Web日志文件的关联规则挖掘模块。该系统应能够对用户访问Web时服务器方留下的访问记录进行挖掘,从中得出用户的访问模式和访问兴趣。为了识别用户浏览模式,实现了利用关联规则挖掘算法Apriori对Web应用挖掘过程中预处理阶段所产生的用户会话文件进行挖掘的模块,该模块针对用户选定的若干页面产生满足最小支持度和最小置信度的页面之间的强关联规则,并以文本的形式显示挖掘的结果。  相似文献   

18.
词间相关性在Web检索中的新应用   总被引:1,自引:2,他引:1  
首先分析了以往信息检索中利用词间相关性的局限性,针对Internet检索对象是Web页面,具有篇幅较小的特点犤2犦,提出“主题关键词集合”的概念,利用词间相关性,通过计算用户查询词集合与网页主题关键词集合之间的距离,对检索结果重新排序。  相似文献   

19.
支持向量机在化学主题爬虫中的应用   总被引:3,自引:0,他引:3  
爬虫是搜索引擎的重要组成部分,它沿着网页中的超链接自动爬行,搜集各种资源。为了提高对特定主题资源的采集效率,文本分类技术被用来指导爬虫的爬行。本文把基于支持向量机的文本自动分类技术应用到化学主题爬虫中,通过SVM 分类器对爬行的网页进行打分,用于指导它爬行化学相关网页。通过与基于广度优先算法的非主题爬虫和基于关键词匹配算法的主题爬虫的比较,表明基于SVM分类器的主题爬虫能有效地提高针对化学Web资源的采集效率。  相似文献   

20.
《Computer Networks》1999,31(11-16):1467-1479
When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach to Web searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers.We describe two algorithms to identify related Web pages. These algorithms use only the connectivity information in the Web (i.e., the links between pages) and not the content of pages or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape's `What's Related' service (http://home.netscape.com/escapes/related/). Our study showed that the precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号