首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
基于页面Block的Web档案采集和存储   总被引:1,自引:0,他引:1  
提出了基于页面Block对Web页面的采集和存储方式,并详细表述了该方法如何完成基于布局页面分区、Block主题的抽取、版本和差异的比较以及增量存储的方式.实现了一个Web归档原型系统,并对所提出的算法进行了详细的测试.理论和实验表明,所提出的基于页面Block的Web档案(Web archive)采集和存储方法能够很好地适应Web档案的管理方式,并对基于Web档案的查询、搜索、知识发现和数据挖掘等应用提供有利的数据资源.  相似文献   

2.
In this paper, we deal with the problem of handling solutions in an external archive with the use of a relaxed form of Pareto dominance called ?-dominance and a variation of it called pa?-dominance. These two relaxed forms of Pareto dominance have been used as archiving strategies in some multi-objective evolutionary algorithms (MOEAs). The main objective of this work is to improve the ?-dominance based schemes to handle nondominated solutions, or to retain nondominated solutions in an external archive. Thus, our main contribution is to add an extra objective function only at the time of accepting a nondominated solution into the external archive, in order to preserve some solutions which are normally lost when using any of the aforementioned relaxed forms of Pareto dominance. Such a proposal is inexpensive (computationally speaking) and quite effective, since it is able to produce Pareto fronts of much better quality than the aforementioned archiving techniques.  相似文献   

3.
WebLog访问序列模式挖掘将数据挖掘中的序列模式技术应用于Web服务器上的日志文件,以此来改善Web的信息服务,而在对海量的数据挖掘时,系统资源开销很大。该文结合SPAM、PrefixSpan的思想,提出一个新的算法——SPAM-FPT,该算法通过建立First_Positon_Table,避免了SPAM中的“与操作”、“连接操作”以及PrefixSpan中大量的“投影数据库”的建立,可以快捷地挖掘数据库中所有“频繁子序列”。  相似文献   

4.
《Computer Networks》1999,31(11-16):1495-1507
The Web mostly contains semi-structured information. It is, however, not easy to search and extract structural data hidden in a Web page. Current practices address this problem by (1) syntax analysis (i.e. HTML tags); or (2) wrappers or user-defined declarative languages. The former is only suitable for highly structured Web sites and the latter is time-consuming and offers low scalability. Wrappers could handle tens, but certainly not thousands, of information sources. In this paper, we present a novel information mining algorithm, namely KPS, over semi-structured information on the Web. KPS employs keywords, patterns and/or samples to mine the desired information. Experimental results show that KPS is more efficient than existing Web extracting methods.  相似文献   

5.
自适应网站能够提高网站对用户的服务质量。本文首先给出自适应网站的总体框架,对框架中主要模块做详细的分析,包括数据预处理、数据挖掘、页面推荐和站点调整。在数据挖掘模块给出一种有效的识别用户访问模式的算法,该算法利用数据库查询简化频繁最大前向访问路径集的查找,并在此基础上形成频繁访问路径图,为页面推荐和站点调整做好准备。最后给出自适应网站的设计原则。  相似文献   

6.
Web mining involves the application of data mining techniques to large amounts of web-related data in order to improve web services. Web traversal pattern mining involves discovering users’ access patterns from web server access logs. This information can provide navigation suggestions for web users indicating appropriate actions that can be taken. However, web logs keep growing continuously, and some web logs may become out of date over time. The users’ behaviors may change as web logs are updated, or when the web site structure is changed. Additionally, it can be difficult to determine a perfect minimum support threshold during the data mining process to find interesting rules. Accordingly, we must constantly adjust the minimum support threshold until satisfactory data mining results can be found.The essence of incremental data mining and interactive data mining is the ability to use previous mining results in order to reduce unnecessary processes when web logs or web site structures are updated, or when the minimum support is changed. In this paper, we propose efficient incremental and interactive data mining algorithms to discover web traversal patterns that match users’ requirements. The experimental results show that our algorithms are more efficient than other comparable approaches.  相似文献   

7.
It is important to provide long-term preservation of digital data even when those data are stored in an unreliable system such as a filesystem, a legacy database, or even the World Wide Web. In this paper we focus on the problem of archiving the contents of a Web site without disrupting users who maintain the site. We propose an archival storage system, the InfoMonitor, in which a reliable archive is integrated with an unmodified existing store. Implementing such a system presents various challenges related to the mismatch of features between the components such as differences in naming and data manipulation operations. We examine each of these issues as well as solutions for the conflicts that arise. We also discuss our experience using the InfoMonitor to archive the Stanford Database Groups Web site.  相似文献   

8.
HTML文档重复模式挖掘是找到Web页面编码模版的关键,是Web数据自动抽取和Web内容挖掘的基础。传统的基于字符串匹配和树匹配的重复模式挖掘方法虽然具有较高的精确度,但是其性能对于处理海量的Web页面来说仍然是一个挑战。为了提高性能,提出了一种基于缩进轮廓的HTML文档重复模式挖掘方法。该方法首先定义了缩进轮廓模型,是一种由HTML文档每行代码的缩进值及行首的HTML标签构成的数据结构,它是HTML文档的一种简化抽象;该方法通过检测缩进轮廓中的串联重复波段,间接地挖掘HTML文档中的重复模式。实验表明,该方法不但具有较高的精确度,而且较明显地提升了性能。  相似文献   

9.
Online mining of path traversal patterns from continuous Web click streams is one of the challenging research problems of Web usage mining. Most of previous works focus on mining path traversal patterns over the entire history of Web click streams. Mining the recent changes of Web click streams can provide valuable information for the analysis of the Web click streams. In this paper, we propose a new, online mining algorithm, called Top-DSW (top-k path traversal patterns of stream Damped Sliding Window), to discover the set of top-k path traversal patterns from streaming maximal forward references, where k is the desired number of path traversal patterns to be mined. An effective summary data structure, called TKP-DSW-list (a list of top-k path traversal patterns of stream Damped Sliding Windows) is developed to maintain the essential information about the top-k path traversal patterns from the maximal forward references within a stream damped sliding window. An effective space pruning mechanism, called TKR-list-maintain, is developed to control the memory requirement of the TKP-DSW-list. Experimental studies show that the proposed Top-DSW algorithm is an efficient, single-pass algorithm for online mining of the set of top-k path traversal patterns over stream damped sliding windows.  相似文献   

10.
Web日志挖掘预处理中的Frame页面过滤算法   总被引:12,自引:0,他引:12  
Web日志挖掘是将数据挖掘技术应用到Web服务器的日志中,发现Web用户的行为模式,在介绍了典型的数据预处理技术的基础上,指出Frame页面降低了挖掘结果的兴趣性,并提出相应的解决方法-Frame页面过滤算法消除其影响。通过实验数据对该算法进行验证,说明Frame页面过滤算法可以显著地提高Web日志挖掘结果的兴趣性。  相似文献   

11.
In the present scenario of global economy and World Wide Web, large sets of evolving and distributed data can be handled efficiently by incremental data mining. Frequent patterns are very important in knowledge discovery and data mining process, such as mining of association rules, correlations. FP-tree is a very versatile data structure used for mining of frequent patterns in knowledge discovery and data mining process. FP-tree is a compact representation of transaction database that contains frequency information of all relevant frequent patterns (FP) of the database. All of the existing incremental frequent pattern mining algorithms, such as AFPIM, CATS, CanTree, CP-tree, and SPO-tree, perform incremental mining by processing one transaction of the incremental part of database at a time and updating it to the FP-tree of initial (original) database. Here, in this paper, we propose a novel method that takes advantage of FP-tree representation of incremental transaction database for incremental mining. We propose a batch incremental processing algorithm BIT_FPGrowth that restructures and merges two small consecutive duration FP-trees to obtain a FP-tree of the FP-Growth algorithm. Our BIT_FPGrowth uses FP-tree as preprocessed data repository to get transactions (i.e., item-sets), unlike other sequential incremental algorithms that read transactions from database. BIT_FPGrowth algorithm takes less time for constructing FP-tree. Our experimental results show that, as the size of the database increases, increase in runtime of BIT_FPGrowth is much less and is least of all the other algorithms.  相似文献   

12.
黄亮  赵泽茂  梁兴开 《计算机应用》2012,32(6):1662-1665
Div+CSS流行于Web页面的布局,在这种布局下,网页中很多数据记录以重复结构的形式聚集在一个层级。为了更好地从网页中挖掘数据,提出了一种新的Web数据挖掘算法,把树编辑距离转化为字符串编辑距离的计算,改进字符串编辑距离算法,利用字符串编辑距离评价树的相似度,进而找到网页中的重复模式,提取数据。通过针对不同重复模式特征的网页的实验说明,基于编辑距离的Web数据挖掘算法不仅能提取具有根节点及上面几层相同的网页的数据,对具有底层节点相同的网页也是有效的。  相似文献   

13.
Web的数据挖掘   总被引:1,自引:0,他引:1  
文章主要描述了WEB页数据挖掘的基本任务,包括内容、结构、使用等。针对Web数据的复杂性和特殊性,Web的数据挖掘除日志等一小部分可以用常用的数据挖掘方法外,必须对Web页做必要的数据处理,使之达到结构化数据的挖掘要求,或使用XML技术来构造半结构数据模式再进行数据挖掘。  相似文献   

14.
Web用户访问多是匿名访问,Web日志挖掘的主要目标是从Web访问记录中抽取用户行为模式,通过分析挖掘结果理解用户的行为,从而改进站点的结构.Web日志挖掘第一步是进行数据预处理.数据预处理是Web页面分析中最耗时的阶段,首先研究了数据预处理的过程,包括数据清洗、用户识别、会话识别、路径补充.提出了一种路径补充的算法,...  相似文献   

15.
Improving pattern quality in web usage mining by using semantic information   总被引:1,自引:1,他引:0  
Frequent Web navigation patterns generated by using Web usage mining techniques provide valuable information for several applications such as Web site restructuring and recommendation. In conventional Web usage mining, semantic information of the Web page content does not take part in the pattern generation process. In this work, we investigate the effect of semantic information on the patterns generated for Web usage mining in the form of frequent sequences. To this aim, we developed a technique and a framework for integrating semantic information into Web navigation pattern generation process, where frequent navigational patterns are composed of ontology instances instead of Web page addresses. The quality of the generated patterns is measured through an evaluation mechanism involving Web page recommendation. Experimental results show that more accurate recommendations can be obtained by including semantic information in navigation pattern generation, which indicates the increase in pattern quality.  相似文献   

16.
数据预处理在Web日志挖掘过程中起着至关重要的作用,直接影响日志挖掘的质量和结果。详细分析了数据预处理的过程,提出一种改进的数据清洗方法,以提高日志挖掘中数据预处理的效率,并针对Web日志数据预处理中会话识别这一重要环节,提出一种改进的会话识别方法。在用户识别后,根据页面内容、站点结构确定页面重要程度,对阈值进行调整。然后,根据用户对页面内容的兴趣度来删除会话中的链接页面和不感兴趣的页面。实验结果表明,提出的方法能更准确地确定页面访问时间阈值,得到更为合理有效的会话集合。  相似文献   

17.
Data Preparation for Mining World Wide Web Browsing Patterns   总被引:8,自引:0,他引:8  
The World Wide Web (WWW) continues to grow at an astounding rate in both the sheer volume of traffic and the size and complexity of Web sites. The complexity of tasks such as Web site design, Web server design, and of simply navigating through a Web site have increased along with this growth. An important input to these design tasks is the analysis of how a Web site is being used. Usage analysis includes straightforward statistics, such as page access frequency, as well as more sophisticated forms of analysis, such as finding the common traversal paths through a Web site. Web Usage Mining is the application of data mining techniques to usage logs of large Web data repositories in order to produce results that can be used in the design tasks mentioned above. However, there are several preprocessing tasks that must be performed prior to applying data mining algorithms to the data collected from server logs. This paper presents several data preparation techniques in order to identify unique users and user sessions. Also, a method to divide user sessions into semantically meaningful transactions is defined and successfully tested against two other methods. Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system [15].  相似文献   

18.
张鹏  向马 《工业控制计算机》2022,35(2):19-20+23
针对制丝生产线工序流程复杂,需要监控的工艺指标多,为提高监控的全面性和准确度,减少生产中由于质量缺陷带来的损失,采用Web技术对每一道工序的工艺指标进行在线监控和超限提醒。通过Web网页实时访问制丝生产数据采集系统,获得各项工序段的生产数据,监控网页将获得的数据生成相应的趋势图,同时监控工艺指标是否满足设定值,并通过语音对监控人员进行提醒。结果表明,监控人员可以在一个Web网页上浏览所有工序的实时生产状况,当制丝生产出现异常,听到语音提示后能迅速确定异常工序段,及时避免质量缺陷带来的损失。该系统应用于长城雪茄烟厂的制丝生产线中,有效减少了生产中的断料次数,提高了烟丝生产的质量。  相似文献   

19.
Most work on pattern mining focuses on simple data structures such as itemsets and sequences of itemsets. However, a lot of recent applications dealing with complex data like chemical compounds, protein structures, XML and Web log databases and social networks, require much more sophisticated data structures such as trees and graphs. In these contexts, interesting patterns involve not only frequent object values (labels) appearing in the graphs (or trees) but also frequent specific topologies found in these structures. Recently, several techniques for tree and graph mining have been proposed in the literature. In this paper, we focus on constraint-based tree pattern mining. We propose to use tree automata as a mechanism to specify user constraints over tree patterns. We present the algorithm CoBMiner which allows user constraints specified by a tree automata to be incorporated in the mining process. An extensive set of experiments executed over synthetic and real data (XML documents and Web usage logs) allows us to conclude that incorporating constraints during the mining process is far more effective than filtering the interesting patterns after the mining process.  相似文献   

20.
美国卫星科学数据处理存档系统S4PA综述   总被引:1,自引:0,他引:1       下载免费PDF全文
简单的、可扩展的、基于脚本的科学数据处理存档系统S4PA,是NASA Goddard地球科学数据和信息服务中心分布式数据存档中心GES DISC DAAC近年发展和使用的、进行气象卫星遥感数据处理和存储管理的系统。本文全面介绍了S4PA的发展历程、建设原则、系统设计和技术特点。其核心思想是采用成熟技术对气象卫星数据处理和存储进行流程化设计,使资源有效利用率达到最优。S4PA的成功应用,证明了这是一种先进的、低成本的、稳定的、可扩展的数据处理和存档系统。本文有助于将国际先进的数据存档技术应用于我国海量数据的数据处理和存档、数据管理的实际工作中。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号