首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 735 毫秒
1.
近年来,Web使用挖掘成为数据挖掘领域中一个新的研究热点,Web使用挖掘是从记录了大量网络用户行为信息的Web日志中发现用户访问行为特征和潜在规律.本文结合某高校主页的真实运行数据,通过Web使用挖掘对于网站的运行日志文件进行全面的挖掘分析,分析用户对信息内容的兴趣度,并通过用户对网页的访问数据推算出各个页面受众的兴趣度高低,借此改良网站的内容和布局.  相似文献   

2.
基于兴趣度的Web用户访问模式分析   总被引:1,自引:0,他引:1  
吕佳 《计算机工程与设计》2007,28(10):2403-2404,2407
Web日志隐含了用户访问Web行为的动因和规律,如何有效地从中挖掘出用户访问模式是Web日志挖掘的重要研究内容.构造了User_ID-URL矩阵,矩阵元素为用户访问页面的兴趣度.应用经典的模糊C-均值聚类算法进行用户访问模式分析,通过在真实数据集上的实验,结果表明引入了用户兴趣度的日志挖掘算法是行之有效的.  相似文献   

3.
企业日志数据,即员工在企业内部使用网络服务时系统保存的记录,包括员工网页访问日志、邮件日志等。在一定程度上反映了企业内部的组织结构、员工的日常工作模式和各种异常情况等。对日志数据进行分析有助于企业高层及时把控企业的运行状况,发现企业潜在威胁,进而帮助更好地进行决策。现有的企业日志分析方法大多是在单一数据基础上使用数据挖掘和机器学习等算法来进行分析。将以数据为中心的分析算法和以人为中心的交互式可视化结合起来能够同时发挥算法和人的分析优势;可视分析方法可以更有效地将多源异构、时变、多维的日志数据分析结合起来,提供多角度分析。为此,设计并实现了面向企业日志数据的员工工作行为可视分析系统EWB-VIS。在ChinaVis2018挑战赛所提供的公开数据集上进行实验,证明了系统的可用性和相关可视化方法的有效性。  相似文献   

4.
Web mining involves the application of data mining techniques to large amounts of web-related data in order to improve web services. Web traversal pattern mining involves discovering users’ access patterns from web server access logs. This information can provide navigation suggestions for web users indicating appropriate actions that can be taken. However, web logs keep growing continuously, and some web logs may become out of date over time. The users’ behaviors may change as web logs are updated, or when the web site structure is changed. Additionally, it can be difficult to determine a perfect minimum support threshold during the data mining process to find interesting rules. Accordingly, we must constantly adjust the minimum support threshold until satisfactory data mining results can be found.The essence of incremental data mining and interactive data mining is the ability to use previous mining results in order to reduce unnecessary processes when web logs or web site structures are updated, or when the minimum support is changed. In this paper, we propose efficient incremental and interactive data mining algorithms to discover web traversal patterns that match users’ requirements. The experimental results show that our algorithms are more efficient than other comparable approaches.  相似文献   

5.
Correlation-Based Web Document Clustering for Adaptive Web Interface Design   总被引:2,自引:2,他引:2  
A great challenge for web site designers is how to ensure users' easy access to important web pages efficiently. In this paper we present a clustering-based approach to address this problem. Our approach to this challenge is to perform efficient and effective correlation analysis based on web logs and construct clusters of web pages to reflect the co-visit behavior of web site users. We present a novel approach for adapting previous clustering algorithms that are designed for databases in the problem domain of web page clustering, and show that our new methods can generate high-quality clusters for very large web logs when previous methods fail. Based on the high-quality clustering results, we then apply the data-mined clustering knowledge to the problem of adapting web interfaces to improve users' performance. We develop an automatic method for web interface adaptation: by introducing index pages that minimize overall user browsing costs. The index pages are aimed at providing short cuts for users to ensure that users get to their objective web pages fast, and we solve a previously open problem of how to determine an optimal number of index pages. We empirically show that our approach performs better than many of the previous algorithms based on experiments on several realistic web log files. Received 25 November 2000 / Revised 15 March 2001 / Accepted in revised form 14 May 2001  相似文献   

6.
A Data Cube Model for Prediction-Based Web Prefetching   总被引:7,自引:0,他引:7  
Reducing the web latency is one of the primary concerns of Internet research. Web caching and web prefetching are two effective techniques to latency reduction. A primary method for intelligent prefetching is to rank potential web documents based on prediction models that are trained on the past web server and proxy server log data, and to prefetch the highly ranked objects. For this method to work well, the prediction model must be updated constantly, and different queries must be answered efficiently. In this paper we present a data-cube model to represent Web access sessions for data mining for supporting the prediction model construction. The cube model organizes session data into three dimensions. With the data cube in place, we apply efficient data mining algorithms for clustering and correlation analysis. As a result of the analysis, the web page clusters can then be used to guide the prefetching system. In this paper, we propose an integrated web-caching and web-prefetching model, where the issues of prefetching aggressiveness, replacement policy and increased network traffic are addressed together in an integrated framework. The core of our integrated solution is a prediction model based on statistical correlation between web objects. This model can be frequently updated by querying the data cube of web server logs. This integrated data cube and prediction based prefetching framework represents a first such effort in our knowledge.  相似文献   

7.
为了获取用户访问页面的行为全过程以及准确时间,在网站中建立自动记录离开访问页面机制,准确的记录了用户访问页面的行为的全过程,确保访问日志的完整性和准确性.在此基础上,提出了服务器访问日志数据清理算法,确保准确提取出页面访问时间,从而解决了常见的页面访问时间算法不能准确确定每个页面被访问的确切时间的问题.  相似文献   

8.
基于归纳化会话的网络用户的聚类   总被引:7,自引:0,他引:7  
为了发掘具有相似的访问兴趣的网络用户,探讨了网络用户聚类的问题。网络用户的访问信息从服务器日志文件中抽取出来,组织成会话向量的形式,会话描述为一段时间内用户向服务器发出一系列访问请求。为了减少会话向量的维度,根据网页的层次性,采用面向属性的推理方法,对这些会话进行了归纳,并且定义了一个新的距离测度来描述两个会话之间的相似度,最后采用某种非欧几里德的关系聚类算法聚类这些归纳化的会话。实验表明,这种方法对在大型的日志文件集中挖掘出有意义的网络用户的分类是高效可行的。  相似文献   

9.
Web使用挖掘是数据挖掘技术在Web信息仓库中的应用.Web使用挖掘通过挖掘Web服务器日志获取的知识来预测用户浏览行为,是Web挖掘技术中的一个重要研究方向.通常发现的知识或一些意外规则很可能是不精确的、不完备的,这就需要用软计算技术如粗糙集来解决.提出一种基于粗糙近似的聚类方法,该方法能够实现从Web访问日志中聚类Web事务.通过这种方法可以有效地挖掘Web日志记录,从而发现用户存取Web页面的模式.  相似文献   

10.
Client interactions with web-accessible network services are organized into sessions involving requests that read and write shared application data. When executed concurrently, web sessions may invalidate each other's data. Allowing the session with invalid data to progress might lead to financial penalties for the service provider, while blocking the session's progress will result in user dissatisfaction. A compromise would be to tolerate some bounded data inconsistency, which would allow most of the sessions to progress, while limiting the potential financial loss incurred by the service. This paper develops analytical models of concurrent web sessions with bounded inconsistency in shared data, which enable quantitative reasoning about these tradeoffs. We illustrate our models using the sample buyer scenario from the TPC-W benchmark, and validate them by showing their close correspondence to measured results in both a simulated and a real web server environments. We augment our web server with a profiling and automated decision making infrastructure which is shown to successfully choose the best concurrency control algorithm in real time in response to changing service usage patterns.  相似文献   

11.
本文研究了使用集群环境下的用户访问日志数据生成用户会话聚类的方法:编制Perl脚本从用户访问日志中生成用户会话,以新的相似度度量取代欧几里德距离改进Leader算 法对用户会话集合进行聚类,并计算聚类的内部距离和间隔距离来验证算法的有效性。实验结果表明,这种实现能有效地对用户访问日志进行聚类,并能满足服务器预取机制
制在线分析的时间、空间要求。  相似文献   

12.
Distributed Denial of Service (DDoS) is one of the most damaging attacks on the Internet security today. Recently, malicious web crawlers have been used to execute automated DDoS attacks on web sites across the WWW. In this study we examine the effect of applying seven well-established data mining classification algorithms on static web server access logs in order to: (1) classify user sessions as belonging to either automated web crawlers or human visitors and (2) identify which of the automated web crawlers sessions exhibit ‘malicious’ behavior and are potentially participants in a DDoS attack. The classification performance is evaluated in terms of classification accuracy, recall, precision and F1 score. Seven out of nine vector (i.e. web-session) features employed in our work are borrowed from earlier studies on classification of user sessions as belonging to web crawlers. However, we also introduce two novel web-session features: the consecutive sequential request ratio and standard deviation of page request depth. The effectiveness of the new features is evaluated in terms of the information gain and gain ratio metrics. The experimental results demonstrate the potential of the new features to improve the accuracy of data mining classifiers in identifying malicious and well-behaved web crawler sessions.  相似文献   

13.
Web数据挖掘中的数据预处理   总被引:11,自引:0,他引:11  
Web数据挖掘是分析网络应用的主要手段,其数据源一般是网络服务器日志,然而日志记录的是杂乱的,不完整的,不准确的并且是非结构化的数据,必须进行数据预处理。文章将预处理过程分为3个阶段-数据清洗、区分使用者,会话识别,并提出了一个高效的Web数据挖掘预处理结构WLP和相应的算法。  相似文献   

14.
虽然基本的Web分析工具非常普遍,目前的分析工具只分析页面级别的用户交互,为了加深对用户行为的理解,该研究提出了将Web应用程序与运行时日志相结合的方法。实现对web应用程序的日志以及用户与页面的交互的跟踪,达到了基于大数据的用户行为分析和可视化。  相似文献   

15.
On Using a Warehouse to Analyze Web Logs   总被引:1,自引:0,他引:1  
Analyzing Web Logs for usage and access trends can not only provide important information to web site developers and administrators, but also help in creating adaptive web sites. While there are many existing tools that generate fixed reports from web logs, they typically do not allow ad-hoc analysis queries. Moreover, such tools cannot discover hidden patterns of access embedded in the access logs. We describe a relational OLAP (ROLAP) approach for creating a web-log warehouse. This is populated both from web logs, as well as the results of mining web logs. We discuss the design criteria that influenced our choice of dimensions, facts and data granularity. A web based ad-hoc tool for analytic queries on the warehouse was developed. We present some of the performance specific experiments that we performed on our warehouse.  相似文献   

16.
Mobile computing systems usually express a user movement trajectory as a sequence of areas that capture the user movement trace. Given a set of user movement trajectories, user movement patterns refer to the sequences of areas through which a user frequently travels. In an attempt to obtain user movement patterns for mobile applications, prior studies explore the problem of mining user movement patterns from the movement logs of mobile users. These movement logs generate a data record whenever a mobile user crosses base station coverage areas. However, this type of movement log does not exist in the system and thus generates extra overheads. By exploiting an existing log, namely, call detail records, this article proposes a Regression-based approach for mining User Movement Patterns (abbreviated as RUMP). This approach views call detail records as random sample trajectory data, and thus, user movement patterns are represented as movement functions in this article. We propose algorithm LS (standing for Large Sequence) to extract the call detail records that capture frequent user movement behaviors. By exploring the spatio-temporal locality of continuous movements (i.e., a mobile user is likely to be in nearby areas if the time interval between consecutive calls is small), we develop algorithm TC (standing for Time Clustering) to cluster call detail records. Then, by utilizing regression analysis, we develop algorithm MF (standing for Movement Function) to derive movement functions. Experimental studies involving both synthetic and real datasets show that RUMP is able to derive user movement functions close to the frequent movement behaviors of mobile users.  相似文献   

17.
模仿正常访问行为的HTTP泛洪攻击较为隐蔽,在消耗网站服务器资源的同时还带来信息安全隐患,提出了一种主动防御方法。用URL重写的方法使Web日志记录HTTP请求的CookieId和SessionId;定时分析Web日志,利用CookieId和SessionID识别用户,根据请求时间特征来识别傀儡主机;对HTTP请求进行预处理,拦截傀儡主机的请求。该方法成本低、便于实施,实践证明了其有效性。  相似文献   

18.
The rapid development of the World Wide Web as a medium of commerce and information dissemination has generated a growing interest of web portal managers in systems able to identify user profiles from the web access logs. The interpretation of these profiles can help re-organize the web portal, e.g., by restructuring the site’s content more efficiently, or even to build adaptive web portals, i.e., portals whose organization and presentation change depending on the specific visitor’s needs. In this paper, we assume that the pages of the web portal have been prearranged in a number of different categories. We introduce a systematic approach to determine a hierarchy of user profiles from the history of users’ accesses to the categories. First, we filter the access log by removing both occasional users and categories of poor interest. Then, we apply an Unsupervised Fuzzy Divisive Hierarchical Clustering (UFDHC) algorithm to cluster the users of the web portal into a hierarchy of fuzzy groups characterized by a set of common interests and each represented by a prototype, which defines the profile of the group typical member. To identify the profile a specific user belongs to, we propose a novel classification method which completely exploits the information contained in the hierarchy. To prove the effectiveness of our approach, we apply the UFDHC algorithm to access log data collected over a period of 15 days and use the classification method to associate a profile with the users defined by access log data collected during subsequent 60 days. Finally, we highlight the good characteristics of our system by comparing our results with the ones obtained by applying a profiling system based on a modified version of the fuzzy C-means.  相似文献   

19.
从Web日志中挖掘用户浏览偏爱路径   总被引:55,自引:0,他引:55  
邢东山  沈钧毅  宋擒豹 《计算机学报》2003,26(11):1518-1523
Web日志中包含了大量的用户浏览信息,如何有效地从其中挖掘出用户浏览兴趣模式是一个重要的研究课题.作者在分析目前用户浏览模式挖掘算法存在的问题的基础上,利用提出的支持一偏爱度的概念,设计了网站访问矩阵,并基于这个矩阵提出了用户浏览偏爱路径挖掘算法:先利用Web日志建立以引用网页URL为行、浏览网页URL为列、路径访问频度为元素值的网站访问矩阵.该矩阵为稀疏矩阵,将该矩阵用三元组法来进行表示.然后,通过对该矩阵进行支持一偏爱度计算得到偏爱子路径.最后进行合并生成浏览偏爱路径.实验表明该算法能准确地反映用户浏览兴趣,而且系统可扩展性较好.这可以应用于电子商务网站的站点优化和个性化服务等.  相似文献   

20.
一种Web用户行为聚类算法   总被引:13,自引:0,他引:13  
提出了一种新的路径相似度系数计算方法,并使之与雅可比相似系数结合,用于计算用户访问行为的相似度,在此基础之上又提出了一种分析web用户行为的聚类算法(FCC)。通过挖掘Web日志,找出具有相似行为的web用户,由于FCC聚类算法过滤了小于指定阚值的相似度系数,大大缩小了数据规模,很好地解决了其他聚类算法(如层次聚类)在高堆空间聚类时的“堆数灾难”问题,最后的实验结果很好。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号