共查询到20条相似文献,搜索用时 31 毫秒
1.
Web sites contain an ever increasing amount of information within their pages. As the amount of information increases so does
the complexity of the structure of the web site. Consequently it has become difficult for visitors to find the information
relevant to their needs. To overcome this problem various clustering methods have been proposed to cluster data in an effort
to help visitors find the relevant information. These clustering methods have typically focused either on the content or the
context of the web pages. In this paper we are proposing a method based on Kohonen’s self-organizing map (SOM) that utilizes
both content and context mining clustering techniques to help visitors identify relevant information quicker. The input of
the content mining is the set of web pages of the web site whereas the source of the context mining is the access-logs of
the web site. SOM can be used to identify clusters of web sessions with similar context and also clusters of web pages with
similar content. It can also provide means of visualizing the outcome of this processing. In this paper we show how this two-level
clustering can help visitors identify the relevant information faster. This procedure has been tested to the access-logs and
web pages of the Department of Informatics and Telecommunications of the University of Athens. 相似文献
2.
一种Web用户行为聚类算法 总被引:13,自引:0,他引:13
提出了一种新的路径相似度系数计算方法,并使之与雅可比相似系数结合,用于计算用户访问行为的相似度,在此基础之上又提出了一种分析web用户行为的聚类算法(FCC)。通过挖掘Web日志,找出具有相似行为的web用户,由于FCC聚类算法过滤了小于指定阚值的相似度系数,大大缩小了数据规模,很好地解决了其他聚类算法(如层次聚类)在高堆空间聚类时的“堆数灾难”问题,最后的实验结果很好。 相似文献
3.
4.
数据挖掘技术在Web预取中的应用研究 总被引:69,自引:0,他引:69
WWW以其多媒体的传输及良好的交互性而倍受青睐,虽然近几年来网络速度得到了很大的提高,但是由于接入Internet的用户数量剧增以及Web服务和网络固有的延迟,使得网络越来越拥护,用户的服务质量得不到很好的保证。为此文中提出了一种智能Web预取技术,它能够加快用户浏览Web页面时获取页面的速度。该技术通过简化的WWW数据模型表示用户浏览器缓冲器中的数据,在此基础上利用数据挖掘技术挖掘用户的兴趣关联规则,存放在兴趣关联知识库中,作为对用户行为进行预测的依据。在用户端,智能代理负责用户兴趣的挖掘及基于兴趣关联知识库的Web预取,从而对用户实现透明的浏览器加速。 相似文献
5.
Time-Aware Web Users' Clustering 总被引:1,自引:0,他引:1
Petridou S.G. Koutsonikola V.A. Vakali A.I. Papadimitriou G.I. 《Knowledge and Data Engineering, IEEE Transactions on》2008,20(5):653-667
Web users' clustering is a crucial task for mining information related to users' needs and preferences. Up to now, popular clustering approaches build clusters based on usage patterns derived from users' page preferences. This paper emphasizes the need to discover similarities in users' accessing behavior with respect to the time locality of their navigational acts. In this context, we present two time-aware clustering approaches for tuning and binding the page and time visiting criteria. The two tracks of the proposed algorithms define clusters with users that show similar visiting behavior at the same time period, by varying the priority given to page or time visiting. The proposed algorithms are evaluated using both synthetic and real data sets and the experimentation has shown that the new clustering schemes result in enriched clusters compared to those created by the conventional non-time-aware user clustering approaches. These clusters contain users exhibiting similar access behavior in terms not only of their page preferences but also of their access time. 相似文献
6.
Multiobjective evolutionary clustering of Web user sessions: a case study in Web page recommendation
G. Nildem Demir A. Şima Uyar Şule Gündüz-Öğüdücü 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2010,14(6):579-597
In this study, we experiment with several multiobjective evolutionary algorithms to determine a suitable approach for clustering
Web user sessions, which consist of sequences of Web pages visited by the users. Our experimental results show that the multiobjective
evolutionary algorithm-based approaches are successful for sequence clustering. We look at a commonly used cluster validity
index to verify our findings. The results for this index indicate that the clustering solutions are of high quality. As a
case study, the obtained clusters are then used in a Web recommender system for representing usage patterns. As a result of
the experiments, we see that these approaches can successfully be applied for generating clustering solutions that lead to
a high recommendation accuracy in the recommender model we used in this paper. 相似文献
7.
由于现有的Web日志缺少明显语义,提出一种语义Web日志模型--SWLM,并给出基于该模型的网页和用户聚类算法.通过日志概念的语义距离定量计算来聚类网页和用户,奠定了Web个性化服务的基础.性能测试实验证明,该模型具有较好的整体性能,能有效地进行网页和用户聚类. 相似文献
8.
Igor Cadez David Heckerman Christopher Meek Padhraic Smyth Steven White 《Data mining and knowledge discovery》2003,7(4):399-424
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we first partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of first-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data; and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-traffic data from msnbc.com. 相似文献
9.
Sequence-based clustering for Web usage mining: A new experimental framework and ANN-enhanced K-means algorithm 总被引:1,自引:0,他引:1
We develop a general sequence-based clustering method by proposing new sequence representation schemes in association with Markov models. The resulting sequence representations allow for calculation of vector-based distances (dissimilarities) between Web user sessions and thus can be used as inputs of various clustering algorithms. We develop an evaluation framework in which the performances of the algorithms are compared in terms of whether the clusters (groups of Web users who follow the same Markov process) are correctly identified using a replicated clustering approach. A series of experiments is conducted to investigate whether clustering performance is affected by different sequence representations and different distance measures as well as by other factors such as number of actual Web user clusters, number of Web pages, similarity between clusters, minimum session length, number of user sessions, and number of clusters to form. A new, fuzzy ART-enhanced K-means algorithm is also developed and its superior performance is demonstrated. 相似文献
10.
Mining Navigation Patterns Using a Sequence Alignment Method 总被引:2,自引:0,他引:2
In this article, a new method is illustrated for mining navigation patterns on a web site. Instead of clustering patterns by means of a Euclidean distance measure, in this approach users are partitioned into clusters using a non-Euclidean distance measure called the Sequence Alignment Method (SAM). This method partitions navigation patterns according to the order in which web pages are requested and handles the problem of clustering sequences of different lengths. The performance of the algorithm is compared with the results of a method based on Euclidean distance measures. SAM is validated by means of user-traffic data of two different web sites. Empirical results show that SAM identifies sequences with similar behavioral patterns not only with regard to content, but also considering the order of pages visited in a sequence. 相似文献
11.
应用链接分析的web搜索结果聚类 总被引:3,自引:0,他引:3
随着web上信息的急剧增长,如何有效地从web上获得高质量的web信息已经成为很多研究领域里的热门研究主题之一,比如在数据库,信息检索等领域。在信息检索里,web搜索引擎是最常用的工具,然而现今的搜索引擎还远不能达到满意的要求,使用链接分析,提出了一种新的方法用来聚类web搜索结果,不同于信息检索中基于文本之间共享关键字或词的聚类算法,该文的方法是应用文献引用和匹配分析的方法,基于两web页面所共享和匹配的公共链接,并且扩展了标准的K-means聚类算法,使它更适合于处理噪音页面,并把它应用于web结果页面的聚类,为验证它的有效性,进行了初步实验,实验结果显示通过链接分析对web搜索结果聚类取得了预期效果 相似文献
12.
结合使用挖掘和内容挖掘的web推荐服务 总被引:10,自引:1,他引:9
随着Internet的基础结构不断扩大和其所含信息的持续增长,Internet用户越来越感觉容易在WWW服务中“资源迷向”。提高用户访问效率的方法有页面预取技术,站点动态重构技术和web个性化推荐技术等。现有的大多数web个性化推荐技术主要是基于用户使用记录的数据挖掘方法,没有或很少考虑结合页面内容—这才是用户真正感兴趣的。该文提出一种结合用户使用挖掘和内容挖掘的web推荐服务,该推荐服务根据频繁最大前向访问路径,提出含有导航页和内容页的频繁访问路径图概念,根据滑动窗口内的最近用户访问页面内容和候选推荐集中页面内容相关性,来向用户提供个性化推荐服务。经推荐质量分析,这种方法具有较好的推荐优化能力。 相似文献
13.
一种Web挖掘的框架 总被引:4,自引:3,他引:1
随着Web信息量的增长,Web用户也迅速增长,如何在海量信息中找出用户需要的信息变得更加重要。基于Web服务器日志,分析在线用户的浏览行为,挖掘Web数据并找出用户的遍历模式已经成为一个新的研究领域。针对Web站点的结构,给出了一个Web挖掘的完整框架,允许在分析复杂的遍历模式时加入约束条件,然后对框架中算法的执行效率和执行准确性进行比较和分析,同时展望了Web挖掘的未来研究方向。 相似文献
14.
Web日志中用户频繁路径快速挖掘算法 总被引:10,自引:0,他引:10
Web访问志中含有大量用户浏览信息,从中有效挖掘出用户频繁路径是建立自适用化网站的必要前提。该文在Apriori算法和有向图存储结构的基础上,提出了会话矩阵和遍历矩阵的概念,设计了用户频繁路径快速挖掘算法:首先利用会话矩阵筛选出满足一定阈值条件的频繁一项集,这样避免产生大量中间项;然后在相似客户群体内,对页面快速聚类,得到相关联页面;最后根据遍历矩阵对相关联页面进行路径合并,得出频繁路径。实验表明此算法的准确性和快速性。 相似文献
15.
面向结构相似的网页聚类是网络数据挖掘的一项重要技术。传统的网页聚类没有给出网页簇中心的表示方式,在计算点簇间和簇簇间相似度时需要计算多个点对的相似度,这种聚类算法一般比使用簇中心的聚类算法慢,难以满足大规模快速增量聚类的需求。针对此问题,该文提出一种快速增量网页聚类方法FPC(Fast Page Clustering)。在该方法中,先提出一种新的计算网页相似度的方法,其计算速度是简单树匹配算法的500倍;给出一种网页簇中心的表示方式,在此基础上使用Kmeans算法的一个变种MKmeans(Merge-Kmeans)进行聚类,在聚类算法层面上提高效率;使用局部敏感哈希技术,从数量庞大的网页类集中快速找出最相似的类,在增量合并层面上提高效率。 相似文献
16.
Karane Vieira André Luiz da Costa Carvalho Klessius Berlt Edleno S. de Moura Altigran S. da Silva Juliana Freire 《World Wide Web》2009,12(2):171-211
Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity
of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that
build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively
affect the quality of results produced by systems that automatically process information available in web sites, such as search
engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and
thus processing and storing such information just once for a set of pages may save computational resources. In this paper,
we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a
collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection
has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises
important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show
how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates.
The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of
HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single
template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly
identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web
pages show that our approach is efficient and scalable while obtaining accurate results. 相似文献
17.
18.
A major bottleneck in content-based image retrieval (CBIR) systems or search engines is the large gap between low-level image features used to index images and high-level semantic contents of images. One solution to this bottleneck is to apply relevance feedback to refine the query or similarity measures in image search process. In this paper, we first address the key issues involved in relevance feedback of CBIR systems and present a brief overview of a set of commonly used relevance feedback algorithms. Almost all of the previously proposed methods fall well into such framework. We present a framework of relevance feedback and semantic learning in CBIR. In this framework, low-level features and keyword annotations are integrated in image retrieval and in feedback processes to improve the retrieval performance. We have also extended framework to a content-based web image search engine in which hosting web pages are used to collect relevant annotations for images and users' feedback logs are used to refine annotations. A prototype system has developed to evaluate our proposed schemes, and our experimental results indicated that our approach outperforms traditional CBIR system and relevance feedback approaches. 相似文献
19.
Although efficient identification of user access sessions from very large web logs is an unavoidable data preparation task for the success of higher level web log mining, little attention has been paid to algorithmic study of this problem. In this paper we consider two types of user access sessions, interval sessions and gap sessions. We design two efficient algorithms for finding respectively those two types of sessions with the help of some proposed structures. We present theoretical analysis of the algorithms and prove that both algorithms have optimal time complexity and certain error-tolerant properties as well. We conduct empirical performance analysis of the algorithms with web logs ranging from 100 megabytes to 500 megabytes. The empirical analysis shows that the algorithms just take several seconds more than the baseline time, i.e., the time needed for reading the web log once sequentially from disk to RAM, testing whether each user access record is valid or not, and writing each valid user access record back to disk. The empirical analysis also shows that our algorithms are substantially faster than the sorting based session finding algorithms. Finally, optimal algorithms for finding user access sessions from distributed web logs are also presented. 相似文献
20.
Interval Set Clustering of Web Users with Rough K-Means 总被引:1,自引:0,他引:1
Data collection and analysis in web mining faces certain unique challenges. Due to a variety of reasons inherent in web browsing and web logging, the likelihood of bad or incomplete data is higher than conventional applications. The analytical techniques in web mining need to accommodate such data. Fuzzy and rough sets provide the ability to deal with incomplete and approximate information. Fuzzy set theory has been shown to be useful in three important aspects of web and data mining, namely clustering, association, and sequential analysis. There is increasing interest in research on clustering based on rough set theory. Clustering is an important part of web mining that involves finding natural groupings of web resources or web users. Researchers have pointed out some important differences between clustering in conventional applications and clustering in web mining. For example, the clusters and associations in web mining do not necessarily have crisp boundaries. As a result, researchers have studied the possibility of using fuzzy sets in web mining clustering applications. Recent attempts have used genetic algorithms based on rough set theory for clustering. However, the genetic algorithms based clustering may not be able to handle the large amount of data typical in a web mining application. This paper proposes a variation of the K-means clustering algorithm based on properties of rough sets. The proposed algorithm represents clusters as interval or rough sets. The paper also describes the design of an experiment including data collection and the clustering process. The experiment is used to create interval set representations of clusters of web visitors. 相似文献