首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
Mining Navigation Patterns Using a Sequence Alignment Method   总被引:2,自引:0,他引:2  
In this article, a new method is illustrated for mining navigation patterns on a web site. Instead of clustering patterns by means of a Euclidean distance measure, in this approach users are partitioned into clusters using a non-Euclidean distance measure called the Sequence Alignment Method (SAM). This method partitions navigation patterns according to the order in which web pages are requested and handles the problem of clustering sequences of different lengths. The performance of the algorithm is compared with the results of a method based on Euclidean distance measures. SAM is validated by means of user-traffic data of two different web sites. Empirical results show that SAM identifies sequences with similar behavioral patterns not only with regard to content, but also considering the order of pages visited in a sequence.  相似文献   

2.
基于隐马尔可夫模型的兴趣迁移模式发现   总被引:17,自引:0,他引:17  
王实  高文 《计算机学报》2001,24(2):152-157
Web挖掘的一个重要研究方向是发现用户的迁移模式。一般来说,用户的迁移具有某种目的性。这种目的性表现为用户对某种概念的兴趣。文中提出基于隐马尔可夫模型的兴趣迁移模式发现方法,用于发现这种带有某种兴趣的用户迁移模式,这种模式实质上是一种特殊的关联规则。在这种方法中,作者首先根据用户的访问记录定义一个隐马尔可夫模型,然后提出一种新的增量发现算法Increase_R用于发现兴趣迁移模式,同时给出了证明以说明该算法可以发现所有的兴趣迁移模式。  相似文献   

3.
基于蚁群行为的动态挖掘用户导航模式兴趣模型   总被引:1,自引:1,他引:0  
随着电子商务的快速发展,一个越来越重要的问题是如何挖掘并预测用户的导航模式。挖掘用户的导航模式是Web使用挖掘的一项重要任务,也是产生导航推荐的基本方法。由于用户的兴趣是不断变化的,因此很难准确跟踪用户的导航模式。在提出了一种蚁群模型来解决该问题。把Web用户看成是人工的蚂蚁,然后应用蚂蚁理论来指导用户在网站上的选择。首先,基于Web日志数据建立一个用户导航模型;其次,设计了一个算法,动态挖掘群体用户偏好的导航模式;最后,对真实数据集的实验结果表明该方法是有效的。  相似文献   

4.
The rapid development of the World Wide Web as a medium of commerce and information dissemination has generated a growing interest of web portal managers in systems able to identify user profiles from the web access logs. The interpretation of these profiles can help re-organize the web portal, e.g., by restructuring the site’s content more efficiently, or even to build adaptive web portals, i.e., portals whose organization and presentation change depending on the specific visitor’s needs. In this paper, we assume that the pages of the web portal have been prearranged in a number of different categories. We introduce a systematic approach to determine a hierarchy of user profiles from the history of users’ accesses to the categories. First, we filter the access log by removing both occasional users and categories of poor interest. Then, we apply an Unsupervised Fuzzy Divisive Hierarchical Clustering (UFDHC) algorithm to cluster the users of the web portal into a hierarchy of fuzzy groups characterized by a set of common interests and each represented by a prototype, which defines the profile of the group typical member. To identify the profile a specific user belongs to, we propose a novel classification method which completely exploits the information contained in the hierarchy. To prove the effectiveness of our approach, we apply the UFDHC algorithm to access log data collected over a period of 15 days and use the classification method to associate a profile with the users defined by access log data collected during subsequent 60 days. Finally, we highlight the good characteristics of our system by comparing our results with the ones obtained by applying a profiling system based on a modified version of the fuzzy C-means.  相似文献   

5.
Summarizing Large-Scale Database Schema Using Community Detection   总被引:1,自引:1,他引:0       下载免费PDF全文
Schema summarization on large-scale databases is a challenge.In a typical large database schema,a great proportion of the tables are closely connected through a few high degree tables.It is thus difficult to separate these tables into clusters that represent different topics.Moreover,as a schema can be very big,the schema summary needs to be structured into multiple levels,to further improve the usability.In this paper,we introduce a new schema summarization approach utilizing the techniques of community detection in social networks.Our approach contains three steps.First,we use a community detection algorithm to divide a database schema into subject groups,each representing a specific subject.Second,we cluster the subject groups into abstract domains to form a multi-level navigation structure.Third,we discover representative tables in each cluster to label the schema summary.We evaluate our approach on Freebase,a real world large-scale database.The results show that our approach can identify subject groups precisely.The generated abstract schema layers are very helpful for users to explore database.  相似文献   

6.
Correlation-Based Web Document Clustering for Adaptive Web Interface Design   总被引:2,自引:2,他引:2  
A great challenge for web site designers is how to ensure users' easy access to important web pages efficiently. In this paper we present a clustering-based approach to address this problem. Our approach to this challenge is to perform efficient and effective correlation analysis based on web logs and construct clusters of web pages to reflect the co-visit behavior of web site users. We present a novel approach for adapting previous clustering algorithms that are designed for databases in the problem domain of web page clustering, and show that our new methods can generate high-quality clusters for very large web logs when previous methods fail. Based on the high-quality clustering results, we then apply the data-mined clustering knowledge to the problem of adapting web interfaces to improve users' performance. We develop an automatic method for web interface adaptation: by introducing index pages that minimize overall user browsing costs. The index pages are aimed at providing short cuts for users to ensure that users get to their objective web pages fast, and we solve a previously open problem of how to determine an optimal number of index pages. We empirically show that our approach performs better than many of the previous algorithms based on experiments on several realistic web log files. Received 25 November 2000 / Revised 15 March 2001 / Accepted in revised form 14 May 2001  相似文献   

7.
Web站点导航是Web数据挖掘的一个重要研究领域,是准确理解用户访问网站行为的关键;传统Web站点导航技术很难全面反映出用户对页面浏览的兴趣程度,找到用户感兴趣页面路径准确度比较低;为提高找到用户感兴趣页面路径准确度,提出一种基于蚁群算法的Web站点导航技术;将网络用户看作人工的蚂蚁,用户的浏览兴趣作蚂蚁的信息素,通过利用Web日志数据采用正负反馈机制和路径概率选择机制建立一个Web站点导航模型,挖掘用户感兴趣页面的导航路径;仿真实验结果表明,基于蚁群算法的Web站点导航技术提高了找到用户感兴趣页面路径准确度,更加能够准确反映出用户的浏览兴趣,用于Web站点导航是可行的。  相似文献   

8.
结合Web用户访问特点,针对Web用户访问路径聚类分析中普遍存在的对象类别不确定性现象进行了研究.结合模糊聚类和可能性聚类的特点,提出来一种新的用户访问路径的可能性模糊聚类算法.新方法通过定义相关的截集,自动地将对象分配到若干簇中,避免了人工干预,实现了交叉聚类的目的.新方法建立在leader聚类算法的框架上,只需要扫描数据集一遍使得算法效率大大提高.在标准数据集上的对比试验表明新算法不仅是有效的,而且效率较高.  相似文献   

9.
路径聚类:在Web站点中的知识发现   总被引:41,自引:0,他引:41  
用户对Web站点的访问代表了用对Web站点上页面的访问兴越,这种兴越程序可以通过用户对Web站点上页面的浏览顺序表现出来,在对Web站点的记问日志进行事务识别后,可以根据群体用户对Web站点的访问顺序进行聚类,即路径聚类,那么最终每一个聚类集就反映出该聚类集中的全体用户具有相似的访问兴越,为了得到这种根据用户访问兴越而对用户集的划分,提出了K-paths路径聚类方法,在这种方法中,根据用户的访问兴越定义了新的相似性测量手段和聚类中心,实验的结果是成功的。  相似文献   

10.
Abstract. The analysis of web usage has mostly focused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The query interfaces of such sites allow the specification of many search criteria. Their generated results support navigation to pages of results combining cross-linked data from many sources. For the analysis of visitor navigation behaviour in such web sites, we propose the web usage miner (WUM), which discovers navigation patterns subject to advanced statistical and structural constraints. Since our objective is the discovery of interesting navigation patterns, we do not focus on accesses to individual pages. Instead, we construct conceptual hierarchies that reflect the query capabilities used in the production of those pages. Our experiments with a real web site that integrates data from multiple databases, the German SchulWeb, demonstrate the appropriateness of WUM in discovering navigation patterns and show how those discoveries can help in assessing and improving the quality of the site. Received June 21, 1999 / Accepted December 24, 1999  相似文献   

11.
基于后缀树的Web检索结果聚类标签生成方法   总被引:1,自引:0,他引:1  
对检索结果进行聚类能够方便用户从搜索结果中快速地找到自己需要的信息,当前已有各种聚类方法和系统被广泛使用,但是,现有大部分方法由于聚类标签的可读性和描述性较差,难以达到预期效果。该文提出了一种新的思路,注重于如何在聚类之前就产生好的标签,在生成了标签的基础上,再进行检索结果聚类。对于搜索引擎返回的结果,我们先统一建立一棵后缀树,然后计算后缀树中各个短语的得分,选取得分最高的若干短语作为候选标签。得到标签后,将搜索引擎返回的各个结果项分配到它所包含的标签对应的分类中,形成最后的聚类。实验表明,我们的方法是比较有效的。  相似文献   

12.
面向结构相似的网页聚类是网络数据挖掘的一项重要技术。传统的网页聚类没有给出网页簇中心的表示方式,在计算点簇间和簇簇间相似度时需要计算多个点对的相似度,这种聚类算法一般比使用簇中心的聚类算法慢,难以满足大规模快速增量聚类的需求。针对此问题,该文提出一种快速增量网页聚类方法FPC(Fast Page Clustering)。在该方法中,先提出一种新的计算网页相似度的方法,其计算速度是简单树匹配算法的500倍;给出一种网页簇中心的表示方式,在此基础上使用Kmeans算法的一个变种MKmeans(Merge-Kmeans)进行聚类,在聚类算法层面上提高效率;使用局部敏感哈希技术,从数量庞大的网页类集中快速找出最相似的类,在增量合并层面上提高效率。  相似文献   

13.
Person name queries often bring up web pages that correspond to individuals sharing the same name. The Web People Search (WePS) task consists of organizing search results for ambiguous person name queries into meaningful clusters, with each cluster referring to one individual. This paper presents a fuzzy ant based clustering approach for this multi-document person name disambiguation problem. The main advantage of fuzzy ant based clustering, a technique inspired by the behavior of ants clustering dead nestmates into piles, is that no specification of the number of output clusters is required. This makes the algorithm very well suited for the Web Person Disambiguation task, where we do not know in advance how many individuals each person name refers to. We compare our results with state-of-the-art partitional and hierarchical clustering approaches (k-means and Agnes) and demonstrate favorable results. This is particularly interesting as the latter involve manual setting of a similarity threshold, or estimating the number of clusters in advance, while the fuzzy ant based clustering algorithm does not.  相似文献   

14.
The well-known Fuzzy C-Means (FCM) algorithm for data clustering has been extended to Evidential C-Means (ECM) algorithm in order to work in the belief functions framework with credal partitions of the data. Depending on data clustering problems, some barycenters of clusters given by ECM can become very close to each other in some cases, and this can cause serious troubles in the performance of ECM for the data clustering. To circumvent this problem, we introduce the notion of imprecise cluster in this paper. The principle of our approach is to consider that objects lying in the middle of specific classes (clusters) barycenters must be committed with equal belief to each specific cluster instead of belonging to an imprecise meta-cluster as done classically in ECM algorithm. Outliers object far away of the centers of two (or more) specific clusters that are hard to be distinguished, will be committed to the imprecise cluster (a disjunctive meta-cluster) composed by these specific clusters. The new Belief C-Means (BCM) algorithm proposed in this paper follows this very simple principle. In BCM, the mass of belief of specific cluster for each object is computed according to distance between object and the center of the cluster it may belong to. The distances between object and centers of the specific clusters and the distances among these centers will be both taken into account in the determination of the mass of belief of the meta-cluster. We do not use the barycenter of the meta-cluster in BCM algorithm contrariwise to what is done with ECM. In this paper we also present several examples to illustrate the interest of BCM, and to show its main differences with respect to clustering techniques based on FCM and ECM.  相似文献   

15.
人名歧义是一种身份不确定的现象,指的是文本中具有相同姓名的字符串指向现实世界中的不同实体人物。人名消歧很长时间一直是一个具有挑战性的问题,关注网页里的人名消歧的问题。因为经典的K-means算法如果选择了一个差的随机初始聚类中心,算法会遇到局部收敛的问题,所以文章提出一种基于最大最小原则的改进的K-means算法来进行人名消歧。同时使用了WePS的训练数据作为实验的语料。实验结果表明,改进的方法比层次聚类方法有着更好的性能。  相似文献   

16.
In this paper, we present a modified filtering algorithm (MFA) by making use of center variations to speed up clustering process. Our method first divides clusters into static and active groups. We use the information of cluster displacements to reject unlikely cluster centers for all nodes in the kd-tree. We reduce the computational complexity of filtering algorithm (FA) through finding candidates for each node mainly from the set of active cluster centers. Two conditions for determining the set of candidate cluster centers for each node from active clusters are developed. Our approach is different from the major available algorithm, which passes no information from one stage of iteration to the next. Theoretical analysis shows that our method can reduce the computational complexity, in terms of the number of distance calculations, of FA at each stage of iteration by a factor of FC/AC, where FC and AC are the numbers of total clusters and active clusters, respectively. Compared with the FA, our algorithm can effectively reduce the computing time and number of distance calculations. It is noted that our proposed algorithm can generate the same clusters as that produced by hard k-means clustering. The superiority of our method is more remarkable when a larger data set with higher dimension is used.  相似文献   

17.
Reconstruction‐based one‐class classification has shown to be very effective in a number of domains. This approach works by attempting to capture the underlying structure of the normal class, typically, by means of clusters of objects. It has the main disadvantage, however, that one has to indicate the number of clusters in advance, for this yields an efficient way of computing a clustering. In this paper, we introduce a new algorithm, OCKRA++, which achieves a better performance, by enhancing a clustering‐based one‐class ensemble classifier (OCKRA) with a cluster validity index that is used to set the best number of clusters during the classifier's training process. We have thoroughly tested OCKRA++ in a particular domain, namely masquerade detection. For this purpose, we have used the Windows‐Users and ‐Intruder simulation Logs data set repository, which contains 70 different masquerade data sets. We have found that OCKRA++ is currently the algorithm that achieves the best area under the curve, with a significant difference, in masquerade detection using the file system navigation approach.  相似文献   

18.
The degree of personalization that a Web site offers in presenting its services to users is an important attribute contributing to the site's popularity. Web server access logs contain substantial data about user access patterns. One way to solve this problem is to group users on the basis of their Web interests and then organize the site's structure according to the needs of different groups. Two main difficulties inhibit this approach: the essentially infinite diversity of user interests and the change in these interests with time. We have developed a clustering algorithm that groups users according to their Web access patterns. The algorithm is based on the ART1 version of adaptive resonance theory. In our ART1-based algorithm, a prototype vector represents each user cluster by generalizing the URLs most frequently accessed by all cluster members. We have compared our algorithm's performance with the traditional k-means clustering algorithm. Results showed that the ART1-based technique performed better in terms of intracluster distances. We also applied the technique in a prefetching scheme that predicts future user requests.  相似文献   

19.
Preferred navigation patterns (PNP) are those contiguous sequential patterns whose elements are preferred by users to be selected as the next steps between several different selections and are preferred by users to spend much time on. Such navigation path and time preferred patterns are more actionable than any other finds only considering either path or time in various web applications, such as web user navigation, targeted online advertising and recommendation. However, due to the conceptual confusion and limitation on navigation preference in the existing work, the corresponding algorithms cannot discover actionable preferred navigation patterns. In this paper, we study the problem of preferred navigation pattern mining by involving both navigation path and time length. Firstly, we carefully define the concepts of time preference and selection preference for time-related path sequences, which can well reflect user interests from the relative path selection and time consumption respectively. Secondly, we propose an efficient PNP-forest algorithm for identifying PNPs, by first introducing PNP-forest data structure, and then presenting PNP-forest growth and maintenance mechanism, associated with optimization strategies. Then we introduce a more efficient mining algorithm called PrefixSpan_Forest, which integrates the advantages of PrefixSpan and PNP-forest. The performance of these two algorithms are also evaluated and the results show that the algorithms can discover PNPs effectively.  相似文献   

20.
在比特流未知协议识别过程中,针对如何将得到的多协议数据帧分为单协议数据帧这一问题,提出了一种改进的凝聚型层次聚类算法。该算法以传统的凝聚型层次聚类算法思想为基础,结合比特流数据帧的特征,定义了数据帧之间及类簇之间的相似度,采用边聚类边提取符合要求类簇的方式,能快速有效地对数据帧进行聚类;并且该算法能自动地确定聚类的个数,所得的类簇含有相似度评价指标。利用林肯实验室公布的数据集进行测试,说明该算法能以较高的正确率对协议数据帧进行聚类。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号