首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Correlation-Based Web Document Clustering for Adaptive Web Interface Design   总被引:2,自引:2,他引:2  
A great challenge for web site designers is how to ensure users' easy access to important web pages efficiently. In this paper we present a clustering-based approach to address this problem. Our approach to this challenge is to perform efficient and effective correlation analysis based on web logs and construct clusters of web pages to reflect the co-visit behavior of web site users. We present a novel approach for adapting previous clustering algorithms that are designed for databases in the problem domain of web page clustering, and show that our new methods can generate high-quality clusters for very large web logs when previous methods fail. Based on the high-quality clustering results, we then apply the data-mined clustering knowledge to the problem of adapting web interfaces to improve users' performance. We develop an automatic method for web interface adaptation: by introducing index pages that minimize overall user browsing costs. The index pages are aimed at providing short cuts for users to ensure that users get to their objective web pages fast, and we solve a previously open problem of how to determine an optimal number of index pages. We empirically show that our approach performs better than many of the previous algorithms based on experiments on several realistic web log files. Received 25 November 2000 / Revised 15 March 2001 / Accepted in revised form 14 May 2001  相似文献   

2.
为了克服搜索引擎在搜索过程中经常重复性地把当前受欢迎的网页放在搜索结果的首要位置,而忽略那些不受大多数用户欢迎的网页的问题,文中提出一个采用改进受欢迎度的PageRank优化算法.该改进算法首先通过建立网页的真实质量函数来纠正搜索引擎的上述问题,然后再采用一个新的网页受欢迎度来消除内在的网页质量问题从而避免该问题.实验...  相似文献   

3.
Mining Navigation Patterns Using a Sequence Alignment Method   总被引:2,自引:0,他引:2  
In this article, a new method is illustrated for mining navigation patterns on a web site. Instead of clustering patterns by means of a Euclidean distance measure, in this approach users are partitioned into clusters using a non-Euclidean distance measure called the Sequence Alignment Method (SAM). This method partitions navigation patterns according to the order in which web pages are requested and handles the problem of clustering sequences of different lengths. The performance of the algorithm is compared with the results of a method based on Euclidean distance measures. SAM is validated by means of user-traffic data of two different web sites. Empirical results show that SAM identifies sequences with similar behavioral patterns not only with regard to content, but also considering the order of pages visited in a sequence.  相似文献   

4.
Multi-Instance Learning Based Web Mining   总被引:7,自引:0,他引:7  
In multi-instance learning, the training set comprises labeled bags that are composed of unlabeled instances, and the task is to predict the labels of unseen bags. In this paper, a web mining problem, i.e. web index recommendation, is investigated from a multi-instance view. In detail, each web index page is regarded as a bag, while each of its linked pages is regarded as an instance. A user favoring an index page means that he or she is interested in at least one page linked by the index. Based on the browsing history of the user, recommendation could be provided for unseen index pages. An algorithm named Fretcit-kNN, which employs the Minimal Hausdorff distance between frequent term sets and utilizes both the references and citers of an unseen bag in determining its label, is proposed to solve the problem. Experiments show that in average the recommendation accuracy of Fretcit-kNN is 81.0% with 71.7% recall and 70.9% precision, which is significantly better than the best algorithm that does not consider the specific characteristics of multi-instance learning, whose performance is 76.3% accuracy with 63.4% recall and 66.1% precision.  相似文献   

5.
The problem of obtaining relevant results in web searching has been tackled with several approaches. Although very effective techniques are currently used by the most popular search engines when no a priori knowledge on the user's desires beside the search keywords is available, in different settings it is conceivable to design search methods that operate on a thematic database of web pages that refer to a common body of knowledge or to specific sets of users. We have considered such premises to design and develop a search method that deploys data mining and optimization techniques to provide a more significant and restricted set of pages as the final result of a user search. We adopt a vectorization method based on search context and user profile to apply clustering techniques that are then refined by a specially designed genetic algorithm. In this paper we describe the method, its implementation, the algorithms applied, and discuss some experiments that has been run on test sets of web pages.  相似文献   

6.
结合使用挖掘和内容挖掘的web推荐服务   总被引:10,自引:1,他引:9  
随着Internet的基础结构不断扩大和其所含信息的持续增长,Internet用户越来越感觉容易在WWW服务中“资源迷向”。提高用户访问效率的方法有页面预取技术,站点动态重构技术和web个性化推荐技术等。现有的大多数web个性化推荐技术主要是基于用户使用记录的数据挖掘方法,没有或很少考虑结合页面内容—这才是用户真正感兴趣的。该文提出一种结合用户使用挖掘和内容挖掘的web推荐服务,该推荐服务根据频繁最大前向访问路径,提出含有导航页和内容页的频繁访问路径图概念,根据滑动窗口内的最近用户访问页面内容和候选推荐集中页面内容相关性,来向用户提供个性化推荐服务。经推荐质量分析,这种方法具有较好的推荐优化能力。  相似文献   

7.
Users of web sites often do not know exactly which information they are looking for nor what the site has to offer. The purpose of their interaction is not only to fulfill but also to articulate their information needs. In these cases users need to pass through a series of pages before they can use the information that will eventually answer their questions. Current systems that support navigation predict which pages are interesting for the users on the basis of commonalities in the contents or the usage of the pages. They do not take into account the order in which the pages must be visited. In this paper we propose a method to automatically divide the pages of a web site on the basis of user logs into sets of pages that correspond to navigation stages. The method searches for an optimal number of stages and assigns each page to a stage. The stages can be used in combination with the pages’ topics to give better recommendations or to structure or adapt the site. The resulting navigation structures guide the users step by step through the site providing pages that do not only match the topic of the user’s search, but also the current stage of the navigation process.  相似文献   

8.
Ranking web pages for presenting the most relevant web pages to user's queries is one of the main issues in any search engine. In this paper, two new ranking algorithms are offered, using Reinforcement Learning (RL) concepts. RL is a powerful technique of modern artificial intelligence that tunes agent's parameters, interactively. In the first step, with formulation of ranking as an RL problem, a new connectivity-based ranking algorithm, called RL_Rank, is proposed. In RL_Rank, agent is considered as a surfer who travels between web pages by clicking randomly on a link in the current page. Each web page is considered as a state and value function of state is used to determine the score of that state (page). Reward is corresponded to number of out links from the current page. Rank scores in RL_Rank are computed in a recursive way. Convergence of these scores is proved. In the next step, we introduce a new hybrid approach using combination of BM25 as a content-based algorithm and RL_Rank. Both proposed algorithms are evaluated by well known benchmark datasets and analyzed according to concerning criteria. Experimental results show using RL concepts leads significant improvements in raking algorithms.  相似文献   

9.
PageRank算法对页面评价太过客观,对不同重要程度的网页被授予相同的权重,并且在排序时,一些旧的页面经常出现在Web检索结果的前面,而新加入的高质量页面用户很难找到.针对Pagerank算法存在的这些缺陷,引入时间维加权概念,开发出TimedWPR算法,同时保证了两种页面的排序优化.该算法采用服务器反馈回来的网页修改时间表示网页年龄,并在此基础上对网络的组织结构和链接质量以及时间序列进行挖掘,从而克服现有Web超链接分析中的不足.  相似文献   

10.
Existing PageRank algorithm exploits the Hyperlink Structure of the web with uniform transition probability distribution to measure the relative importance of web pages. This paper proposes a novel method namely Proportionate Prestige Score (PPS) for prestige analysis. This proposed PPS method is purely based on the exact prestige of web pages, which is applicable to Initial Probability Distribution (IPD) matrix and Transition Probability Distribution (TPD) matrix. This proposed PPS method computes the single PageRank vector with non-uniform transition probability distribution, using the link structure of the web pages offline. This non-uniform transition probability distribution has efficiently overcome the dangling page problem than the existing PageRank algorithm. This paper provides benchmark analysis of ranking methods: PageRank and proposed PPS. These methods are tested with real social network data from three different domains: Social Circle:Facebook, Wikipedia vote network and Enron email network. The findings of this research work propose that the quality of the ranking has improved by using the proposed PPS method compared with the existing PageRank algorithm.  相似文献   

11.
Search engines are among the most popular as well as useful services on the web. There is a need, however, to cater to the preferences of the users when supplying the search results to them. We propose to maintain the search profile of each user, on the basis of which the search results would be determined. This requires the integration of techniques for measuring search quality, learning from the user feedback and biased rank aggregation, etc. For the purpose of measuring web search quality, the “user satisfaction” is gauged by the sequence in which he picks up the results, the time he spends at those documents and whether or not he prints, saves, bookmarks, e-mails to someone or copies-and-pastes a portion of that document. For rank aggregation, we adopt and evaluate the classical fuzzy rank ordering techniques for web applications, and also propose a few novel techniques that outshine the existing techniques. A “user satisfaction” guided web search procedure is also put forward. Learning from the user feedback proceeds in such a way that there is an improvement in the ranking of the documents that are consistently preferred by the users. As an integration of our work, we propose a personalized web search system.  相似文献   

12.
网页在其生命周期内的活跃程度会随时间发生变化。有的网页只在特定的阶段有价值,此后就会过时。从用户的角度对网页的生命周期进行分析可以提高网络爬虫和搜索引擎的性能,改善网络广告的效果。利用一台代理服务器收集的网页访问量信息,我们对网页的生命周期进行了研究,给出了用户兴趣演变的模型。这个模型有助于更好地理解网络的组织与运行机理。  相似文献   

13.
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we first partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of first-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data; and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-traffic data from msnbc.com.  相似文献   

14.
Abstract. The analysis of web usage has mostly focused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The query interfaces of such sites allow the specification of many search criteria. Their generated results support navigation to pages of results combining cross-linked data from many sources. For the analysis of visitor navigation behaviour in such web sites, we propose the web usage miner (WUM), which discovers navigation patterns subject to advanced statistical and structural constraints. Since our objective is the discovery of interesting navigation patterns, we do not focus on accesses to individual pages. Instead, we construct conceptual hierarchies that reflect the query capabilities used in the production of those pages. Our experiments with a real web site that integrates data from multiple databases, the German SchulWeb, demonstrate the appropriateness of WUM in discovering navigation patterns and show how those discoveries can help in assessing and improving the quality of the site. Received June 21, 1999 / Accepted December 24, 1999  相似文献   

15.
Content distribution networks (CDNs) improve scalability and reliability, by replicating content to the “edge” of the Internet. Apart from the pure networking issues of the CDNs relevant to the establishment of the infrastructure, some very crucial data management issues must be resolved to exploit the full potential of CDNs to reduce the “last mile” latencies. A very important issue is the selection of the content to be prefetched to the CDN servers. All the approaches developed so far, assume the existence of adequate content popularity statistics to drive the prefetch decisions. Such information though, is not always available, or it is extremely volatile, turning such methods problematic. To address this issue, we develop self-adaptive techniques to select the outsourced content in a CDN infrastructure, which requires no apriori knowledge of request statistics. We identify clusters of “correlated” Web pages in a site, called Web site communities, and make these communities the basic outsourcing unit. Through a detailed simulation environment, using both real and synthetic data, we show that the proposed techniques are very robust and effective in reducing the user-perceived latency, performing very close to an unfeasible, off-line policy, which has full knowledge of the content popularity.  相似文献   

16.
基于用户访问事务文法的序列关联规则发现   总被引:4,自引:0,他引:4  
王实  高文  李锦涛 《软件学报》2001,12(10):1503-1509
在Web挖掘中,应用关联规则发现方法可以发现Web页面之间用户访问的关联度.由于Web站点内含丰富的页面结构信息,也由于用户的访问总是要遵循一定的访问顺序,因此提出一种新的可以发现用户访问序列之间关联度的方法——序列关联规则发现方法.该方法首先得到用户访问事务;然后根据正则文法,定义了一种新的用户访问事务文法,用于从用户访问事务中得到用户序列访问事务;最后应用关联规则发现算法进而发现序列关联规则.为了进一步评价所发现的序列关联规则,引入了互信息的概念.发现的序列关联规则可以帮助Web站点的设计者更好地理解用户的访问,以用于调整Web站点的结构.  相似文献   

17.
In Web classification, web pages are assigned to pre-defined categories mainly according to their content (content mining). However, the structure of the web site might provide extra information about their category (structure mining). Traditionally, both approaches have been applied separately, or are dealt with techniques that do not generate a model, such as Bayesian techniques. Unfortunately, in some classification contexts, a comprehensible model becomes crucial. Thus, it would be interesting to apply rule-based techniques (rule learning, decision tree learning) for the web categorisation task. In this paper we outline how our general-purpose learning algorithm, the so called distance based decision tree learning algorithm (DBDT), could be used in web categorisation scenarios. This algorithm differs from traditional ones in the sense that the splitting criterion is defined by means of metric conditions (“is nearer than”). This change allows decision trees to handle structured attributes (lists, graphs, sets, etc.) along with the well-known nominal and numerical attributes. Generally speaking, these structured attributes will be employed to represent the content and the structure of the web-site.  相似文献   

18.
为了有效地吸引和留住用户,提高网站服务的质量,在原有个性化实现技术基础上,提出了一种前后端日志相结合的方式存取用户浏览信息,对用户浏览站点的行为进行跟踪,为Web日志挖掘提供更精确有效的信息.结合前后端日志记录相结合的策略,提出了一个可伸缩的,独立于具体Web站点的页面推荐系统架构.实验分析结果表明,该方式能更准确全面的收集用户数据,同时个性化模块以一种非侵入的方式与系统集成,提高了系统的灵活性,方便系统重用.  相似文献   

19.
Web sites contain an ever increasing amount of information within their pages. As the amount of information increases so does the complexity of the structure of the web site. Consequently it has become difficult for visitors to find the information relevant to their needs. To overcome this problem various clustering methods have been proposed to cluster data in an effort to help visitors find the relevant information. These clustering methods have typically focused either on the content or the context of the web pages. In this paper we are proposing a method based on Kohonen’s self-organizing map (SOM) that utilizes both content and context mining clustering techniques to help visitors identify relevant information quicker. The input of the content mining is the set of web pages of the web site whereas the source of the context mining is the access-logs of the web site. SOM can be used to identify clusters of web sessions with similar context and also clusters of web pages with similar content. It can also provide means of visualizing the outcome of this processing. In this paper we show how this two-level clustering can help visitors identify the relevant information faster. This procedure has been tested to the access-logs and web pages of the Department of Informatics and Telecommunications of the University of Athens.  相似文献   

20.
该文提出了一种从搜索引擎返回的结果网页中获取双语网页的新方法,该方法分为两个任务。第一个任务是自动地检测并收集搜索引擎返回的结果网页中的数据记录。该步骤通过聚类的方法识别出有用的记录摘要并且为下一个任务即高质量双语混合网页的验证及其获取提供有效特征。该文中把双语混合网页的验证看作是有效的分类问题,该方法不依赖于特定领域和搜索引擎。基于从搜索引擎收集并经过人工标注的2 516条检索结果记录,该文提出的方法取得了81.3%的精确率和94.93%的召回率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号