首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Although efficient identification of user access sessions from very large web logs is an unavoidable data preparation task for the success of higher level web log mining, little attention has been paid to algorithmic study of this problem. In this paper we consider two types of user access sessions, interval sessions and gap sessions. We design two efficient algorithms for finding respectively those two types of sessions with the help of some proposed structures. We present theoretical analysis of the algorithms and prove that both algorithms have optimal time complexity and certain error-tolerant properties as well. We conduct empirical performance analysis of the algorithms with web logs ranging from 100 megabytes to 500 megabytes. The empirical analysis shows that the algorithms just take several seconds more than the baseline time, i.e., the time needed for reading the web log once sequentially from disk to RAM, testing whether each user access record is valid or not, and writing each valid user access record back to disk. The empirical analysis also shows that our algorithms are substantially faster than the sorting based session finding algorithms. Finally, optimal algorithms for finding user access sessions from distributed web logs are also presented.  相似文献   

2.
Web数据仓库的异步迭代查询处理方法   总被引:2,自引:0,他引:2  
何震瀛  李建中  高宏 《软件学报》2002,13(2):214-218
数据仓库信息量的飞速膨胀对数据仓库提出了巨大挑战.如何提高Web环境下数据仓库的查询效率成为数据仓库研究领域重要的研究问题.对Web数据仓库的体系结构和查询方法进行了研究和探讨.在分析几种Web数据仓库实现方法的基础上,提出了一种Web数据仓库的层次体系结构,并在此基础上提出了Web数据仓库的异步迭代查询方法.该方法充分利用了流水线并行技术,在Web数据仓库的查询处理过程中不同层次的结点以流水线方式运行,并行完成查询的处理,提高了查询效率.理论分析表明,该方法可以有效地提高Web数据仓库的查询效率.  相似文献   

3.
One of the useful tools offered by existing web search engines is query suggestion (QS), which assists users in formulating keyword queries by suggesting keywords that are unfamiliar to users, offering alternative queries that deviate from the original ones, and even correcting spelling errors. The design goal of QS is to enrich the web search experience of users and avoid the frustrating process of choosing controlled keywords to specify their special information needs, which releases their burden on creating web queries. Unfortunately, the algorithms or design methodologies of the QS module developed by Google, the most popular web search engine these days, is not made publicly available, which means that they cannot be duplicated by software developers to build the tool for specifically-design software systems for enterprise search, desktop search, or vertical search, to name a few. Keyword suggested by Yahoo! and Bing, another two well-known web search engines, however, are mostly popular currently-searched words, which might not meet the specific information needs of the users. These problems can be solved by WebQS, our proposed web QS approach, which provides the same mechanism offered by Google, Yahoo!, and Bing to support users in formulating keyword queries that improve the precision and recall of search results. WebQS relies on frequency of occurrence, keyword similarity measures, and modification patterns of queries in user query logs, which capture information on millions of searches conducted by millions of users, to suggest useful queries/query keywords during the user query construction process and achieve the design goal of QS. Experimental results show that WebQS performs as well as Yahoo! and Bing in terms of effectiveness and efficiency and is comparable to Google in terms of query suggestion time.  相似文献   

4.
网页搜索引擎查询日志的Session划分研究   总被引:4,自引:1,他引:3  
搜索引擎查询日志中的session (以下简称session)是指某特定用户为得到某个信息需求而在一段时间内的搜索行为的连续序列。Session的正确划分是进行用户搜索行为分析等一系列工作的重要基础,目前尚没有关于session的系统研究工作。本文针对相关研究工作的问题重新统一定义了session的概念并进行探索和比较研究,得出结论(1)统计语言模型因数据稀疏问题不适合做session划分;(2)利用多种属性的决策树方法可以得到比较理想的结果,以session为单位进行评价,F值达到了78.6%。  相似文献   

5.
A Data Cube Model for Prediction-Based Web Prefetching   总被引:7,自引:0,他引:7  
Reducing the web latency is one of the primary concerns of Internet research. Web caching and web prefetching are two effective techniques to latency reduction. A primary method for intelligent prefetching is to rank potential web documents based on prediction models that are trained on the past web server and proxy server log data, and to prefetch the highly ranked objects. For this method to work well, the prediction model must be updated constantly, and different queries must be answered efficiently. In this paper we present a data-cube model to represent Web access sessions for data mining for supporting the prediction model construction. The cube model organizes session data into three dimensions. With the data cube in place, we apply efficient data mining algorithms for clustering and correlation analysis. As a result of the analysis, the web page clusters can then be used to guide the prefetching system. In this paper, we propose an integrated web-caching and web-prefetching model, where the issues of prefetching aggressiveness, replacement policy and increased network traffic are addressed together in an integrated framework. The core of our integrated solution is a prediction model based on statistical correlation between web objects. This model can be frequently updated by querying the data cube of web server logs. This integrated data cube and prediction based prefetching framework represents a first such effort in our knowledge.  相似文献   

6.
Correlation-Based Web Document Clustering for Adaptive Web Interface Design   总被引:2,自引:2,他引:2  
A great challenge for web site designers is how to ensure users' easy access to important web pages efficiently. In this paper we present a clustering-based approach to address this problem. Our approach to this challenge is to perform efficient and effective correlation analysis based on web logs and construct clusters of web pages to reflect the co-visit behavior of web site users. We present a novel approach for adapting previous clustering algorithms that are designed for databases in the problem domain of web page clustering, and show that our new methods can generate high-quality clusters for very large web logs when previous methods fail. Based on the high-quality clustering results, we then apply the data-mined clustering knowledge to the problem of adapting web interfaces to improve users' performance. We develop an automatic method for web interface adaptation: by introducing index pages that minimize overall user browsing costs. The index pages are aimed at providing short cuts for users to ensure that users get to their objective web pages fast, and we solve a previously open problem of how to determine an optimal number of index pages. We empirically show that our approach performs better than many of the previous algorithms based on experiments on several realistic web log files. Received 25 November 2000 / Revised 15 March 2001 / Accepted in revised form 14 May 2001  相似文献   

7.
Web使用挖掘是数据挖掘技术在Web信息仓库中的应用.Web使用挖掘通过挖掘Web服务器日志获取的知识来预测用户浏览行为,是Web挖掘技术中的一个重要研究方向.通常发现的知识或一些意外规则很可能是不精确的、不完备的,这就需要用软计算技术如粗糙集来解决.提出一种基于粗糙近似的聚类方法,该方法能够实现从Web访问日志中聚类Web事务.通过这种方法可以有效地挖掘Web日志记录,从而发现用户存取Web页面的模式.  相似文献   

8.
9.
基于兴趣度的Web用户访问模式分析   总被引:1,自引:0,他引:1  
吕佳 《计算机工程与设计》2007,28(10):2403-2404,2407
Web日志隐含了用户访问Web行为的动因和规律,如何有效地从中挖掘出用户访问模式是Web日志挖掘的重要研究内容.构造了User_ID-URL矩阵,矩阵元素为用户访问页面的兴趣度.应用经典的模糊C-均值聚类算法进行用户访问模式分析,通过在真实数据集上的实验,结果表明引入了用户兴趣度的日志挖掘算法是行之有效的.  相似文献   

10.
The development of data warehouses begins with the definition of multidimensional models at the conceptual level in order to structure data, which will facilitate decision makers with an easier data analysis. Current proposals for conceptual multidimensional modelling focus on the design of static data warehouse structures, but few approaches model the queries which the data warehouse should support by means of OLAP (on-line analytical processing) tools. OLAP queries are, therefore, only defined once the rest of the data warehouse has been implemented, which prevents designers from verifying from the very beginning of the development whether the decision maker will be able to obtain the required information from the data warehouse. This article presents a solution to this drawback consisting of an extension to the object constraint language (OCL), which has been developed to include a set of predefined OLAP operators. These operators can be used to define platform-independent OLAP queries as a part of the specification of the data warehouse conceptual multidimensional model. Furthermore, OLAP tools require the implementation of queries to assure performance optimisations based on pre-aggregation. It is interesting to note that the OLAP queries defined by our approach can be automatically implemented in the rest of the data warehouse, in a coherent and integrated manner. This implementation is supported by a code-generation architecture aligned with model-driven technologies, in particular the MDA (model-driven architecture) proposal. Finally, our proposal has been validated by means of a set of sample data sets from a well-known case study.  相似文献   

11.
Many read‐intensive systems where fast access to data is more important than the rate at which data can change make use of multidimensional index structures, like the generalized search tree (GiST). Although in these systems the indexed data are rarely updated and read access is highly concurrent, the existing concurrency control mechanisms for multidimensional index structures are based on locking techniques, which cause significant overhead. In this article we present the multiversion‐GiST (MVGiST), an in‐memory mechanism that extends the GiST with multiversion concurrency control. The MVGiST enables lock‐free read access and ensures a consistent view of the index structure throughout a reader's series of queries, by creating lightweight, read‐only versions of the GiST that share unchanging nodes among themselves. An example of a system with high read to write ratio, where providing wait‐free queries is of utmost importance, is a large‐scale directory that indexes web services according to their input and output parameters. A performance evaluation shows that for low update rates, the MVGiST significantly improves scalability w.r.t. the number of concurrent read accesses when compared with a traditional, locking‐based concurrency control mechanism. We propose a technique to control memory consumption and confirm through our evaluation that the MVGiST efficiently manages memory. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

12.
Web servers keep track of web users' browsing behavior in web logs. From these logs, one can build statistical models that predict the users' next requests based on their current behavior. These data are complex due to their large size and sequential nature. In the past, researchers have proposed different methods for building association-rule based prediction models using the web logs, but there has been no systematic study on the relative merits of these methods. In this paper, we provide a comparative study on different kinds of sequential association rules for web document prediction. We show that the existing approaches can be cast under two important dimensions, namely the type of antecedents of rules and the criterion for selecting prediction rules. From this comparison we propose a best overall method and empirically test the proposed model on real web logs.  相似文献   

13.
Yao Liu  Hui Xiong 《Information Sciences》2006,176(9):1215-1240
A data warehouse stores current and historical records consolidated from multiple transactional systems. Securing data warehouses is of ever-increasing interest, especially considering areas where data are sold in pieces to third parties for data mining practices. In this case, existing data warehouse security techniques, such as data access control, may not be easy to enforce and can be ineffective. Instead, this paper proposes a data perturbation based approach, called the cubic-wise balance method, to provide privacy preserving range queries on data cubes in a data warehouse. This approach is motivated by the following observation: analysts are usually interested in summary data rather than individual data values. Indeed, our approach can provide a closely estimated summary data for range queries without providing access to actual individual data values. As demonstrated by our experimental results on APB benchmark data set from the OLAP council, the cubic-wise balance method can achieve both better privacy preservation and better range query accuracy than random data perturbation alternatives.  相似文献   

14.
Web caching proxy servers are essential for improving web performance and scalability, and recent research has focused on making proxy caching work for database-backed web sites. In this paper, we explore a new proxy caching framework that exploits the query semantics of HTML forms. We identify two common classes of form-based queries from real-world database-backed web sites, namely, keyword-based queries and function-embedded queries. Using typical examples of these queries, we study two representative caching schemes within our framework: (i) traditional passive query caching, and (ii) active query caching, in which the proxy cache can service a request by evaluating a query over the contents of the cache. Results from our experimental implementation show that our form-based proxy is a general and flexible approach that efficiently enables active caching schemes for database-backed web sites. Furthermore, handling query containment at the proxy yields significant performance advantages over passive query caching, but extending the power of the active cache to do full semantic caching appears to be less generally effective.  相似文献   

15.
Traditional search engines have become the most useful tools to search the World Wide Web. Even though they are good for certain search tasks, they may be less effective for others, such as satisfying ambiguous or synonym queries. In this paper, we propose an algorithm that, with the help of Wikipedia and collaborative semantic annotations, improves the quality of web search engines in the ranking of returned results. Our work is supported by (1) the logs generated after query searching, (2) semantic annotations of queries and (3) semantic annotations of web pages. The algorithm makes use of this information to elaborate an appropriate ranking. To validate our approach we have implemented a system that can apply the algorithm to a particular search engine. Evaluation results show that the number of relevant web resources obtained after executing a query with the algorithm is higher than the one obtained without it.  相似文献   

16.
17.
Web Search is increasingly entity centric; as a large fraction of common queries target specific entities, search results get progressively augmented with semi-structured and multimedia information about those entities. However, search over personal web browsing history still revolves around keyword-search mostly. In this paper, we present a novel approach to answer queries over web browsing logs that takes into account entities appearing in the web pages, user activities, as well as temporal information. Our system, B-hist, aims at providing web users with an effective tool for searching and accessing information they previously looked up on the web by supporting multiple ways of filtering results using clustering and entity-centric search. In the following, we present our system and motivate our User Interface (UI) design choices by detailing the results of a survey on web browsing and history search. In addition, we present an empirical evaluation of our entity-based approach used to cluster web pages.  相似文献   

18.
Towards Intelligent Semantic Caching for Web Sources   总被引:2,自引:0,他引:2  
An intelligent semantic caching scheme suitable for web sources is presented. Since web sources typically have weaker querying capabilities than conventional databases, existing semantic caching schemes cannot be directly applied. Our proposal takes care of the difference between the query capabilities of an end user system and web sources. In addition, an analysis on the match types between a user's input query and cached queries is presented. Based on this analysis, we present an algorithm that finds the best matched query under different circumstances. Furthermore, a method to use semantic knowledge, acquired from the data, to avoid unnecessary access to web sources by transforming the cache miss to the cache hit is presented. To verify the effectiveness of the proposed semantic caching scheme, we first show how to generate synthetic queries exhibiting different levels of semantic localities. Then, using the test sets, we show that the proposed query matching technique is an efficient and effective way for semantic caching in web databases.  相似文献   

19.
20.
We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号