首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Efficient data mining for path traversal patterns   总被引:16,自引:0,他引:16  
The authors explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. The solution procedure consists of two steps. First, they derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, one can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, they derive algorithms to determine the frequent traversal patterns-i.e., large reference sequences-from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted  相似文献   

2.
Mining patterns from graph traversals   总被引:12,自引:0,他引:12  
In data models that have graph representations, users navigate following the links of the graph structure. Conducting data mining on collected information about user accesses in such models, involves the determination of frequently occurring access sequences. In this paper, the problem of finding traversal patterns from such collections is examined. The determination of patterns is based on the graph structure of the model. For this purpose, three algorithms, one which is level-wise with respect to the lengths of the patterns and two which are not are presented. Additionally, we consider the fact that accesses within patterns may be interleaved with random accesses due to navigational purposes. The definition of the pattern type generalizes existing ones in order to take into account this fact. The performance of all algorithms and their sensitivity to several parameters is examined experimentally.  相似文献   

3.
The recent increase in HyperText Transfer Protocol (HTTP) traffic on the World Wide Web (WWW) has generated an enormous amount of log records on Web server databases. Applying Web mining techniques on these server log records can discover potentially useful patterns and reveal user access behaviors on the Web site. In this paper, we propose a new approach for mining user access patterns for predicting Web page requests, which consists of two steps. First, the Minimum Reaching Distance (MRD) algorithm is applied to find the distances between the Web pages. Second, the association rule mining technique is applied to form a set of predictive rules, and the MRD information is used to prune the results from the association rule mining process. Experimental results from a real Web data set show that our approach improved the performance over the existing Markov-model approach in precision, recall, and the reduction of user browsing time. Mei-Ling Shyu received her Ph.D. degree from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN in 1999, and three Master's degrees from Computer Science, Electrical Engineering, and Restaurant, Hotel, Institutional, and Tourism Management from Purdue University. She has been an Associate Professor in the Department of Electrical and Computer Engineering (ECE) at the University of Miami (UM), Coral Gables, FL, since June 2005, Prior to that, she was an Assistant Professor in ECE at UM dating from January 2000. Her research interests include data mining, multimedia database systems, multimedia networking, database systems, and security. She has authored and co-authored more than 120 technical papers published in various prestigious journals, refereed conference/symposium/workshop proceedings, and book chapters. She is/was the guest editor of several journal special issues. Choochart Haruechaiyasak received his Ph.D. degree from the Department of Electrical and Computer Engineering, University of Miami, in 2003 with the Outstanding Departmental Graduating Student award from the College of Engineering. After receiving his degree, he has joined the National Electronics and Computer Technology Center (NECTEC), located in Thailand Science Park, as a researcher in Information Research and Development Division (RDI). His current research interests include data/ text/ Web mining, Natural Language Processing, Information Retrieval, Search Engines, and Recommender Systems. He is currently leading a small group of researchers and programmer to develop an open-source search engine for Thai language. One of his objectives is to promote the use of data mining technology and other advanced applications in Information Technology in Thailand. He is also a visiting lecturer for Data Mining, Artificial Intelligence and Decision Support Systems courses in many universities in Thailand. Shu-Ching Chen received his Ph.D. from the School of Electrical and Computer Engineering at Purdue University, West Lafayette, IN, USA in December, 1998. He also received Master's degrees in Computer Science, Electrical Engineering, and Civil Engineering from Purdue University. He has been an Associate Professor in the School of Computing and Information Sciences (SCIS), Florida International University (FIU) since August, 2004. Prior to that, he was an Assistant Professor in SCIS at FIU dating from August, 1999. His main research interests include distributed multimedia database systems and multimedia data mining. Dr. Chen has authored and co-authored more than 140 research papers in journals, refereed conference/symposium/workshop proceedings, and book chapters. In 2005, he was awarded the IEEE Systems, Man, and Cybernetics Society's Outstanding Contribution Award. He was also awarded a University Outstanding Faculty Research Award from FIU in 2004, Outstanding Faculty Service Award from SCIS in 2004 and Outstanding Faculty Research Award from SCIS in 2002.  相似文献   

4.
遍历模式数据挖掘方法已经在多种应用中被提出,传统的遍历模式挖掘仅仅考虑了非加权遍历。为解决加权遍历模式挖掘问题,首先提出了一种从EWDG(边加权有向图)到VWDG(顶点加权有向图)的变换模型;基于这种模型,提出了在具有层次特性的局部图遍历中,挖掘加权频繁模式的LGTWFPMiner(局部图遍历加权频繁模式挖掘法)及其支持度/权值界的局部评估方法。针对合成数据的实验结果表明该算法能够有效地进行基于图遍历的加权频繁模式挖掘。  相似文献   

5.
为解决加权图遍历模式的挖掘问题,提出了一种从加权有向图中挖掘加权频繁模式算法.在该算法中,利用图全局拓扑结构和顶点权值信息评估遍历模式的权支持度,从而将剪枝问题转化成模式可扩展性问题,再利用可扩展模式产生候选模式集.本算法把图,顶点权值融合进来,提高了挖掘结果的准确度.实验结果表明,该算法可以有效地进行基于加权向图的权频繁模式挖掘.  相似文献   

6.
用户访问兴趣路径挖掘方法   总被引:1,自引:1,他引:1  
针对当前挖掘用户访问模式算法仅将频繁访问路径作为用户浏览兴趣路径的问题,依据使用Web日志挖掘用户兴趣页面时,通过引入页面信息量参数,综合考虑页面访问次数、浏览时间和页面信息量大小来定义用户兴趣度,提出了基于兴趣度的用户访问模式挖掘算法。实验证明该算法是有效的,在用户浏览兴趣度量方面比当前的频繁访问路径挖掘算法更准确。  相似文献   

7.
Extensions to query languages for graph traversal problems   总被引:2,自引:0,他引:2  
Extensions to database query languages for retrievals that involve inferencing on the nodes and edges of a graph are surveyed. Common types of inferencing are to find paths between two nodes, compute a value for a path such as a distance or an elapsed time, and to choose among alternative paths. The survey is based on the data model (relational or functional), method of extension (iteration, recursion, or special operators), interface style (string or tabular), and restrictions (data- and problem-oriented). The Quel, objected-oriented functional data, G-Whin, and Alpha languages are examined in detail with different values for these properties. The characteristics of other languages are summarized in several tables. The results of the survey indicate the diversity of language extensions and the need to provide data-model and query-language features to address such problems  相似文献   

8.
Grahne et al. have presented a graph algorithm for evaluating a subset of recursive queries. This method consists of two phases. In the first phase, the method transforms a linear binary-chain program into a set of equations over expressions containing predicate symbols. In the second phase, a graph is constructed from the equations and the answers are produced by traversing the relevant paths. In this paper, we describe a new algorithm which requires less time than Grahne's. The key idea of the improvement is to reduce the search space that will be traversed when a query is invoked. Furthermore, we speed up the evaluation of cyclic data by generating most answers directly in terms of the answers already found and the associated "path information" instead of traversing the corresponding paths as usual. In this way, our algorithm achieves a linear time complexity for both acyclic and cyclic data.  相似文献   

9.
This paper considers a problem of finding predictive and useful association rules with a new Web mining algorithm, a streaming association rule (SAR) model. We first adopt a weighted order-dependent scheme (assigning more weights for early visited pages) rather than taking a traditional Boolean scheme (assigning 1 for visited and 0 for non-visited pages). This way, we intend to improve the limited representation of navigation patterns in previous association rule mining (ARM) algorithms. We also note that most traditional association rule models are not scalable because they require multiple scans of all records to re-calibrate a predictive model when there are new updates in original databases. The proposed SAR model takes a “divide-and-conquer” approach and requires only single scan of data sets to avoid the curse of dimensionality. Through comparative experiments on a real-world data set, we show that prediction models based on a weighted order-dependent representation are more accurate in predicting the next moves of Web navigators than models based on a Boolean representation. In particular, when combined with several heuristics developed to eliminate redundant association rules, SAR models show a very comparable prediction accuracy while maintaining a small fraction of association rules compared to traditional ARM models. Finally, we quantify and graphically show the significance or contribution of each pages to forming unique rule sets in each database segments.  相似文献   

10.
Recent theoretical insights have led to the introduction of efficient algorithms for mining closed item-sets. This paper investigates potential generalizations of this paradigm to mine closed patterns in relational, graph and network databases. Several semantics and associated definitions for closed patterns in relational data have been introduced in previous work, but the differences among these and the implications of the choice of semantics was not clear. The paper investigates these implications in the context of generalizing the LCM algorithm, an algorithm for enumerating closed item-sets. LCM is attractive since its run time is linear in the number of closed patterns and since it does not need to store the patterns output in order to avoid duplicates, further reducing memory signature and run time. Our investigation shows that the choice of semantics has a dramatic effect on the properties of closed patterns and as a result, in some settings a generalization of the LCM algorithm is not possible. On the other hand, we provide a full generalization of LCM for the semantic setting that has been previously used by the Claudien system.  相似文献   

11.
A k-disjoint path cover of a graph is a set of k internally vertex-disjoint paths which cover the vertex set with k paths and each of which runs between a source and a sink. Given that each source and sink v is associated with an integer-valued demand d(v)≥1, we are concerned with general-demand k-disjoint path cover in which every source and sink v is contained in the d(v) paths. In this paper, we present a reduction of a general-demand disjoint path cover problem to an unpaired many-to-many disjoint path cover problem, and obtain some results on disjoint path covers of restricted HL-graphs and proper interval graphs with faulty vertices and/or edges.  相似文献   

12.
Preferred navigation patterns (PNP) are those contiguous sequential patterns whose elements are preferred by users to be selected as the next steps between several different selections and are preferred by users to spend much time on. Such navigation path and time preferred patterns are more actionable than any other finds only considering either path or time in various web applications, such as web user navigation, targeted online advertising and recommendation. However, due to the conceptual confusion and limitation on navigation preference in the existing work, the corresponding algorithms cannot discover actionable preferred navigation patterns. In this paper, we study the problem of preferred navigation pattern mining by involving both navigation path and time length. Firstly, we carefully define the concepts of time preference and selection preference for time-related path sequences, which can well reflect user interests from the relative path selection and time consumption respectively. Secondly, we propose an efficient PNP-forest algorithm for identifying PNPs, by first introducing PNP-forest data structure, and then presenting PNP-forest growth and maintenance mechanism, associated with optimization strategies. Then we introduce a more efficient mining algorithm called PrefixSpan_Forest, which integrates the advantages of PrefixSpan and PNP-forest. The performance of these two algorithms are also evaluated and the results show that the algorithms can discover PNPs effectively.  相似文献   

13.
网络论坛中蕴涵着大量具有实用价值和商业价值的信息,是搜索引擎和问答系统信息的重要来源。针对论坛结构复杂、链接种类繁多,以及容易陷入采集陷阱等问题,提出了一种基于结构驱动的采集路径选择方法。首先根据用户标注的少量类型数据,利用DOM树对采样网页基于网页结构进行结构聚类;其次根据各节点的评价进行采集路径选择;最后对翻页链接进行有效的识别和处理。实验表明,该方法采集的覆盖率和有效率明显优于传统算法,并且应用在中国科学院计算所舆情监测平台上取得了良好的效果。  相似文献   

14.
一种基于后缀树的Web访问模式挖掘算法   总被引:4,自引:0,他引:4  
何丽  韩文秀 《计算机应用》2004,24(11):68-70
在Web使用挖掘中,分析用户的行为模式是一个关键的问题。文中提出了一种基于后缀树的最大频繁序列MFS(Maximal Frequent Sequences)的有效挖掘算法,该算法能够从增量数据中动态发现和输出MFS。  相似文献   

15.
On approximating the longest path in a graph   总被引:6,自引:0,他引:6  
We consider the problem of approximating the longest path in undirected graphs. In an attempt to pin down the best achievable performance ratio of an approximation algorithm for this problem, we present both positive and negative results. First, a simple greedy algorithm is shown to find long paths in dense graphs. We then consider the problem of finding paths in graphs that are guaranteed to have extremely long paths. We devise an algorithm that finds paths of a logarithmic length in Hamiltonian graphs. This algorithm works for a much larger class of graphs (weakly Hamiltonian), where the result is the best possible. Since the hard case appears to be that of sparse graphs, we also consider sparse random graphs. Here we show that a relatively long path can be obtained, thereby partially answering an open problem of Broderet al. To explain the difficulty of obtaining better approximations, we also prove hardness results. We show that, for any ε<1, the problem of finding a path of lengthn-n ε in ann-vertex Hamiltonian graph isNP-hard. We then show that no polynomial-time algorithm can find a constant factor approximation to the longest-path problem unlessP=NP. We conjecture that the result can be strengthened to say that, for some constant δ>0, finding an approximation of ration δ is alsoNP-hard. As evidence toward this conjecture, we show that if any polynomial-time algorithm can approximate the longest path to a ratio of , for any ε>0, thenNP has a quasi-polynomial deterministic time simulation. The hardness results apply even to the special case where the input consists of bounded degree graphs. D. Karger was supported by an NSF Graduate Fellowship, NSF Grant CCR-9010517, and grants from the Mitsubishi Corporation and OTL. R. Motwani was supported by an Alfred P. Sloan Research Fellowship, an IBM Faculty Development Award, grants from Mitsubishi and OTL, NSF Grant CCR-9010517, and NSF Young Investigator Award CCR-9357849, with matching funds from IBM, the Schlumberger Foundation, the Shell Foundation, and the Xerox Corporation, G. D. S. Ramkumar was supported by a grant from the Toshiba Corporation. Communicated by M. X. Goemans.  相似文献   

16.
A fast path planning by path graph optimization   总被引:1,自引:0,他引:1  
A fast path planning method by optimization of a path graph for both efficiency and accuracy is proposed. A conventional quadtree-based path planning approach is simple, robust, and efficient. However, it has two limitations. We propose a path graph optimization technique employing a compact mesh representation. A world space is triangulated into a base mesh and the base mesh is simplified to a compact mesh. The compact mesh representation is object-dependent; the positions of vertexes of the mesh are optimized according to the curvatures of the obstacles. The compact mesh represents the obstacles as accurately as the quadtree even though using much fewer vertexes than the quadtree. The compact mesh distributes vertexes in a free space in a balanced way by ensuring that the lengths of edges are below an edge length threshold. An optimized path graph is extracted from the compact mesh. An iterative vertex pushing method is proposed to include important obstacle boundary edges in the path graph. Dijkstra's shortest path searching algorithm is used to search the shortest path in the path graph. Experimental results show that the path planning using the optimized path graph is an order of magnitude faster than the quadtree approach while the length of the path generated by the proposed method is almost the same as that of the path generated by the quadtree.  相似文献   

17.
Web mining involves the application of data mining techniques to large amounts of web-related data in order to improve web services. Web traversal pattern mining involves discovering users’ access patterns from web server access logs. This information can provide navigation suggestions for web users indicating appropriate actions that can be taken. However, web logs keep growing continuously, and some web logs may become out of date over time. The users’ behaviors may change as web logs are updated, or when the web site structure is changed. Additionally, it can be difficult to determine a perfect minimum support threshold during the data mining process to find interesting rules. Accordingly, we must constantly adjust the minimum support threshold until satisfactory data mining results can be found.The essence of incremental data mining and interactive data mining is the ability to use previous mining results in order to reduce unnecessary processes when web logs or web site structures are updated, or when the minimum support is changed. In this paper, we propose efficient incremental and interactive data mining algorithms to discover web traversal patterns that match users’ requirements. The experimental results show that our algorithms are more efficient than other comparable approaches.  相似文献   

18.
Competitor Mining with the Web   总被引:1,自引:0,他引:1  
This paper is concerned with the problem of mining competitors from the web automatically. Nowadays the fierce competition in the market necessitates every company not only to know which companies are its primary competitors, but also in which fields the company's rivals compete with itself and what its competitors' strength is in a specific competitive domain. The task of competitor mining that we address in the paper includes mining all the information such as competitors, competing fields and competitors' strength. A novel algorithm called CoMiner is proposed, which tries to conduct a web-scale mining in a domain-independent manner. The CoMiner algorithm consists of three parts: 1) given an input entity, extracting a set of comparative candidates and then ranking them according to comparability; 2) extracting the fields in which the given entity and its competitors play against each other; 3) identifying and summarizing the competitive evidence that details the competitors' strength. As for evaluation, a prototype system implementing the CoMiner algorithm is presented. An evaluation data set consisting of 70 entities is constructed. 728 competitors and 3,640 competitive fields with 6,381 competitive evidences are discovered with the prototype. The experimental results show that the proposed algorithm is highly effective.  相似文献   

19.
挖掘频繁访问模式是Web日志挖掘的一个重要任务。针对类Apriori算法和GITC算法的不足,提出了基于双亲链的单次扫描求交的Web频繁访问模式挖掘算法—BIPL,该算法首先对用户的访问模式两两进行交集运算,生成候选访问模式,并在求交集过程中保存各个候选访问模式的双亲模式,然后通过简单的求和运算,计算出各个候选访问模式的支持数。最后通过理论分析和实验验证,该算法是稳定的和高效的。  相似文献   

20.
Web数据挖掘   总被引:30,自引:4,他引:26  
王实  高文 《计算机科学》2000,27(4):28-31
Web Mining is an important branch in Data Mining.It attracts more research interest for rapidly developing Internet. Web Mining includes(1)Web Content Mining;(g)Web Usage Mining;(3) Web structure Mining.In this paper we define Web Mining and present an overview of the various research issues,techniques and development efforts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号