首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
为了有效地从电子政务网站的Web日志中挖掘民众关注的热点信息, 提出基于区域—频道访问度的民意热点信息挖掘算法——PHIMA算法。该算法在分析目前Web日志挖掘算法存在的问题的基础上, 利用提出的区域—频道访问度概念设计Web访问矩阵, 并基于该矩阵结合区域—频道访问度和灰关联分析法提出。实验表明算法能有效地挖掘民意热点信息, 可用于电子政务网站站点优化、个性化服务和为决策者提供决策支持等。  相似文献   

2.
This paper presents the application of data mining algorithms to the prediction of Web performance. Our domain-driven data mining uses historic HTTP transactions data reflecting Web performance as perceived by the end-users located in the Internet domain of Wroclaw University of Technology, Wroclaw, Poland. The predictive modeling features of two general data mining systems, Microsoft SQL Server and IBM Intelligent Miner, are compared. The neural networks, decision tree, time series, and transform regression models are evaluated. It is shown that the data mining algorithms return quite accurate prediction results. The best results are achieved using the IBM's transform regression algorithm.  相似文献   

3.
With increasing richness in features such as personalization of content, Web applications are becoming increasingly complex and hence compute intensive. Traditional approaches for improving performance of static content Web sites have been based on the assumption that static content such as images are network intensive. However, these methods are not applicable to the dynamic content applications which are more compute intensive than static content. This paper proposes a suite of algorithms which jointly optimize the performance of dynamic content applications by reducing the client access times while also minimizing the resource utilization. A server migration algorithm allocates servers on-demand within a cluster such that the client access times are not affected even under sudden overload conditions. Further, a server selection mechanism enables statistical multiplexing of resources across clusters by redirecting requests away from overloaded clusters. We also propose a cluster decision algorithm which decides whether to migrate in additional servers at the local cluster or redirect requests remotely under different workload conditions. Through a combination of analytical modeling, trace-driven simulation over traces from large e-commerce sites and testbed implementation, we explore the performance savings achieved by the proposed algorithms.  相似文献   

4.
The recently introduced Datalog+?/?? family of ontology languages is especially useful for representing and reasoning over lightweight ontologies, and is set to play a central role in the context of query answering and information extraction for the Semantic Web. Recently, it has become apparent that it is necessary to develop a principled way to handle uncertainty in this domain. In addition to uncertainty as an inherent aspect of the Web, one must also deal with forms of uncertainty due to inconsistency and incompleteness, uncertainty resulting from automatically processing Web data, as well as uncertainty stemming from the integration of multiple heterogeneous data sources. In this paper, we take an important step in this direction by developing a probabilistic extension of Datalog+?/??. This extension uses Markov logic networks as the underlying probabilistic semantics. Here, we focus especially on scalable algorithms for answering threshold queries, which correspond to the question “what is the set of all ground atoms that are inferred from a given probabilistic ontology with a probability of at least p?”. These queries are especially relevant to Web information extraction, since uncertain rules lead to uncertain facts, and only information with a certain minimum confidence is desired. We present several algorithms, namely a basic approach, an anytime one, and one based on heuristics, which is guaranteed to return sound results. Furthermore, we also study inconsistency in probabilistic Datalog+?/?? ontologies. We propose two approaches for computing preferred repairs based on two different notions of distance between repairs, namely symmetric and score-based distance. We also study the complexity of the decision problems corresponding to computing such repairs, which turn out to be polynomial and NP-complete in the data complexity, respectively.  相似文献   

5.
Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5?% in cumulative PageRank.  相似文献   

6.
《Computer Networks》1999,31(11-16):1467-1479
When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach to Web searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers.We describe two algorithms to identify related Web pages. These algorithms use only the connectivity information in the Web (i.e., the links between pages) and not the content of pages or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape's `What's Related' service (http://home.netscape.com/escapes/related/). Our study showed that the precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information.  相似文献   

7.
Web Services在人才安全决策支持系统中的应用   总被引:1,自引:0,他引:1  
针对当前人才安全所面临的问题,提出人才安全决策支持系统的体系模型;在人才安全决策支持系统中应用WebServices技术,建立与模型管理系统、数据库(知识库)系统、数据仓库系统相关联的具有自学习功能的分析系统(AS);通过分析系统对各种模型算法的计算、仿真、预测结果进行比较,并根据产生的反馈信息,不断调整求解过程,从而为决策者提供最佳解决方案,以采取有效的人才安全措施。  相似文献   

8.
Automatic identification of informative sections of Web pages   总被引:3,自引:0,他引:3  
Web pages - especially dynamically generated ones - contain several items that cannot be classified as the "primary content," e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections" from the other content sections. We call these sections as "Web page blocks" or just "blocks." First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.  相似文献   

9.
A data mining algorithm for generalized Web prefetching   总被引:8,自引:0,他引:8  
Predictive Web prefetching refers to the mechanism of deducing the forthcoming page accesses of a client based on its past accesses. In this paper, we present a new context for the interpretation of Web prefetching algorithms as Markov predictors. We identify the factors that affect the performance of Web prefetching algorithms. We propose a new algorithm called WM,,, which is based on data mining and is proven to be a generalization of existing ones. It was designed to address their specific limitations and its characteristics include all the above factors. It compares favorably with previously proposed algorithms. Further, the algorithm efficiently addresses the increased number of candidates. We present a detailed performance evaluation of WM, with synthetic and real data. The experimental results show that WM/sub o/ can provide significant improvements over previously proposed Web prefetching algorithms.  相似文献   

10.
In this paper, we introduce a media decision taking engine (MDTE), enabling the automatic selection and/or rating of multimedia content versions, based on the available context information. The presented approach is fully semantic-driven, which means that we not only semantically model the context information, but also the decision algorithms themselve, which are represented in N3 Rules, a rule language that extends RDF. The decision rules are based on a rating function, supporting the specification of weights and affinity parameters for each environment property. Finally, we show how the MDTE is integrated in a media delivery platform, using the provisions of the existing Web infrastructure.  相似文献   

11.
Automatic fragment detection in dynamic Web pages and its impact on caching   总被引:2,自引:0,他引:2  
Constructing Web pages from fragments has been shown to provide significant benefits for both content generation and caching. In order for a Web site to use fragment-based content generation, however, good methods are needed for fragmenting the Web pages. Manual fragmentation of Web pages is expensive, error prone, and unscalable. This paper proposes a novel scheme to automatically detect and flag fragments that are cost-effective cache units in Web sites serving dynamic content. Our approach analyzes Web pages with respect to their information sharing behavior, personalization characteristics, and change patterns. We identify fragments which are shared among multiple documents or have different lifetime or personalization characteristics. Our approach has three unique features. First, we propose a framework for fragment detection, which includes a hierarchical and fragment-aware model for dynamic Web pages and a compact and effective data structure for fragment detection. Second, we present an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, we develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. This paper shows the results when the algorithms are applied to real Web sites. We evaluate the proposed scheme through a series of experiments, showing the benefits and costs of the algorithms. We also study the impact of using the fragments detected by our system on key parameters such as disk space utilization, network bandwidth consumption, and load on the origin servers.  相似文献   

12.
面对多个功能相同或相似的服务,服务的Qo S是服务选择中重要的考虑因素.将Qo S属性分为精确数型、区间数型和三角模糊数型.在此基础上,利用TOPSIS(technique for order preference by similarity to an ideal solution)条件下的多属性群决策模型给出了服务选择过程,该过程考虑了多个决策者在决策过程中所占的权重,以及多个决策者不同的Qo S偏好权重.通过一个实例验证了该方法的有效性.  相似文献   

13.
浏览器挖矿通过向网页内嵌入挖矿代码,使得用户访问该网站的同时,非法占用他人系统资源和网络资源开采货币,达到自己获益的挖矿攻击。通过对网页挖矿特征进行融合,选取八个特征用以恶意挖矿攻击检测,同时使用逻辑回归、支持向量机、决策树、随机森林四种算法进行模型训练,最终得到了平均识别率高达98.7%的检测模型。同时经实验得出随机森林算法模型在恶意挖矿检测中性能最高;有无Websocket连接、Web Worker的个数和Postmessage及onmessage事件总数这三个特征的组合对恶意挖矿检测具有高标识性。  相似文献   

14.
In this paper we study the problem of searching the Web with online learning algorithms. We consider that Web documents can be represented by vectors of n boolean attributes. A search engine is viewed as a learner, and a user is viewed as a teacher. We investigate the number of queries a search engine needs from the user to search for a collection of Web documents. We design several efficient learning algorithms to search for any collection of documents represented by a disjunction (or a conjunction) of relevant attributes with the help of membership queries or equivalence queries.  相似文献   

15.
遗传算法多目标优化及其在决策支持系统中的应用   总被引:1,自引:1,他引:0  
针对区域经济发展的现状,提出了改进的经济多目标优化模型。并用多目标遗传算法来对模型进行优化计算与仿真。在此基础上,通过将MATLAB与Web技术相结合,提出了一种新的基于B/S模式的决策支持系统的解决方案。  相似文献   

16.
Link spam is created with the intention of boosting one target’s rank in exchange of business profit. This unethical way of deceiving Web search engines is known as Web spam. Since then many anti-link spam detection techniques have constantly being proposed. Web spam detection is a crucial task due to its devastation towards Web search engines and global cost of billion dollars annually. In this paper, we proposed a novel technique by incorporating weight properties to enhance the Web spam detection algorithms. Weight properties can be defined as the influences of one Web node towards another Web node. We modified existing Web spam detection algorithms with our novel technique to evaluate the performances on a large public Web spam dataset – WEBSPAM-UK2007. The overall performance have shown that the modified algorithms outperform the benchmark algorithms up to 30.5 % improvement at host level and 6.11 % improvement at page level.  相似文献   

17.
目前关于决策树剪枝优化方面的研究主要集中于预剪枝和后剪枝算法。然而,这些剪枝算法通常作用于传统的决策树分类算法,在代价敏感学习与剪枝优化算法相结合方面还没有较好的研究成果。基于经济学中的效益成本分析理论,提出代价收益矩阵及单位代价收益等相关概念,采用单位代价收益最大化原则对决策树叶节点的类标号进行分配,并通过与预剪枝策略相结合,设计一种新型的决策树剪枝算法。通过对生成的决策树进行单位代价收益剪枝,使其具有代价敏感性,能够很好地解决实际问题。实验结果表明,该算法能生成较小规模的决策树,且与REP、EBP算法相比具有较好的分类效果。  相似文献   

18.
We develop a general sequence-based clustering method by proposing new sequence representation schemes in association with Markov models. The resulting sequence representations allow for calculation of vector-based distances (dissimilarities) between Web user sessions and thus can be used as inputs of various clustering algorithms. We develop an evaluation framework in which the performances of the algorithms are compared in terms of whether the clusters (groups of Web users who follow the same Markov process) are correctly identified using a replicated clustering approach. A series of experiments is conducted to investigate whether clustering performance is affected by different sequence representations and different distance measures as well as by other factors such as number of actual Web user clusters, number of Web pages, similarity between clusters, minimum session length, number of user sessions, and number of clusters to form. A new, fuzzy ART-enhanced K-means algorithm is also developed and its superior performance is demonstrated.  相似文献   

19.
With the proliferation of smartphones and social media, journalistic practices are increasingly dependent on information and images contributed by local bystanders through Internet-based applications and platforms. Verifying the images produced by these sources is integral to forming accurate news reports, given that there is very little or no control over the type of user-contributed content, and hence, images found on the Web are always likely to be the result of image tampering. In particular, image splicing, i.e. the process of taking an area from one image and placing it in another is a typical such tampering practice, often used with the goal of misinforming or manipulating Internet users. Currently, the localization of splicing traces in images found on the Web is a challenging task. In this work, we present the first, to our knowledge, exhaustive evaluation of today’s state-of-the-art algorithms for splicing localization, that is, algorithms attempting to detect which pixels in an image have been tampered with as the result of such a forgery. As our aim is the application of splicing localization on images found on the Web and social media environments, we evaluate a large number of algorithms aimed at this problem on datasets that match this use case, while also evaluating algorithm robustness in the face of image degradation due to JPEG recompressions. We then extend our evaluations to a large dataset we formed by collecting real-world forgeries that have circulated the Web during the past years. We review the performance of the implemented algorithms and attempt to draw broader conclusions with respect to the robustness of splicing localization algorithms for application in Web environments, their current weaknesses, and the future of the field. Finally, we openly share the framework and the corresponding algorithm implementations to allow for further evaluations and experimentation.  相似文献   

20.
We present WebACE, an agent for exploring and categorizing documents onthe World Wide Web based on a user profile. The heart of the agent is anunsupervised categorization of a set of documents, combined with a processfor generating new queries that is used to search for new relateddocuments and for filtering the resulting documents to extract the onesmost closely related to the starting set. The document categories are notgiven a priori. We present the overall architecture and describe twonovel algorithms which provide significant improvement over HierarchicalAgglomeration Clustering and AutoClass algorithms and form the basis forthe query generation and search component of the agent. We report on theresults of our experiments comparing these new algorithms with moretraditional clustering algorithms and we show that our algorithms are fastand sacalable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号