首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Phishing attack is growing significantly each year and is considered as one of the most dangerous threats in the Internet which may cause people to lose confidence in e-commerce. In this paper, we present a heuristic method to determine whether a webpage is a legitimate or a phishing page. This scheme could detect new phishing pages which black list based anti-phishing tools could not. We first convert a web page into 12 features which are well selected based on the existing normal and fishing pages. A training set of web pages including normal and fishing pages are then input for a support vector machine to do training. A testing set is finally fed into the trained model to do the testing. Compared to the existing methods, the experimental results show that the proposed phishing detector can achieve the high accuracy rate with relatively low false positive and low false negative rates.  相似文献   

2.
Increasing high volume phishing attacks are being encountered every day due to attackers’ high financial returns. Recently, there has been significant interest in applying machine learning for phishing Web pages detection. Different from literatures, this paper introduces predicted labels of textual contents to be part of the features and proposes a novel framework for phishing Web pages detection using hybrid features consisting of URL-based, Web-based, rule-based and textual content-based features. We achieve this framework by developing an efficient two-stage extreme learning machine (ELM). The first stage is to construct classification models on textual contents of Web pages using ELM. In particular, we take Optical Character Recognition (OCR) as an assistant tool to extract textual contents from image format Web pages in this stage. In the second stage, a classification model on hybrid features is developed by using a linear combination model-based ensemble ELMs (LC-ELMs), with the weights calculated by the generalized inverse. Experimental results indicate the proposed framework is promising for detecting phishing Web pages.  相似文献   

3.
针对海量网页在线自动高效获取网页分类系统设计中如何更有效地平衡准确度与资源开销之间的矛盾问题,提出一种基于级联式分类器的网页分类方法。该方法利用级联策略,将在线与离线网页分类方法结合,各取所长。级联分类系统的一级分类采用在线分类方法,仅利用锚文本中网页标题包含的特征预测其分类,同时计算分类结果的置信度,分类结果的置信度由分类后验概率分布的信息熵度量。若置信度高于阈值(该阈值采用多目标粒子群优化算法预先计算取得),则触发二级分类器。二级分类器从下载的网页正文中提取特征,利用预先基于网页正文特征训练的分类器进行离线分类。结果表明,相对于单独的在线法和离线法,级联分类系统的F1值分别提升了10.85%和4.57%,并且级联分类系统的效率比在线法未降低很多(30%左右),而比离线法的效率提升了约70%。级联式分类系统不仅具有更高的分类能力,而且显著地减少了分类的计算开销与带宽消耗。  相似文献   

4.
Since the Web encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks. A Web document may be authored in multiple ways, such as: (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages containing keywords. We introduce the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to efficiently retrieve information units. Our algorithm can perform progressive query processing. These functionalities are essential for information retrieval on the Web and large XML databases. We also present experimental results on synthetic graphs and real Web data  相似文献   

5.
Most Web pages contain location information, which are usually neglected by traditional search engines. Queries combining location and textual terms are called as spatial textual Web queries. Based on the fact that traditional search engines pay little attention in the location information in Web pages, in this paper we study a framework to utilize location information for Web search. The proposed framework consists of an offline stage to extract focused locations for crawled Web pages, as well as an online ranking stage to perform location-aware ranking for search results. The focused locations of a Web page refer to the most appropriate locations associated with the Web page. In the offline stage, we extract the focused locations and keywords from Web pages and map each keyword with specific focused locations, which forms a set of <keyword, location> pairs. In the second online query processing stage, we extract keywords from the query, and computer the ranking scores based on location relevance and the location-constrained scores for each querying keyword. The experiments on various real datasets crawled from nj.gov, BBC and New York Time show that the performance of our algorithm on focused location extraction is superior to previous methods and the proposed ranking algorithm has the best performance w.r.t different spatial textual queries.  相似文献   

6.
基于集成学习的钓鱼网页深度检测系统   总被引:1,自引:0,他引:1  
网络钓鱼是一种在线欺诈行为,它利用钓鱼网页仿冒正常合法的网页,窃取用户敏感信息从而达到非法目的.提出了基于集成学习的钓鱼网页深度检测方法,采用网页渲染来应对常见的页面伪装手段,提取渲染后网页的URL信息特征、链接信息特征以及页面文本特征,利用集成学习的方法,针对不同的特征信息构造并训练不同的基础分类器模型,最后利用分类集成策略综合多个基础分类器生成最终的结果.针对PhishTank钓鱼网页的检测实验表明,本文提出的检测方法具有较好的准确率与召回率.  相似文献   

7.
《Computer Networks》1999,31(11-16):1467-1479
When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach to Web searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers.We describe two algorithms to identify related Web pages. These algorithms use only the connectivity information in the Web (i.e., the links between pages) and not the content of pages or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape's `What's Related' service (http://home.netscape.com/escapes/related/). Our study showed that the precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information.  相似文献   

8.
In this paper we outline the use of term rewriting techniques for modeling the dynamic behavior of Web sites. We associate rewrite rules to each Web page expressing the Web pages which are immediately reachable from this page. The obtained system permits the application of well-known results from the rewriting theory to analyse interesting properties of the Web site. In particular, we briefly discuss the use of some logics with strong connections with term rewriting as a basis for specifying and verifying dynamic properties of Web sites. We use Maude as a suitable specification language for such rewriting models which also permits to directly explore interesting dynamic properties of Web sites.  相似文献   

9.
A home page is the gateway to an organization's Web site. To design effective Web home pages, it is necessary to understand the fundamental drivers of user's perception of Web pages. Not only do designers have to understand potential users' frame of mind, they also have at their choosing a stupefying array of attributes – including numerous font types, audio, video, and graphics – all of which can be arranged on a page in different ways, compounding the complexity of the design task. A theoretical model capable of explaining user reactions at a molar level should be invaluable to Web designers as a complement to prevalent intuitive and heuristic approaches. Such a model transcends piecemeal page attributes to focus on overall Web page perceptions of users. Reasoning that people perceive the cyberspace of Web pages in ways similar to their perception of physical places, we use Kaplan and Kaplan's informational model of place perception from the field of environmental psychology to predict that only two dimensions: understanding of information on a Web page, and the involvement potential of a Web page, should adequately capture Web page perception at a molar level. We empirically verify the existence of these dimensions and develop valid scales for measuring them. Using a home page as a stimulus in a lab experiment, we find that understanding and involvement together account for a significant amount of the variance in the attitude toward the Web page and in the intention to browse the underlying Web site. We show that the informational model is a parsimonious and powerful theoretical framework to measure users' perceptions of Web home pages and it could potentially serve as a guide to Web page design and testing efforts.  相似文献   

10.
As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.  相似文献   

11.
12.
This paper presents the QA-Pagelet as a fundamental data preparation technique for large-scale data analysis of the deep Web. To support QA-Pagelet extraction, we present the Thor framework for sampling, locating, and partioning the QA-Pagelets from the deep Web. Two unique features of the Thor framework are 1) the novel page clustering for grouping pages from a deep Web source into distinct clusters of control-flow dependent pages and 2) the novel subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets within highly ranked page clusters. We evaluate the effectiveness of the Thor framework through experiments using both simulation and real data sets. We show that Thor performs well over millions of deep Web pages and over a wide range of sources, including e-commerce sites, general and specialized search engines, corporate Web sites, medical and legal resources, and several others. Our experiments also show that the proposed page clustering algorithm achieves low-entropy clusters, and the subtree filtering algorithm identifies QA-Pagelets with excellent precision and recall.  相似文献   

13.
In web browsers, a variety of anti-phishing tools and technologies are available to assist users to identify phishing attempts and potentially harmful pages. Such anti-phishing tools and technologies provide Internet users with essential information, such as warnings of spoofed pages. To determine how well users are able to recognise and identify phishing web pages with anti-phishing tools, we designed and conducted usability tests for two types of phishing-detection applications: blacklist-based and whitelist-based anti-phishing toolbars. The research results mainly indicate no significant performance differences between the application types. We also observed that, in many web browsing cases, a significant amount of useful and practical information for users is absent, such as information explaining professional web page security certificates. Such certificates are crucial in ensuring user privacy and protection. We also found other deficiencies in web identities in web pages and web browsers that present challenges to the design of anti-phishing toolbars. These challenges will require more professional, illustrative, instructional, and reliable information for users to facilitate user verification of the authenticity of web pages and their content.  相似文献   

14.
In this paper, we present a new rule-based method to detect phishing attacks in internet banking. Our rule-based method used two novel feature sets, which have been proposed to determine the webpage identity. Our proposed feature sets include four features to evaluate the page resources identity, and four features to identify the access protocol of page resource elements. We used approximate string matching algorithms to determine the relationship between the content and the URL of a page in our first proposed feature set. Our proposed features are independent from third-party services such as search engines result and/or web browser history. We employed support vector machine (SVM) algorithm to classify webpages. Our experiments indicate that the proposed model can detect phishing pages in internet banking with accuracy of 99.14% true positive and only 0.86% false negative alarm. Output of sensitivity analysis demonstrates the significant impact of our proposed features over traditional features. We extracted the hidden knowledge from the proposed SVM model by adopting a related method. We embedded the extracted rules into a browser extension named PhishDetector to make our proposed method more functional and easy to use. Evaluating of the implemented browser extension indicates that it can detect phishing attacks in internet banking with high accuracy and reliability. PhishDetector can detect zero-day phishing attacks too.  相似文献   

15.
针对钓鱼攻击者常用的伪造HTTPS网站以及其他混淆技术,借鉴了目前主流基于机器学习以及规则匹配的检测钓鱼网站的方法RMLR和PhishDef,增加对网页文本关键字和网页子链接等信息进行特征提取的过程,提出了Nmap-RF分类方法。Nmap-RF是基于规则匹配和随机森林方法的集成钓鱼网站检测方法。根据网页协议对网站进行预过滤,若判定其为钓鱼网站则省略后续特征提取步骤。否则以文本关键字置信度,网页子链接置信度,钓鱼类词汇相似度以及网页PageRank作为关键特征,以常见URL、Whois、DNS信息和网页标签信息作为辅助特征,经过随机森林分类模型判断后给出最终的分类结果。实验证明,Nmap-RF集成方法可以在平均9~10 μs的时间内对钓鱼网页进行检测,且可以过滤掉98.4%的不合法页面,平均总精度可达99.6%。  相似文献   

16.
Given a user keyword query, current Web search engines return a list of individual Web pages ranked by their "goodness" with respect to the query. Thus, the basic unit for search and retrieval is an individual page, even though information on a topic is often spread across multiple pages. This degrades the quality of search results, especially for long or uncorrelated (multitopic) queries (in which individual keywords rarely occur together in the same document), where a single page is unlikely to satisfy the user's information need. We propose a technique that, given a keyword query, on the fly generates new pages, called composed pages, which contain all query keywords. The composed pages are generated by extracting and stitching together relevant pieces from hyperlinked Web pages and retaining links to the original Web pages. To rank the composed pages, we consider both the hyperlink structure of the original pages and the associations between the keywords within each page. Furthermore, we present and experimentally evaluate heuristic algorithms to efficiently generate the top composed pages. The quality of our method is compared to current approaches by using user surveys. Finally, we also show how our techniques can be used to perform query-specific summarization of Web pages.  相似文献   

17.
刘强  郭景峰 《微机发展》2007,17(1):151-154
已有的基于访问路径分析的页面推荐系统大多由离线处理和在线处理两部分组成,由于其周期性离线处理的过程较为耗时,难以适应大型网站以及内容更新频繁的网站的需要。提出了一种新的基于用户访问路径分析的页面推荐模型。该模型采用在线处理方式,利用增量图划分方法形成页面聚类,依此生成动态页面推荐。模型以Apache模块的形式实现,可适用于大型网站以及内容更新频繁的网站。实验结果表明,该模型具有较好的整体性能。  相似文献   

18.
An antiphishing strategy based on visual similarity assessment   总被引:1,自引:0,他引:1  
The authors' proposed antiphishing strategy uses visual characteristics to identify potential phishing sites and measure suspicious pages similarity to actual sites registered with the system. The first of two sequential processes in the SiteWatcher system runs on local email servers and monitors emails for keywords and suspicious URLs. The second process then compares the potential phishing pages against actual pages and assesses visual similarities between them in terms of key regions, page layouts, and overall styles. The approach is designed to be part of an enterprise antiphishing solution.  相似文献   

19.
基于统计学习的挂马网页实时检测   总被引:1,自引:0,他引:1  
近年来挂马网页对Web安全造成严重威胁,客户端的主要防御手段包括反病毒软件与恶意站点黑名单。反病毒软件采用特征码匹配方法,无法有效检测经过加密与混淆变形的网页脚本代码;黑名单无法防御最新出现的恶意站点。提出一种新型的、与网页内容代码无关的挂马网页实时检测方法。该方法主要提取访问网页时HTTP会话过程的各种统计特征,利用决策树机器学习方法构建挂马网页分类模型并用于在线实时检测。实验证明,该方法能够达到89. 7%的挂马网页检测率与0. 3%的误检率。  相似文献   

20.
Web users tend to search only the pages displayed at the top of the search engine results page (the ‘top link’ heuristic). Although it might be reasonable to use this heuristic to navigate simple and unambiguous facts, it might be risky when searching for conflicting socio-scientific topics, such as potential measures to reduce greenhouse gas emissions. In the present study, we explored the extent to which students consider other Web page characteristics, such as topic relevance and trustworthiness, when searching and bookmarking pages concerning a conflicting topic. We also examined the extent to which prior background knowledge moderates students’ behavior. The results revealed that while the study participants actually used a ‘top link’ heuristic to navigate the results, they engaged in more systematic processes to bookmark pages for further study. Furthermore, the students’ background knowledge was related to the assessment of Web page trustworthiness. We discuss these results from the perspective of a dual-processing model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号