首页 | 本学科首页   官方微博 | 高级检索  
 共查询到19条相似文献,搜索用时 125 毫秒
鉴于传统方法在赌博网站检测上时效性低、准确度低,提出基于PAM概率主题模型的赌博网站检测方法.抽取网站及其关联网页的文本内容,并参考网站的结构化信息给不同的文本内容赋予不同的权重;利用PAM模型对网页文本信息进行主题挖掘,分析其是否大概率倾向于"赌博"主题;综合计算所抽取的各个网页的主题信息,判断该网站是否属于赌博网站,从而实现对赌博网站的有效检测.通过实验分析,该方法在赌博网站检测上的准确度达到72.3%.  相似文献   

Blog(博客)可以称为在线个人日志。作为一种新兴的媒体,Blog目前已经成为一种在Web上表达个人观点和情感的一种非常流行的方式。那么如何从Blog中快速准确地抽取有用的信息(话题发布时间、话题题目、话题内容、评论内容等)就成为了Blog应用中一个非常重要的步骤。提出了一种基于模板化的Blog信息抽取方法,该方法通过分析Blog网站的HTML源代码,然后提取出网站的模板,并根据该模板对Blog网页进行信息抽取。对来自国内10个著名博客网站进行模板的提取,并对这10个网站中的7 374个Blog网页进行了实验,实验结果表明,该方法能根据提取出的模板快速、准确地对Blog网页进行信息抽取。  相似文献   

Web页面信息通常包含大量无关结构和HTML标记,而页面主题信息通常淹没其中,如何快速获取Web页面主题信息。本文提出了一种抽取策略,首先判定是否为主题型页面,然后提取网页正文信息,最后利用正则表达式滤除内容块中HTML标记和无关文字。实验结果表明:该方法能准确地完成主题型网页的正文抽取任务。  相似文献   

快速积累的海量产品评论信息是商家和消费者进行需求调研或购物决策时的重要依据。针对现有网页信息抽取方法普遍存在人工耗时大,抽取准确率低等问题,提出了一种基于加权频繁子树相似度的网页评论信息抽取方法WTS。首先通过视觉特征对网页进行剪枝处理。然后,通过深度加权的相似度度量方法抽取最佳频繁子树。最后,通过子树对齐方法抽取评论路径并解析评论内容。通过对京东、苏宁等网站的评论内容抽取实验,验证了WTS比D-EEM、POL等方法在抽取产品评论信息上具有一定的优势。  相似文献   

精准地抽取新闻网页的内容,是提高Web新闻分析等应用系统工作质量的关键技术之一.由于缺少Web新闻出版的标准,存在大量不同的出版格式,并且Web本身是一种具有高度异构性的大数据载体,导致Web新闻内容抽取成为一个开放性问题.经大量实例分析发现,新闻网页内容与其上的标签路径存在潜在的关联性.因此,设计了标签路径特征系,以从不同视角区分网页内容和噪音.在特征相似性分析的基础上,提出了一种基于组合特征选择的特征融合策略,并设计了基于融合特征的Web新闻内容抽取方法CEPF.CEPF是一种快速的通用、无需训练的在线Web新闻内容抽取算法,可抽取多种来源、多种风格、多种语言的Web新闻网页.在CleanEval等测试数据集上的实验结果表明,CEPF方法优于CETR等抽取方法.  相似文献   

由于人工抽取网页信息效率低、成本高,因此根据对大量网页结构的观察,提出基于网页文档对象模型DOM树节点路径相似度的正文抽取方法。依据同网站下的网页结构相同的特点去除网页噪声得到网页的主题内容,然后结合正文节点在DOM树中的路径的相似度抽取正文。通过对不同类型的中文新闻网站上的1 000个网页进行实验,结果表明该方法对于97.6%的网页都能够去除大部分噪声并保持正文内容的完整性,正文抽取结果有93.30%的准确率和95.59%的召回率。所提算法对不同类型的网页都有较好的适应性。  相似文献   

Web信息抽取通常采用的是一种归纳学习方法,从指定的模版网页中归纳到抽取规则,这种方法虽然能够准确地抽取出信息,当网站的模版发生改变后,必须重新获得抽取规则,因而这种抽取器的维护成本比较高,可适应性差。本文针对这一难题,提出一种基于DOM树的可适应性多信息块Web信息抽取,该方法首先通过NekoHtml将网页解析成DOM树,然后确定包含关键词组的信息块,从而实现Web信息抽取。经过大量网站的实验证明该方法适用于不同站点的信息抽取,并且能对多信息块的Web页面进行信息抽取。  相似文献   

针对模板生成网页的一种数据自动抽取方法   总被引:5,自引:0,他引:5  
当前,Web上的很多网页是动态生成的,网站根据请求从后台数据库中选取数据并嵌入到通用的模板中,例如电子商务网站的商品描述网页.研究如何从这类由模板生成的网页中检测出其背后的模板,并将嵌入的数据(例如商品名称、价格等等)自动地抽取出来.给出了模板检测问题的形式化描述,并深入分析模板产生网页的结构特征.提出了一种新颖的模板检测方法,并利用检测出的模板自动地从实例网页中抽取数据.与其他已有方法相比,该方法能够适用于"列表页面"和"详细页面"两种类型的网页.在两个第三方的测试集上进行了实验,结果表明,该方法具有很高的抽取准确率.  相似文献   

基于传统的关键词统计的分类方法难以正确识别网页的主题,从而难以实现按主题进行分类.为了有效地对Web上的结构化数据源进行主题分类,结合语义知识,将基于概念的主题分类方法,应用到网络购物网站数据源的自动主题分类中.实验表明,该方法能够较好地提高主题分类的精度.  相似文献   

网页的半结构化特点与新闻的自身特征为选择性抽取网页内容创造了条件。我们在前人的研究基础上,挖掘Web页面结构特征、充分利用Html标记与新闻特征,重点从Web页面编者对文本修饰角度出发,提出了基于网页内容分割的主题内容抽取方法。实验结果表明该方法能有效地抽取新闻各要素,测试的抽取准确率在96%以上。  相似文献   

For the reason of convenience, auction sites are one of the most popular and quickly developing of Internet shopping sites. There have been many problems, however, arising in relation to online auction commerce; these problems have been dealt with one by one. In addition to the important issue of security, some customers have faced problems such as complicated processes, insufficient product information, and bad interface design. Based on the consumer behavior model that we have built on auction sites, the purpose of this study is to evaluate the information and interface design for current domestic and foreign auction sites. The results indicated that some auction Web sites provided insufficient information and inconvenient interface design during some shopping steps. Foreign auction Web sites provided sufficient information and a more convenient interface design than did the domestic auction Web sites. The results, hopefully, can be used as the reference for online auction Web sites to provide customers more convenient shopping environments. © 2011 Wiley Periodicals, Inc.  相似文献   

As companies increase the quantity of information they provide through their Web sites, it is critical that content is structured with an appropriate architecture. However, resource constraints often limit the ability of companies to apply all Web design principles completely. This study quantifies the effects of two major information architecture principles in a controlled study that isolates the incremental effects of organizational scheme and labeling on user performance and satisfaction. Sixty participants with a wide range of Internet and on-line shopping experience were recruited to complete a series of shopping tasks on a prototype retail shopping Web site. User-centered labels provided a significant benefit in performance and satisfaction over labels obtained through company-centered methods. User-centered organization did not result in improved performance except when the label quality was poor. Significant interactions suggest specific guidelines for allocating resources in Web site design. Applications of this research include the design of Web sites for any commercial application, particularly E-commerce.  相似文献   

针对由模板生成的购物信息网页,且根据其网页信息量大,网页结构复杂的特点,提出了一种不使用复杂的学习规则,而将购物信息从模板网页中抽取出来的方法。研究内容包括定义网页模板和网页的信息抽取模板,设计用于快速构建模板的模板语言,并提出一种基于模板语言抽取内容的模型。实验结果表明,在标准的450个网页的测试集下,所提方法的召回率相比抽取问题算法(EXALG)提高了12%;在250个网页的测试集下,召回率相比基于视觉信息和标签结构的包装器生成器(ViNTs)方法和增加自动信息抽取和视觉感知(ViPER)方法分别提升了7.4%,0.2%;准确率相比ViNTs方法和ViPER方法分别提升了5.2%,0.2%。基于快速构建模板的信息抽取方法的召回率和准确率都有很大提升,使得购物信息检索和购物比价系统中的网页分析的准确性和信息召回率得到很大的改进。  相似文献   

The Semantic Web and Web services provide many opportunities in various applications such as product search and comparison in electronic commerce. We implemented an intelligent meta-search and recommendation system for products through consideration of multiple attributes by using ontology mapping and Web services. Under the assumption that each shopping site offers product ontology and product search service with Web services, we proposed a meta-search framework to configure a customer’s search intent, make and dispatch proper queries to each shopping site, evaluate search results from shopping sites, and show the customer the relevant product list with associated rankings. Ontology mapping is used for generating proper queries for shopping sites that have different product categories. We also implemented our framework and performed empirical evaluation of our approach with two leading shopping sites in the world.  相似文献   

面向网上论坛的信息抽取技术   总被引:5,自引:0,他引:5  
在分析了网上论坛内部的信息组织模式和链接结构的基础上,提出了一套面向网上论坛的语义话题线索抽取框架、叙述了其具体实现。为信息抽取定义了完善的抽取规则规范,提供了用户定制规则的可视化工具和论坛站点中语义信息单元自动下载抽取的后台引擎。  相似文献   

张万山  肖瑶  梁俊杰  余敦辉 《计算机应用》2014,34(11):3144-3146
针对传统Web文本聚类算法没有考虑Web文本主题信息导致对多主题Web文本聚类结果准确率不高的问题,提出基于主题的Web文本聚类方法。该方法通过主题提取、特征抽取、文本聚类三个步骤实现对多主题Web文本的聚类。相对于传统的Web文本聚类算法,所提方法充分考虑了Web文本的主题信息。实验结果表明,对多主题Web文本聚类,所提方法的准确率比基于K-means的文本聚类方法和基于《知网》的文本聚类方法要好。  相似文献   

Preferences for certain characteristics of an online shopping experience may be related to demographic data. This paper discusses the characteristics of that experience, demographic data and preferences by demographic group. The results of an online survey of 488 individuals in the United States indicate that respondents are generally satisfied with their online shopping experiences, with security, information quality and information quantity ranking first in importance overall. The sensory impact of a site ranked last overall of the seven characteristics measured. Preferences for these characteristics in e-commerce sites were differentiated by age, education and income. The sensory impact of sites became less important as respondents increased in age, income or education. As the income of respondents increased, the importance of the reputation of the vendor rose. Web site designers may incorporate these findings into the design of e-commerce sites in an attempt to increase the shopping satisfaction of their users. Results from the customer relationship management portion of the survey suggest that current push technologies and site personalization are not an effective means of achieving user satisfaction.  相似文献   

Lightner NJ 《Ergonomics》2003,46(1-3):153-168
Preferences for certain characteristics of an online shopping experience may be related to demographic data. This paper discusses the characteristics of that experience, demographic data and preferences by demographic group. The results of an online survey of 488 individuals in the United States indicate that respondents are generally satisfied with their online shopping experiences, with security, information quality and information quantity ranking first in importance overall. The sensory impact of a site ranked last overall of the seven characteristics measured. Preferences for these characteristics in e-commerce sites were differentiated by age, education and income. The sensory impact of sites became less important as respondents increased in age, income or education. As the income of respondents increased, the importance of the reputation of the vendor rose. Web site designers may incorporate these findings into the design of e-commerce sites in an attempt to increase the shopping satisfaction of their users. Results from the customer relationship management portion of the survey suggest that current push technologies and site personalization are not an effective means of achieving user satisfaction.  相似文献   

传统的主题抽取方法单纯依靠分析网页内容的来自动获取网页主题,其分析结果并不十分精确.在WWW上,网页之间通过超链接来互相联系,而链接关系紧密的网页趋向于属于同一主题、基于这一思想,本文提出了一种利用Web链接结构信息来对主题抽取结果进行求精的方法,其通过所链接网页对本网页的影响来修正本网页的主题权值.本文还通过一个实际应用例子,分析了这一方法的特点。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号