首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
With the development of mobile technology, the users browsing habits are gradually shifted from only information retrieval to active recommendation. The classification mapping algorithm between users interests and web contents has been become more and more difficult with the volume and variety of web pages. Some big news portal sites and social media companies hire more editors to label these new concepts and words, and use the computing servers with larger memory to deal with the massive document classification, based on traditional supervised or semi-supervised machine learning methods. This paper provides an optimized classification algorithm for massive web page classification using semantic networks, such as Wikipedia, WordNet. In this paper, we used Wikipedia data set and initialized a few category entity words as class words. A weight estimation algorithm based on the depth and breadth of Wikipedia network is used to calculate the class weight of all Wikipedia Entity Words. A kinship-relation association based on content similarity of entity was therefore suggested optimizing the unbalance problem when a category node inherited the probability from multiple fathers. The keywords in the web page are extracted from the title and the main text using N-gram with Wikipedia Entity Words, and Bayesian classifier is used to estimate the page class probability. Experimental results showed that the proposed method obtained good scalability, robustness and reliability for massive web pages.  相似文献   

2.
网页检索结果中,用户经常会得到内容相同的冗余页面。提出了一种通过新闻主题要素学习新闻内容的新闻网页去重算法。该方法的基本思想是:首先,抽取新闻要素中关于事件发生的时间和地点短语;然后,通过抽取的时间和地点短语抽取新闻的内容;最终,根据学习的新闻内容通过计算它们的相似度来判断新闻网页的重复度。实验结果表明,该方法能够完成针对新闻内容的新闻网页的去重,并得到较高的查全率和查准率。  相似文献   

3.
Mining linguistic browsing patterns in the world wide web   总被引:2,自引:0,他引:2  
 World-wide-web applications have grown very rapidly and have made a significant impact on computer systems. Among them, web browsing for useful information may be most commonly seen. Due to its tremendous amounts of use, efficient and effective web retrieval has thus become a very important research topic in this field. Data mining is the process of extracting desirable knowledge or interesting patterns from existing databases for a certain purpose. In this paper, we use the data mining techniques to discover relevant browsing behavior from log data in web servers, thus being able to help make rules for retrieval of web pages. The browsing time of a customer on each web page is used to analyze the retrieval behavior. Since the data collected are numeric, fuzzy concepts are used to process them and to form linguistic terms. A sophisticated web-mining algorithm is thus proposed to find relevant browsing behavior from the linguistic data. Each page uses only the linguistic term with the maximum cardinality in later mining processes, thus making the number of fuzzy regions to be processed the same as the number of the pages. Computational time can thus be greatly reduced. The patterns mined out thus exhibit the browsing behavior and can be used to provide some appropriate suggestions to web-server managers.  相似文献   

4.
基于发布时间的新闻网页去重方法研究   总被引:1,自引:0,他引:1  
网页检索结果中,用户经常会得到内容相同的冗余页面。它们不但浪费了存储资源,而且给信息检索或其它文本处理带来诸多不便。论文在抽取出新闻标题、主题内容和发布日期的前提下,依据新闻的时间性(易碎性),按发布日期分“群”,对冗余网页去重方法进行了探索性研究,从而很大程度地缩小了计算时间,提高了去重准确性。  相似文献   

5.
该文提出了一种从搜索引擎返回的结果网页中获取双语网页的新方法,该方法分为两个任务。第一个任务是自动地检测并收集搜索引擎返回的结果网页中的数据记录。该步骤通过聚类的方法识别出有用的记录摘要并且为下一个任务即高质量双语混合网页的验证及其获取提供有效特征。该文中把双语混合网页的验证看作是有效的分类问题,该方法不依赖于特定领域和搜索引擎。基于从搜索引擎收集并经过人工标注的2 516条检索结果记录,该文提出的方法取得了81.3%的精确率和94.93%的召回率。  相似文献   

6.
Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F 1-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.  相似文献   

7.
平行语料是自然语言处理中一项重要的基础资源,在双语平行网页中大量存在。该文首先介绍双语URL匹配模式的可信度计算方法,然后提出基于局部可信度的双语平行网页识别算法,再依据匹配模式的全局可信度,提出两种优化方法: 即利用全局可信度,救回因低于局部可信度阈值而被初始算法滤掉的匹配模式;通过全局可信度和网页检测方法,挖出深层网页。进一步,结合网站双语可信度、链接关系,侦测出种子网站周边更多较具可信度的双语网站。除了双语URL匹配模式自动识别,还利用搜索引擎,依据少数高可信度的匹配模式快速识别双语网页。为了提高以上五种方法识别候选双语网页对的准确率,计算了候选双语网页对的双语相似度,并设置阈值过滤非双语网页对。通过实验验证了所提方法的有效性。  相似文献   

8.
当我们浏览网页时,在访问速度方面静态网页要明显比动态网页快得多,因此把一些关键性或经常访问的页面使用静态页技术做成静态页至关重要。在介绍什么是静态页生成技术之后分别以发布新闻和首页新闻条目处如何设计为例对静态页的生成作了详细的阐述,其中主要使用了文件对象来完成对文件生成、读取等操作,使用的技术为ASP。  相似文献   

9.
设计了一种采集分析互联网新闻网页的方法。该方法根据给定的新闻网站的入口地址在网络上找出所有的相关链接;区分这些链接所指向的页面特征,过滤掉相关性不大的内容,提取所有新闻网页的链接;进而进行多层次链接分析,根据新闻的图片、标题字体属性及日期,采用NewsPageRank算法计算每个新闻链接的权重。测试结果表明该方法对Internet上的新闻站点普遍具有较好的分析效果,性能可以满足实用要求。  相似文献   

10.
从互联网上挖掘大量双语平行句对,可以快速有效地构建大规模双语资源,服务于统计机器翻译。从挖掘对象的不同,将网络数据源分成对照网页和平行网页两类,提出一种抽取双语句对的方法。首先,从上述两类网页中分别抽取平行文本段,对照网页文本段抽取的主要方法为页面过滤和模板匹配,而平行网页依赖于网页结构的相似,采用对应节点匹配方法;其次,采用Gale—Church算法进行句对齐,得到平行句对;最后统一进行后处理。实验结果表明,从对照网页获取平行句对的准确率达到93.3%,平行网页为93.5%。  相似文献   

11.
通过对Web日志的预处理,构建动态矩阵,该矩阵能够反映用户访问路径的先后顺序,利用相似度计算进行网页推荐。提出的动态矩阵预测算法具有较快的响应速度,可以满足实时页面推荐的需要,同时该算法无需事先训练,还可以对动态矩阵进行增量更新,提高了预测性能。  相似文献   

12.
The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.  相似文献   

13.
互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。  相似文献   

14.
预取技术通过在用户浏览当前网页的时间内提前取回其将来最有可能请求的网页来减少实际感知的获取网页的时间.传统的Markov链模型是一种简单而有效的预测模型,但同时存在预测准确率偏低,存储复杂度偏高等缺点.通过提出一种算法来减小存储空间,最后通过证明能有效减小存储空间.  相似文献   

15.
传统网络爬虫为基于关键字检索的通用搜索引擎服务,无法抓取网页类别信息,给文本聚类和话题检测带来计算效率和准确度问题。本文提出基于站点分层结构的网页分类与抽取,通过构建虚拟站点层次分类树并抽取真实站点分层结构,设计并实现了面向分层结构的网页抓取;对于无分类信息的站点,给出了基于标题的网页分类技术,包括领域知识库构建和基于《知网》的词语语义相似度计算。实验结果表明,该方法具有良好的分类效果。  相似文献   

16.
Cellular phones are widely used to access the Web. However, most available Web pages are designed for desktop PCs, and it is inconvenient to browse these large Web pages on a cellular phone with a small screen and poor interfaces. Users who browse a Web page on a cellular phone have to scroll through the whole page to find the desired content, and must then search and scroll within that content in detail to get useful information. This paper describes the design and implementation of a novel Web browsing system for cellular phones. This system includes a Web page overview to reduce scrolling operations when finding objective content within the page. Furthermore, it adaptively presents content according to its characteristics to reduce burdensome operations when searching within content.  相似文献   

17.
It is common to browse web pages via mobile devices. However, most of the web pages were designed for desktop computers equipped with big screens. When browsing on mobile devices, a user has to scroll up and down to find the information they want because of the limited screen size. Some commercial products reformat web pages. However, the result pages still contain irrelevant information. We propose a system to personalize users’ mobile web pages. A user can determine which blocks in a web page should be retained. The sequence of these blocks can also be altered according to individual preferences.  相似文献   

18.
Web Search is increasingly entity centric; as a large fraction of common queries target specific entities, search results get progressively augmented with semi-structured and multimedia information about those entities. However, search over personal web browsing history still revolves around keyword-search mostly. In this paper, we present a novel approach to answer queries over web browsing logs that takes into account entities appearing in the web pages, user activities, as well as temporal information. Our system, B-hist, aims at providing web users with an effective tool for searching and accessing information they previously looked up on the web by supporting multiple ways of filtering results using clustering and entity-centric search. In the following, we present our system and motivate our User Interface (UI) design choices by detailing the results of a survey on web browsing and history search. In addition, we present an empirical evaluation of our entity-based approach used to cluster web pages.  相似文献   

19.
On web information exists in the form of text, audio, image, and video objects often referred to multiple media objects. Vertical web search provides the search of multiple media information usually via keyword-based queries. The search results in different media formats usually presented in separate panels/tabs; integration is mostly non-blended. Therefore, results exploration via vertical web search engines require the selection of a source and scrolling of a linear ranked list of results. Relationships in the results presented in separate panels/tabs are mostly not considered. Search aggregations unify results from several vertical web sources via blended integration, but exploration still requires scrolling of a linear ranked list. Multimedia search frameworks provide the exploration of results in different media formats but more focused towards the retrieval issues. We proposed a multiple media information search framework to address issues, particularly in aggregated search. Our search framework provides a mechanism to explore results via non-linear ways. The search framework realized by suggesting a framework architecture design and instantiating a search tool. The effectiveness of blended integration and browsing is measured via precision and click through rate respectively. Search task support in results exploration mechanism measured via task-based evaluation. We also validated the conformance of various search/exploration attributes discussed in the state-of-the-art in our frameworks.  相似文献   

20.
Web portals today offer a variety of content and services to their users. This content can be split into various categories and usually content semantically related is placed in the same area. In this paper, a software technique is presented that allows the viewers of web sites to build their own personalized portals, using specific areas of their preferred sites. This technique saves users’ time and reduces the cost of browsing the web by minimizing the volume of data that has to be downloaded. It is based on an algorithm, which fragments a web page in discrete fragments using the page's internal structure. Users utilize a web interface to define which parts of selected web pages they desire to appear in their personalized portal. No additional software needs to be installed on the users’ personal computers, since this technique is designed to function centrally as a data source for a Web Server. In addition, usage of this technique reduces user perceived latency during browsing sessions, since less data must be transferred to users’ personal computers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号