首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
At the IBM T.J. Watson Research Center we implemented a way of preserving state during HTTP sessions by modifying hypertext links to encode state information. We call the method dynamic argument embedding, and it was developed in response to problems we encountered implementing the Coyote Virtual Store, a transaction-processing system prototype. Virtual store applications, such as IBM's NetCommerce and Netscape's Merchant System, typically need to maintain information such as the contents of shopping baskets while customers are shopping. We wanted our application to be flexible enough to maintain such state information without restricting the sorts of HTML pages a customer might view. We also wanted a system that did not require extensions to the hypertext transfer protocol (HTTP) and so could be implemented on any standard Web server and client browser. Finally, we wanted to permit customers to access several accounts at once by using the browser's cache to concurrently store pages corresponding to multiple invocations of the virtual store application  相似文献   

2.
The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. Page variation is more prodigious than the data's raw scale: taken as a whole, the set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text document collections. This level of complexity makes an “off-the-shelf” database management and information retrieval solution impossible. To date, index based search engines for the Web have been the primary tool by which users search for information. Such engines can build giant indices that let you quickly retrieve the set of all Web pages containing a given word or string. Experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained key words and phrases. These search engines are, however, unsuited for a wide range of equally important tasks. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. How then, from this sea of pages, should a search engine select the correct ones-those of most value to the user? Clever is a search engine that analyzes hyperlinks to uncover two types of pages: authorities, which provide the best source of information on a given topic; and hubs, which provide collections of links to authorities. We outline the thinking that went into Clever's design, report briefly on a study that compared Clever's performance to that of Yahoo and AltaVista, and examine how our system is being extended and updated  相似文献   

3.
Web站点的超链结构挖掘   总被引:11,自引:0,他引:11  
WWW是一个由成千上万个分布在世界各地的Web站点组成的全球信息系统,每个Web站点又是一个由许多Web页构成的信息(子)系统。由于一个文档作者可以通过超链把自己的文档与任意一个已知的Web页链接起来,而一个 Web站点上的信息资源又通常是由许多人共同提供的, 因此 Web站点内的超链链接通常是五花八门、各种各样的,它们可以有各种含义和用途。文章分析了WWW系统中超链的使用特征和规律,提出了一个划分超链类型、挖掘站点结构的方法,初步探讨了它在信息收集和查询等方面的应用。  相似文献   

4.
《Computer Networks》1999,31(11-16):1623-1640
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. We report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.  相似文献   

5.
用Naive Bayes方法协调分类Web网页   总被引:41,自引:0,他引:41  
范焱  郑诚  王清毅  蔡庆生  刘洁 《软件学报》2001,12(9):1386-1392
WWW上的信息极大丰富,如何从巨量的信息中有效地发现有用的信息,是亟待解决的问题,而Web网页的正确分类正是其中的核心问题.针对超文本结构中的结构特征,提出了用NaiveBayes方法协调分别利用超文本页面中的文本信息和结构信息进行分类的方法.经实验验证,与只用单种方法对超文本进行分类的方法相比,综合分类法有效地提高了分类的正确率.  相似文献   

6.
7.
Kindlund  E. 《Software, IEEE》1997,14(5):22-25
The World Wide Web has emerged as a new application-delivery platform. In response, developers are offering users sophisticated Web-based Java applets that range from cybershopping carts to complex tools for genome mapping. These applets give you application functionality without taking up space on your hard drive. But trailing behind the applet bounty are new usability questions. A major one is how to make applet navigation seamless in the Web browser domain. Java applets are programs you write in Java and integrate into your Web page. Although applets can provide functionality similar to traditional applications, the applet code need not be installed on the users' hard drive. Instead, the applets execute Java-compatible Web browsers. Unlike standard Web pages, which users simply visit and browse, applet-enhanced pages let the user manipulate applet components and dynamically interact with information. The author discusses tools, techniques and concepts to optimize user interfaces  相似文献   

8.
Since the Web encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks. A Web document may be authored in multiple ways, such as: (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages containing keywords. We introduce the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to efficiently retrieve information units. Our algorithm can perform progressive query processing. These functionalities are essential for information retrieval on the Web and large XML databases. We also present experimental results on synthetic graphs and real Web data  相似文献   

9.
Mapping the semantics of Web text and links   总被引:1,自引:0,他引:1  
Search engines use content and links to search, rank, cluster, and classify Web pages. These information discovery applications use similarity measures derived from this data to estimate relatedness between pages. However, little research exists on the relationships between similarity measures or between such measures and semantic similarity. The author analyzes and visualizes similarity relationships in massive Web data sets to identify how to integrate content and link analysis for approximating relevance. He uses human-generated metadata from Web directories to estimate semantic similarity and semantic maps to visualize relationships between content and link cues and what these cues suggest about page meaning. Highly heterogeneous topical maps point to a critical dependence on search context.  相似文献   

10.
A Study of Approaches to Hypertext Categorization   总被引:34,自引:2,他引:34  
Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related Web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our results show that the identification of hypertext regularities in the data and the selection of appropriate representations for hypertext in particular domains are crucial, but seldom obvious, in real-world problems. We find that adding the words in the linked neighborhood to the page having those links (both inlinks and outlinks) were helpful for all our classifiers on one data set, but more harmful than helpful for two out of the three classifiers on the remaining datasets. We also observed that extracting meta data from related Web sites was extremely useful for improving classification accuracy in some of those domains. Finally, the relative performance of the classifiers being tested provided insights into their strengths and limitations for solving classification problems involving diverse and often noisy Web pages.  相似文献   

11.
Our current understanding of Web structure is based on large graphs created by centralized crawlers and indexers. They obtain data almost exclusively from the so-called surface Web, which consists, loosely speaking, of interlinked HTML pages. The deep Web, by contrast, is information that is reachable over the Web, but that resides in databases; it is dynamically available in response to queries, not placed on static pages ahead of time. Recent estimates indicate that the deep Web has hundreds of times more data than the surface Web. The deep Web gives us reason to rethink much of the current doctrine of broad-based link analysis. Instead of looking up pages and finding links on them, Web crawlers would have to produce queries to generate relevant pages. Creating appropriate queries ahead of time is nontrivial without understanding the content of the queried sites. The deep Web's scale would also make it much harder to cache results than to merely index static pages. Whereas a static page presents its links for all to see, a deep Web site can decide whose queries to process and how well. It can, for example, authenticate the querying party before giving it any truly valuable information and links. It can build an understanding of the querying party's context in order to give proper responses, and it can engage in dialogues and negotiate for the information it reveals. The Web site can thus prevent its information from being used by unknown parties. What's more, the querying party can ensure that the information is meant for it.  相似文献   

12.
A text mining approach for automatic construction of hypertexts   总被引:1,自引:0,他引:1  
The research on automatic hypertext construction emerges rapidly in the last decade because there exists a urgent need to translate the gigantic amount of legacy documents into web pages. Unlike traditional ‘flat’ texts, a hypertext contains a number of navigational hyperlinks that point to some related hypertexts or locations of the same hypertext. Traditionally, these hyperlinks were constructed by the creators of the web pages with or without the help of some authoring tools. However, the gigantic amount of documents produced each day prevent from such manual construction. Thus an automatic hypertext construction method is necessary for content providers to efficiently produce adequate information that can be used by web surfers. Although most of the web pages contain a number of non-textual data such as images, sounds, and video clips, text data still contribute the major part of information about the pages. Therefore, it is not surprising that most of automatic hypertext construction methods inherit from traditional information retrieval research. In this work, we will propose a new automatic hypertext construction method based on a text mining approach. Our method applies the self-organizing map algorithm to cluster some at text documents in a training corpus and generate two maps. We then use these maps to identify the sources and destinations of some important hyperlinks within these training documents. The constructed hyperlinks are then inserted into the training documents to translate them into hypertext form. Such translated documents will form the new corpus. Incoming documents can also be translated into hypertext form and added to the corpus through the same approach. Our method had been tested on a set of at text documents collected from a newswire site. Although we only use Chinese text documents, our approach can be applied to any documents that can be transformed to a set of index terms.  相似文献   

13.
Vetter  R.J. Spell  C. Ward  C. 《Computer》1994,27(10):49-57
The World-Wide Web, an information service on the Internet, uses hypertext links to other textual documents or files. Users can click on a highlighted word or words in the text to provide additional information about the selected word(s). Users can also access graphic pictures, images, audio clips, or even full-motion video through hypermedia, an extension of hypertext. One of the most popular graphics-oriented browsers is Mosaic, which was developed at the National Center for Supercomputing Applications (NCSA) as a way to graphically-navigate the WWW. Mosaic browsers are currently available for Unix workstations running X Windows, PCs running Microsoft Windows, and Macintosh computers. Mosaic can access data in WWW servers, Wide Area Information Servers (WAIS), Gopher servers, Archie servers, and several others. The World-Wide Web is still evolving at a rapid pace. Distributed hypermedia systems on the Internet will continue to be an active area of development in the future. The flexibility of the WWW design, its use of hyperlinks, and the integration of existing WAIS and Gopher information resources, make the WWW ideal for future research and study. Highly interactive multimedia applications will require more sophisticated tools than currently exist. The most significant issue that needs to be resolved is the mismatch between WWW system capabilities and user requirements in the areas of presentation and quality of service  相似文献   

14.
15.
《Computer Networks》1999,31(11-16):1331-1345
This paper discusses how to augment the World Wide Web with an open hypermedia service (Webvise) that provides structures such as contexts, links, annotations, and guided tours stored in hypermedia databases external to the Web pages. This includes the ability for users collaboratively to create links from parts of HTML Web pages they do not own and support for creating links to parts of Web pages without writing HTML target tags. The method for locating parts of Web pages can locate parts of pages across frame hierarchies and it also supports certain repairs of links that break due to modified Web pages. Support for providing links to/from parts of non-HTML data, such as sound and movie, will be possible via interfaces to plug-ins and Java-based media players.The hypermedia structures are stored in a hypermedia database, developed from the Devise Hypermedia framework, and the service is available on the Web via an ordinary URL. The best user interface for creating and manipulating the structures is currently provided for the Microsoft Internet Explorer 4.x browser through COM integration that utilizes the Explorer's DOM representation of Web-pages. But the structures can also be manipulated and used via special Java applets and a pure proxy server solution is provided for users who only need to browse the structures. A user can create and use the external structures as `transparency' layers on top of arbitrary Web pages, the user can switch between viewing pages with one or more layers (contexts) of structures or without any external structures imposed on them.  相似文献   

16.
In order to provide a ubiquitous, comprehensive and versatile service on the WWW the development of a WWW telephone browsing system named Phone‐Web is proposed. This Phone‐Web browser system would act as an intermediary between the telephone user and Web sites, thereby facilitating access to the WWW from any phone. The Phone‐Web system would filter Web page information and then convert it into speech format. Users of the Phone‐Web system could retrieve and hear information stored on WWW servers by using telephone handsets. For this system to work it requires a new hypertext language “Hyper Phone Markup Language” (HPML) and a dedicated Phone‐Web browser. By using the proposed HPML language, Web page designers can easily specify service information in a set of HPML pages, which would be included in the site they are designing. The Phone‐Web browser would be capable of retrieving and then converting the HPML pages into speech patterns. By connecting to the Phone‐Web browser, telephone users can access any information on any site using the HPML language from any telephone anywhere in the world. However, HPML‐specified pages can also be accessed using existing browsers (e.g., Netscape Navigator, Microsoft Internet Explorer, etc.) This means that both telephone and computer users can now access the same set of Web pages to retrieve the same information. Therefore, instead of maintaining the existing two systems (access via the telephone or computer) service providers can now maintain one system, which would provide a versatile, and comprehensive service for users at all levels of Web‐literacy. This revised version was published online in August 2006 with corrections to the Cover Date.  相似文献   

17.
基于关键词相关度的Deep Web爬虫爬行策略   总被引:1,自引:0,他引:1       下载免费PDF全文
田野  丁岳伟 《计算机工程》2008,34(15):220-222
Deep Web蕴藏丰富的、高质量的信息资源,为了获取某Deep Web站点的页面,用户不得不键入一系列的关键词集。由于没有直接指向Deep Web页面的静态链接,目前大多数搜索引擎不能发现这些页面。该文提出的Deep Web爬虫爬行策略,可以有效地下载Deep Web页面。由于该页面只提供一个查询接口,因此Deep Web爬虫设计面对的主要挑战是怎样选择最佳的查询关键词产生有意义的查询。实验证明文中提出的一种基于不同关键词相关度权重的选择方法是有效的。  相似文献   

18.
应用链接分析的web搜索结果聚类   总被引:3,自引:0,他引:3  
随着web上信息的急剧增长,如何有效地从web上获得高质量的web信息已经成为很多研究领域里的热门研究主题之一,比如在数据库,信息检索等领域。在信息检索里,web搜索引擎是最常用的工具,然而现今的搜索引擎还远不能达到满意的要求,使用链接分析,提出了一种新的方法用来聚类web搜索结果,不同于信息检索中基于文本之间共享关键字或词的聚类算法,该文的方法是应用文献引用和匹配分析的方法,基于两web页面所共享和匹配的公共链接,并且扩展了标准的K-means聚类算法,使它更适合于处理噪音页面,并把它应用于web结果页面的聚类,为验证它的有效性,进行了初步实验,实验结果显示通过链接分析对web搜索结果聚类取得了预期效果  相似文献   

19.
The evidenced fact that “Linking is as powerful as computing” in a dynamic web context has lead to evaluating Turing completeness for hypertext systems based on their linking model. The same evaluation can be applied to the Semantic Web domain too. RDF is the default data model of the Semantic Web links, so the evaluation comes back to whether or not RDF can support the required computational power at the linking level. RDF represents semantic relationships with explicitly naming the participating triples, however the enumeration is only one method amongst many for representing relations, and not always the most efficient or viable. In this paper we firstly consider that Turing completeness of binary-linked hypertext is realized if and only if the links are dynamic (functional). Ashman’s Binary Relation Model (BRM) showed that binary relations can most usefully be represented with Mili’s pE (predicate-expression) representation, and Moreau and Hall concluded that hypertext systems which use the pE representation as the basis for their linking (relation) activities are Turing-complete. Secondly we consider that RDF –as it is- is a static version of a general ternary relations model, called TRM. We then conclude that the current computing power of the Semantic Web depends on the dynamicity supported by its underlying TRM. The value of this is firstly that RDF’s triples can be considered within a framework and compared to alternatives, such as the TRM version of pE, designated pfE (predicate-function-expression). Secondly, that a system whose relations are represented with pfE is likewise going to be Turing-complete. Thus moving from RDF to a pfE representation of relations would give far greater power and flexibility within the Semantic Web applications.  相似文献   

20.
The World Wide Web provides hypertext and multimedia based information across the Internet. Many applications have been developed on http servers.One important and novel use of the servers has been the provision of courseware facilities. This includes on-line lecture notes, exercises and their solutions as well as interactive packages suited primarily for teaching and demonstration packages. A variety of disciplines have benefitted notably C programming, X Windows, Computer Vision, Image Processing, Computer Graphics, Artificial Intelligence and Parallel Computing.This paper will address the issues of (i) implementing a variety of computer science courses and (ii) using the packages in a class environment.It also considers how best to provide information in such a hypertext based system and how interactive image processing packages can be developed. A suite of multimedia based tools have been developed to facilitate such systems and these will be described in the paper. In particular we have developed a number of methods for running applications live over the WWW.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号