共查询到20条相似文献,搜索用时 0 毫秒
1.
提出了一个采用新的抽取规则的包装器 ,结合采用基于文档结构抽取规则和基于特征Pattern匹配抽取规则包装器的优点 ,可以适用于含有多个信息块的Web页面。 相似文献
2.
The information accessible through the Internet is increasing explosively as the Web is getting more and more widespread.
In this situation, the Web is indispensable information resource for both of information gathering and information searching.
Though traditional information retrieval techniques have been applied to information gathering and searching in the Web, they
are insufficient for this new form of information source. Fortunately some Al techniques can be straightforwardly applicable
to such tasks in the Web, and many researchers are trying this approach. In this paper, we attempt to describe the current
state of information gathering and searching technologies in the Web, and the application of AI techniques in the fields.
Then we point out limitations of these traditional and AI approaches and introduce two aapproaches: navigation planning and
a Mondou search engine for overcoming them. The navigation planning system tries to collect systematic knowledge, rather than
Web pages, which are only pieces of knowledge. The Mondou search engine copes with the problems of the query expansion/modification
based on the techniques of text/web mining and information visualization.
Seiji Yamada, Dr. Eng.: He received the B.S., M.S. and Ph.S. degrees in control engineering and artificial intelligence from Osaka University, Osaka,
Japan, in 1984, 1986 and 1989, respectively. From 1989 to 1991, he served as a Research Associate in the Department of Control
Engineering at Osaka University. From 1991 to 1996, he served as a Lecturer in the Institute of Scientific and Industrial
Research at Osaka University. In 1996, he joined the Department of Computational Intelligence and Systems Science at Tokyo
Institute of Technology, Yokohama, Japan, as an Associate Professor. His research interests include artificial intelligence,
planning, machine learning for a robotics, intelligent information retrieval in the WWW, human computer interaction, He is
a member of AAAI, IEEE, JSAI, RSJ and IEICE.
Hiroyuki Kawano, Dr.Eng.: He is an Associate Professor at the Department of Systems Science, Graduate School of Informatics, Kyoto University, Japan.
He obtained his B.Eng. and M.Eng. degrees in Applied Mathematics and Physics, and his Dr.Eng. degree in Applied Systems Science
from Kyoto University. His research interests are in advanced database technologies, such as data mining, data warehousing,
knowledge discovery and web search engine (Mondou). He has served on the program committees of several conferences in the
areas of Data Base Systems, and technical committes of advanced information systems. 相似文献
3.
The World Wide Web (WWW) has become the biggest information source for students while solving information problems for school projects. Since anyone can post anything on the WWW, information is often unreliable or incomplete, and it is important to evaluate sources and information before using them. Earlier research has shown that students have difficulties with evaluating sources and information. This study investigates the criteria secondary educational students use while searching the Web for information. 23 students solved two information problems while thinking aloud. After completing the tasks they were interviewed in groups on their use of criteria. Results show that students do not evaluate results, source and information very often. The criteria students mention when asked which criteria are important for evaluating information are not always the same criteria they mention while solving the information problems. They mentioned more criteria but also admitted not always using these criteria while searching the Web. 相似文献
4.
以往的包装器主要针对仅含有一个数据块的Web页面,而对含有多个信息块的Web页面,简称MIB(Multiple Information Block)Web页面无法处理。该文提出了一个新的抽取规则,结合了基于文档结构的抽取规则和基于特征Pattern匹配的抽取规则的优点,能够有效地抽取MIB Web页面中的信息。 相似文献
5.
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text files. Data in these formats are not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems. In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system—XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content filtering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates tasks of building wrappers that are specific to a web source from the tasks that are repetitive for any source, and uses a component library to provide basic building blocks for wrapper programs. Second, it provides inductive learning algorithms that derive or discover wrapper patterns by reasoning about sample pages or sample specifications. Third and most importantly, we introduce and develop a two-phase code generation framework. The first phase utilizes an interactive interface facility to encode the source-specific metadata knowledge identified by individual wrapper developers as declarative information extraction rules. The second phase combines the information extraction rules generated at the first phase with the XWRAP component library to construct an executable wrapper program for the given web source. 相似文献
6.
Information retrieval from the World Wide Web through the use of search engines is known to be unable to capture effectively
the information needs of users. The approach taken in this paper is to add intelligence to information retrieval from the
World Wide Web, by the modeling of users to improve the interaction between the user and information retrieval systems. In
other words, to improve the performance of the user in retrieving information from the information source. To effect such
an improvement, it is necessary that any retrieval system should somehow make inferences concerning the information the user
might want. The system then can aid the user, for instance by giving suggestions or by adapting any query based on predictions
furnished by the model. So, by a combination of user modeling and fuzzy logic a prototype system has been developed (the Fuzzy
Modeling Query Assistant (FMQA)) which modifies a user's query based on a fuzzy user model. The FMQA was tested via a user
study which clearly indicated that, for the limited domain chosen, the modified queries are better than those that are left
unmodified.
Received 10 November 1998 / Revised 14 June 2000 / Accepted in revised form 25 September 2000 相似文献
7.
1.引言万维网(World Wide Web)的出现使计算机拥有海量的信息资源,然而这些信息却很少以计算机可理解的结构存在,因为,万维网上的页面本来就是以人,而不是计算机为其阅读对象的。因此,复杂的文本结构、图像、声音等多种信息的存在,既把万维网变成一种丰富多采的媒体,又造成了计算机对万维网信息进一步处理的障碍。 相似文献
8.
随着互联网的发展以及网上信息的日益丰富 ,传统的信息处理已经延伸到互联网领域。在对互联网上的信息进行处理时 ,常常要将分布在互联网各处的Web页面下载到本地供进一步处理 ;这便是所讨论的Web页面收集工具的核心功能。该页面收集系统在综合使用Web页面间的链接关系和页面内容的基础上 ,增加了多层次的页面过滤模块 ,可用来收集特定领域内的Web页面 ;同时可采用多机并行收集的方法提高页面收集的效率 ;采用大型数据库存放元收集信息 ,并对收集到的页面进行压缩 ,能够支持海量数据的收集 ;动态更新机制的实施使得下载到本地的页面信息能够得到及时的更新。 相似文献
9.
Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied to massive corpora such as the Web, UIE systems present an approach to a primary challenge in artificial intelligence: the automatic accumulation of massive bodies of knowledge.A fundamental problem for a UIE system is assessing the probability that its extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness?We present a combinatorial “balls-and-urns” model, called Urns, that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating Urns's parameters in practice and demonstrate experimentally that for UIE the model's log likelihoods are 15 times better, on average, than those obtained by methods used in previous work. We illustrate the generality of the redundancy model by detailing multiple applications beyond UIE in which Urns has been effective. We also provide a theoretical foundation for Urns's performance, including a theorem showing that PAC Learnability in Urns is guaranteed without hand-labeled data, under certain assumptions. 相似文献
10.
为搜索相关主题最具权威的Web信息资源,提出了一种计算Web页权威值的算法。该算法改进了HITS^[1]算法,无须用户提供关键词,采用由Web例子页的连接扩展获得相关主题的例子页集,用一个Web页被超链接引用的次数来度量该页的权威性。 相似文献
11.
在分析了网上论坛内部的信息组织模式和链接结构的基础上,提出了一套面向网上论坛的语义话题线索抽取框架、叙述了其具体实现。为信息抽取定义了完善的抽取规则规范,提供了用户定制规则的可视化工具和论坛站点中语义信息单元自动下载抽取的后台引擎。 相似文献
12.
ABSTRACT This article reviews selected magazines and journals of interest to academic librarians that include information about the Web. Reviewed titles include library science titles, general educational titles, and popular titles found on newsstands. The titles reviewed are aimed towards two groups of librarians: those most involved with developing content (such as reference librarians and bibliographers), and those involved with Web design, development, and policy issues. 相似文献
13.
ABSTRACT A selective, annotated bibliography of books, journal articles, and electronic resources relating to Web site design aimed specifically at beginning library Web Managers. 相似文献
14.
We present the Object-Web Mediator to querying integrated Web data sources composed of a retrieval component based on an intermediate object view mechanism and search views, and an XML engine. Search views map the source capabilities to attributes defined at object classes, and parsers that process retrieved documents and cache them in XML format. The XML engine queries cached documents, extracts data, and returns extracted data for evaluation. The originality of this approach consists of a generic view mechanism to access data sources with limited data access and complex capabilities, and an XML engine to support data extraction and reorganization. This approach has been developed and demonstrated as part of the multi-database system supporting queries via uniform Object Protocol Model interfaces against public Web data sources of interest to the biologists. 相似文献
15.
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich, multiple-record documents (e.g., advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology – a conceptual model instance – that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents and structure it according to the generated database scheme. Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in recognizable constants and narrow in ontological breadth. Our approach is less labor-intensive than other approaches that manually or semiautomatically generate wrappers, and it is generally insensitive to changes in Web-page format. 相似文献
16.
ABSTRACT Library Web developers can improve their Web pages and work towards accessibility by all user groups. This article describes some general accessibility guidelines to follow when creating Web pages. It discusses the recommendations of the World Wide Web Consortium (W3C), called the Web Accessibility Initiatives (WAI). This article focuses attention on the accessibility of Web pages to users with visual impairments, auditory impairments, and mobility impairments. Accessibility guidelines for these user groups are suggested. Another important aspect of creating accessible Web pages is to test the pages during and after development. Online products that test accessibility are discussed as well as additional test methods, such as user input. 相似文献
17.
地理信息系统可以对空间数据按地理坐标或空间位置进行各种处理,有效管理数据,研究各种空间实体及相互关系。基于WWW的地理信息系统在电力企业建立的企业内部网络(Intranet)中运行,能够及时、准确地传递信息并辅助决策。目前电力企业输配电WebGIS的关键问题是在原有基于C/S模式MIS的基础上,创建与平台无关的、运行于开放的、基于TCP/IP协议网络之上的软件系统。为了解决这个问题,本文讨论了三种基于WWW技术的电力企业地理信息系统的实现策略:服务器端策略——允许用户向Web服务器提交申请数据和分析结果的请求;客户机端策略——允许用户在他们的本地机上执行某些数据的处理和分析;混合策略——服务器端策略和客户机端策略的结合,并对这些策略的优缺点做了对比,分析了电力企业WebGIS各个功能模块适合采取的策略。 相似文献
18.
The closely related research areas management of semistructured data and languages for querying the Web have recently attracted a lot of interest. We argue that languages supporting deduction and object-orientation (dood languages) are particularly suited in this context: Object-orientation provides a flexible common data model for combining information from heterogeneous sources and for handling partial information. Techniques for navigating in object-oriented databases can be applied to semistructured databases as well, since the latter may be viewed as (very simple) instances of the former. Deductive rules provide a powerful framework for expressing complex queries in a high-level, declarative programming style. We elaborate on the management of semistructured data and show how reachability queries involving general path expressions and the extraction of data paths in the presence of cyclic data can be handled. We then propose a formal model for querying structure and contents of Web data and present its declarative semantics. A main advantage of our approach is that it brings together the above-mentioned issues in a unified, formal framework and—using the
system—supports rapid prototyping and experimenting with all these features. Concrete examples illustrate the concise and elegant programming style supported by
and substantiate the above-mentioned claims. 相似文献
19.
文中通过对基于World Wide Web(WWW)上建立决策支持电子环境的研究,利用中介代理的概念,在用户和技术的拥有者之间架起了一座桥梁,构成了决策网,从而使WWW上的决策技术得到共享。讨论了其结构、性能和各种功能,为进一步研究WWW上的决策支持提供了一条途径。 相似文献
20.
Librarians have offered reference and instruction services at the reference desk and in classrooms for many years. Now a third location, the network, is emerging as a viable place for reference and instruction. The widespread availability of software to browse, create, edit, and serve World Wide Web pages has opened exciting new opportunities for teaching librarians. With this software and the proliferation of networked personal computers in colleges and universities, it is possible to deliver reference and instruction to library users at their computers. Suggestions and guidelines for creating Web pages are offered. In the future, the Web technology may include asynchronous interactive partticipation in the teaching-learning process by students, instructors, and librarians. 相似文献
|