Data Extraction from the Web Based on Pre-Defined Schema Data extraction from the web based on pre-defined schema期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Data Extraction from the Web Based on Pre-Defined Schema

作者姓名：	孟小峰陆宏钧王海燕谷明哲

作者单位：	[1]SchoolofInformation,RenminUniversityofChian,Beijing1000872,P.R.China [2]DepartmentofComputerScience,HongKongUniversityofScienceandTechnologyHongKong,P.R.China

摘要：	With the development of the Internet,the World Web has become an invaluable information source for most organizations,However,most documents available from the Web are in HTML form which is originally designed for document formatting with little consideration of its contents.Effectively extracting data from such documents remains a non-trivial task.In this paper,we present a schema-guided approach to extracting data from HTML pages .Under the approach,the user defines a schema specifying what to be extracted and provides sample mappings between the schema and th HTML page.The system will induce the mapping rules and generate a wrapper that takes the HTML page as input and produces the required datas in the form of XML conforming to the use-defined schema .A prototype system implementing the approach has been developed .The preliminary experiments indicate that the proposed semi-automatic approach is not only easy to use but also able to produce a wrapper that extracts required data from inputted pages with high accuracy.
关键词：	数据提取 Web 预定义模式计算机网络
Data extraction from the web based on pre-defined schema

Meng?Xiaofeng?Email author,Lu?Hongjun,Gang?Haiyan,Gu?Mingzhe.Data Extraction from the Web Based on Pre-Defined Schema[J].Journal of Computer Science and Technology,2002,17(4):0-0.

Authors:	Email author" target="_blank">Meng?Xiaofeng?Email author Lu?Hongjun Gang?Haiyan Gu?Mingzhe

Affiliation:	(1) School of Information, Renmin University of China, 100872 Beijing, P.R. China;(2) Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong, P.R. China

Abstract:	With the development of the Internet, the World Wide Web has become an invaluable information source for most organizations. However, most documents available from the Web are in HTML form which is originally designed for document formatting with little consideration of its contents. Effectively extracting data from such documents remains a nontrivial task. In this paper, we present a schema-guided approach to extracting data from HTML pages. Under the approach, the user defines a schema specifying what to be extracted and provides sample mappings between the schema and the HTML page. The system will induce the mapping rules and generate a wrapper that takes the HTML page as input and produces the required data in the form of XML conforming to the user-defined schema. A prototype system implementing the approach has been developed. The preliminary experiments indicate that the proposed semi-automatic approach is not only easy to use but also able to produce a wrapper that extracts required data from inputted pages with high accuracy. This research is partially supported by the National Natural Science Foundation of China under Grant No.60073014. MENG Xiaofeng is a professor of the School of Information, Renmin University of China. He received his M.S. degree from Renmin University of China and the Ph.D. degree from the Institute of Computing Technology, the Chinese Academy of Sciences. His recent research work includes Web data management and mobile data management. He is also interested in DBMS implementation and natural language interfacing. He has published more than 40 research papers in database-related international journals and conferences. Now he is a member of the Database Society of CCF, ACM SIGMOD and IEEE CS. LU Hongjun is a professor at the Department of Computer Science, Hong Kong University of Science and Technology (HKUST). He received his B.Sc. in automatic control from Tsinghua University, China, and M.Sc. and Ph.D. in computer science from the University of Wisconsin, Madison. Before joining HKUST, Dr. Lu served as a principal research scientist at the Computer Science Center of Honeywell Inc. in 1985–1987 and a professor at the National University of Singapore in 1987–2000. His recent research work includes data quality, data warehousing and data mining. He is also interested in development of Internet-based database applications and electronic business systems. He has published more than 100 research papers in databaserelated international journals and conferences. WANG Haiyan is a graduate student in the School of Information, Renmin University of China. She received her B.Sc. in computer application from Renmin University of China in 2000. Her research interests include Web data management. GU Mingzhe is a graduate student in the School of Information, Renmin University of China. She received her B.Sc. in computer application from Renmin University of China in 1999. Her research interests include Web data management.

Keywords:	data extraction wrapper generation data integration
本文献已被维普万方数据 SpringerLink 等数据库收录！
	点击此处可从《计算机科学技术学报》浏览原始摘要信息
	点击此处可从《计算机科学技术学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏