Learning to Extract Symbolic Knowledge from the World Wide Web

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, Tom Mitchell, K. Nigam, and S. Slattery

Conference Paper, Proceedings of 15th National Conference on Artificial Intelligence (AAAI '98), pp. 509 - 516, July, 1998

View Publication

Abstract

The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs: an ontology defining the classes and relations of interest, and a set of training data consisting of labeled regions of hypertext representing instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This paper describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system.

BibTeX

@conference{Craven-1998-14714,
author = {M. Craven and D. DiPasquo and D. Freitag and A. McCallum and Tom Mitchell and K. Nigam and S. Slattery},
title = {Learning to Extract Symbolic Knowledge from the World Wide Web},
booktitle = {Proceedings of 15th National Conference on Artificial Intelligence (AAAI '98)},
year = {1998},
month = {July},
pages = {509 - 516},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.