Ontology-driven web-based semantic similarity |
| |
Authors: | David Sánchez Montserrat Batet Aida Valls Karina Gibert |
| |
Affiliation: | 1.Department of Computer Science and Mathematics,Universitat Rovira i Virgili (URV),Tarragona,Spain;2.Department of Statistics and Operations Research,Universitat Politècnica de Catalunya,Barcelona,Spain |
| |
Abstract: | Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such
as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity
measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such
as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics;
in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises
suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the
second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept
statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine
both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological
entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to
other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented
issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized
and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’
dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able
to outperform classical approaches when using the Web for estimating concept probabilities. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|