首页 | 本学科首页   官方微博 | 高级检索  
     


Measuring structural similarity of semistructured data based on information-theoretic approaches
Authors:Sven Helmer  Nikolaus Augsten  Michael B?hlen
Affiliation:1. Birkbeck, University of London, Malet Street, London, WC1E 7HX, UK
2. Free University of Bozen-Bolzano, Dominikanerplatz 3, 39100, Bozen-Bolzano, Italy
3. University of Zurich, Binzmühlestrasse 14, 8050, Zurich, Switzerland
Abstract:We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号