首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
基于链接的方法进行Web信息检索的TREC实验研究   总被引:1,自引:0,他引:1  
本文通过TREC实验研究基于链接信息的检索对Web信息检索的影响,包括使用链接描述文本,链接结构以及将基于链接的方法和传统基于内容检索的方法合并。得到如下结论:首先,链接描述文档对网页主题的概括有高度的精确性,但是对网页内容的描述有极大的不完全性;其次,与传统检索方法相比,使用链接文本在网页定位的任务上能够使系统性能提高96% ,但是在信息查询任务上没有帮助;最后,将基于链 接信息的检索与传统的基于内容检索技术合并,在网页入口定位任务上总能将系统性能提高48%到124.8% ,而对特定信息查询任务也能在一定程度上改善检索效果。  相似文献   

2.
Members of the academic community have increasingly turned to digital libraries to search for the latest work of their peers. On account of their role in the academic community, it is very important that these digital libraries collect citations in a consistent, accurate, and up-to-date manner, yet they do not correctly compile citations for myriads of authors for various reasons including authors with the same name, a problem known as the “name ambiguity problem.” This problem occurs when multiple authors share the same name and particularly when names are simplified as in cases where names merely contain the first initial and the last name. This paper proposes a reliable and accurate pair-wise similarities approach to disambiguate names using supervised classification on Web correlations and authorship correlations. This approach makes use of Web correlations among citations assuming citations that co-refer on publication lists on the Web should to refer to the same author. This approach also makes use of authorship correlations assuming citations with the same rare author name refer to the same author, and furthermore, citations with the same full names of authors or e-mail addresses likely refer to the same author. These two types of correlations are measured in our approach using pair-wise similarity metrics. In addition, a binary classifier, as part of supervised classification, is applied to label matching pairs of citations using pair-wise similarity metrics, and these labels are then used to group citations into different clusters such that each cluster represents an individual author. Results show our approach greatly improves upon the name disambiguation accuracy and performance of other proposed approaches, especially in some name clusters with high degree of ambiguity.  相似文献   

3.
Computing the semantic similarity between terms (or short text expressions) that have the same meaning but which are not lexicographically similar is a key challenge in many computer related fields. The problem is that traditional approaches to semantic similarity measurement are not suitable for all situations, for example, many of them often fail to deal with terms not covered by synonym dictionaries or are not able to cope with acronyms, abbreviations, buzzwords, brand names, proper nouns, and so on. In this paper, we present and evaluate a collection of emerging techniques developed to avoid this problem. These techniques use some kinds of web intelligence to determine the degree of similarity between text expressions. These techniques implement a variety of paradigms including the study of co-occurrence, text snippet comparison, frequent pattern finding, or search log analysis. The goal is to substitute the traditional techniques where necessary.  相似文献   

4.
《Applied Soft Computing》2007,7(1):398-410
Personalized search engines are important tools for finding web documents for specific users, because they are able to provide the location of information on the WWW as accurately as possible, using efficient methods of data mining and knowledge discovery. The types and features of traditional search engines are various, including support for different functionality and ranking methods. New search engines that use link structures have produced improved search results which can overcome the limitations of conventional text-based search engines. Going a step further, this paper presents a system that provides users with personalized results derived from a search engine that uses link structures. The fuzzy document retrieval system (constructed from a fuzzy concept network based on the user's profile) personalizes the results yielded from link-based search engines with the preferences of the specific user. A preliminary experiment with six subjects indicates that the developed system is capable of searching not only relevant but also personalized web pages, depending on the preferences of the user.  相似文献   

5.
We describe a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data. By “categorical data,” we mean tables with fields that cannot be naturally ordered by a metric – e.g., the names of producers of automobiles, or the names of products offered by a manufacturer. Our approach is based on an iterative method for assigning and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from the co-occurrence of values in the dataset. Our techniques can be studied analytically in terms of certain types of non-linear dynamical systems. Received February 15, 1999 / Accepted August 15, 1999  相似文献   

6.
人名歧义是一种身份不确定的现象,指的是文本中具有相同姓名的字符串指向现实世界中的不同实体人物。人名消歧很长时间一直是一个具有挑战性的问题,关注网页里的人名消歧的问题。因为经典的K-means算法如果选择了一个差的随机初始聚类中心,算法会遇到局部收敛的问题,所以文章提出一种基于最大最小原则的改进的K-means算法来进行人名消歧。同时使用了WePS的训练数据作为实验的语料。实验结果表明,改进的方法比层次聚类方法有着更好的性能。  相似文献   

7.
Link-based similarity plays an important role in measuring similarities between nodes in a graph. As a widely used link-based similarity, SimRank scores similarity between two nodes as the first-meeting probability of two random surfers. However, due to the large scale of graphs in real-world applications and dynamic change characteristic, it is not viable to frequently update the whole similarity matrix. Also, people often only concern about the similarities of a small subset of nodes in a graph. In such a case, the existing approaches need to compute the similarities of all node-pairs simultaneously, suffering from high computation cost.In this paper, we propose a new algorithm, Iterative Single-Pair SimRank (ISP), based on the random surfer-pair model to compute the SimRank similarity score for a single pair of nodes in a graph. To avoid computing similarities of all other nodes, we introduce a new data structure, position matrix, to facilitate computation of the first-meeting probabilities of two random surfers, and give two optimization techniques to further enhance their performance. In addition, we theoretically prove that the time cost of ISP is always less than the original algorithm SimRank. Comprehensive experiments conducted on both synthetic and real datasets demonstrate the effectiveness and efficiency of our approach.  相似文献   

8.
Over last few years, CAPTCHAs are ubiquitously found on internet as a security mechanism to distinguish between humans and spams. The text-based CAPTCHAs offer users to recognize the distorted text from the challenged images. Having based on hard AI problem, they have emerged as a hot research topic in computer vision and machine learning. The contemporary text-based CAPTCHAs are based on the segmentation problem that involves their decomposition into sub-images of individual characters. This is a challenging task for current OCR programs which is not yet solved to a great extent. In this paper, we present a novel segmentation and recognition method which uses simple image processing techniques including thresholding, thinning and pixel count methods along with an artificial neural network for text-based CAPTCHAs. We attack the popular CCT (Crowded Characters Together) based CAPTCHAs and compare our results with other schemes. As overall, our system achieves an overall precision of 51.3, 27.1 and 53.2% for Taobao, MSN and eBay datasets with 1000,500 and 1000 CAPTCHAs respectively. The benefits of this research are twofold: by recognizing text-based CAPTCHAs, we not only explore the weaknesses in the current design but also find a way to segment and recognize the connected characters from images. The proposed algorithm can be used in digitization of ancient books, handwriting recognition and other similar tasks.  相似文献   

9.
It is common for large organizations to maintain repositories of business process models in order to document and to continuously improve their operations. Given such a repository, this paper deals with the problem of retrieving those models in the repository that most closely resemble a given process model or fragment thereof. Up to now, there is a notable research gap on comparing different approaches to this problem and on evaluating them in the same setting. Therefore, this paper presents three similarity metrics that can be used to answer queries on process repositories: (i) node matching similarity that compares the labels and attributes attached to process model elements; (ii) structural similarity that compares element labels as well as the topology of process models; and (iii) behavioral similarity that compares element labels as well as causal relations captured in the process model. These metrics are experimentally evaluated in terms of precision and recall. The results show that all three metrics yield comparable results, with structural similarity slightly outperforming the other two metrics. Also, all three metrics outperform text-based search engines when it comes to searching through a repository for similar business process models.  相似文献   

10.
基于链接描述文本及其上下文的Web信息检索   总被引:20,自引:0,他引:20  
文档之间的超链接结构是Web信息检索和传统信息检索的最大区别之一,由此产生了基于超链接结构的检索技术。描述了链接描述文档的概念,并在此基础上研究链接文本(anchor text)及其上下文信息在检索中的作用。通过使用超过169万篇网页的大规模真实数据集以及TREC 2001提供的相关文档及评价方法进行测试,得到如下结论:首先,链接描述文档对网页主题的概括有高度的精确性,但是对网页内容的描述有极大的不完全性;其次,与传统检索方法相比,使用链接文本在已知网页定位的任务上能够使系统性能提高96%,但是链接文本及其上下文信息无法在未知信息查询任务上改善检索性能;最后,把基于链接描述文本的方法与传统方法相结合,能够在检索性能上提高近16%。  相似文献   

11.
Identity verification is essential in our mission to identify potential terrorists and criminals. It is not a trivial task because terrorists reportedly assume multiple identities using either fraudulent or legitimate means. A national identification card and biometrics technologies have been proposed as solutions to the identity problem. However, several studies show their inability to tackle the complex problem. We aim to develop data mining alternatives that can match identities referring to the same individual. Existing identity matching techniques based on data mining primarily rely on personal identity features. In this research, we propose a new identity matching technique that considers both personal identity features and social identity features. We define two groups of social identity features including social activities and social relations. The proposed technique is built upon a probabilistic relational model that utilizes a relational database structure to extract social identity features. Experiments show that the social activity features significantly improve the matching performance while the social relation features effectively reduce false positive and false negative decisions.  相似文献   

12.
Link-based similarity measures play a significant role in many graph based applications. Consequently, measuring node similarity in a graph is a fundamental problem of graph datamining. Personalized pagerank (PPR) and simrank (SR) have emerged as the most popular and influential link-based similarity measures. Recently, a novel link-based similarity measure, penetrating rank (P-Rank), which enriches SR, was proposed. In practice, PPR, SR and P-Rank scores are calculated by iterative methods. As the number of iterations increases so does the overhead of the calculation. The ideal solution is that computing similarity within the minimum number of iterations is sufficient to guarantee a desired accuracy. However, the existing upper bounds are too coarse to be useful in general. Therefore, we focus on designing an accurate and tight upper bounds for PPR, SR, and P-Rank in the paper. Our upper bounds are designed based on the following intuition: the smaller the difference between the two consecutive iteration steps is, the smaller the difference between the theoretical and iterative similarity scores becomes. Furthermore, we demonstrate the effectiveness of our upper bounds in the scenario of top-k similar nodes queries, where our upper bounds helps accelerate the speed of the query. We also run a comprehensive set of experiments on real world data sets to verify the effectiveness and efficiency of our upper bounds.  相似文献   

13.
This paper describes a machine learning approach to building an efficient and accurate name spotting system. Finding names in free text is an important task in many text-based applications. Most previous approaches were based on hand-crafted modules encoding language and genre-specific knowledge. These approaches had at least two shortcomings: They required large amounts of time and expertise to develop and were not easily portable to new languages and genres. This paper describes an extensible system that automatically combines weak evidence from different, easily available sources: parts-of-speech tags, dictionaries, and surface-level syntactic information such as capitalization and punctuation. Individually, each piece of evidence is insufficient for robust name detection. However, the combination of evidence, through standard machine learning techniques, yields a system that achieves performance equivalent to the best existing hand-crafted approaches.  相似文献   

14.
Published scientific articles are linked together into a graph, the citation graph, through their citations. This paper explores the notion of similarity based on connectivity alone, and proposes several algorithms to quantify it. Our metrics take advantage of the local neighborhoods of the nodes in the citation graph. Two variants of link-based similarity estimation between two nodes are described, one based on the separate local neighborhoods of the nodes, and another based on the joint local neighborhood expanded from both nodes at the same time. The algorithms are implemented and evaluated on a subgraph of the citation graph of computer science in a retrieval context. The results are compared with text-based similarity, and demonstrate the complementarity of link-based and text-based retrieval. Wangzhong Lu holds a Bachelor's degree from Hefei University of Technology (1993), and a Master's degree from Dalhousie University (2001), both in computer science. From 1993 to 1999 he worked as a developer with China National Computer Software and Technical Service Corp. in Beijing. From 2001 to 2005 he held industrial positions as a senior software architect in Atlantic Canada. He is currently with DST Systems, Charlotte, NC, as a senior data architect. Jeannette Janssen's research area is applied graph theory. She has worked on the problem of frequency assignment in cellular and digital broadcasting networks. Her current interest is in graph theory applied to the World Wide Web and other networked information spaces. Dr. Janssen did her Master's studies at Eindhoven University of Technology in the Netherlands, and her doctorate at Lehigh University, USA. She is currently an associate professor at Dalhousie University, Canada. Evangelos Milios received a diploma in electrical engineering from the National Technical University of Athens, and Master's and Ph.D. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology. He held faculty positions at the University of Toronto and York University. He is currently a professor of computer science at Dalhousie University, Canada, where he was Director of the Graduate Program. He has served on the committees of the ACM Dissertation Award, and the AAAI/SIGART Doctoral Consortium. He has worked on the interpretation of visual and range signals for landmark-based positioning, navigation and map construction in single- and multi-agent robotics. His current research activity is centered on Networked Information Spaces, Web information retrieval, and aquatic robotics. He is a senior member of the IEEE. Nathalie Japkowicz is an associate professor at the School of Information Technology and Engineering of the University of Ottawa. She obtained her Ph.D. from Rutgers University, her M.Sc. from the University of Toronto, and her B.Sc. from McGill University. Prior to joining the University of Ottawa, she taught at Ohio State University and Dalhousie University. Her area of specialization is Machine Learning and her most recent research interests focused on the class imbalance problem. She made over 50 contributions in the form of journal articles, conference articles, workshop articles, magazine articles, technical reports or edited volumes. Yongzheng Zhang obtained a B.E. in computer applications from Southeast University, China, in 1997 and a M.S. in computer science from Dalhousie University in 2002. From 1997 to 1999 he was an instructor and undergraduate advisor at Southeast University. He also worked as a software engineer in Ricom Information and Telecommunications Co. Ltd., China. He is currently a Ph.D. candidate at Dalhousie University. His research interests are in the areas of Information Retrieval, Machine Learning, Natural Language Processing, and Web Mining, particularly centered on Web Document Summarization. A paper based on his Master's thesis received the best paper award at the 2003 Canadian Artificial Intelligence conference.  相似文献   

15.
Alok  Arun K.  Kuldip K. 《Computers & Security》2007,26(7-8):488-495
This paper focuses on intrusion detection based on system call sequences using text processing techniques. It introduces kernel based similarity measure for the detection of host-based intrusions. The k-nearest neighbour (kNN) classifier is used to classify a process as either normal or abnormal. The proposed technique is evaluated on the DARPA-1998 database and its performance is compared with other existing techniques available in the literature. It is shown that this technique is significantly better than the other techniques in achieving lower false positive rates at 100% detection rate.  相似文献   

16.
张应龙  李翠平  陈红 《软件学报》2014,25(11):2602-2615
信息网络无处不在.通过把网络中的对象抽象为点,把对象之间的关系刻画为边,相应的信息网络就可以用图来表示.图中结点相似度计算是图数据管理中的基本问题,在很多领域都有运用,比如社会网络分析、信息检索和推荐系统等.其中,著名的相似度度量是以Personalized PageRank和SimRank为代表.这两种度量本质都是以图中的路径来定义,然而它们侧重的路径截然不同.为此,提出了一个度量 SuperSimRank.它不仅涵盖了这些路径,而且考虑了Personalized PageRank和SimRank两者都没有考虑的路径,从而能够更加体现出这种链接关系的本质.在此基础上对SuperSimRank进行了理论分析,从而提出了相应的优化算法,使得计算性能从最坏情况O(kn4)提高到O(knl).这里,k 是迭代次数,n 是结点数,l 是边数.最后,通过实验验证了 SuperSimRank 优于 SimRank 和 Personalized PageRank,同时验证了优化算法在各种情况下都是有效的.  相似文献   

17.
This paper addresses the problem of handling semantic heterogeneity during database schema integration. We focus on the semantics of terms used as identifiers in schema definitions. Our solution does not rely on the names of the schema elements or the structure of the schemas. Instead, we utilize formal ontologies consisting of intensional definitions of terms represented in a logical language. The approach is based on similarity relations between intensional definitions in different ontologies. We present the definitions of similarity relations based on intensional definitions in formal ontologies. The extensional consequences of intensional relations are addressed. The paper shows how similarity relations are discovered by a reasoning system using a higher-level ontology. These similarity relations are then used to derive an integrated schema in two steps. First, we show how to use similarity relations to generate the class hierarchy of the global schema. Second, we explain how to enhance the class definitions with attributes. This approach reduces the cost of generating or re-generating global schemas for tightly-coupled federated databases.  相似文献   

18.
19.
一种综合的概念相似度计算方法   总被引:17,自引:0,他引:17  
本体映射可以用来解决本体异构问题,也是本体结盟、本体集成、本体合并、本体翻译等的技术基础。本文针对目前本体映射中概念相似度计算所存在的问题,提出了一种综合的相似度计算方法。首先根据两个概念名称相似性过滤出最相关的概念,减少相似度的计算;然后基于概念实例、基于概念属性、基于概念关系计算概念相似度,并进行综合;最后对其性能进行了简单分析。  相似文献   

20.
《Computer Networks》2008,52(12):2360-2372
In this paper we present a new approach for VPN (virtual private network) traffic engineering with path protection in Multiprotocol Label Switching networks carrying QoS and best effort traffic. Our approach eliminates the path cycles, a problem often encountered in link-based traffic engineering methods. It also allows for control of the maximum path length and the size of the label space in each label switch router. We consider off-line computation of the working and backup paths using a link-based approach. Two cases of 1 + 1 and 1:1 path protection are considered. Numerical results are presented to show the efficacy of the algorithm in calculating link-disjoint and node-disjoint primary and backup paths for the QoS traffic.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号