首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Mining outliers in heterogeneous networks is crucial to many applications,but challenges abound.In this paper,we focus on identifying meta-path-based outliers in heterogeneous information network(HIN),and calculate the similarity between different types of objects.We propose a meta-path-based outlier detection method(MPOutliers)in heterogeneous information network to deal with problems in one go under a unified framework.MPOutliers calculates the heterogeneous reachable probability by combining different types of objects and their relationships.It discovers the semantic information among nodes in heterogeneous networks,instead of only considering the network structure.It also computes the closeness degree between nodes with the same type,which extends the whole heterogeneous network.Moreover,each node is assigned with a reliable weighting to measure its authority degree.Substantial experiments on two real datasets(AMiner and Movies dataset)show that our proposed method is very effective and efficient for outlier detection.  相似文献   

2.
Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata—namely topical directories—to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective for relevance ranking.  相似文献   

3.
一种面向高维混合属性数据的异常挖掘算法   总被引:2,自引:0,他引:2  
李庆华  李新  蒋盛益 《计算机应用》2005,25(6):1353-1356
异常检测是数据挖掘领域研究的最基本的问题之一,它在欺诈甄别、气象预报、客户分类和入侵检测等方面有广泛的应用。针对网络入侵检测的需求提出了一种新的基于混合属性聚类的异常挖掘算法,并且依据异常点(outliers)是数据集中的稀有点这一本质,给出了一种新的数据相似性和异常度的定义。本文所提出算法具有线性时间复杂度,在KDDCUP99和WisconsinPrognosisBreastCancer数据集上的实验表明,算本法在提供了近似线性时间复杂度和很好的可扩展性的同时,能够较好的发现数据集中的异常点。  相似文献   

4.
Anomaly detection is considered an important data mining task, aiming at the discovery of elements (known as outliers) that show significant diversion from the expected case. More specifically, given a set of objects the problem is to return the suspicious objects that deviate significantly from the typical behavior. As in the case of clustering, the application of different criteria leads to different definitions for an outlier. In this work, we focus on distance-based outliers: an object x is an outlier if there are less than k objects lying at distance at most R from x. The problem offers significant challenges when a stream-based environment is considered, where data arrive continuously and outliers must be detected on-the-fly. There are a few research works studying the problem of continuous outlier detection. However, none of these proposals meets the requirements of modern stream-based applications for the following reasons: (i) they demand a significant storage overhead, (ii) their efficiency is limited and (iii) they lack flexibility in the sense that they assume a single configuration of the k and R parameters. In this work, we propose new algorithms for continuous outlier monitoring in data streams, based on sliding windows. Our techniques are able to reduce the required storage overhead, are more efficient than previously proposed techniques and offer significant flexibility with regard to the input parameters. Experiments performed on real-life and synthetic data sets verify our theoretical study.  相似文献   

5.
Providing top-k typical relevant keyword queries would benefit the users who cannot formulate appropriate queries to express their imprecise query intentions. By extracting the semantic relationships both between keywords and keyword queries, this paper proposes a new keyword query suggestion approach which can provide typical and semantically related queries to the given query. Firstly, a keyword coupling relationship measure, which considers both intra- and inter-couplings between each pair of keywords, is proposed. Then, the semantic similarity of different keyword queries can be measured by using a semantic matrix, in which the coupling relationships between keywords in queries are reserved. Based on the query semantic similarities, we next propose an approximation algorithm to find the most typical queries from query history by using the probability density estimation method. Lastly, a threshold-based top-k query selection method is proposed to expeditiously evaluate the top-k typical relevant queries. We demonstrate that our keyword coupling relationship and query semantic similarity measures can capture the coupling relationships between keywords and semantic similarities between keyword queries accurately. The efficiency of query typicality analysis and top-k query selection algorithm is also demonstrated.  相似文献   

6.
Subsequence matching is a basic problem in the field of data stream mining. In recent years, there has been significant research effort spent on efficiently finding subsequences similar to a query sequence. Another challenging issue in relation to subsequence matching is how we identify common local patterns when both sequences are evolving. This problem arises in trend detection, clustering, and outlier detection. Dynamic time warping (DTW) is often used for subsequence matching and is a powerful similarity measure. However, the straightforward method using DTW incurs a high computation cost for this problem. In this paper, we propose a one-pass algorithm, CrossMatch, that achieves the above goal. CrossMatch addresses two important challenges: (1) how can we identify common local patterns efficiently without any omission? (2) how can we find common local patterns in data stream processing? To tackle these challenges, CrossMatch incorporates three ideas: (1) a scoring function, which computes the DTW distance indirectly to reduce the computation cost, (2) a position matrix, which stores starting positions to keep track of common local patterns in a streaming fashion, and (3) a streaming algorithm, which identifies common local patterns efficiently and outputs them on the fly. We provide a theoretical analysis and prove that our algorithm does not sacrifice accuracy. Our experimental evaluation and case studies show that CrossMatch can incrementally discover common local patterns in data streams within constant time (per update) and space.  相似文献   

7.
Resolving semantic heterogeneity is one of the major research challenges involved in many fields of study, such as, natural language processing, search engine development, document clustering, geospatial information retrieval, knowledge discovery, etc. When semantic heterogeneity is often considered as an obstacle for realizing full interoperability among diverse datasets, proper quantification of semantic similarity is another challenge to measure the extent of association between two qualitative concepts. The proposed work addresses this issue for any geospatial application where spatial land-cover distribution is crucial to model. Most of the these applications such as: prediction, change detection, land-cover classification, etc. often require to examine the land-cover distribution of the terrain. This paper presents an ontology-based approach to measure semantic similarity between spatial land-cover classes. As land-cover distribution is a qualitative information of a terrain, it is challenging to measure their extent of similarity among each other pragmatically. Here, an ontology is considered as the concept hierarchy of different land-cover classes which is built using domain experts’ knowledge. This work can be considered as the spatial extension of our earlier work presented in [1]. The similarity metric proposed in [1] is utilized here for spatial concepts. A case study with real land-cover ontology is presented to quantify the semantic similarity between every pair of land-covers with semantic hierarchy based similarity measurement (SHSM) scheme [1]. This work may facilitate quantification of semantic knowledge of the terrain for other spatial analyses as well.  相似文献   

8.
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.  相似文献   

9.
刘建明  史一民  张俊  陈存衡 《计算机工程》2013,39(3):223-228,235
在资源描述框架(RDF)图的语义相似性度量过程中,结构相似性和语义相似性计算不精确。针对该问题,提出结构语义(SAS)方法。结合改进的基于网络距离模型的语义距离公式、基于信息量模型的权重度量机制,计算概念节点的语义相似度,完善RDF图语义相似度算法,分析结构、深度和密度对RDF图语义相似性度量的影响。设计并实现原型系统,实验结果表明,该方法可有效保证RDF图的语义相似度与实际相符。  相似文献   

10.
Learning hash functions/codes for similarity search over multi-view data is attracting increasing attention, where similar hash codes are assigned to the data objects characterizing consistently neighborhood relationship across views. Traditional methods in this category inherently suffer three limitations: 1) they commonly adopt a two-stage scheme where similarity matrix is first constructed, followed by a subsequent hash function learning; 2) these methods are commonly developed on the assumption that data samples with multiple representations are noise-free,which is not practical in real-life applications; and 3) they often incur cumbersome training model caused by the neighborhood graph construction using all N points in the database (O(N)). In this paper, we motivate the problem of jointly and efficiently training the robust hash functions over data objects with multi-feature representations which may be noise corrupted. To achieve both the robustness and training efficiency, we propose an approach to effectively and efficiently learning low-rank kernelized1 hash functions shared across views. Specifically, we utilize landmark graphs to construct tractable similarity matrices in multi-views to automatically discover neighborhood structure in the data. To learn robust hash functions, a latent low-rank kernel function is used to construct hash functions in order to accommodate linearly inseparable data. In particular, a latent kernelized similarity matrix is recovered by rank minimization on multiple kernel-based similarity matrices. Extensive experiments on real-world multi-view datasets validate the efficacy of our method in the presence of error corruptions.We use kernelized similarity rather than kernel, as it is not a squared symmetric matrix for data-landmark affinity matrix.  相似文献   

11.
为了提高离群数据检测精度和效率,提出了一种基于相关子空间的离群数据检测算法。该算法首先根据数据局部密度分布特征得出稀疏度矩阵,通过高斯相似核函数放大稀疏度特征;然后计算各属性维中数据稀疏度相似因子,确定子空间向量及相关子空间,结合数据稀疏度和维度权值得出数据对象的离群因子,选取最大的若干个对象为离群数据;最后采用人工数...  相似文献   

12.
基于R-Tree的高效异常轨迹检测算法   总被引:1,自引:0,他引:1  
提出了异常轨迹检测算法,通过检测轨迹的局部异常程度来判断两条轨迹是否全局匹配,进而检测异常轨迹.算法要点如下:(1) 为了有效地表示轨迹的局部特征,以k个连续轨迹点作为基本比较单元,提出一种计算两个基本比较单元间不匹配程度的距离函数,并在此基础上定义了局部匹配、全局匹配和异常轨迹的概念;(2) 针对异常轨迹检测算法普遍存在计算代价高的不足,提出了一种基于R-Tree的异常轨迹检测算法,其优势在于利用R-Tree和轨迹间的距离特征矩阵找出所有可能匹配的基本比较单元对,然后再通过计算距离确定其是否局部匹配,从而消除大量不必要的距离计算.实验结果表明,该算法不仅具有很好的效率,而且检测出来的异常轨迹也具有实际意义.  相似文献   

13.
14.
随着社交网络和文献索引网络等大规模互联多类异质信息网络的浮现,为相似搜索提出许多挑战,其中相似性度量是关键问题之一。现有适用于同构网络的相似度量方法没有考虑网络多个路径的不同语义。本文提出一种新的基于元路径的相似性度量方法,可以在异构网络中搜索相同类型的对象。元路径是由在不同对象类型中定义的一系列关系所组成的路径,可以为网络中相似搜索引擎提供共同的基础。在真实数据集上的实验表明,与无序相似性衡量方法相比,本文提出的方法支持快速路径相似性查询,可广泛地应用于社交网络和电子商务领域。  相似文献   

15.
In this study, we propose a novel local outlier detection approach - called LOMA - to mining local outliers in high-dimensional data sets. To improve the efficiency of outlier detection, LOMA prunes irrelevance attributes and objects in the data set by analyzing attribute relevance with a sparse factor threshold. Such a pruning technique substantially reduce the size of data sets. The core of LOMA is searching sparse subspace, which implements the particle swarm optimization method in reduced data sets. In the process of searching sparse subspace, we introduce the sparse coefficient threshold to represent sparse degrees of data objects in a subspace, where the data objects are considered as local outliers. The attribute relevance analysis provides a guidance for experts and users to identify useless attributes for detecting outliers. In addition, our sparse-subspace-based outlier algorithm is a novel technique for local-outlier detection in a wide variety of applications. Experimental results driven by both synthetic and UCI data sets validate the effectiveness and accuracy of our LOMA. In particular, LOMA achieves high mining efficiency and accuracy when the sparse factor threshold is set to a small value.  相似文献   

16.
话题关联检测是话题检测与跟踪的一项子任务,是判断随机抽取的两篇新闻报道是否讨论同一个话题的技术。受词语共现模型的启发,结合话题关联检测的特点,提出了词语间的动态同现关系,实现了基于动态共现关系的报道相似度计算方法;探讨了相似度计算方法在中文话题关联检测中的应用。通过实验可知,动态共现关系可以在一定程度上反映报道的语义信息,相似度计算方法很好地改善了中文话题关联检测系统的性能,取得了不错的效果。  相似文献   

17.
Computing the semantic similarity between terms (or short text expressions) that have the same meaning but which are not lexicographically similar is an important challenge in the information integration field. The problem is that techniques for textual semantic similarity measurement often fail to deal with words not covered by synonym dictionaries. In this paper, we try to solve this problem by determining the semantic similarity for terms using the knowledge inherent in the search history logs from the Google search engine. To do this, we have designed and evaluated four algorithmic methods for measuring the semantic similarity between terms using their associated history search patterns. These algorithmic methods are: a) frequent co-occurrence of terms in search patterns, b) computation of the relationship between search patterns, c) outlier coincidence on search patterns, and d) forecasting comparisons. We have shown experimentally that some of these methods correlate well with respect to human judgment when evaluating general purpose benchmark datasets, and significantly outperform existing methods when evaluating datasets containing terms that do not usually appear in dictionaries.  相似文献   

18.
A Novel Approach Towards Large Scale Cross-Media Retrieval   总被引:1,自引:1,他引:0       下载免费PDF全文
With the rapid development of Internet and multimedia technology,cross-media retrieval is concerned to retrieve all the related media objects with multi-modality by submitting a query media object.Unfortunately,the complexity and the heterogeneity of multi-modality have posed the following two major challenges for cross-media retrieval:1) how to construct a unified and compact model for media objects with multi-modality,2) how to improve the performance of retrieval for large scale cross-media database.In this paper,we propose a novel method which is dedicate to solving these issues to achieve effective and accurate cross-media retrieval.Firstly,a multi-modality semantic relationship graph(MSRG) is constructed using the semantic correlation amongst the media objects with multi-modality.Secondly,all the media objects in MSRG are mapped onto an isomorphic semantic space.Further,an efficient indexing MK-tree based on heterogeneous data distribution is proposed to manage the media objects within the semantic space and improve the performance of cross-media retrieval.Extensive experiments on real large scale cross-media datasets indicate that our proposal dramatically improves the accuracy and efficiency of cross-media retrieval,outperforming the existing methods significantly.  相似文献   

19.
In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers - objects who behave in an unexpected way or have abnormal properties. The identification of outliers is important for many applications such as intrusion detection, credit card fraud, criminal activities in electronic commerce, medical diagnosis and anti-terrorism, etc. In this paper, we propose a hybrid approach to outlier detection, which combines the opinions from boundary-based and distance-based methods for outlier detection ( [Jiang et al., 2005], [Jiang et al., 2009] and [Knorr and Ng, 1998]). We give a novel definition of outliers - BD (boundary and distance)-based outliers, by virtue of the notion of boundary region in rough set theory and the definitions of distance-based outliers. An algorithm to find such outliers is also given. And the effectiveness of our method for outlier detection is demonstrated on two publicly available databases.  相似文献   

20.
Spatial outlier detection is an important research problem that has received much attentions in recent years. Most existing approaches are designed for numerical attributes, but are not applicable to categorical ones (e.g., binary, ordinal, and nominal) that are popular in many applications. The main challenges are the modeling of spatial categorical dependency as well as the computational efficiency. This paper presents the first outlier detection framework for spatial categorical data. Specifically, a new metric, named as Pair Correlation Ratio (PCR), is measured for each pair of category sets based on their co-occurrence frequencies at specific spatial distance ranges. The relevances among spatial objects are then calculated using PCR values with regard to their spatial distances. The outlierness for each object is defined as the inverse of the average relevance between an object and its spatial neighbors. Those objects with the highest outlier scores are returned as spatial categorical outliers. A set of algorithms are further designed for single-attribute and multi-attribute spatial categorical datasets. Extensive experimental evaluations on both simulated and real datasets demonstrated the effectiveness and efficiency of our proposed approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号