首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Kin relationship has been well investigated in psychology community over the past decades, while kin verification using facial images is relatively new and challenging problem in biometrics society. Recently, it has attracted substantial attention from biometrics society, mainly motivated by the relative characteristics that children generally resemble their parents more than other persons with respect to facial appearance. Unlike most previous supervised metric learning methods focusing on learning the Mahalanobis distance metric for kin verification, we propose in this paper a new Ensemble similarity learning (ESL) method for this challenging problem. We first introduce a sparse bilinear similarity function to model the relative characteristics encoded in kin data. The similarity function parameterized by a diagonal matrix enjoys the superiority in computational efficiency, making it more practical for real-world high-dimensional kinship verification applications. Then, ESL learns from kin dataset by generating an ensemble of similarity models with the aim of achieving strong generalization ability. Specifically, ESL works by best satisfying the constraints (typically triplet-based) derived from the class labels on each base similarity model, while maximizing the diversity among the base similarity models. Experiments results demonstrate that our method is superior to some state-of-the-art methods in terms of both verification rate and computational efficiency.  相似文献   

2.
Due to the famous dimensionality curse problem, search in a high-dimensional space is considered as a "hard" problem. In this paper, a novel composite distance transformation method, which is called CDT, is proposed to support a fast k-nearest-neighbor (k-NN) search in high-dimensional spaces. In CDT, all (n) data points are first grouped into some clusters by a k-Means clustering algorithm. Then a composite distance key of each data point is computed. Finally, these index keys of such n data points are inserted by a partition-based B -tree. Thus, given a query point, its k-NN search in high-dimensional spaces is transformed into the search in the single dimensional space with the aid of CDT index. Extensive performance studies are conducted to evaluate the effectiveness and efficiency of the proposed scheme. Our results show-that this method outperforms the state-of-the-art high-dimensional search techniques, such as the X-Tree, VA-file, iDistance and NB-Tree.  相似文献   

3.
Learning hash functions/codes for similarity search over multi-view data is attracting increasing attention, where similar hash codes are assigned to the data objects characterizing consistently neighborhood relationship across views. Traditional methods in this category inherently suffer three limitations: 1) they commonly adopt a two-stage scheme where similarity matrix is first constructed, followed by a subsequent hash function learning; 2) these methods are commonly developed on the assumption that data samples with multiple representations are noise-free,which is not practical in real-life applications; and 3) they often incur cumbersome training model caused by the neighborhood graph construction using all N points in the database (O(N)). In this paper, we motivate the problem of jointly and efficiently training the robust hash functions over data objects with multi-feature representations which may be noise corrupted. To achieve both the robustness and training efficiency, we propose an approach to effectively and efficiently learning low-rank kernelized1 hash functions shared across views. Specifically, we utilize landmark graphs to construct tractable similarity matrices in multi-views to automatically discover neighborhood structure in the data. To learn robust hash functions, a latent low-rank kernel function is used to construct hash functions in order to accommodate linearly inseparable data. In particular, a latent kernelized similarity matrix is recovered by rank minimization on multiple kernel-based similarity matrices. Extensive experiments on real-world multi-view datasets validate the efficacy of our method in the presence of error corruptions.We use kernelized similarity rather than kernel, as it is not a squared symmetric matrix for data-landmark affinity matrix.  相似文献   

4.
徐鲲鹏  陈黎飞  孙浩军  王备战 《软件学报》2020,31(11):3492-3505
现有的类属型数据子空间聚类方法大多基于特征间相互独立假设,未考虑属性间存在的线性或非线性相关性.提出一种类属型数据核子空间聚类方法.首先引入原作用于连续型数据的核函数将类属型数据投影到核空间,定义了核空间中特征加权的类属型数据相似性度量.其次,基于该度量推导了类属型数据核子空间聚类目标函数,并提出一种高效求解该目标函数的优化方法.最后,定义了一种类属型数据核子空间聚类算法.该算法不仅在非线性空间中考虑了属性间的关系,而且在聚类过程中赋予每个属性衡量其与簇类相关程度的特征权重,实现了类属型属性的嵌入式特征选择.还定义了一个聚类有效性指标,以评价类属型数据聚类结果的质量.在合成数据和实际数据集上的实验结果表明,与现有子空间聚类算法相比,核子空间聚类算法可以发掘类属型属性间的非线性关系,并有效提高了聚类结果的质量.  相似文献   

5.
Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600 dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement the practical applications of our novel technique with insights into what it means to find local outliers in real data including image and text data, and include potential applications for this knowledge.  相似文献   

6.
Due to the famous dimensionality curse problem, search in a high-dimensional space is considered as a "hard" problem. In this paper, a novel composite distance transformation method, which is called CDT, is proposed to support a fast κ-nearest-neighbor (κ-NN) search in high-dimensional spaces. In CDT, all (n) data points are first grouped into some clusters by a κ-Means clustering algorithm. Then a composite distance key of each data point is computed. Finally, these index keys of such n data points are inserted by a partition-based B^+-tree. Thus, given a query point, its κ-NN search in high-dimensional spaces is transformed into the search in the single dimensional space with the aid of CDT index. Extensive performance studies are conducted to evaluate the effectiveness and efficiency of the proposed scheme. Our results show that this method outperforms the state-of-the-art high-dimensional search techniques, such as the X-Tree, VA-file, iDistance and NB-Tree.  相似文献   

7.
针对传统高维轨迹隐私保护模型抑制点数过多而导致的数据匿名性差及数据损失大的问题,提出了一种基于信息熵抑制的轨迹隐私保护方法。通过为轨迹数据建立基于熵的流量图,根据轨迹时空点信息熵大小设计合理的花费代价函数,局部抑制时空点以达到隐私保护的目的;同时改进了一种比较抑制前后流量图相似性的算法,并提出了一个衡量隐私收益的函数;最后,与LK-Local方法进行了轨迹隐私度与数据实用性的比较。在模拟地铁交通运输系统数据集上的实验结果表明,与LK-Local方法相比,在相同的匿名参数取值下,所提方法在相似性度量上提高了约27%,在数据损失度量上降低了约25%,在隐私收益上提高了约21%。  相似文献   

8.
Network filtering is a challenging area in high-speed computer networks, mostly because lots of filtering rules are required and there is only a limited time available for matching these rules. Therefore, network filters accelerated by field-programmable gate arrays (FPGAs) are becoming common where the fast lookup of filtering rules is achieved by the use of hash tables. It is desirable to be able to fill-up these tables efficiently, i.e. to achieve a high table-load factor in order to reduce the offline time of the network filter due to rehashing and/or table replacement. A parallel reconfigurable hash function tuned by an evolutionary algorithm (EA) is proposed in this paper for Internet Protocol (IP) address filtering in FPGAs. The EA fine-tunes the reconfigurable hash function for a given set of IP addresses. The experiments demonstrate that the proposed hash function provides high-speed lookup and achieves a higher table-load factor in comparison with conventional solutions.  相似文献   

9.
High-dimensional similarity joins   总被引:3,自引:0,他引:3  
Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the ε tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the ε tree and the R-tree family, and show that the ε tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the ε tree is twice to an order of magnitude faster than the R+ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the ε tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the ε tree  相似文献   

10.
高维数据之间的相似性度量问题是高维空间数据挖掘中所面临的问题之一。为了有效解决高维效应给相似性度量带来的种种问题,首先分析传统相似性度量算法,得出其局限性。再通过对传统度量算法进行改进,提出新的Close函数,以弥补传统相似性度量算法应用在高维空间时的不足。提出Close函数后,将其与几种传统的相似性度量算法作比较,得出新算法在高维空间相似性度量方面的优越性。文中最后用Matlab对该函数做了定量分析,实验证明该函数在高维空间中能有效避免噪声和维灾效应的影响。  相似文献   

11.
许健 《计算机应用研究》2021,38(8):2394-2400
针对传统漏洞检测分类需要定义人工特征以及相似度匹配算法不能检测非克隆漏洞、现有深度学习漏洞检测的方法特征维度过大以及只针对函数调用的问题,提出一种融合滑动窗口和哈希函数的深度学习方法,对源代码进行静态漏洞检测分类.首先抽取源代码的方法体,形成正负样本集,对样本集中的每个样本构建抽象语法树,根据语法树中的节点类型替换程序员自定义的变量名以及方法名,并以先序遍历的方式序列化抽象语法树;然后对抽象语法树节点中的节点信息进行分词,为每个词分配一个独立的节点编号;其次对树节点进行进一步的拆分,形成词序列,基于滑动窗口与哈希函数训练出相应的漏洞检测分类模型.最后,在SARD数据集中选取CWE190整数上溢和CWE191整数下溢两类漏洞进行实验,该模型在CWE190、CWE191中的分类准确率和召回率分别达到97.4%、94.2%和97.6%、95.1%.实验结果表明,提出方法能够检测到代码中的安全漏洞类型,并且在分类准确率和召回率上优于现有的方法.  相似文献   

12.
Clustering is one of the most important techniques for the design of intelligent systems, and it has been incorporated into a large number of real applications. However, classical clustering algorithms cannot process high-dimensional data, such as text, in a reasonable amount of time. To address this problem, we use techniques based on locality-sensitive hashing (LSH), which was originally designed as an efficient means of solving the near-neighbor search problem for high-dimensional data. We propose the use of two LSH strategies to group high-dimensional data: MinHash, which enables Jaccard similarity approximations, and SimHash, which approximates cosine similarity. Instead of creating a computational costly data structure for responding to queries from near neighbors, we use a low-dimensional Hamming embedding to approximate a pairwise similarity matrix using a single-pass procedure. This procedure does not require data storage. It requires only the maintenance of a low-dimensional embedding. Then, the clustering solution is found by applying the bisection method to the similarity matrix. In addition to the above, we propose an improvement to LSH that is beneficial for its use on high-dimensional data. This improvement introduces a penalty on the Hamming distance, which is used in conjunction with SimHash, thereby improving the cosine similarity approximation. Experimental results indicate that our proposal yields a solution that is very close to the one found by applying the bisection method to a matrix with complete information, with better running times and a lower use of memory.  相似文献   

13.
在多目标优化中,许多实际问题都是由很多目标(超过三个)所组成,但是目前提出的大多数算法却只有在三维以下时高效。由于超过三维的情况无法用欧式空间来表示,而且在处理高维问题时,算法的时间复杂度通常很高,因此人们开始考虑将高维目标转化为低维目标后再处理。首先介绍了目前已经存在的将高维目标转化为低维目标的算法,提出了一种新的算法,该方法通过数据拟合,将各目标函数拟合为一条直线,比较相互之间的斜率之差来确定目标是否存在冗余,以期减少冗余目标。  相似文献   

14.
In this paper, we introduce the concept of extended feature objects for similarity retrieval. Conventional approaches for similarity search in databases map each object in the database to a point in some high-dimensional feature space and define similarity as some distance measure in this space. For many similarity search problems, this feature-based approach is not sufficient. When retrieving partially similar polygons, for example, the search cannot be restricted to edge sequences, since similar polygon sections may start and end anywhere on the edges of the polygons. In general, inherently continuous problems such as the partial similarity search cannot be solved by using point objects in feature space. In our solution, we therefore introduce extended feature objects consisting of an infinite set of feature points. For an efficient storage and retrieval of the extended feature objects, we determine the minimal bounding boxes of the feature objects in multidimensional space and store these boxes using a spatial access structure. In our concrete polygon problem, sets of polygon sections are mapped to 2D feature objects in high-dimensional space which are then approximated by minimal bounding boxes and stored in an R-tree. The selectivity of the index is improved by using an adaptive decomposition of very large feature objects and a dynamic joining of small feature objects. For the polygon problem, translation, rotation, and scaling invariance is achieved by using the Fourier-transformed curvature of the normalized polygon sections. In contrast to vertex-based algorithms, our algorithm guarantees that no false dismissals may occur and additionally provides fast search times for realistic database sizes. We evaluate our method using real polygon data of a supplier for the car manufacturing industry. Edited by R. Güting. Received October 7, 1996 / Accepted March 28, 1997  相似文献   

15.
Learning distance metrics for measuring the similarity between two data points in unsupervised and supervised pattern recognition has been widely studied in unconstrained face verification tasks. Motivated by the fact that enforcing single distance metric learning for verification via an empirical score threshold is not robust in uncontrolled experimental conditions, we therefore propose to obtain a metric swarm by learning local patches alike sub-metrics simultaneously that naturally formulates a generalized metric swarm learning (GMSL) model with a joint similarity score function solved by an efficient alternative optimization algorithm. Further, each sample pair is represented as a similarity vector via the well-learned metric swarm, such that the face verification task becomes a generalized SVM-alike classification problem. Therefore, the verification can be enforced in the represented metric swarm space that can well improve the robustness of verification under irregular data structure. Experiments are preliminarily conducted using several UCI benchmark datasets for solving general classification problem. Further, the face verification experiments on real-world LFW and PubFig datasets demonstrate that our proposed model outperforms several state-of-the-art metric learning methods.  相似文献   

16.
Imbalance classification techniques have been frequently applied in many machine learning application domains where the number of the majority (or positive) class of a dataset is much larger than that of the minority (or negative) class. Meanwhile, feature selection (FS) is one of the key techniques for the high-dimensional classification task in a manner which greatly improves the classification performance and the computational efficiency. However, most studies of feature selection and imbalance classification are restricted to off-line batch learning, which is not well adapted to some practical scenarios. In this paper, we aim to solve high-dimensional imbalanced classification problem accurately and efficiently with only a small number of active features in an online fashion, and we propose two novel online learning algorithms for this purpose. In our approach, a classifier which involves only a small and fixed number of features is constructed to classify a sequence of imbalanced data received in an online manner. We formulate the construction of such online learner into an optimization problem and use an iterative approach to solve the problem based on the passive-aggressive (PA) algorithm as well as a truncated gradient (TG) method. We evaluate the performance of the proposed algorithms based on several real-world datasets, and our experimental results have demonstrated the effectiveness of the proposed algorithms in comparison with the baselines.  相似文献   

17.
Manifold learning algorithms seek to find a low-dimensional parameterization of high-dimensional data. They heavily rely on the notion of what can be considered as local, how accurately the manifold can be approximated locally, and, last but not least, how the local structures can be patched together to produce the global parameterization. In this paper, we develop algorithms that address two key issues in manifold learning: 1) the adaptive selection of the local neighborhood sizes when imposing a connectivity structure on the given set of high-dimensional data points and 2) the adaptive bias reduction in the local low-dimensional embedding by accounting for the variations in the curvature of the manifold as well as its interplay with the sampling density of the data set. We demonstrate the effectiveness of our methods for improving the performance of manifold learning algorithms using both synthetic and real-world data sets.  相似文献   

18.
最小距离鉴别投影及其在人脸识别中的应用   总被引:2,自引:1,他引:1       下载免费PDF全文
针对人脸识别问题,提出了最小距离鉴别投影算法,其与经典的线性鉴别分析不同,它是一种流形学习降维算法。该算法首先定义样本的类内相似度与类间相似度:前者能够度量样本与类内中心的距离关系,后者不仅能够反映样本与类间中心的距离关系而且能够反映样本类间距与类内距的大小关系;然后将高维数据映射到低维特征空间,使得样本到类内中心距离最小同时到类间中心距离最大。最后,在ORL、FERET及AR人脸库上的实验结果表明所提算法识别性能要优于其他算法。  相似文献   

19.
In this paper, extreme learning machine (ELM) for ε-insensitive error loss function-based regression problem formulated in 2-norm as an unconstrained optimization problem in primal variables is proposed. Since the objective function of this unconstrained optimization problem is not twice differentiable, the popular generalized Hessian matrix and smoothing approaches are considered which lead to optimization problems whose solutions are determined using fast Newton–Armijo algorithm. The main advantage of the algorithm is that at each iteration, a system of linear equations is solved. By performing numerical experiments on a number of interesting synthetic and real-world datasets, the results of the proposed method are compared with that of ELM using additive and radial basis function hidden nodes and of support vector regression (SVR) using Gaussian kernel. Similar or better generalization performance of the proposed method on the test data in comparable computational time over ELM and SVR clearly illustrates its efficiency and applicability.  相似文献   

20.
姜逸凡  叶青 《计算机应用》2019,39(4):1041-1045
在时间序列分类等数据挖掘工作中,不同数据集基于类别的相似性表现有明显不同,因此一个合理有效的相似性度量对数据挖掘非常关键。传统的欧氏距离、余弦距离和动态时间弯曲等方法仅针对数据自身进行相似度公式计算,忽略了不同数据集所包含的知识标注对于相似性度量的影响。为了解决这一问题,提出基于孪生神经网络(SNN)的时间序列相似性度量学习方法。该方法从样例标签的监督信息中学习数据之间的邻域关系,建立时间序列之间的高效距离度量。在UCR提供的时间序列数据集上进行的相似性度量和验证性分类实验的结果表明,与ED/DTW-1NN相比SNN在分类质量总体上有明显的提升。虽然基于动态时间弯曲(DTW)的1近邻(1NN)分类方法在部分数据上表现优于基于SNN的1NN分类方法,但在分类过程的相似度计算复杂度和速度上SNN优于DTW。可见所提方法能明显提高分类数据集相似性的度量效率,在高维、复杂的时间序列的数据分类上有不错的表现。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号