首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Previous work has shown that the Minkowski-p distance metrics are unsuitable for clustering very high dimensional document data. We extend this work. We frame statistical theory on the relationships between the Euclidean, cosine, and correlation distance metrics in terms of item neighborhoods. We discuss the differences between the cosine and correlation distance metrics and illustrate our discussion with an example from collaborative filtering. We introduce a family of normalized Minkowski metrics and test their use on both document data and synthetic data generated from the uniform distribution. We describe a range of criteria for testing neighborhood homogeneity relative to underlying latent classes. We discuss how these criteria are explicitly and implicitly linked to classification performance. By testing both normalized and non-normalized Minkowski-p metrics for multiple values of p, we separate out distance compression effects from normalization effects. For multi-class classification problems, we believe that distance compression on high dimensional data aids classification and data analysis. For document data, we find that the cosine (and normalized Euclidean), correlation, and proportioned city block metrics give strong neighborhood recovery. The proportioned city block metric gives particularly good results for nearest neighbors recovery and should be used when utilizing document data analysis techniques for which nearest neighborhood recovery is important. For data generated from the uniform distribution, neighborhood recovery improves as the value of p increases.  相似文献   

2.
The goal of image annotation is to automatically assign a set of textual labels to an image to describe the visual contents thereof. Recently, with the rapid increase in the number of web images, nearest neighbor (NN) based methods have become more attractive and have shown exciting results for image annotation. One of the key challenges of these methods is to define an appropriate similarity measure between images for neighbor selection. Several distance metric learning (DML) algorithms derived from traditional image classification problems have been applied to annotation tasks. However, a fundamental limitation of applying DML to image annotation is that it learns a single global distance metric over the entire image collection and measures the distance between image pairs in the image-level. For multi-label annotation problems, it may be more reasonable to measure similarity of image pairs in the label-level. In this paper, we develop a novel label prediction scheme utilizing multiple label-specific local metrics for label-level similarity measure, and propose two different local metric learning methods in a multi-task learning (MTL) framework. Extensive experimental results on two challenging annotation datasets demonstrate that 1) utilizing multiple local distance metrics to learn label-level distances is superior to using a single global metric in label prediction, and 2) the proposed methods using the MTL framework to learn multiple local metrics simultaneously can model the commonalities of labels, thereby facilitating label prediction results to achieve state-of-the-art annotation performance.  相似文献   

3.
Vector quantization has been widely employed in nearest neighbor search because it can approximate the Euclidean distance of two vectors with the table look-up way that can be precomputed. Additive quantization (AQ) algorithm validated that low approximation error can be achieved by representing each input vector with a sum of dependent codewords, each of which is from its own codebook. However, the AQ algorithm relies on computational expensive beam search algorithm to encode each vector, which is prohibitive for the efficiency of the approximate nearest neighbor search. In this paper, we propose a fast AQ algorithm that significantly accelerates the encoding phase. We formulate the beam search algorithm as an optimization of codebook selection orders. According to the optimal order, we learn the codebooks with hierarchical construction, in which the search width can be set very small. Specifically, the codewords are firstly exchanged into proper codebooks by the indexed frequency in each step. Then the codebooks are updated successively to adapt the quantization residual of previous quantization level. In coding phase, the vectors are compressed with learned codebooks via the best order, where the search range is considerably reduced. The proposed method achieves almost the same performance as AQ, while the speed for the vector encoding phase can be accelerated dozens of times. The experiments are implemented on two benchmark datasets and the results verify our conclusion.  相似文献   

4.
This paper aims to introduce and study two novel metrics on gray tone images. These metrics are based on the General Adaptive Neighborhood Image Processing (GANIP) framework that enables to represent an image by spatial neighborhoods, named General Adaptive Neighborhoods (GAN) that fit to their local context. These metrics are generalized in the sense that they do not satisfy all the axioms of a standard mathematical metric. This notion of adaptive generalized metrics leads to the definition of relevant GAN distance maps and GAN nearest neighbor transforms used for image segmentation.  相似文献   

5.
When classes are nonseparable or overlapping, training samples in a local neighborhood may come from different classes. In this situation, the samples with different class labels may be comparable in the neighborhood of query. As a consequence, the conventional nearest neighbor classifier, such as kappa-nearest neighbor scheme, may produce a wrong prediction. To address this issue, in this paper, we propose a new classification method, which performs a classification task based on the local probabilistic centers of each class. This method works by reducing the number of negative contributing points, which are the known samples falling on the wrong side of the ideal decision boundary, in a training set and by restricting their influence regions. In classification, this method classifies the query sample by using two measures of which one is the distance between the query and the local categorical probability centers, and the other is the computed posterior probability of the query. Although both measures are effective, the experiments show that the second one achieves the smaller classification error. Meanwhile, the theoretical analyses of the suggested methods are investigated, and some experiments are conducted on the basis of both constructed and real datasets. The investigation results show that this method substantially improves the classification performance of the nearest neighbor algorithm.  相似文献   

6.
A three-dimensional anisotropic Riemannian metric is constructed from a triangulated CAD model to control its spatial discretization for numerical analysis. In addition to the usual curvature criterion, the present geometric metric is also based on the local thickness of the modeled domain. This local thickness is extracted from the domain skeleton while local curvature is deduced from the model triangulated boundaries. A Cartesian background octree is used as the support medium for this metric and skeletonization takes advantage of this structure through an octree extension of a digital medial axis transform. For this purpose, the octree has to be refined according to not only boundary curvature but also a local separation criterion from digital topology theory. The resulting metric can be used to geometrically adapt any mesh type as long as metric-based adaptation tools are available. To illustrate such an application, geometric adaptation of overlay meshes used in grid-based methods for unstructured hexahedral mesh generation is presented. However, beyond mesh generation, the present metric may also be useful as a shape analysis tool and such a possibility could be explored in future developments.  相似文献   

7.
阵列声波测井是目前一种重要的测井方法.但由于它产生的数据量较大,加之电缆传输速度的限制,测井效率会因此受到影响.为提升测井效率,有必要对原始声波测井数据进行压缩处理.通过分析声波信号的特点,提出了采用DCT+适当量化+算术编码的压缩方法,并根据实际波形的测试结果对压缩方法进行了一系列改进.通过对660组真实测井数据的测试,得到了对于672个点的单极声波数据平均4. 56的压缩比,对于1 200个点的偶极声波数据平均9. 18的压缩比,均方根误差在2%左右,证实了该方法的有效性.  相似文献   

8.
In these years, we often deal with an enormous amount of data in a large variety of pattern recognition tasks. Such data require a huge amount of memory space and computation time for processing. One of the approaches to cope with these problems is using prototypes. We propose volume prototypes as an extension of traditional point prototypes. A volume prototype is defined as a geometric configuration that represents some data points inside. A volume prototype is akin to a data point in the usage rather than a component of a mixture model. We show a one-pass algorithm to have such prototypes for data stream, along with an application for classification. An oblivion mechanism is also incorporated to adapt concept drift.  相似文献   

9.
ContextSoftware development projects involve the use of a wide range of tools to produce a software artifact. Software repositories such as source control systems have become a focus for emergent research because they are a source of rich information regarding software development projects. The mining of such repositories is becoming increasingly common with a view to gaining a deeper understanding of the development process.ObjectiveThis paper explores the concepts of representing a software development project as a process that results in the creation of a data stream. It also describes the extraction of metrics from the Jazz repository and the application of data stream mining techniques to identify useful metrics for predicting build success or failure.MethodThis research is a systematic study using the Hoeffding Tree classification method used in conjunction with the Adaptive Sliding Window (ADWIN) method for detecting concept drift by applying the Massive Online Analysis (MOA) tool.ResultsThe results indicate that only a relatively small number of the available measures considered have any significance for predicting the outcome of a build over time. These significant measures are identified and the implication of the results discussed, particularly the relative difficulty of being able to predict failed builds. The Hoeffding Tree approach is shown to produce a more stable and robust model than traditional data mining approaches.ConclusionOverall prediction accuracies of 75% have been achieved through the use of the Hoeffding Tree classification method. Despite this high overall accuracy, there is greater difficulty in predicting failure than success. The emergence of a stable classification tree is limited by the lack of data but overall the approach shows promise in terms of informing software development activities in order to minimize the chance of failure.  相似文献   

10.
When a file is to be transmitted from a sender to a recipient and when the latter already has a file somewhat similar to it, remote differential compression seeks to determine the similarities interactively so as to transmit only the part of the new file not already in the recipient's old file. Content-dependent chunking means that the sender and recipient chop their files into chunks, with the cutpoints determined by some internal features of the files, so that when segments of the two files agree (possibly in different locations within the files) the cutpoints in such segments tend to be in corresponding locations, and so the chunks agree. By exchanging hash values of the chunks, the sender and recipient can determine which chunks of the new file are absent from the old one and thus need to be transmitted.We propose two new algorithms for content-dependent chunking, and we compare their behavior, on random files, with each other and with previously used algorithms. One of our algorithms, the local maximum chunking method, has been implemented and found to work better in practice than previously used algorithms.Theoretical comparisons between the various algorithms can be based on several criteria, most of which seek to formalize the idea that chunks should be neither too small (so that hashing and sending hash values become inefficient) nor too large (so that agreements of entire chunks become unlikely). We propose a new criterion, called the slack of a chunking method, which seeks to measure how much of an interval of agreement between two files is wasted because it lies in chunks that don't agree.Finally, we show how to efficiently find the cutpoints for local maximum chunking.  相似文献   

11.
12.
高速数据压缩与缓存的FPGA实现   总被引:1,自引:0,他引:1  
王宁  李冰 《微计算机信息》2008,24(8):213-214
本文设计了一种以FPGA为数据压缩和数据缓存单元的高速数据采集系统,其主要特点是对高速采集的数据进行实时压缩,再将压缩后的数据进行缓冲存储.该设计利用数据比较模块实时地将一个压缩比数组中的最大值保存起来,再将该最大值缓冲存储,从而满足采集系统的需要.文中分别设计了基于双口RAM和FIFO实现的两种缓冲方法,并对仿真结果进行了对比分析,该系统工作频率可迭90MHZ.  相似文献   

13.
The crux in the locally linear embedding algorithm is the selection of the number of nearest neighbors k. Some previous techniques have been developed for finding this parameter based on embedding quality measures. Nevertheless, they do not achieve suitable results when they are tested on several kind of manifolds. In this work is presented a new method for automatically computing the number of neighbors by means of analyzing global and local properties of the embedding results. Besides, it is also proposed a second strategy for choosing the parameter k, on manifolds where the density and the intrinsic dimensionality of the neighborhoods are changeful. The first proposed technique, called preservation neighborhood error, calculates a unique value of k for the whole manifold. Moreover, the second method, named local neighborhood selection, computes a suitable number of neighbors for each sample point in the manifold. The methodologies were tested on artificial and real-world datasets which allow us to visually confirm the quality of the embedding. According to the results our methods aim to find suitable values of k and appropriated embeddings.  相似文献   

14.
Data gathering algorithms in sensor networks using energy metrics   总被引:5,自引:0,他引:5  
Gathering sensed information in an energy efficient manner is critical to operating the sensor network for a long period of time. The LEACH protocol presented by Heinzelman et al. (2000) is an elegant solution where clusters are formed to fuse data before transmitting to the base station. In this paper, we present an improved scheme, called PEGASIS (power-efficient gathering in sensor information systems), which is a near-optimal chain-based protocol that minimizes energy. In PEGASIS, each node communicates only with a close neighbor and takes turns transmitting to the base station, thus reducing the amount of energy spent per round. Simulation results show that PEGASIS performs better than LEACH. For many applications, in addition to minimizing energy, it is also important to consider the delay incurred in gathering sensed data. We capture this with the energy /spl times/ delay metric and present schemes that attempt to balance the energy and delay cost for data gathering from sensor networks. We present two new schemes to minimize energy /spl times/ delay using CDMA and non-CDMA sensor nodes. We compared the performance of direct, LEACH, and our schemes with respect to energy /spl times/ delay using extensive simulations for different network sizes. Results show that our schemes perform 80 or more times better than the direct scheme and also outperform the LEACH protocol.  相似文献   

15.
Image based relighting (IBL) solves the illumination control problem in the image-based modelling and rendering. However, it trades a drastic increase of storage requirement caused by the tremendous reference images pre-captured under various illumination conditions. In this paper, we propose a spherical wavelet transform and wavelet transform (SWT–WLT) based approach to compress the huge IBL dataset. Two major steps are involved. First, the spherical wavelet transform (SWT) is used to reduce the correlation between different reference images. Second, wavelet transform (WLT) is applied to compress those SW transformed images (SWT images). Due to the locality of SWT and WLT, the proposed method inherits an advantage of low memory requirement, hence, it is suitable for compressing arbitrarily large dataset. Using its integer format implementation this method can be further sped up with the help of bit shift operation. Simulations are given to evaluate its good features.  相似文献   

16.
This paper describes a new adaptive coding technique to the coding of transform coefficients used in block based image compression schemes. The presence and orientation of the edge information in a sub-block are used to select different quantization tables and zigzag scan paths to cater for the local image pattern. Measures of the edge presence and edge orientation in a sub-block are calculated out of their DCT coefficients, and each sub-block can be classified into four different edge patterns. Experimental results show that compared to JPEG and the improved HVS-based coding, the new scheme has significantly increased the compression ratio without sacrificing the reconstructed image quality.  相似文献   

17.
J. H. Poore 《Software》1988,18(11):1017-1027
Software is a product in serious need of quality control technology. Major effort notwithstanding, software engineering has produced few metrics for aspects of software quality that have the potential of being universally applicable. The present paper suggests that, although universal metrics are elusive, metrics that are applicable and useful in a fully defined setting are readily available. A theory is presented that a well-defined software work group can articulate their operational concept of quality and derive useful metrics for that concept and their environment.  相似文献   

18.
We present the Nearest Subclass Classifier (NSC), which is a classification algorithm that unifies the flexibility of the nearest neighbor classifier with the robustness of the nearest mean classifier. The algorithm is based on the Maximum Variance Cluster algorithm and, as such, it belongs to the class of prototype-based classifiers. The variance constraint parameter of the cluster algorithm serves to regularize the classifier, that is, to prevent overfitting. With a low variance constraint value, the classifier turns into the nearest neighbor classifier and, with a high variance parameter, it becomes the nearest mean classifier with the respective properties. In other words, the number of prototypes ranges from the whole training set to only one per class. In the experiments, we compared the NSC with regard to its performance and data set compression ratio to several other prototype-based methods. On several data sets, the NSC performed similarly to the k-nearest neighbor classifier, which is a well-established classifier in many domains. Also concerning storage requirements and classification speed, the NSC has favorable properties, so it gives a good compromise between classification performance and efficiency.  相似文献   

19.
The problem of local artificial changes detection (forensics) with JPEG compression properties [1] was established in this article. The known methods for detecting such changes [2–4] describe only the distinctions of JPEG-compressed images from those without compression. In this work we developed an algorithm for detecting local image embeddings with compression properties and for determining the shifts of embedded JPEG blocks relative to the embedding coordinates, which are multiples of 8, and also derived the dependence between the period of peaks in the spectrum of the histogram of DCT coefficients and the JPEG compression quality factor. In the course of studies, we obtained numerical results on the quality of true and false embedding detections for the developed algorithm.  相似文献   

20.
The authors have solved the problem of detecting the local artificial changes (falsifications) with JPEG compression properties [1]. The known methods for detecting these changes [2–4] describe only the distinctive properties of JPEG compressed images from those without compression. The authors have also developed an algorithm for detecting local embeddings with compression properties on the images and for determining the shifts of embedded JPEG blocks in relation to the embedding coordinates, which are multiples of eight. The relationship between the period of peaks at the spectrum of the histogram of DCT coefficients and the quality factor of the JPEG compression algorithm is found. The paper presents the numerical results on the quality of the true and false embedding detections for the developed algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号