共查询到20条相似文献,搜索用时 15 毫秒
2.
This paper introduces a shape-based similarity measure, called the angular metric for shape similarity (AMSS), for time series data. Unlike most similarity or dissimilarity measures, AMSS is based not on individual data points of a time series but on vectors equivalently representing it. AMSS treats a time series as a vector sequence to focus on the shape of the data and compares data shapes by employing a variant of cosine similarity. AMSS is, by design, expected to be robust to time and amplitude shifting and scaling, but sensitive to short-term oscillations. To deal with the potential drawback, ensemble learning is adopted, which integrates data smoothing when AMSS is used for classification. Evaluative experiments reveal distinct properties of AMSS and its effectiveness when applied in the ensemble framework as compared to existing measures. 相似文献
3.
由于时间序列的长度很大,并且不确定时间序列在每个采样点的取值具有不确定性,导致时间序列在相似性匹配和聚类挖掘中时间复杂度很高,为了解决该问题,提出了基于趋势的时间序列相似性度量方法和聚类方法.其中基于趋势的相似性度量方法根据时间序列的整体变化趋势,将时间序列映射为短的趋势符号序列,并利用各趋势的一阶连接性指数和塔尼莫特系数完成相似性度量;基于趋势的聚类方法通过定义趋势高度,并对趋势符号序列迭代进行区间划分和趋势判断,并以此构建趋势树,最后将趋势树根节点中趋势符号相同的序列聚集为一类.实验结果表明:a)五种趋势符号的一阶连接性指数可唯一地表示一条时间序列;b)基于趋势的相似性度量方法在多项式时间内可有效完成时间序列的相似性匹配;c)基于趋势的聚类方法将序列的相似性度量和聚类过程集中在一起,聚类效果显著. 相似文献
4.
Time profiled association mining is one of the important and challenging research problems that is relatively less addressed. Time profiled association mining has two main challenges that must be addressed. These include addressing i) dissimilarity measure that also holds monotonicity property and can efficiently prune itemset associations ii) approaches for estimating prevalence values of itemset associations over time. The pioneering research that addressed time profiled association mining is by J.S. Yoo using Euclidean distance. It is widely known fact that this distance measure suffers from high dimensionality. Given a time stamped transaction database, time profiled association mining refers to the discovery of underlying and hidden time profiled itemset associations whose true prevalence variations are similar as the user query sequence under subset constraints that include i) allowable dissimilarity value ii) a reference query time sequence iii) dissimilarity function that can find degree of similarity between a temporal itemset and reference. In this paper, we propose a novel dissimilarity measure whose design is a function of product based gaussian membership function through extending the similarity function proposed in our earlier research (G-Spamine). Our approach, MASTER (Mining of Similar Temporal Associations) which is primarily inspired from SPAMINE uses the dissimilarity measure proposed in this paper and support bound estimation approach proposed in our earlier research. Expression for computation of distance bounds of temporal patterns are designed considering the proposed measure and support estimation approach. Experiments are performed by considering naïve, sequential, Spamine and G-Spamine approaches under various test case considerations that study the scalability and computational performance of the proposed approach. Experimental results prove the scalability and efficiency of the proposed approach. The correctness and completeness of proposed approach is also proved analytically. 相似文献
5.
Adverse drug events (ADEs) are a major limitation of drug safety. They are often caused by inappropriate selection of dose and the concurrent use of drugs modulating each other (drug interaction). Risk assessment and prevention strategies must therefore consider co-administered drugs, individual doses, and their timing. In a new approach we evaluated the performance of cross correlation, commonly used in signal processing, to determine similarities in patient treatments. To achieve this, patient treatments were modeled as groups of vectors representing discrete time intervals. These vectors were cross-correlated and the results evaluated to find clusters in time courses indicating similarity in treatment of different patients. To evaluate our algorithm, we then created a number of test cases. The focus of this article is on each treatment, and its pattern in time and dosage. The algorithm successfully produces a relatively low similarity score for cases that are completely different with respect to their pattern of time and dosage but high scores when they are equal (score of 0.699) or similar (score of 0.528) in their therapies, and thus succeeds in having a relatively high specificity (27/30). Such an approach might help to considerably reduce the problem of false alarms which hampers most existing alerting systems for medication errors or impending ADEs. 相似文献
6.
As the number of Internet servers increases rapidly, it becomes difficult to determine the relevant servers when searching for information. The authors develop a new method to rank Internet servers for Boolean queries. Their method reduces time and space complexity from exponential to polynomial in the number of Boolean terms. They contrast it with other known methods and describe its implementation 相似文献
7.
Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document
recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate
document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to
further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching
factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity,
among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental
results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which
is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure
with all three factors to further find highly similar documents from those topically similar documents, which is much better
than that of the popular measures and other baseline structural similarity measures.
Xiaojun Wan received a B.Sc. degree in information science, a M.Sc. degree in computer science and a Ph.D. degree in computer science
from Peking University, Beijing, China, in 2000, 2003 and 2006, respectively. He is currently a lecturer at Institute of Computer
Science and Technology of Peking University. His research interests include information retrieval and natural language processing. 相似文献
9.
Personalized recommendation has become a pivotal aspect of online marketing and e-commerce as a means of overcoming the information overload problem. There are several recommendation techniques but collaborative recommendation is the most effective and widely used technique. It relies on either item-based or user-based nearest neighborhood algorithms which utilize some kind of similarity measure to assess the similarity between different users or items for generating the recommendations. In this paper, we present a new similarity measure which is based on rating frequency and compare its performance with the current most commonly used similarity measures. The applicability and use of this similarity measure from the perspective of multimedia content recommendation is presented and discussed. 相似文献
10.
Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different order. In this paper we propose SCS, a novel, effective and domain-independent method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those which are merely noise. It constitutes an effective approach to measuring the similarity between data in the form of categorical sequences, such as biological sequences, natural language texts, speech recognition data, certain types of network transactions, and retail transactions. To show its effectiveness, we have tested SCS extensively on a range of data sets from different application fields, and compared the results with those obtained by various mainstream algorithms. The results obtained show that SCS produces results that are often competitive with domain-specific similarity approaches. 相似文献
11.
E-commerce systems employ recommender systems to enhance the customer loyalty and hence increasing the cross-selling of products. However, choosing appropriate similarity measure is a key to the recommender system success. Based on this measure, a set of neighbors for the current active user is formed which in turn will be used later to recommend unseen items to this active user. Pearson correlation coefficient, the most popular similarity measure for memory-based collaborative recommender system (CRS), measures how much two users are correlated. However, statistic’s literature introduced many other coefficients for matching two sets (vectors) that may perform better than Pearson correlation coefficient. This paper explores Jaccard and Dice coefficients for matching users of CRS. A more general coefficient called a Power coefficient is proposed in this paper which represents a family of coefficients. Specifically, Power coefficient gives many degrees for emphasizing on the positive matches between users. However, CRS users have positive and negative matches and therefore these coefficients have to be modified to take negative matches into consideration. Consequently, they become more suitable for CRS research. Many experiments are carried out for all the proposed variants and are compared with the traditional approaches. The experimental results show that the proposed variants outperform Pearson correlation coefficient and cosine similarity measure as they are the most common approaches for memory-based CRS. 相似文献
12.
This research analyzes the gene relationship according to their annotations. We present here a similar genes discovery system (SGDS), based upon semantic similarity measure of gene ontology (GO) and Entrez gene, to identify groups of similar genes. In order to validate the proposed measure, we analyze the relationships between similarity and expression correlation of pairs of genes. We explore a number of semantic similarity measures and compute the Pearson correlation coefficient. Highly correlated genes exhibit strong similarity in the ontology taxonomies. The results show that our proposed semantic similarity measure outperforms the others and seems better suited for use in GO. We use MAPK homogenous genes group and MAP kinase pathway as benchmarks to tune the parameters in our system for achieving higher accuracy. We applied the SGDS to RON and Lutheran pathways, the results show that it is able to identify a group of similar genes and to predict novel pathways based on a group of candidate genes. 相似文献
13.
This paper proposes and evaluates a new statistical discrimination measure for hidden Markov models (HMMs) extending the notion of divergence, a measure of average discrimination information originally defined for two probability density functions. Similar distance measures have been proposed for the case of HMMs, but those have focused primarily on the stationary behavior of the models. However, in speech recognition applications, the transient aspects of the models have a principal role in the discrimination process and, consequently, capturing this information is crucial in the formulation of any discrimination indicator. This paper proposes the notion of average divergence distance (ADD) as a statistical discrimination measure between two HMMs, considering the transient behavior of these models. This paper provides an analytical formulation of the proposed discrimination measure, a justification of its definition based on the Viterbi decoding approach, and a formal proof that this quantity is well defined for a left-to-right HMM topology with a final nonemitting state, a standard model for basic acoustic units in automatic speech recognition (ASR) systems. Using experiments based on this discrimination measure, it is shown that ADD provides a coherent way to evaluate the discrimination dissimilarity between acoustic models. 相似文献
14.
We introduce a new methodology for measuring the degree of similarity between two intuitionistic fuzzy sets. The new method is developed on the basis of a distance defined on an interval by the use of convex combination of endpoints and also focusing on the property of min and max operators. It is shown that among the existing methods, the proposed method meets all the well-known properties of a similarity measure and has no counter-intuitive examples. The validity and applicability of the proposed similarity measure is illustrated with two examples known as pattern recognition and medical diagnosis. 相似文献
16.
Due to some unreasonable results obtained from most current similarity measures for intuitionistic fuzzy sets (IFSs), we introduce a necessary condition to obtain a stronger definition of similarity measures for IFSs, and present a new similarity measure derived from a general idea of similarity measures for concepts on a lattice. In experiments, we focus our attention on two basic directions of performance evaluation: one is how much the proposed measure is reasonable and the other is how much accuracy the measure produces when it is applied to classification problems. The experimental results show that the proposed measure is reasonable and achieves a satisfactory performance on classification problems. 相似文献
17.
In image processing, image similarity indices evaluate how much structural information is maintained by a processed image in relation to a reference image. Commonly used measures, such as the mean squared error (MSE) and peak signal to noise ratio (PSNR), ignore the spatial information (e.g. redundancy) contained in natural images, which can lead to an inconsistent similarity evaluation from the human visual perception. Recently, a structural similarity measure (SSIM), that quantifies image fidelity through estimation of local correlations scaled by local brightness and contrast comparisons, was introduced by Wang et al. (2004). This correlation-based SSIM outperforms MSE in the similarity assessment of natural images. However, as correlation only measures linear dependence, distortions from multiple sources or nonlinear image processing such as nonlinear filtering can cause SSIM to under- or overestimate the true structural similarity. In this article, we propose a new similarity measure that replaces the correlation and contrast comparisons of SSIM by a term obtained from a nonparametric test that has superior power to capture general dependence, including linear and nonlinear dependence in the conditional mean regression function as a special case. The new similarity measure applied to images from noise contamination, filtering, and watermarking, provides a more consistent image structural fidelity measure than commonly used measures. 相似文献
18.
Clustering analysis of temporal gene expression data is widely used to study dynamic biological systems, such as identifying sets of genes that are regulated by the same mechanism. However, most temporal gene expression data often contain noise, missing data points, and non-uniformly sampled time points, which imposes challenges for traditional clustering methods of extracting meaningful information. In this paper, we introduce an improved clustering approach based on the regularized spline regression and an energy based similarity measure. The proposed approach models each gene expression profile as a B-spline expansion, for which the spline coefficients are estimated by regularized least squares scheme on the observed data. To compensate the inadequate information from noisy and short gene expression data, we use its correlated genes as the test set to choose the optimal number of basis and the regularization parameter. We show that this treatment can help to avoid over-fitting. After fitting the continuous representations of gene expression profiles, we use an energy based similarity measure for clustering. The energy based measure can include the temporal information and relative changes of the time series using the first and second derivatives of the time series. We demonstrate that our method is robust to noise and can produce meaningful clustering results. 相似文献
20.
This article presents a new interestingness measure for association rules called confidence gain (CG). Focus is given to extraction
of human associations rather than associations between market products. There are two main differences between the two (human
and market associations). The first difference is the strong asymmetry of human associations (e.g., the association “shampoo”
→ “hair” is much stronger than “hair” → “shampoo”), where in market products asymmetry is less intuitive and less evident.
The second is the background knowledge humans employ when presented with a stimulus (input phrase).
CG calculates the local confidence of a given term compared to its average confidence throughout a given database. CG is found
to outperform several association measures since it captures both the asymmetric notion of an association (as in the confidence
measure) while adding the comparison to an expected confidence (as in the lift measure). The use of average confidence introduces
the “background knowledge” notion into the CG measure.
Various experiments have shown that CG and local confidence gain (a low-complexity version of CG) successfully generate association
rules when compared to human free associations. The experiments include a large-scale “free sssociation Turing test” where
human free associations were compared to associations generated by the CG and other association measures. Rules discovered
by CG were found to be significantly better than those discovered by other measures.
CG can be used for many purposes, such as personalization, sense disambiguation, query expansion, and improving classification
performance of small item sets within large databases.
Although CG was found to be useful for Internet data retrieval, results can be easily used over any type of database.
Edited by J. Srivastava 相似文献
|