首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We examine the tail behaviour and extremal cluster characteristics of two-state Markov-switching autoregressive models where the first regime behaves like a random walk, the second regime is a stationary autoregression, and the generating noise is light-tailed. Under additional technical conditions we prove that the stationary solution has asymptotically exponential tail and the extremal index is smaller than one. The extremal index and the limiting cluster size distribution of the process are calculated explicitly for some noise distributions, and simulated for others. The practical relevance of the results is illustrated by examining extremal properties of a regime-switching autoregressive process with Gamma-distributed noise, already applied successfully in river flow modeling. The limiting aggregate excess distribution is shown to possess Weibull-like tail in this special case.  相似文献   

2.
Low overhead analysis of large distributed data sets is necessary for current data centers and for future sensor networks. In such systems, each node holds some data value, e.g., a local sensor read, and a concise picture of the global system state needs to be obtained. In resource-constrained environments like sensor networks, this needs to be done without collecting all the data at any location, i.e., in a distributed manner. To this end, we address the distributed clustering problem, in which numerous interconnected nodes compute a clustering of their data, i.e., partition these values into multiple clusters, and describe each cluster concisely. We present a generic algorithm that solves the distributed clustering problem and may be implemented in various topologies, using different clustering types. For example, the generic algorithm can be instantiated to cluster values according to distance, targeting the same problem as the famous k-means clustering algorithm. However, the distance criterion is often not sufficient to provide good clustering results. We present an instantiation of the generic algorithm that describes the values as a Gaussian Mixture (a set of weighted normal distributions), and uses machine learning tools for clustering decisions. Simulations show the robustness, speed and scalability of this algorithm. We prove that any implementation of the generic algorithm converges over any connected topology, clustering criterion and cluster representation, in fully asynchronous settings.  相似文献   

3.
Rough k-means clustering describes uncertainty by assigning some objects to more than one cluster. Rough cluster quality index based on decision theory is applicable to the evaluation of rough clustering. In this paper we analyze rough k-means clustering with respect to the selection of the threshold, the value of risk for assigning an object and uncertainty of objects. According to the analysis, clusters presented as interval sets with lower and upper approximations in rough k-means clustering are not adequate to describe clusters. This paper proposes an interval set clustering based on decision theory. Lower and upper approximations in the proposed algorithm are hierarchical and constructed as outer-level approximations and inner-level ones. Uncertainty of objects in out-level upper approximation is described by the assignment of objects among different clusters. Accordingly, ambiguity of objects in inner-level upper approximation is represented by local uniform factors of objects. In addition, interval set clustering can be improved to obtain a satisfactory clustering result with the optimal number of clusters, as well as optimal values of parameters, by taking advantage of the usefulness of rough cluster quality index in the evaluation of clustering. The experimental results on synthetic and standard data demonstrate how to construct clusters with satisfactory lower and upper approximations in the proposed algorithm. The experiments with a promotional campaign for the retail data illustrates the usefulness of interval set clustering for improving rough k-means clustering results.  相似文献   

4.
Recently, a large amount of work has been devoted to the study of spectral clustering—a simple yet powerful method for finding structure in a data set using spectral properties of an associated pairwise similarity matrix. Most of the existing spectral clustering algorithms estimate only one cluster number or estimate non-unique cluster numbers based on eigengap criterion. However, the number of clusters not always exists one, and eigengap criterion lacks theoretical justification. In this paper, we propose non-unique cluster numbers determination methods based on stability in spectral clustering (NCNDBS). We first utilize the multiway normalized cut spectral clustering algorithm to cluster data set for a candidate cluster number $k$ . Then the ratio value of the multiway normalized cut criterion of the obtained clusters and the sum of the leading eigenvalues (descending sort) of the stochastic transition matrix is chosen as a standard to decide whether the $k$ is a reasonable cluster number. At last, by varying the scaling parameter in the Gaussian function, we judge whether the reasonable cluster number $k$ is also a stability one. By three stages, we can determine non-unique cluster numbers of a data set. The Lumpability theorem concluded by Meil $\breve{a}$ and Xu provides a theoretical base for our methods. NCNDBS can estimate non-unique cluster numbers of the data set successfully by illustrative experiments.  相似文献   

5.
《Information Fusion》2008,9(2):223-233
Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-ANMI, a new efficient algorithm for clustering categorical data. The k-ANMI algorithm works in a way that is similar to the popular k-means algorithm, and the goodness of clustering in each step is evaluated using a mutual information based criterion (namely, average normalized mutual information – ANMI) borrowed from cluster ensemble. This algorithm is easy to implement, requiring multiple hash tables as the only major data structure. Experimental results on real datasets show that k-ANMI algorithm is competitive with those state-of-the-art categorical data clustering algorithms with respect to clustering accuracy.  相似文献   

6.
Minimizing energy dissipation and maximizing network lifetime are among the central concerns when designing applications and protocols for sensor networks. Clustering has been proven to be energy-efficient in sensor networks since data routing and relaying are only operated by cluster heads. Besides, cluster heads can process, filter and aggregate data sent by cluster members, thus reducing network load and alleviating the bandwidth. In this paper, we propose a novel distributed clustering algorithm where cluster heads are elected following a three-way message exchange between each sensor and its neighbors. Sensor’s eligibility to be elected cluster head is based on its residual energy and its degree. Our protocol has a message exchange complexity of O(1) and a worst-case convergence time complexity of O(N). Simulations show that our algorithm outperforms EESH, one of the most recently published distributed clustering algorithms, in terms of network lifetime and ratio of elected cluster heads.  相似文献   

7.
This paper discusses the mean-square performance of a first order random sampled-data system with feedback, where the sampling times constitute a stationary point process, with independent and identically distributed sampling intervals. The paper presents some new results for the cases of periodic sampling, periodic sampling with skips, and Poisson sampling.  相似文献   

8.
Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ? 1 penalization methods. However, current methods assume that the data are independent and identically distributed. If the distribution, and hence the graph, evolves over time then the data are not longer identically distributed. In this paper we develop a nonparametric method for estimating time varying graphical structure for multivariate Gaussian distributions using an ? 1 regularization method, and show that, as long as the covariances change smoothly over time, we can estimate the covariance matrix well (in predictive risk) even when p is large.  相似文献   

9.
This paper deals with an infinite-capacity multi-server queueing system with a second optional service (SOS) channel. The inter-arrival times of arriving customers, the service times of the first essential service (FES) and the SOS channel are all exponentially distributed. A customer may leave the system after the FES channel with a probability (1 − θ), or the completion of the FES may immediately require a SOS with a probability θ (0 ? θ ? 1). The formulae for computing the rate matrix and stationary probabilities are derived by means of a matrix analytical approach. A cost model is developed to simultaneously determine the optimal values of the number of servers and the two service rates at the minimal total expected cost per unit time. Quasi-Newton method and Particle Swarm Optimization (PSO) method are employed to deal with the optimization problem. Under optimal operating conditions, numerical results are provided from which several system performance measures are calculated based on the assumed numerical values of the system parameters.  相似文献   

10.
In this paper we present a new distance metric that incorporates the distance variation in a cluster to regularize the distance between a data point and the cluster centroid. It is then applied to the conventional fuzzy C-means (FCM) clustering in data space and the kernel fuzzy C-means (KFCM) clustering in a high-dimensional feature space. Experiments on two-dimensional artificial data sets, real data sets from public data libraries and color image segmentation have shown that the proposed FCM and KFCM with the new distance metric generally have better performance on non-spherically distributed data with uneven density for linear and nonlinear separation.  相似文献   

11.
In this article, a cluster validity index and its fuzzification is described, which can provide a measure of goodness of clustering on different partitions of a data set. The maximum value of this index, called the PBM-index, across the hierarchy provides the best partitioning. The index is defined as a product of three factors, maximization of which ensures the formation of a small number of compact clusters with large separation between at least two clusters. We have used both the k-means and the expectation maximization algorithms as underlying crisp clustering techniques. For fuzzy clustering, we have utilized the well-known fuzzy c-means algorithm. Results demonstrating the superiority of the PBM-index in appropriately determining the number of clusters, as compared to three other well-known measures, the Davies-Bouldin index, Dunn's index and the Xie-Beni index, are provided for several artificial and real-life data sets.  相似文献   

12.
In some population the AIDS/HIV incidence rate θ∈(0,∞) is altered in the middle of a data collection period due to preventive treatments imposed by the health service agencies. The intervened Poisson (IP) model in [Comput. Programs Biomed. 17 (1983) 89; Biometrics 48 (1985) 559] was introduced which is appropriate to analyze data of this type. However, the classical approach leading to the maximum likelihood (ML), moment (M) or minimum variance unbiased (MVU) estimator of θ is mathematically formidable and practically inconvenient as far as sequentially updating the estimate when new data arrive. Previous subjective Bayesian work has been done to overcome these issues. Hence, there is a need to devise a more practical empirical Bayesian technique to estimate θ, and it is done in this article. The results are illustrated using a data on AIDS/HIV incidence in the state of Alabama. Advantages in the Bayesian intervened approach are cited.  相似文献   

13.
Fuzzy c-means (FCM) algorithm is an important clustering method in pattern recognition, while the fuzziness parameter, m, in FCM algorithm is a key parameter that can significantly affect the result of clustering. Cluster validity index (CVI) is a kind of criterion function to validate the clustering results, thereby determining the optimal cluster number of a data set. From the perspective of cluster validation, we propose a novel method to select the optimal value of m in FCM, and four well-known CVIs, namely XB, VK, VT, and SC, for fuzzy clustering are used. In this method, the optimal value of m is determined when CVIs reach their minimum values. Experimental results on four synthetic data sets and four real data sets have demonstrated that the range of m is [2, 3.5] and the optimal interval is [2.5, 3].  相似文献   

14.
We introduce a top news list model based on extremal shot noise with Poisson arrival flow. We find one- and multi-dimensional distributions of popularity of current news (at arbitrary time and at infinity), as well as distributions of places of news in a top list and their sojourn times in a stationary regime. We consider in detail the case where the popularity of a news item is Pareto distributed at the initial time and then decreases exponentially.  相似文献   

15.
Normalized Cuts is a state-of-the-art spectral method for clustering. By applying spectral techniques, the data becomes easier to cluster and then k-means is classically used. Unfortunately the number of clusters must be manually set and it is very sensitive to initialization. Moreover, k-means tends to split large clusters, to merge small clusters, and to favor convex-shaped clusters. In this work we present a new clustering method which is parameterless, independent from the original data dimensionality and from the shape of the clusters. It only takes into account inter-point distances and it has no random steps. The combination of the proposed method with normalized cuts proved successful in our experiments.  相似文献   

16.
孙林  秦小营  徐久成  薛占熬 《软件学报》2022,33(4):1390-1411
密度峰值聚类(density peak clustering, DPC)是一种简单有效的聚类分析方法.但在实际应用中,对于簇间密度差别大或者簇中存在多密度峰的数据集,DPC很难选择正确的簇中心;同时,DPC中点的分配方法存在多米诺骨牌效应.针对这些问题,提出一种基于K近邻(K-nearest neighbors,KNN)和优化分配策略的密度峰值聚类算法.首先,基于KNN、点的局部密度和边界点确定候选簇中心;定义路径距离以反映候选簇中心之间的相似度,基于路径距离提出密度因子和距离因子来量化候选簇中心作为簇中心的可能性,确定簇中心.然后,为了提升点的分配的准确性,依据共享近邻、高密度最近邻、密度差值和KNN之间距离构建相似度,并给出邻域、相似集和相似域等概念,以协助点的分配;根据相似域和边界点确定初始聚类结果,并基于簇中心获得中间聚类结果.最后,依据中间聚类结果和相似集,从簇中心到簇边界将簇划分为多层,分别设计点的分配策略;对于具体层次中的点,基于相似域和积极域提出积极值以确定点的分配顺序,将点分配给其积极域中占主导地位的簇,获得最终聚类结果.在11个合成数据集和27个真实数据集上进行仿真...  相似文献   

17.
Document clustering using synthetic cluster prototypes   总被引:3,自引:0,他引:3  
The use of centroids as prototypes for clustering text documents with the k-means family of methods is not always the best choice for representing text clusters due to the high dimensionality, sparsity, and low quality of text data. Especially for the cases where we seek clusters with small number of objects, the use of centroids may lead to poor solutions near the bad initial conditions. To overcome this problem, we propose the idea of synthetic cluster prototype that is computed by first selecting a subset of cluster objects (instances), then computing the representative of these objects and finally selecting important features. In this spirit, we introduce the MedoidKNN synthetic prototype that favors the representation of the dominant class in a cluster. These synthetic cluster prototypes are incorporated into the generic spherical k-means procedure leading to a robust clustering method called k-synthetic prototypes (k-sp). Comparative experimental evaluation demonstrates the robustness of the approach especially for small datasets and clusters overlapping in many dimensions and its superior performance against traditional and subspace clustering methods.  相似文献   

18.
When representing DNA molecules as words, it is necessary to take into account the fact that a word u encodes basically the same information as its Watson-Crick complement θ(u), where θ denotes the Watson-Crick complementarity function. Thus, an expression which involves only a word u and its complement can be still considered as a repeating sequence. In this context, we define and investigate the properties of a special class of primitive words, called pseudo-primitive words relative to θ or simply θ-primitive words, which cannot be expressed as such repeating sequences. For instance, we prove the existence of a unique θ-primitive root of a given word, and we give some constraints forcing two distinct words to share their θ-primitive root. Also, we present an extension of the well-known Fine and Wilf theorem, for which we give an optimal bound.  相似文献   

19.
Major problems exist in both crisp and fuzzy clustering algorithms. The fuzzy c-means type of algorithms use weights determined by a power m of inverse distances that remains fixed over all iterations and over all clusters, even though smaller clusters should have a larger m. Our method uses a different “distance” for each cluster that changes over the early iterations to fit the clusters. Comparisons show improved results. We also address other perplexing problems in clustering: (i) find the optimal number K of clusters; (ii) assess the validity of a given clustering; (iii) prevent the selection of seed vectors as initial prototypes from affecting the clustering; (iv) prevent the order of merging from affecting the clustering; and (v) permit the clusters to form more natural shapes rather than forcing them into normed balls of the distance function. We employ a relatively large number K of uniformly randomly distributed seeds and then thin them to leave fewer uniformly distributed seeds. Next, the main loop iterates by assigning the feature vectors and computing new fuzzy prototypes. Our fuzzy merging then merges any clusters that are too close to each other. We use a modified Xie-Bene validity measure as the goodness of clustering measure for multiple values of K in a user-interaction approach where the user selects two parameters (for eliminating clusters and merging clusters after viewing the results thus far). The algorithm is compared with the fuzzy c-means on the iris data and on the Wisconsin breast cancer data.  相似文献   

20.
Recent experimental studies have revealed that a large percentage of wireless links are lossy and unreliable for data delivery in wireless sensor networks (WSNs). Such findings raise new challenges for the design of clustering algorithms in WSNs in terms of data reliability and energy efficiency. In this paper, we propose distributed clustering algorithms for lossy WSNs with a mobile collector, where the mobile collector moves close to each cluster head to receive data directly and then uploads collected data to the base station. We first consider constructing one-hop clusters in lossy WSNs where all cluster members are within the direct communication range of their cluster heads. We formulate the problem into an integer program, aiming at maximizing the network lifetime, which is defined as the number of rounds of data collection until the first node dies. We then prove that the problem is NP-hard. After that, we propose a metric-based distributed clustering algorithm to solve the problem. We adopt a metric called selection weight for each sensor node that indicates both link qualities around the node and its capability of being a cluster head. We further extend the algorithm to multi-hop clustering to achieve better scalability. We have found out that the performance of the one-hop clustering algorithm in small WSNs is very close to the optimal results obtained by mathematical tools. We have conducted extensive simulations for large WSNs and the results demonstrate that the proposed clustering algorithms can significantly improve the data reception ratio, reduce the total energy consumption in the network and prolong network lifetime compared to a typical distributed clustering algorithm, HEED, that does not consider lossy links.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号