首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Many practical problems in computer science require the knowledge of the most frequently occurring items in a data set. Current state-of-the-art algorithms for frequent items discovery are either fully centralized or rely on node hierarchies which are inflexible and prone to failures in massively distributed systems. In this paper we describe a family of gossip-based algorithms that efficiently approximate the most frequent items in large-scale distributed datasets. We show, both analytically and using real-world datasets, that our algorithms are fast, highly scalable, and resilient to node failures.  相似文献   

2.
Smart services, one of the most intriguing areas of current Internet of Things(IoT) research, require improvement in terms of recognizing user activities. Sound is a useful medium for making decisions based on activity recognition in the smart home environment, which includes mobile devices such as sensors and actuators. Instead of visual sensors to recognize human activity, acoustic sensor data is acquired in an unobtrusive manner for greater privacy. However, multiuser activity provides a formidable challenge for acoustic data-based activity recognition systems because of the difficulty of identifying multiple sources of activity from among a variety of sounds. In our study, we propose a statistical method to detect the interval of interference, which is also known as the unexpected mesa, distinguishing activities based on the pre- and post-mesa intervals. The results suggest that the proposed method outperforms previously presented classification algorithms in terms of the accuracy of multiuser activity recognition. Future studies may utilize this method for improvement of existing smart home systems.  相似文献   

3.
4.
Energy consumption in datacenters has recently become a major concern due to the rising operational costs and scalability issues. Recent solutions to this problem propose the principle of energy proportionality, i.e., the amount of energy consumed by the server nodes must be proportional to the amount of work performed. For data parallelism and fault tolerance purposes, most common file systems used in MapReduce-type clusters maintain a set of replicas for each data block. A covering subset is a group of nodes that together contain at least one replica of the data blocks needed for performing computing tasks. In this work, we develop and analyze algorithms to maintain energy proportionality by discovering a covering subset that minimizes energy consumption while placing the remaining nodes in low-power standby mode in a data parallel computing cluster. Our algorithms can also discover covering subset in heterogeneous   computing environments. In order to allow more data parallelism, we generalize our algorithms so that it can discover kk-covering subset, i.e., a set of nodes that contain at least kk replicas of the data blocks. Our experimental results show that we can achieve substantial energy saving without significant performance loss in diverse cluster configurations and working environments.  相似文献   

5.
The objective of this note is to present a solution to the decentralized estimation and control problem for linear discrete-time varying systems, composed of overlapping subsystems. The solution is based on the expansion-contraction framework of the inclusion principle. It is shown how decentralized estimation and control laws can be independently computed for the expanded system, and then contracted for implementation in the original system to satisfy the overlapping information structure constraint.  相似文献   

6.
Until recently, the aim of most text-mining work has been to understand major topics and clusters. Minor topics and clusters have been relatively neglected even though they may represent important information on rare events. We present a novel method for exploring overlapping clusters of heterogeneous sizes, which is based on vector space modeling, covariance matrix analysis, random sampling, and dynamic re-weighting of document vectors in massive databases. Our system addresses a combination of difficult issues in database analysis, such as synonymy and polysemy, identification of minor clusters, accommodation of cluster overlap, automatic labeling of clusters based on their document contents, and the user-controlled trade-off between speed of computation and quality of results. We conducted implementation studies with new articles from the Reuters and LA Times TREC data sets and artificially generated data with a known cluster structure to demonstrate the effectiveness of our system. Mei Kobayashi received a Bachelors degree in Chemistry from Princeton and Masters and Ph.D. degrees in Pure and Applied Mathematics from UC Berkeley. She was a student intern in Frick Chemical Laboratory at Princeton, the Biochemical and Math-Physics divisions of Lawrence Berkeley Laboratories, and IBM Research. She has been a Researcher at IBM since 1988 and has been involved in projects ranging from inverse problems, airflow simulation and graphics to speech signal analysis using wavelets. Her most recent work has been on information retrieval, data mining, and unstructured information management. She has served on the Editorial Board of the Bulletin of Japan SIAM and Technical Program Committees of the SIAM Data Mining Conference, SIAM Text Mining Workshops, and Symposiums on Wavelets sponsored by the Japanese Ministry of Education. From 1996 to 1999, she was a Visiting Associate Professor at the Graduate School for Mathematical Sciences of the University of Tokyo. Masaki Aono received Bachelors and Masters in Science degrees in Information Science from the University of Tokyo and a Ph.D. in Computer Science from Rensselaer Polytechnic Institute. He worked for IBM Research, Tokyo Research Laboratory from 1984 to 2003. He is currently a Professor in the Information and Computer Sciences. Department at the Toyohashi University of Technology, where he is teaching object-oriented programming, logic circuit, computer architecture, and knowledge data engineering. His current research interests include text and data mining, information extraction, semantic web, and information visualization. His most recent work on time series data mining from human body bio-signals obtained by microsensors, was been selected to be part of the 21st century Center Of Excellence Program sponsored by Japanese government. He has been a Japanese delegate of the ISO/IEC JTC1 SC24 Standard Committee since 1996.  相似文献   

7.
Outlier detection has attracted substantial attention in many applications and research areas; some of the most prominent applications are network intrusion detection or credit card fraud detection. Many of the existing approaches are based on calculating distances among the points in the dataset. These approaches cannot easily adapt to current datasets that usually contain a mix of categorical and continuous attributes, and may be distributed among different geographical locations. In addition, current datasets usually have a large number of dimensions. These datasets tend to be sparse, and traditional concepts such as Euclidean distance or nearest neighbor become unsuitable. We propose a fast distributed outlier detection strategy intended for datasets containing mixed attributes. The proposed method takes into consideration the sparseness of the dataset, and is experimentally shown to be highly scalable with the number of points and the number of attributes in the dataset. Experimental results show that the proposed outlier detection method compares very favorably with other state-of-the art outlier detection strategies proposed in the literature and that the speedup achieved by its distributed version is very close to linear.  相似文献   

8.
Multimedia Tools and Applications - Undirected graphs and symmetric square matrices are frequently found in various domains. An example is character co-occurrence matrices in digital humanities....  相似文献   

9.
10.
Calculation algorithms for the realization of gradient methods based on the solution to direct and adjoint problems in weak formulations are proposed for a number of inverse complex problems of estimating the parameters of multicomponent elliptic-pseudoparabolic distributed systems. The proposed approach makes it unnecessary to construct Lagrange functionals in explicit form and to use Green functions.  相似文献   

11.
12.
Energy management for large-scale clusters has been the subject of significant research attention in recent years. The principle of energy proportionality states that we can save energy by activating only a subset of cluster nodes, in proportion to the current load. However, achieving the energy proportionality in shared-nothing clusters is challenging, because the arbitrary deactivation of nodes would make some data become unavailable. In this paper, we propose a new algorithm, named popularity-based covering sets (PCS), to achieve the energy proportionality in large-scale shared-nothing clusters. PCS determines the set of active nodes dynamically, in order to achieve the design goals of (a) guaranteeing the minimum level of availability for every data so that any job can execute promptly, and (b) providing more replicas for popular data to mitigate contention on the data. This differs from previous studies, where some data may become unavailable, or they provide the same number of replicas for every data. Furthermore, PCS is rack-aware and thus it can reduce the energy consumption of power-hungry rack components. Experiment results indicate that PCS improves the overall energy savings by up to 62% compared to previous algorithms without significant performance loss.  相似文献   

13.
Evolving clusters in gene-expression data   总被引:1,自引:0,他引:1  
Clustering is a useful exploratory tool for gene-expression data. Although successful applications of clustering techniques have been reported in the literature, there is no method of choice in the gene-expression analysis community. Moreover, there are only a few works that deal with the problem of automatically estimating the number of clusters in bioinformatics datasets. Most clustering methods require the number k of clusters to be either specified in advance or selected a posteriori from a set of clustering solutions over a range of k. In both cases, the user has to select the number of clusters. This paper proposes improvements to a clustering genetic algorithm that is capable of automatically discovering an optimal number of clusters and its corresponding optimal partition based upon numeric criteria. The proposed improvements are mainly designed to enhance the efficiency of the original clustering genetic algorithm, resulting in two new clustering genetic algorithms and an evolutionary algorithm for clustering (EAC). The original clustering genetic algorithm and its modified versions are evaluated in several runs using six gene-expression datasets in which the right clusters are known a priori. The results illustrate that all the proposed algorithms perform well in gene-expression data, although statistical comparisons in terms of the computational efficiency of each algorithm point out that EAC outperforms the others. Statistical evidence also shows that EAC is able to outperform a traditional method based on multiple runs of k-means over a range of k.  相似文献   

14.
15.
16.
The paper discusses the separation of partially overlapping data packets by an antenna array in narrowband communication systems. This problem occurs in asynchronous communication systems and several transponder systems such as Radio Frequency Identification (RFID) for wireless tags, Automatic Identification System (AIS) for ships, and Secondary Surveillance Radar (SSR) and Automatic Dependent Surveillance—Broadcast (ADS—B) for aircraft. Partially overlapping data packages also occur as inter-cell interference in mutually unsynchronized communication systems. Arbitrary arrival times of the overlapping packets cause nonstationary scenarios and makes it difficult to identify the signals using standard blind beamforming techniques. After selecting an observation interval, we propose subspace-based algorithms to suppress partially present (interfering) packets, as a preprocessing step for existing blind beamforming algorithms that assume stationary (fully overlapping) sources. The proposed algorithms are based on a subspace intersection, computed using a generalized singular value decomposition (GSVD) or a generalized eigenvalue decomposition (GEVD). In the second part of the paper, the algorithm is refined using a recently developed subspace estimation tool, the Signed URV algorithm, which is closely related to the GSVD but can be computed non-iteratively and allows for efficient subspace tracking. Simulation results show that the proposed algorithms significantly improve the performance of classical algorithms designed for block stationary scenarios in cases where asynchronous co-channel interference is present. An example on experimental data from the AIS ship transponder system confirms the effectiveness of the proposed algorithms in a real application.  相似文献   

17.
In this paper we address confidentiality issues in distributed data clustering, particularly the inference problem. We present KDEC-S algorithm for distributed data clustering, which is shown to provide mining results while preserving confidentiality of original data. We also present a confidentiality framework with which we can state the confidentiality level of KDEC-S. The underlying idea of KDEC-S is to use an approximation of density estimation such that the original data cannot be reconstructed to a given extent.  相似文献   

18.
In present computer structures in view of distributed data centers (DC) provides flexible management of network topology and functions in real time. This method allows to get rid of the excess costs of expensive equipment and network service. To provide resiliency of network infrastructure of DC it is necessary to use effective fast rerouting algorithms. In this work we propose adaptive rerouting algorithm of data flows in distributed DC based on the paired shifts data.  相似文献   

19.
Handling of incomplete data sets using ICA and SOM in data mining   总被引:1,自引:0,他引:1  
Based on independent component analysis (ICA) and self-organizing maps (SOM), this paper proposes an ISOM-DH model for the incomplete data’s handling in data mining. Under these circumstances the data remain dependent and non-Gaussian, this model can make full use of the information of the given data to estimate the missing data and can visualize the handled high-dimensional data. Compared with mixture of principal component analyzers (MPCA), mean method and standard SOM-based fuzzy map model, ISOM-DH model can be applied to more cases, thus performing its superiority. Meanwhile, the correctness and reasonableness of ISOM-DH model is also validated by the experiment carried out in this paper.  相似文献   

20.
The theoretical aspects of statistical inference with imprecise data, with focus on random sets, are considered. On the setting of coarse data analysis imprecision and randomness in observed data are exhibited, and the relationship between probability and other types of uncertainty, such as belief functions and possibility measures, is analyzed. Coarsening schemes are viewed as models for perception-based information gathering processes in which random fuzzy sets appear naturally. As an implication, fuzzy statistics is statistics with fuzzy data. That is, fuzzy sets are a new type of data and as such, complementary to statistical analysis in the sense that they enlarge the domain of applications of statistical science.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号