首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm that can compute PCA in one pass on a large data set based on summarization matrices. We also study how to integrate our algorithm with a DBMS; our solution is based on a combination of parallel data set summarization via user-defined aggregations and calling the MKL parallel variant of the LAPACK library to solve Singular Value Decomposition (SVD) in RAM. Our algorithm is theoretically shown to achieve linear speedup, linear scalability on data size, quadratic time on dimensionality (but in RAM), spending most of the time on data set summarization, despite the fact that SVD has cubic time complexity on dimensionality. Experiments with large data sets on multicore CPUs show that our solution is much faster than the R statistical package as well as solving PCA with SQL queries. Benchmarking on multicore CPUs and a parallel DBMS running on multiple nodes confirms linear speedup and linear scalability.  相似文献   

2.
张德喜  黄浩 《计算机应用》2006,26(8):1884-1887
EM算法的计算强度较大,且当数据集较大时,计算效率较低。为此,提出了基于部分E步的混合EM算法,降低了算法的计算强度,提高了算法对数据集大小的适应能力,并且保持了EM算法的收敛特性。最后通过将算法应用于大的数据集,验证了该算法能减少计算强度。  相似文献   

3.
Hierarchical pixel bar charts   总被引:1,自引:0,他引:1  
Simple presentation graphics are intuitive and easy-to-use, but only show highly aggregated data. Bar charts, for example, only show a rather small number of data values and x-y-plots often have a high degree of overlap. Presentation techniques are often chosen depending on the considered data type, bar charts, for example, are used for categorical data and x-y plots are used for numerical data. We propose a combination of traditional bar charts and x-y-plots, which allows the visualization of large amounts of data with categorical and numerical data. The categorical data dimensions are used for the partitioning into the bars and the numerical data dimensions are used for the ordering arrangement within the bars. The basic idea is to use the pixels within the bars to present the detailed information of the data records. Our so-called pixel bar charts retain the intuitiveness of traditional bar charts while applying the principle of x-y charts within the bars. In many applications, a natural hierarchy is defined on the categorical data dimensions such as time, region, or product type. In hierarchical pixel bar charts, the hierarchy is exploited to split the bars for selected portions of the hierarchy. Our application to a number of real-world e-business and Web services data sets shows the wide applicability and usefulness of our new idea.  相似文献   

4.
Global telecommunication services create an enormous volume of real time data. Long distance voice networks, for example, can complete more than 250 million calls a day; wide area data networks can support many hundreds of thousands of virtual circuits and millions of Internet protocol (IP) flows and Web server sessions. Unlike terabyte databases, which typically contain images or multimedia streams, telecommunication databases mainly contain numerous small records describing transactions and network status events. The data processing involved therefore differs markedly, both in the number of records and the data items interpreted. To efficiently configure and operate these networks, as well as manage performance and reliability for the user, these vast data sets must be understandable. Increasingly, visualization proves key to achieving this goal. AT&T Infolab is an interdisciplinary project created in 1996 to explore how software, data management and analysis, and visualization can combine to attack information problems involving large scale networks. The data Infolab collects daily reaches tens of gigabytes. The Infolab project Swift-3D uses interactive 3D maps with statistical widgets, topology diagrams, and pixel oriented displays to abstract network data and let users interact with it. We have implemented a full scale Swift-3D prototype, which generated the examples presented  相似文献   

5.
The interdisciplinary research presented in this study is based on a novel approach to clustering tasks and the visualization of the internal structure of high-dimensional data sets. Following normalization, a pre-processing step performs dimensionality reduction on a high-dimensional data set, using an unsupervised neural architecture known as cooperative maximum likelihood Hebbian learning (CMLHL), which is characterized by its capability to preserve a degree of global ordering in the data. Subsequently, the self organising-map (SOM) is applied, as a topology-preserving architecture used for two-dimensional visualization of the internal structure of such data sets. This research studies the joint performance of these two neural models and their capability to preserve some global ordering. Their effectiveness is demonstrated through a case of study on a real-life high complex dimensional spectroscopic data set characterized by its lack of reproducibility. The data under analysis are taken from an X-ray spectroscopic analysis of a rose window in a famous ancient Gothic Spanish cathedral. The main aim of this study is to classify each sample by its date and place of origin, so as to facilitate the restoration of these and other historical stained glass windows. Thus, having ascertained the sample’s chemical composition and degree of conservation, this technique contributes to identifying different areas and periods in which the stained glass panels were produced. The combined method proposed in this study is compared with a classical statistical model that uses principal component analysis (PCA) as a pre-processing step, and with some other unsupervised models such as maximum likelihood Hebbian learning (MLHL) and the application of the SOM without a pre-processing step. In the final case, a comparison of the convergence processes was performed to examine the efficacy of the CMLHL/SOM combined model.  相似文献   

6.
Interactive texture-based volume rendering for large data sets   总被引:6,自引:0,他引:6  
To employ direct volume rendering, TRex uses parallel graphics hardware, software-based compositing, and high-performance I/O to provide near-interactive display rates for time-varying, terabyte-sized data sets. We present a scalable, pipelined approach for rendering data sets too large for a single graphics card. To do so, we take advantage of multiple hardware rendering units and parallel software compositing. The goals of TRex, our system for interactive volume rendering of large data sets, are to provide near-interactive display rates for time-varying, terabyte-sized uniformly sampled data sets and provide a low-latency platform for volume visualization in immersive environments. We consider 5 frames per second (fps) to be near-interactive rates for normal viewing environments and immersive environments to have a lower bound frame rate of l0 fps. Using TRex for virtual reality environments requires low latency - around 50 ms per frame or 100 ms per view update or stereo pair. To achieve lower latency renderings, we either render smaller portions of the volume on more graphics pipes or subsample the volume to render fewer samples per frame by each graphics pipe. Unstructured data sets must be resampled to appropriately leverage the 3D texture volume rendering method  相似文献   

7.
Optimized fixed-size kernel models for large data sets   总被引:2,自引:0,他引:2  
A modified active subset selection method based on quadratic Rényi entropy and a fast cross-validation for fixed-size least squares support vector machines is proposed for classification and regression with optimized tuning process. The kernel bandwidth of the entropy based selection criterion is optimally determined according to the solve-the-equation plug-in method. Also a fast cross-validation method based on a simple updating scheme is developed. The combination of these two techniques is suitable for handling large scale data sets on standard personal computers. Finally, the performance on test data and computational time of this fixed-size method are compared to those for standard support vector machines and ν-support vector machines resulting in sparser models with lower computational cost and comparable accuracy.  相似文献   

8.
9.
We describe two highly scalable, parallel software volume-rendering algorithms - one renders unstructured grid volume data and the other renders isosurfaces. We designed one algorithm for distributed-memory parallel architectures to render unstructured grid volume data. We designed the other for shared-memory parallel architectures to directly render isosurfaces. Through the discussion of these two algorithms, we address the most relevant issues when using massively parallel computers to render large-scale volumetric data. The focus of our discussion is direct rendering of volumetric data  相似文献   

10.
This paper describes problems, challenges, and opportunities forintelligent simulation of physical systems. Prototype intelligent simulation tools have been constructed for interpreting massive data sets from physical fields and for designing engineering systems. We identify the characteristics of intelligent simulation and describe several concrete application examples. These applications, which include weather data interpretation, distributed control optimization, and spatio-temporal diffusion-reaction pattern analysis, demonstrate that intelligent simulation tools are indispensable for the rapid prototyping of application programs in many challenging scientific and engineering domains.  相似文献   

11.
By executing different fingerprint-image matching algorithms on large data sets, it reveals that the match and non-match similarity scores have no specific underlying distribution function. Thus, it requires a nonparametric analysis for fingerprint-image matching algorithms on large data sets without any assumption about such irregularly discrete distribution functions. A precise receiver operating characteristic (ROC) curve based on the true accept rate (TAR) of the match similarity scores and the false accept rate (FAR) of the non-match similarity scores can be constructed. The area under such an ROC curve computed using the trapezoidal rule is equivalent to the Mann-Whitney statistic directly formed from the match and non-match similarity scores. Thereafter, the Z statistic formulated using the areas under ROC curves along with their variances and the correlation coefficient is applied to test the significance of the difference between two ROC curves. Four examples from the extensive testing of commercial fingerprint systems at the National Institute of Standards and Technology are provided. The nonparametric approach presented in this article can also be employed in the analysis of other large biometric data sets.  相似文献   

12.
Outlier mining in large high-dimensional data sets   总被引:17,自引:0,他引:17  
A new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.  相似文献   

13.
Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.
Nikos VlassisEmail:
  相似文献   

14.
Visualizing and segmenting large volumetric data sets   总被引:1,自引:0,他引:1  
Current systems for segmenting and visualizing volumetric data sets characteristically require the user to possess a technical sophistication in volume visualization techniques, thus restricting the potential audience of users. As large volumetric data sets become more common, segmentation and visualization tools need to deemphasize the technical aspects of visualization and let users exploit their content knowledge of the data set. This proves especially critical in an educational setting. In anatomical education, data sets such as the Visible Human Project provide significant learning opportunities, but students must have tools that let them apply, refine, and build on their anatomical knowledge without technical obstacles. I describe a software environment that uses immersive virtual reality technology to let users immediately apply their expert knowledge to exploring and visualizing volumetric data sets  相似文献   

15.
The problem of determining whether clusters are present in a data set (i.e., assessment of cluster tendency) is an important first step in cluster analysis. The visual assessment of cluster tendency (VAT) tool has been successful in determining potential cluster structure of various data sets, but it can be computationally expensive for large data sets. In this article, we present a new scalable, sample-based version of VAT, which is feasible for large data sets. We include analysis and numerical examples that demonstrate the new scalable VAT algorithm.  相似文献   

16.
P-AutoClass: scalable parallel clustering for mining large data sets   总被引:3,自引:0,他引:3  
Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental performance results on different processor numbers and data sets are presented and compared with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared.  相似文献   

17.
为提高大规模数据集生成树的准确率,提出一种预生成一棵基于这个数据集的决策树,采用广度优先遍历将其划分为满足预定义的限制的数据集,再对各数据集按照一定比例进行随机采样,最后将采样结果整合为目标数据集的数据采样方法.通过对一UCI数据集进行采样,并用现有决策树算法实验证明,该采样方法优于传统随机采样方法,基于该采样方法的生成树准确率有所提高.  相似文献   

18.
A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non‐Euclidean relational fuzzy c‐means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N × N relational matrix RN to obtain a n × n sample matrix Rn; (iii) clustering Rn with literal NERF; and (iv) extending the clusters in Rn to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding Rn, are not very practical because the sample size n often turns out to be roughly 50% of n, and this over‐sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over‐sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc.  相似文献   

19.
Population models are widely applied in biomedical data analysis since they characterize both the average and individual responses of a population of subjects. In the absence of a reliable mechanistic model, one can resort to the Bayesian nonparametric approach that models the individual curves as Gaussian processes. This paper develops an efficient computational scheme for estimating the average and individual curves from large data sets collected in standardized experiments, i.e. with a fixed sampling schedule. It is shown that the overall scheme exhibits a “client-server” architecture. The server is in charge of handling and processing the collective data base of past experiments. The clients ask the server for the information needed to reconstruct the individual curve in a single new experiment. This architecture allows the clients to take advantage of the overall data set without violating possible privacy and confidentiality constraints and with negligible computational effort.  相似文献   

20.
Until recently, the study of terrain data gathered by satellites has meant the time-consuming analysis of hundreds of still images. An attempt to use computer animation to analyze such images more effectively is described. A simulated fly-by created by computer animation allows scientists to study many different perspective views of the data quickly. Using this technique, scientists can often detect features they might otherwise have missed  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号