期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

PCA for large data sets with parallel data summarization

Carlos Ordonez Naveen Mohanam Carlos Garcia-Alvarado 《Distributed and Parallel Databases》2014,32(3):377-403

Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm that can compute PCA in one pass on a large data set based on summarization matrices. We also study how to integrate our algorithm with a DBMS; our solution is based on a combination of parallel data set summarization via user-defined aggregations and calling the MKL parallel variant of the LAPACK library to solve Singular Value Decomposition (SVD) in RAM. Our algorithm is theoretically shown to achieve linear speedup, linear scalability on data size, quadratic time on dimensionality (but in RAM), spending most of the time on data set summarization, despite the fact that SVD has cubic time complexity on dimensionality. Experiments with large data sets on multicore CPUs show that our solution is much faster than the R statistical package as well as solving PCA with SQL queries. Benchmarking on multicore CPUs and a parallel DBMS running on multiple nodes confirms linear speedup and linear scalability. 相似文献

2.

Hierarchical pixel bar charts 总被引：1，自引：0，他引：1

Keim D.A. Hao M.C. Dayal U. 《IEEE transactions on visualization and computer graphics》2002,8(3):255-269

Simple presentation graphics are intuitive and easy-to-use, but only show highly aggregated data. Bar charts, for example, only show a rather small number of data values and x-y-plots often have a high degree of overlap. Presentation techniques are often chosen depending on the considered data type, bar charts, for example, are used for categorical data and x-y plots are used for numerical data. We propose a combination of traditional bar charts and x-y-plots, which allows the visualization of large amounts of data with categorical and numerical data. The categorical data dimensions are used for the partitioning into the bars and the numerical data dimensions are used for the ordering arrangement within the bars. The basic idea is to use the pixels within the bars to present the detailed information of the data records. Our so-called pixel bar charts retain the intuitiveness of traditional bar charts while applying the principle of x-y charts within the bars. In many applications, a natural hierarchy is defined on the categorical data dimensions such as time, region, or product type. In hierarchical pixel bar charts, the hierarchy is exploited to split the bars for selected portions of the hierarchy. Our application to a number of real-world e-business and Web services data sets shows the wide applicability and usefulness of our new idea. 相似文献

3.

Visualizing large telecommunication data sets

Koutsofios E.E. North S.C. Keim D.A. 《Computer Graphics and Applications, IEEE》1999,19(3):16-19

Global telecommunication services create an enormous volume of real time data. Long distance voice networks, for example, can complete more than 250 million calls a day; wide area data networks can support many hundreds of thousands of virtual circuits and millions of Internet protocol (IP) flows and Web server sessions. Unlike terabyte databases, which typically contain images or multimedia streams, telecommunication databases mainly contain numerous small records describing transactions and network status events. The data processing involved therefore differs markedly, both in the number of records and the data items interpreted. To efficiently configure and operate these networks, as well as manage performance and reliability for the user, these vast data sets must be understandable. Increasingly, visualization proves key to achieving this goal. AT&T Infolab is an interdisciplinary project created in 1996 to explore how software, data management and analysis, and visualization can combine to attack information problems involving large scale networks. The data Infolab collects daily reaches tens of gigabytes. The Infolab project Swift-3D uses interactive 3D maps with statistical widgets, topology diagrams, and pixel oriented displays to abstract network data and let users interact with it. We have implemented a full scale Swift-3D prototype, which generated the examples presented 相似文献

4.

Interactive texture-based volume rendering for large data sets 总被引：6，自引：0，他引：6

Kniss J. McCormick P. McPherson A. Ahrens J. Painter J. Keahey A. Hansen C. 《Computer Graphics and Applications, IEEE》2001,21(4):52-61

To employ direct volume rendering, TRex uses parallel graphics hardware, software-based compositing, and high-performance I/O to provide near-interactive display rates for time-varying, terabyte-sized data sets. We present a scalable, pipelined approach for rendering data sets too large for a single graphics card. To do so, we take advantage of multiple hardware rendering units and parallel software compositing. The goals of TRex, our system for interactive volume rendering of large data sets, are to provide near-interactive display rates for time-varying, terabyte-sized uniformly sampled data sets and provide a low-latency platform for volume visualization in immersive environments. We consider 5 frames per second (fps) to be near-interactive rates for normal viewing environments and immersive environments to have a lower bound frame rate of l0 fps. Using TRex for virtual reality environments requires low latency - around 50 ms per frame or 100 ms per view update or stereo pair. To achieve lower latency renderings, we either render smaller portions of the volume on more graphics pipes or subsample the volume to render fewer samples per frame by each graphics pipe. Unstructured data sets must be resampled to appropriately leverage the 3D texture volume rendering method 相似文献

5.

Optimized fixed-size kernel models for large data sets 总被引：2，自引：0，他引：2

K. De Brabanter J. De Brabanter J.A.K. Suykens 《Computational statistics & data analysis》2010,54(6):1484-628

A modified active subset selection method based on quadratic Rényi entropy and a fast cross-validation for fixed-size least squares support vector machines is proposed for classification and regression with optimized tuning process. The kernel bandwidth of the entropy based selection criterion is optimally determined according to the solve-the-equation plug-in method. Also a fast cross-validation method based on a simple updating scheme is developed. The combination of these two techniques is suitable for handling large scale data sets on standard personal computers. Finally, the performance on test data and computational time of this fixed-size method are compared to those for standard support vector machines and ν-support vector machines resulting in sparser models with lower computational cost and comparable accuracy. 相似文献

6.

Efficient clustering of large data sets

V.S. Ananthanarayana M.Narasimha Murty D.K. Subramanian 《Pattern recognition》2001,34(12):2561-2563

相似文献

7.

Massively parallel software rendering for visualizing large-scaledata sets

Kwan-Liu Ma Parker S. 《Computer Graphics and Applications, IEEE》2001,21(4):72-83

We describe two highly scalable, parallel software volume-rendering algorithms - one renders unstructured grid volume data and the other renders isosurfaces. We designed one algorithm for distributed-memory parallel architectures to render unstructured grid volume data. We designed the other for shared-memory parallel architectures to directly render isosurfaces. Through the discussion of these two algorithms, we address the most relevant issues when using massively parallel computers to render large-scale volumetric data. The focus of our discussion is direct rendering of volumetric data 相似文献

8.

Intelligent simulation tools for mining large scientific data sets

Feng Zhao Chris Bailey-Kellogg Xingang Huang Iván Ordóñez 《New Generation Computing》1999,17(4):333-347

This paper describes problems, challenges, and opportunities forintelligent simulation of physical systems. Prototype intelligent simulation tools have been constructed for interpreting massive data sets from physical fields and for designing engineering systems. We identify the characteristics of intelligent simulation and describe several concrete application examples. These applications, which include weather data interpretation, distributed control optimization, and spatio-temporal diffusion-reaction pattern analysis, demonstrate that intelligent simulation tools are indispensable for the rapid prototyping of application programs in many challenging scientific and engineering domains. 相似文献

9.

Nonparametric analysis of fingerprint data on large data sets

Jin Chu Wu^{Author Vitae} Charles L. Wilson Author Vitae 《Pattern recognition》2007,40(9):2574-2584

By executing different fingerprint-image matching algorithms on large data sets, it reveals that the match and non-match similarity scores have no specific underlying distribution function. Thus, it requires a nonparametric analysis for fingerprint-image matching algorithms on large data sets without any assumption about such irregularly discrete distribution functions. A precise receiver operating characteristic (ROC) curve based on the true accept rate (TAR) of the match similarity scores and the false accept rate (FAR) of the non-match similarity scores can be constructed. The area under such an ROC curve computed using the trapezoidal rule is equivalent to the Mann-Whitney statistic directly formed from the match and non-match similarity scores. Thereafter, the Z statistic formulated using the areas under ROC curves along with their variances and the correlation coefficient is applied to test the significance of the difference between two ROC curves. Four examples from the extensive testing of commercial fingerprint systems at the National Institute of Standards and Technology are provided. The nonparametric approach presented in this article can also be employed in the analysis of other large biometric data sets. 相似文献

10.

Outlier mining in large high-dimensional data sets 总被引：17，自引：0，他引：17

Angiulli F. Pizzuti C. 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(2):203-215

A new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases. 相似文献

11.

Accelerated EM-based clustering of large data sets

Jakob J. Verbeek Jan R. J. Nunnink Nikos Vlassis 《Data mining and knowledge discovery》2006,13(3):291-307

Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.

Nikos VlassisEmail:

相似文献

12.

Visualizing and segmenting large volumetric data sets 总被引：1，自引：0，他引：1

Senger S. 《Computer Graphics and Applications, IEEE》1999,19(3):32-37

Current systems for segmenting and visualizing volumetric data sets characteristically require the user to possess a technical sophistication in volume visualization techniques, thus restricting the potential audience of users. As large volumetric data sets become more common, segmentation and visualization tools need to deemphasize the technical aspects of visualization and let users exploit their content knowledge of the data set. This proves especially critical in an educational setting. In anatomical education, data sets such as the Visible Human Project provide significant learning opportunities, but students must have tools that let them apply, refine, and build on their anatomical knowledge without technical obstacles. I describe a software environment that uses immersive virtual reality technology to let users immediately apply their expert knowledge to exploring and visualizing volumetric data sets 相似文献

13.

Computer animation for visualizing terrain data

Klasky R.S. 《Computer Graphics and Applications, IEEE》1989,9(3):12-13

Until recently, the study of terrain data gathered by satellites has meant the time-consuming analysis of hundreds of still images. An attempt to use computer animation to analyze such images more effectively is described. A simulated fly-by created by computer animation allows scientists to study many different perspective views of the data quickly. Using this technique, scientists can often detect features they might otherwise have missed 相似文献

14.

Scalable visual assessment of cluster tendency for large data sets

Richard J. Hathaway Author Vitae Author Vitae Jacalyn M. Huband Author Vitae 《Pattern recognition》2006,39(7):1315-1324

The problem of determining whether clusters are present in a data set (i.e., assessment of cluster tendency) is an important first step in cluster analysis. The visual assessment of cluster tendency (VAT) tool has been successful in determining potential cluster structure of various data sets, but it can be computationally expensive for large data sets. In this article, we present a new scalable, sample-based version of VAT, which is feasible for large data sets. We include analysis and numerical examples that demonstrate the new scalable VAT algorithm. 相似文献

15.

P-AutoClass: scalable parallel clustering for mining large data sets 总被引：3，自引：0，他引：3

Pizzuti C. Talia D. 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(3):629-641

Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental performance results on different processor numbers and data sets are presented and compared with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared. 相似文献

16.

一种用于大规模数据集的决策树采样策略

赵国强王会进《微型机与应用》2010,29(21)

为提高大规模数据集生成树的准确率,提出一种预生成一棵基于这个数据集的决策树,采用广度优先遍历将其划分为满足预定义的限制的数据集,再对各数据集按照一定比例进行随机采样,最后将采样结果整合为目标数据集的数据采样方法.通过对一UCI数据集进行采样,并用现有决策树算法实验证明,该采样方法优于传统随机采样方法,基于该采样方法的生成树准确率有所提高. 相似文献

17.

Fast algorithms for nonparametric population modeling of large data sets

Gianluigi Pillonetto Author Vitae Giuseppe De Nicolao^{Author Vitae} 《Automatica》2009,45(1):173-179

Population models are widely applied in biomedical data analysis since they characterize both the average and individual responses of a population of subjects. In the absence of a reliable mechanistic model, one can resort to the Bayesian nonparametric approach that models the individual curves as Gaussian processes. This paper develops an efficient computational scheme for estimating the average and individual curves from large data sets collected in standardized experiments, i.e. with a fixed sampling schedule. It is shown that the overall scheme exhibits a “client-server” architecture. The server is in charge of handling and processing the collective data base of past experiments. The clients ask the server for the information needed to reconstruct the individual curve in a single new experiment. This architecture allows the clients to take advantage of the overall data set without violating possible privacy and confidentiality constraints and with negligible computational effort. 相似文献

18.

Topological fisheye views for visualizing large graphs 总被引：1，自引：0，他引：1

Gansner ER Koren Y North SC 《IEEE transactions on visualization and computer graphics》2005,11(4):457-468

Graph drawing is a basic visualization tool that works well for graphs having up to hundreds of nodes and edges. At greater scale, data density and occlusion problems often negate its effectiveness. Conventional pan-and-zoom, multiscale, and geometric fisheye views are not fully satisfactory solutions to this problem. As an alternative, we propose a topological zooming method. It precomputes a hierarchy of coarsened graphs that are combined on-the-fly into renderings, with the level of detail dependent on distance from one or more foci. A related geometric distortion method yields constant information density displays from these renderings. 相似文献

19.

Looking into the seeds of time: Discovering temporal patterns in large transaction sets

Yingjiu Li Sencun Zhu 《Information Sciences》2006,176(8):1003-1031

This paper studies the problem of mining frequent itemsets along with their temporal patterns from large transaction sets. A model is proposed in which users define a large set of temporal patterns that are interesting or meaningful to them. A temporal pattern defines the set of time points where the user expects a discovered itemset to be frequent. The model is general in that (i) no constraints are placed on the interesting patterns given by the users, and (ii) two measures—inclusiveness and exclusiveness—are used to capture how well the temporal patterns match the time points given by the discovered itemsets. Intuitively, these measures indicate to what extent a discovered itemset is frequent at time points included in a temporal pattern p, but not at time points not in p. Using these two measures, one is able to model many temporal data mining problems appeared in the literature, as well as those that have not been studied. By exploiting the relationship within and between itemset space and pattern space simultaneously, a series of pruning techniques are developed to speed up the mining process. Experiments show that these pruning techniques allow one to obtain performance benefits up to 100 times over a direct extension of non-temporal data mining algorithms. 相似文献

20.

Visualizing large data sets in the earth sciences

Hibbard W. Santek D. 《Computer》1989,22(8):53-57

The authors describe the capabilities of McIDAS , an interactive visualization system that is vastly increasing the ability of earth scientists to manage and analyze data from remote sensing instruments and numerical simulation models. McIDAS provides animated three-dimensional images and highly interactive displays. The software can manage, analyze, and visualize large data sets that span many physical variables (such as temperature, pressure, humidity, and wind speed), as well as time and three spatial dimensions. The McIDAS system manages data from at least 100 different sources. The data management tools consist of data structures for storing different data types in files, libraries of routines for accessing these data structures, system commands for performing housekeeping functions on the data files, and reformatting programs for converting external data to the system's data structures. The McIDAS tools for three-dimensional visualization of meteorological data run on an IBM mainframe and can load up to 128-frame animation sequences into the workstations. A highly interactive version of the system can provide an interactive window into data sets containing tens of millions of points produced by numerical models and remote sensing instruments. The visualizations are being used for teaching as well as by scientists 相似文献