首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The scatterplot matrix (SPLOM) is a well‐established technique to visually explore high‐dimensional data sets. It is characterized by the number of scatterplots (plots) of which it consists of. Unfortunately, this number quadratically grows with the number of the data set’s dimensions. Thus, an SPLOM scales very poorly. Consequently, the usefulness of SPLOMs is restricted to a small number of dimensions. For this, several approaches already exist to explore such ‘small’ SPLOMs. Those approaches address the scalability problem just indirectly and without solving it. Therefore, we introduce a new greedy approach to manage ‘large’ SPLOMs with more than 100 dimensions. We establish a combined visualization and interaction scheme that produces intuitively interpretable SPLOMs by combining known quality measures, a pre‐process reordering and a perception‐based abstraction. With this scheme, the user can interactively find large amounts of relevant plots in large SPLOMs.  相似文献   

2.
Visual quality measures seek to algorithmically imitate human judgments of patterns such as class separability, correlation, or outliers. In this paper, we propose a novel data‐driven framework for evaluating such measures. The basic idea is to take a large set of visually encoded data, such as scatterplots, with reliable human “ground truth” judgements, and to use this human‐labeled data to learn how well a measure would predict human judgements on previously unseen data. Measures can then be evaluated based on predictive performance—an approach that is crucial for generalizing across datasets but has gained little attention so far. To illustrate our framework, we use it to evaluate 15 state‐of‐the‐art class separation measures, using human ground truth data from 828 class separation judgments on color‐coded 2D scatterplots.  相似文献   

3.
We propose ClustMe, a new visual quality measure to rank monochrome scatterplots based on cluster patterns. ClustMe is based on data collected from a human‐subjects study, in which 34 participants judged synthetically generated cluster patterns in 1000 scatterplots. We generated these patterns by carefully varying the free parameters of a simple Gaussian Mixture Model with two components, and asked the participants to count the number of clusters they could see (1 or more than 1). Based on the results, we form ClustMe by selecting the model that best predicts these human judgments among 7 different state‐of‐the‐art merging techniques (Demp ). To quantitatively evaluate ClustMe, we conducted a second study, in which 31 human subjects ranked 435 pairs of scatterplots of real and synthetic data in terms of cluster patterns complexity. We use this data to compare ClustMe's performance to 4 other state‐of‐the‐art clustering measures, including the well‐known Clumpiness scagnostics. We found that of all measures, ClustMe is in strongest agreement with the human rankings.  相似文献   

4.
A standard approach for visualizing multivariate networks is to use one or more multidimensional views (for example, scatterplots) for selecting nodes by various metrics, possibly coordinated with a node-link view of the network. In this paper, we present three novel approaches for achieving a tighter integration of these views through hybrid techniques for multidimensional visualization, graph selection and layout. First, we present the FlowVizMenu, a radial menu containing a scatterplot that can be popped up transiently and manipulated with rapid, fluid gestures to select and modify the axes of its scatterplot. Second, the FlowVizMenu can be used to steer an attribute-driven layout of the network, causing certain nodes of a node-link diagram to move toward their corresponding positions in a scatterplot while others can be positioned manually or by force-directed layout. Third, we describe a novel hybrid approach that combines a scatterplot matrix (SPLOM) and parallel coordinates called the Parallel Scatterplot Matrix (P-SPLOM), which can be used to visualize and select features within the network. We also describe a novel arrangement of scatterplots called the Scatterplot Staircase (SPLOS) that requires less space than a traditional scatterplot matrix. Initial user feedback is reported.  相似文献   

5.
Linear models are commonly used to identify trends in data. While it is an easy task to build linear models using pre‐selected variables, it is challenging to select the best variables from a large number of alternatives. Most metrics for selecting variables are global in nature, and thus not useful for identifying local patterns. In this work, we present an integrated framework with visual representations that allows the user to incrementally build and verify models in three model spaces that support local pattern discovery and summarization: model complementarity, model diversity, and model representivity. Visual representations are designed and implemented for each of the model spaces. Our visualizations enable the discovery of complementary variables, i.e., those that perform well in modeling different subsets of data points. They also support the isolation of local models based on a diversity measure. Furthermore, the system integrates a hierarchical representation to identify the outlier local trends and the local trends that share similar directions in the model space. A case study on financial risk analysis is discussed, followed by a user study.  相似文献   

6.
Virologists are not only interested in point mutations in a genome, but also in relationships between mutations. In this work, we present a design study to support the discovery of correlated mutation events (called co‐occurrences) in populations of viral genomes. The key challenge is to identify potentially interesting pairs of events within the vast space of event combinations. In our work, we identify analyst requirements and develop a prototype through a participatory process. The key ideas of our approach are to use interest metrics to create dynamic filtering that guides the viewer to interesting and relevant correlations of genome mutations, and to provide visual encodings designed to fit scientists' mental map of the data, along with dynamic filtering techniques. We demonstrate the strength of our approach in virology‐situated case studies, and offer suggestions for extending our strategy to other sequence‐based domains.  相似文献   

7.
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.  相似文献   

8.
Systems projecting a continuous n‐dimensional parameter space to a continuous m‐dimensional target space play an important role in science and engineering. If evaluating the system is expensive, however, an analysis is often limited to a small number of sample points. The main contribution of this paper is an interactive approach to enable a continuous analysis of a sampled parameter space with respect to multiple target values. We employ methods from statistical learning to predict results in real‐time at any user‐defined point and its neighborhood. In particular, we describe techniques to guide the user to potentially interesting parameter regions, and we visualize the inherent uncertainty of predictions in 2D scatterplots and parallel coordinates. An evaluation describes a real‐world scenario in the application context of car engine design and reports feedback of domain experts. The results indicate that our approach is suitable to accelerate a local sensitivity analysis of multiple target dimensions, and to determine a sufficient local sampling density for interesting parameter regions.  相似文献   

9.
Multivariate Discretization for Set Mining   总被引:2,自引:0,他引:2  
Many algorithms in data mining can be formulated as a set-mining problem where the goal is to find conjunctions (or disjunctions) of terms that meet user-specified constraints. Set-mining techniques have been largely designed for categorical or discrete data where variables can only take on a fixed number of values. However, many datasets also contain continuous variables and a common method of dealing with these is to discretize them by breaking them into ranges. Most discretization methods are univariate and consider only a single feature at a time (sometimes in conjunction with a class variable). We argue that this is a suboptimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data. Discretization should consider the effects on all variables in the analysis and that two regions X and Y should only be in the same interval after discretization if the instances in those regions have similar multivariate distributions (F x F y ) across all variables and combinations of variables. We present a bottom-up merging algorithm to discretize continuous variables based on this rule. Our experiments indicate that the approach is feasible, that it will not destroy hidden patterns and that it will generate meaningful intervals. Received 14 November 2000 / Revised 1 February 2001 / Accepted in revised form 1 May 2001  相似文献   

10.
11.
Scatterplot matrices or SPLOMs provide a feasible method of visualizing and representing multi‐dimensional data especially for a small number of dimensions. For very high dimensional data, we introduce a novel technique to summarize a SPLOM, as a clustered matrix of glyphs, or a Glyph SPLOM. Each glyph visually encodes a general measure of dependency strength, distance correlation, and a logical dependency class based on the occupancy of the scatterplot quadrants. We present the Glyph SPLOM as a general alternative to the traditional correlation based heatmap and the scatterplot matrix in two examples: demography data from the World Health Organization (WHO), and gene expression data from developmental biology. By using both, dependency class and strength, the Glyph SPLOM illustrates high dimensional data in more detail than a heatmap but with more summarization than a SPLOM. More importantly, the summarization capabilities of Glyph SPLOM allow for the assertion of “necessity” causal relationships in the data and the reconstruction of interaction networks in various dynamic systems.  相似文献   

12.
One of the common endeavours in engineering applications is outlier detection, which aims to identify inconsistent records from large amounts of data. Although outlier detection schemes in data mining discipline are acknowledged as a more viable solution to efficient identification of anomalies from these data repository, current outlier mining algorithms require the input of domain parameters. These parameters are often unknown, difficult to determine and vary across different datasets containing different cluster features. This paper presents a novel resolution-based outlier notion and a nonparametric outlier-mining algorithm, which can efficiently identify and rank top listed outliers from a wide variety of datasets. The algorithm generates reasonable outlier results by taking both local and global features of a dataset into account. Experiments are conducted using both synthetic datasets and a real life construction equipment dataset from a large road building contractor. Comparison with the current outlier mining algorithms indicates that the proposed algorithm is more effective and can be integrated into a decision support system to serve as a universal detector of potentially inconsistent records.  相似文献   

13.
The cycle plot is an established and effective visualization technique for identifying and comprehending patterns in periodic time series, like trends and seasonal cycles. It also allows to visually identify and contextualize extreme values and outliers from a different perspective. Unfortunately, it is limited to univariate data. For multivariate time series, patterns that exist across several dimensions are much harder or impossible to explore. We propose a modified cycle plot using a distance‐based abstraction (Mahalanobis distance) to reduce multiple dimensions to one overview dimension and retain a representation similar to the original. Utilizing this distance‐based cycle plot in an interactive exploration environment, we enhance the Visual Analytics capacity of cycle plots for multivariate outlier detection. To enable interactive exploration and interpretation of outliers, we employ coordinated multiple views that juxtapose a distance‐based cycle plot with Cleveland's original cycle plots of the underlying dimensions. With our approach it is possible to judge the outlyingness regarding the seasonal cycle in multivariate periodic time series.  相似文献   

14.
Biochemical research often involves examining structural relationships in molecules since scientists strongly believe in the causal relationship between structure and function. Traditionally, researchers have identified these patterns, or motifs, manually using domain expertise. However, with the massive influx of new biochemical data and the ability to gather data for very large molecules, there is great need for techniques that automatically and efficiently identify commonly occurring structural patterns in molecules. Previous automated substructure discovery approaches have each introduced variations of similar underlying techniques and have embedded domain knowledge. While doing so improves performance for the particular domain, this complicates extensibility to other domains. Also, they do not address scalability or noise, which is critical for macromolecules such as proteins. In this paper, we present MotifMiner, a general framework for efficiently identifying common motifs in most scientific molecular datasets. The approach combines structure-based frequent-pattern discovery with search space reduction and coordinate noise handling. We describe both the framework and several algorithms as well as demonstrate the flexibility of our system by analyzing protein and drug biochemical datasets.  相似文献   

15.
As geospatial data grows explosively, there is a great demand for the incorporation of data mining techniques into a geospatial context. Association rules mining is a core technique in data mining and is a solid candidate for the associative analysis of large geospatial databases. In this article, we propose a geospatial knowledge discovery framework for automating the detection of multivariate associations based on a given areal base map. We investigate a series of geospatial preprocessing steps involving data conversion and classification so that the traditional Boolean and quantitative association rules mining can be applied. Our framework has been integrated into GISs using a dynamic link library to allow the automation of both the preprocessing and data mining phases to provide greater ease of use for users. Experiments with real-crime datasets quickly reveal interesting frequent patterns and multivariate associations, which demonstrate the robustness and efficiency of our approach.  相似文献   

16.
Top‐Rank‐K Frequent Itemset (or Pattern) Mining (FPM) is an important data mining task, where user decides on the number of top frequency ranks of patterns (itemsets) they want to mine from a transactional dataset. This problem does not require the minimum support threshold parameter that is typically used in FPM problems. Rather, the algorithms solving the Top‐Rank‐K FPM problem are fed with K , the number of frequency ranks of itemsets required, to compute the threshold internally. This paper presents two declarative approaches to tackle the Top‐Rank‐K Closed FPM problem. The first approach is Boolean Satisfiability‐based (SAT‐based) where we propose an effective encoding for the problem along with an efficient algorithm employing this encoding. The second approach is CP‐based, that is, utilizes Constraint Programming technique, where a simple CP model is exploited in an innovative manner to mine the Top‐Rank‐K Closed FPM itemsets from transactional datasets. Both approaches are evaluated experimentally against other declarative and imperative algorithms. The proposed SAT‐based approach significantly outperforms IM, another SAT‐based approach, and outperforms the proposed CP‐approach for sparse and moderate datasets, whereas the latter excels on dense datasets. An extensive study has been conducted to assess the proposed approaches in terms of their feasibility, performance factors, and practicality of use.  相似文献   

17.
Time‐series data is a common target for visual analytics, as they appear in a wide range of application domains. Typical tasks in analyzing time‐series data include identifying cyclic behavior, outliers, trends, and periods of time that share distinctive shape characteristics. Many methods for visualizing time series data exist, generally mapping the data values to positions or colors. While each can be used to perform a subset of the above tasks, none to date is a complete solution. In this paper we present a novel approach to time‐series data visualization, namely creating multivariate data records out of short subsequences of the data and then using multivariate visualization methods to display and explore the data in the resulting shape space . We borrow ideas from text analysis, where the use of N‐grams is a common approach to decomposing and processing unstructured text. By mapping each temporal N‐gram to a glyph, and then positioning the glyphs via PCA (basically a projection in shape space), many different kinds of patterns in the sequence can be readily identified. Interactive selection via brushing, in conjunction with linking to other visualizations, provides a wide range of tools for exploring the data. We validate the usefulness of this approach with examples from several application domains and tasks, comparing our methods with traditional time‐series visualizations.  相似文献   

18.
Radial axes plots are multivariate visualization techniques that extend scatterplots in order to represent high‐dimensional data as points on an observable display. Well‐known methods include star coordinates or principal component biplots, which represent data attributes as vectors that define axes, and produce linear dimensionality reduction mappings. In this paper we propose a hybrid approach that bridges the gap between star coordinates and principal component biplots, which we denominate “adaptable radial axes plots”. It is based on solving convex optimization problems where users can: (a) update the axis vectors interactively, as in star coordinates, while producing mappings that enable to estimate attribute values optimally through labeled axes, similarly to principal component biplots; (b) use different norms in order to explore additional nonlinear mappings of the data; and (c) include weights and constraints in the optimization problems for sorting the data along one axis. The result is a flexible technique that complements, extends, and enhances current radial methods for data analysis.  相似文献   

19.
There is significant interest in the data mining and network management communities about the need to improve existing techniques for clustering multivariate network traffic flow records so that we can quickly infer underlying traffic patterns. In this paper, we investigate the use of clustering techniques to identify interesting traffic patterns from network traffic data in an efficient manner. We develop a framework to deal with mixed type attributes including numerical, categorical, and hierarchical attributes for a one-pass hierarchical clustering algorithm. We demonstrate the improved accuracy and efficiency of our approach in comparison to previous work on clustering network traffic.  相似文献   

20.
Finding clusters in large datasets is a difficult task. Almost all computationally feasible methods are related to k-means and need a clear partition structure of the data, while most such datasets contain masking outliers and other deviations from the usual models of partitioning cluster analysis. It is possible to look for clusters informally using graphic tools like the grand tour, but the meaning and the validity of such patterns is unclear. In this paper, a three-step-approach is suggested: In the first step, data visualization methods like the grand tour are used to find cluster candidate subsets of the data. In the second step, reproducible clusters are generated from them by means of fixed point clustering, a method to find a single cluster at a time based on the Mahalanobis distance. In the third step, the validity of the clusters is assessed by the use of classification plots. The approach is applied to an astronomical dataset of spectra from the Hamburg/ESO survey.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号