首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
In this paper we consider the problem of identifying the most influential (or central) group of nodes (of some predefined size) in a network. Such a group has the largest value of betweenness centrality or one of its variants, for example, the length-scaled or the bounded-distance betweenness centralities. We demonstrate that this problem can be modelled as a mixed integer program (MIP) that can be solved for reasonably sized network instances using off-the-shelf MIP solvers. We also discuss interesting relations between the group betweenness and the bounded-distance betweenness centrality concepts. In particular, we exploit these relations in an algorithmic scheme to identify approximate solutions for the original problem of identifying the most central group of nodes. Furthermore, we generalize our approach for identification of not only the most central groups of nodes, but also central groups of graph elements that consists of either nodes or edges exclusively, or their combination according to some pre-specified criteria. If necessary, additional cohesiveness properties can also be enforced, for example, the targeted group should form a clique or a κ-club. Finally, we conduct extensive computational experiments with different types of real-life and synthetic network instances to show the effectiveness and flexibility of the proposed framework. Even more importantly, our experiments reveal some interesting insights into the properties of influential groups of graph elements modelled using the maximum betweenness centrality concept or one of its variations.  相似文献   

3.
The methods of cluster analysis are applied to ultrasonic testing data of welded joints. The methods of principal component analysis, K-means clustering, and support vector machines are considered. The application methodology and the results obtained are presented. The article was translated by the authors.  相似文献   

4.
Cluster analysis for gene expression data: a survey   总被引:16,自引:0,他引:16  
DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.  相似文献   

5.
聚类分析中利用有效性指标判断数据集的正确类数极易受到噪声数据、类之间分离性以及聚类算法的影响,所确定类数的正确性难以得到保证.为克服这个问题,以文献[1]中的数据约减方法为基础,对原数据集和约减后的数据集利用有效性指标进行正确类数判别.实验表明:该方法能增大类之间的分离性,有效判断数据集的最优类数.  相似文献   

6.
Cosmological N-body simulations on parallel computers produce large datasets--gigabytes at each instant of simulated cosmological time, and hundreds of gigabytes over the course of a simulation. These large datasets require further analysis before they can be compared to astronomical observations. The “Halo World” tools include two methods for performing halo finding: identifying all of the gravitationally stable clusters in a point-sampled density field. One of these methods is a parallel implementation of the friends of friends (FOF) algorithm, widely used in the field of N-body cosmology. The new IsoDen method based on isodensity surfaces has been developed to overcome some of the shortcomings of FOF. Parallel processing is the only viable way of obtaining the necessary performance and storage capacity to carry out these analysis tasks. Ultimately, we must also plan to use disk storage as the only economically viable alternative for storing and manipulating such large data sets. Both IsoDen and friends of friends have been implemented on a variety of computer systems, with parallelism up to 512 processors, and successfully used to extract halos from simulations with up to 16.8 million particles.  相似文献   

7.
基于多重分形的聚类层次优化算法   总被引:2,自引:0,他引:2  
闫光辉  李战怀  党建武 《软件学报》2008,19(6):1283-1300
大量初始聚类结果之间存在强弱不同的相似性,会给用户理解与描述聚类结果带来不利影响,进而阻碍数据挖掘后续工作的顺利展开.传统聚类算法由于注重聚类形状及空间邻接性,或者考虑全局数据分布密度的均匀性,实际中均难以解决这一类问题.为此,提出了基于分形的聚类层次优化算法FCHO(fractal-based cluster hierarchy optimization),FCHO算法基于多重分形理论,利用聚类对应多重分形维数及聚类合并之后多重分形维数的变化程度来度量初始聚类之间的相似程度,最终生成反映数据自然聚集状态的聚类家族树.此外,初步分析了算法的时空复杂性,基于合成数据集和标准数据集的有关实验工作证实了算法的有效性.  相似文献   

8.
Bio-chip data that consists of high-dimensional attributes have more attributes than specimens. Thus, it is difficult to obtain covariance matrix from tens thousands of genes within a number of samples. Feature selection and extraction is critical to remove noisy features and reduce the dimensionality in microarray analysis. This study aims to fill the gap by developing a data mining framework with a proposed algorithm for cluster analysis of gene expression data, in which coefficient correlation is employed to arrange genes. Indeed, cluster analysis of microarray data can find coherent patterns of gene expression. The output is displayed as table list for convenient survey. We adopt the breast cancer microarray dataset to demonstrate practical viability of this approach.  相似文献   

9.
《Data Processing》1986,28(1):6-9
Data distribution can simplify access to computer held information, but it is important to weigh up the advantages and disadvantages, for not all types of data benefit. The four basic schemes for data distribution are centralization, decentralization, partitioning and replication.  相似文献   

10.
The amount of ontologies and semantic annotations available on the Web is constantly growing. This new type of complex and heterogeneous graph-structured data raises new challenges for the data mining community. In this paper, we present a novel method for mining association rules from semantic instance data repositories expressed in RDF/(S) and OWL. We take advantage of the schema-level (i.e. Tbox) knowledge encoded in the ontology to derive appropriate transactions which will later feed traditional association rules algorithms. This process is guided by the analyst requirements, expressed in the form of query patterns. Initial experiments performed on semantic data of a biomedical application show the usefulness and efficiency of the approach.  相似文献   

11.
In this article, the author describe issues of scale that arise at all stages of the visual exploration process and give examples of some solutions to these problems that his group at Worcester Polytechnic Institute have experimented with over the past several years. While the work has predominantly focused on nonspatial multivariate data analysis, he feels that many of the concepts and approaches described are relevant to the visualization of other types of data and information.  相似文献   

12.
Finding localized associations in market basket data   总被引:2,自引:0,他引:2  
In this paper, we discuss a technique for discovering localized associations in segments of the data using clustering. Often, the aggregate behavior of a data set may be very different from localized segments. In such cases, it is desirable to design algorithms which are effective in discovering localized associations because they expose a customer pattern which is more specific than the aggregate behavior. This information may be very useful for target marketing. We present empirical results which show that the method is indeed able to find a significantly larger number of associations than what can be discovered by analysis of the aggregate data  相似文献   

13.
Finding trading patterns in stock market data   总被引:1,自引:0,他引:1  
This article describes our design and evaluation of a multisensory human perceptual tool for the real-world task domain of stock market trading. The tool is complementary in that it displays different information to different senses - our design incorporates both a 3D visual and a 2D sound display. The results of evaluating the tool in a formal experiment are complex. The data mined in this case study is bid-and-ask data - also called depth-of-market data - from the Australian Stock Exchange. Our visual-auditory display is the bid-ask-land-scape, which we developed over much iteration with the close collaboration of an expert in the stock market domain. From this domain's perspective, the project's principal goal was to develop a tool to help traders uncover new trading patterns in depth-of-market data. In this article, we not only describe the design of the bid-ask-landscape but also report on a formal evaluation of this visual-auditory display. We tested nonexperts on their ability to use the tool to predict the future direction of stock prices.  相似文献   

14.
A set of 13 extensively used hemodynamic, ventilatory and gas analysis variables are measured (on-line or off-line) on 200 patients in an intensive care unit (ICU) during the 6 h immediately following cardiac surgery. In order to identify both low- and high-risk patterns, a clustering method is applied to these data at three equidistant observation times. Application of the divergence criterion allows a quantitative evaluation of the diversity between the clusters identified, showing that the two patterns are really distinct in the 13-D space. The same criterion is then used to find possible subsets of variables capable of maintaining, in time, an effective separation power. The latter always include the cardiac index (CI), representative of cardiac performance, and two indices related to respiratory efficiency and metabolic rate, i.e., the carbon dioxide production index (VCO2I) and the arterio-venous oxygen difference (avO2D).  相似文献   

15.
16.
DNA microarrays make it possible to study simultaneously the expression of thousands of genes in a biological sample. Univariate clustering techniques have been used to discover target genes with differential expression between two experimental conditions. Because of possible loss of information due to use of univariate summary statistics, it may be more effective to use multivariate statistics. We present multivariate normal mixture model based clustering analyses to detect differential gene expression between two conditions.Deviating from the general mixture model and model-based clustering, we propose mixture models with specific mean and covariance structures that account for special features of two-condition microarray experiments. Explicit updating formulas in the EM algorithm for three such models are derived. The methods are applied to a real dataset to compare the expression levels of 1176 genes of rats with and without pneumococcal middle-ear infection to illustrate the performance and usefulness of this approach. About 10 genes and 20 genes are found to be differentially expressed in a six-dimensional modeling and a bivariate modeling, respectively. Two simulation studies are conducted to compare the performance of univariate and multivariate methods. Depending on data, neither method can always dominate the other. The results suggest that multivariate normal mixture models can be useful alternatives to univariate methods to detect differential gene expression in exploratory data analysis.  相似文献   

17.
The only way to attack strong cryptographic implementations is to attack the infrastructure upon which they are built. This infrastructure is most often the underlying operating system or middleware, but attacks can also be mounted directly against the hardware upon which the cryptographic implementation is being run. This issue's Crypto Corner describes some of the methods used to induce faults in systems and explains how such faults can be exploited to reveal secret information.  相似文献   

18.
Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use of fast and small memory footprint algorithms is paramount when the number of individuals is very large. In many situations such analysis needs to be performed and kept up to date in near real time. Fortunately, approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm that addresses this problem by merging the commonly used counter-based and sketch-based techniques for top-k identification. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value. It also provides strong guarantees on the error estimate, order of elements and inclusion of elements in the list depending on their real frequency. Additionally the algorithm provides stochastic bounds on the error and expected error estimates. Telecommunications customer’s behavior and voice call data is used to present concrete results obtained with this algorithm and to illustrate improvements over previously existing algorithms.  相似文献   

19.
Presents a method for finding patterns in 3D graphs. Each node in a graph is an undecomposable or atomic unit and has a label. Edges are links between the atomic units. Patterns are rigid substructures that may occur in a graph after allowing for an arbitrary number of whole-structure rotations and translations as well as a small number (specified by the user) of edit operations in the patterns or in the graph. (When a pattern appears in a graph only after the graph has been modified, we call that appearance "approximate occurrence.") The edit operations include relabeling a node, deleting a node and inserting a node. The proposed method is based on the geometric hashing technique, which hashes node-triplets of the graphs into a 3D table and compresses the label-triplets in the table. To demonstrate the utility of our algorithms, we discuss two applications of them in scientific data mining. First, we apply the method to locating frequently occurring motifs in two families of proteins pertaining to RNA-directed DNA polymerase and thymidylate synthase and use the motifs to classify the proteins. Then, we apply the method to clustering chemical compounds pertaining to aromatic compounds, bicyclicalkanes and photosynthesis. Experimental results indicate the good performance of our algorithms and high recall and precision rates for both classification and clustering  相似文献   

20.
Conventional data mining methods for finding frequent itemsets require considerable computing time to produce their results from a large data set. Due to this reason, it is almost impossible to apply them to an analysis task in an online data stream where a new transaction is continuously generated at a rapid rate. An algorithm for finding frequent itemsets over an online data stream should support flexible trade-off between processing time and mining accuracy. Furthermore, the most up-to-date resulting set of frequent itemsets should be available quickly at any moment. To satisfy these requirements, this paper proposes a data mining method for finding frequent itemsets over an online data stream. The proposed method examines each transaction one-by-one without any candidate generation process. The count of an itemset that appears in each transaction is monitored by a lexicographic tree resided in main memory. The current set of monitored itemsets in an online data stream is minimized by two major operations: delayed-insertion and pruning. The former is delaying the insertion of a new itemset in recent transactions until the itemset becomes significant enough to be monitored. The latter is pruning a monitored itemset when the itemset turns out to be insignificant. The number of monitored itemsets can be flexibly controlled by the thresholds of these two operations. As the number of monitored itemsets is decreased, frequent itemsets in the online data stream are more rapidly traced while they are less accurate. The performance of the proposed method is analyzed through a series of experiments in order to identify its various characteristics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号