期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

董林舒红李莎《计算机应用研究》2013,30(8):2330-2333

为简化空间频繁模式挖掘的预处理步骤并提高挖掘效率, 提出一种可以直接以空间矢量和栅格图层作为输入的挖掘算法FISA(fast intersect spatial Apriori)。该算法利用图层求交和面积计算操作实现谓词集支持度计数进而实现频繁谓词集和关联规则挖掘。相对于基于事务空间关联规则挖掘算法, FISA不需要预先进行空间数据事务化处理, 并且所得结果均有对应图层, 便于实现结果的可视化; 相对于其他基于空间分析的挖掘算法, FISA支持空间数据的矢量和栅格格式, 且引入了快速求交方法以保证其可伸缩性。实验结果表明该算法可以直接从空间数据中高效正确地挖掘出频繁模式。相似文献

2.

模糊时态序列演化模式挖掘

下载免费PDF全文

王炳雪《计算机工程与应用》2011,47(28):128-130

目前时态序列挖掘方法大多都是以一种自然的方式对序列分割、离散处理等,从而使离散化结果很大程度依赖于外部的人为分割变量。为了使离散化结果更强地依赖于原始数据,应用模糊聚类方法,将连续时态演化序列转变为模糊时态演化序列,应用模糊时态演化片段支持度评定频繁模糊时态演化模式,用隶属度计算关联规则的支持度和可信度,使这两个重要指标计算更为精确。给出了频繁模糊模式集的生成算法和复杂度。实际算例显示了方法的有效性。相似文献

3.

KSPF: using gene sequence patterns and data mining for biological knowledge management

《Expert systems with applications》2005,28(3):537-545

Most traditional approaches for annotating protein families are not efficient because of high throughput sequences, complex analytic tools and unordered literature and results cannot be reused. Here, we describe a framework, knowledge sharing for protein families (KSPF), that uses sequence pattern data mining and knowledge management to improve upon traditional approaches. It is divided into three modules: automation, retrieval and refinement. This framework builds an environment that allows biological researchers to submit an unknown protein sequence and search for information on its sub-family. Once this sub-family protein category has been found, the related literature and knowledge records provided by previous users can be retrieved. The possible functions of the protein can then be predicted by use of the literature and records. The proposed framework is applicable to all types of protein families. We describe the search for a plant lipid transfer protein (PLTP) with use of the framework. The system KS-PLTP functions to map an unknown sequence to the sub-family of the PLTP knowledge base and predict the sequence's possible function. The prediction rate of KS-PLTP reached 89.6%. 相似文献

4.

Issues in the mining of heart failure datasets

Nongnuch Poolsawad Lisa Moore Chandrasekhar Kambhampati John G. F. Cleland 《国际自动化与计算杂志》2014,11(2):162-179

This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs). 相似文献

5.

Fast mining of distance-based outliers in high-dimensional datasets

Amol Ghoting Srinivasan Parthasarathy Matthew Eric Otey 《Data mining and knowledge discovery》2008,16(3):349-364

Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude. 相似文献

6.

iSAX: disk-aware mining and indexing of massive time series datasets 总被引：1，自引：0，他引：1

Jin Shieh Eamonn Keogh 《Data mining and knowledge discovery》2009,19(1):24-57

Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series. 相似文献

7.

On discovering co-location patterns in datasets: a case study of pollutants and child cancers

Jundong Li Aibek Adilmagambetov Mohomed Shazan Mohomed Jabbar Osmar R. Zaïane Alvaro Osornio-Vargas Osnat Wine 《GeoInformatica》2016,20(4):651-692

We intend to identify relationships between cancer cases and pollutant emissions by proposing a novel co-location mining algorithm. In this context, we specifically attempt to understand whether there is a relationship between the location of a child diagnosed with cancer with any chemical combinations emitted from various facilities in that particular location. Co-location pattern mining intends to detect sets of spatial features frequently located in close proximity to each other. Most of the previous works in this domain are based on transaction-free apriori-like algorithms which are dependent on user-defined thresholds, and are designed for boolean data points. Due to the absence of a clear notion of transactions, it is nontrivial to use association rule mining techniques to tackle the co-location mining problem. Our proposed approach is focused on a grid based transactionization? of the geographic space, and is designed to mine datasets with extended spatial objects. It is also capable of incorporating uncertainty of the existence of features to model real world scenarios more accurately. We eliminate the necessity of using a global threshold by introducing a statistical test to validate the significance of candidate co-location patterns and rules. Experiments on both synthetic and real datasets reveal that our algorithm can detect a considerable amount of statistically significant co-location patterns. In addition, we explain the data modelling framework which is used on real datasets of pollutants (PRTR/NPRI) and childhood cancer cases. 相似文献

8.

Recent advances in mining patterns from complex data

Annalisa Appice Michelangelo Ceci Corrado Loglisci Giuseppe Manco Elio Masciari 《Journal of Intelligent Information Systems》2016,46(1):1-19

相似文献

9.

EIFDD: An efficient approach for erasable itemset mining of very dense datasets

Giang Nguyen Tuong Le Bay Vo Bac Le 《Applied Intelligence》2015,43(1):85-94

相似文献

10.

Mining sequential patterns from multidimensional sequence data 总被引：1，自引：0，他引：1

Chung-Ching Yu Yen-Liang Chen 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(1):136-140

The problem addressed in This work is to discover the frequently occurred sequential patterns from databases. Although much work has been devoted to this subject, to the best of our knowledge, no previous research was able to find sequential patterns from d-dimensional sequence data, where d>2. Without such a capability, many practical data would be impossible to mine. For example, an online stock-trading site may have a customer database, where each customer may visit a Web site in a series of days; each day takes a series of sessions and each session visits a series of Web pages. Then, the data for each customer forms a 3-dimensional list, where the first dimension is days, the second is sessions, and the third is visited pages. To mine sequential patterns from this kind of sequence data, two efficient algorithms have been developed in This work. 相似文献

11.

DBC: a condensed representation of frequent patterns for efficient mining

Artur Bykowski Christophe Rigotti 《Information Systems》2003,28(8):949-977

Given a large set of data, a common data mining problem is to extract the frequent patterns occurring in this set. The idea presented in this paper is to extract a condensed representation of the frequent patterns called disjunction-bordered condensation (DBC), instead of extracting the whole frequent pattern collection. We show that this condensed representation can be used to regenerate all frequent patterns and their exact frequencies. Moreover, this regeneration can be performed without any access to the original data. Practical experiments show that the DBCcan be extracted very efficiently even in difficult cases and that this extraction and the regeneration of the frequent patterns is much more efficient than the direct extraction of the frequent patterns themselves. We compared the DBC with another representation of frequent patterns previously investigated in the literature called frequent closed sets. In nearly all experiments we have run, the DBC have been extracted much more efficiently than frequent closed sets. In the other cases, the extraction times are very close. 相似文献

12.

Efficient mining of group patterns from user movement data

Yida Ee-Peng San-Yih 《Data & Knowledge Engineering》2006,57(3):240-282

In this paper, we present a new approach to derive groupings of mobile users based on their movement data. We assume that the user movement data are collected by logging location data emitted from mobile devices tracking users. We formally define group pattern as a group of users that are within a distance threshold from one another for at least a minimum duration. To mine group patterns, we first propose two algorithms, namely AGP and VG-growth. In our first set of experiments, it is shown when both the number of users and logging duration are large, AGP and VG-growth are inefficient for the mining group patterns of size two. We therefore propose a framework that summarizes user movement data before group pattern mining. In the second series of experiments, we show that the methods using location summarization reduce the mining overheads for group patterns of size two significantly. We conclude that the cuboid based summarization methods give better performance when the summarized database size is small compared to the original movement database. In addition, we also evaluate the impact of parameters on the mining overhead. 相似文献

13.

Efficient discovery of co-location patterns from massive spatial datasets with or without rare features

Yang Peizhong Wang Lizhen Wang Xiaoxuan Zhou Lihua 《Knowledge and Information Systems》2021,63(6):1365-1395

Knowledge and Information Systems - A co-location pattern indicates a group of spatial features whose instances are frequently located together in proximate geographic area. Spatial co-location... 相似文献

14.

ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

Majed Sahli Essam Mansour Panos Kalnis 《The VLDB Journal The International Journal on Very Large Data Bases》2014,23(6):871-893

Modern applications, including bioinformatics, time series, and web log analysis, require the extraction of frequent patterns, called motifs, from one very long (i.e., several gigabytes) sequence. Existing approaches are either heuristics that are error-prone, or exact (also called combinatorial) methods that are extremely slow, therefore, applicable only to very small sequences (i.e., in the order of megabytes). This paper presents ACME, a combinatorial approach that scales to gigabyte-long sequences and is the first to support supermaximal motifs. ACME is a versatile parallel system that can be deployed on desktop multi-core systems, or on thousands of CPUs in the cloud. However, merely using more compute nodes does not guarantee efficiency, because of the related overheads. To this end, ACME introduces an automatic tuning mechanism that suggests the appropriate number of CPUs to utilize, in order to meet the user constraints in terms of run time, while minimizing the financial cost of cloud resources. Our experiments show that, compared to the state of the art, ACME supports three orders of magnitude longer sequences (e.g., DNA for the entire human genome); handles large alphabets (e.g., English alphabet for Wikipedia); scales out to 16,384 CPUs on a supercomputer; and supports elastic deployment in the cloud. 相似文献

15.

Performance of a spatio-temporal error model for raster datasets under complex error patterns

Y. Carmel D. J. Dean 《International journal of remote sensing》2013,34(23):5283-5296

The CLC (Combined Location Classification) error model provides indices for overall data uncertainty in thematic spatio-temporal datasets. It accounts for the two major sources of error in such datasets, location error and classification error. The model assumes independence between error components, while recent studies revealed various degrees of correlation between error components in actual datasets. The goal of this study is to determine if the likely violation of model assumptions biases model predictions. A comprehensive algorithm was devised to simulate the entire process of error formation and propagation. Time series thematic maps were constructed, and modified maps were derived as realizations of underlying error patterns. Error rate and pattern (positive autocorrelation) were controlled for location error and for classification error. The magnitude of correlation between errors from different sources and correlation between error at different time steps was also controlled. A very good agreement between model predictions and simulation results was found in the absence of correlation in error between time steps and between error types, while the inclusion of such correlations was shown to affect model fit slightly. Given our current knowledge of spatio-temporal error patterns in real data, the CLC error model can be used reliably to assess the overall uncertainty in thematic change detection analyses. 相似文献

16.

On mining latent treatment patterns from electronic medical records

Zhengxing Huang Wei Dong Peter Bath Lei Ji Huilong Duan 《Data mining and knowledge discovery》2015,29(4):914-949

相似文献

17.

Efficient mining of understandable patterns from multivariate interval time series 总被引：1，自引：0，他引：1

Fabian Mörchen Alfred Ultsch 《Data mining and knowledge discovery》2007,15(2):181-215

相似文献

18.

互关联后继树在时间序列特征模式挖掘中的应用

秦少辉肖辉胡运发《计算机工程与设计》2006,27(8):1327-1329,1332

在文献[1]中提出的基于互关联后继树（IRST）的时间序列特征模式挖掘方法的基础上,加入了时间窗口的概念,以弥补IRST这种原本应用于文本检索中的索引模型在时间序列应用中的不足.对IRST以及挖掘算法做出了改进,弥补了其只能挖掘出紧密衔接特征模式的缺陷.实验结果表明,该方法可以挖掘出更多更具应用价值的特征模式. 相似文献

19.

'Multifrequency' location and clustering of sequence patterns from proteins

E Ollivier H Soldano A Viari 《Computer applications in the biosciences》1991,7(1):31-38

In previous work, we have shown that a set of characteristics, defined as (code frequency) pairs, can be derived from a protein family by the use of a signal-processing method. This method enables the location and extraction of sequence patterns by taking into account each (code frequency) pair individually. In the present paper, we propose to extend this method in order to detect and visualize patterns by taking into account several pairs simultaneously. Two 'multifrequency' methods are described. The first one is based on a rewriting of the sequences with new symbols which summarize the frequency information. The second method is based on a clustering of the patterns associated with each pair. Both methods lead to the definition of significant consensus sequences. Some results obtained with calcium-binding proteins and serine proteases are also discussed. 相似文献

20.

Alattin: mining alternative patterns for defect detection

Suresh Thummalapenta Tao Xie 《Automated Software Engineering》2011,18(3-4):293-323

To improve software quality, static or dynamic defect-detection tools accept programming rules as input and detect their violations in software as defects. As these programming rules are often not well documented in practice, previous work developed various approaches that mine programming rules as frequent patterns from program source code. Then these approaches use static or dynamic defect-detection techniques to detect pattern violations in source code under analysis. However, these existing approaches often produce many false positives due to various factors. To reduce false positives produced by these mining approaches, we develop a novel approach, called Alattin, that includes new mining algorithms and a technique for detecting neglected conditions based on our mining algorithm. Our new mining algorithms mine patterns in four pattern formats: conjunctive, disjunctive, exclusive-disjunctive, and combinations of these patterns. We show the benefits and limitations of these four pattern formats with respect to false positives and false negatives among detected violations by applying those patterns to the problem of detecting neglected conditions. 相似文献