共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper we introduce a method called CL.E.D.M. (CLassification through ELECTRE and Data Mining), that employs aspects of the methodological framework of the ELECTRE I outranking method, and aims at increasing the accuracy of existing data mining classification algorithms. In particular, the method chooses the best decision rules extracted from the training process of the data mining classification algorithms, and then it assigns the classes that correspond to these rules, to the objects that must be classified. Three well known data mining classification algorithms are tested in five different widely used databases to verify the robustness of the proposed method. 相似文献
2.
3.
Chin-Feng Lee S. Wesley Changchien Wei-Tse Wang Jau-Ji Shen 《Information Systems Frontiers》2006,8(3):147-161
Data mining can dig out valuable information from databases to assist a business in approaching knowledge discovery and improving
business intelligence. Database stores large structured data. The amount of data increases due to the advanced database technology
and extensive use of information systems. Despite the price drop of storage devices, it is still important to develop efficient
techniques for database compression. This paper develops a database compression method by eliminating redundant data, which
often exist in transaction database. The proposed approach uses a data mining structure to extract association rules from
a database. Redundant data will then be replaced by means of compression rules. A heuristic method is designed to resolve
the conflicts of the compression rules. To prove its efficiency and effectiveness, the proposed approach is compared with
two other database compression methods.
Chin-Feng Lee is an associate professor with the Department of Information Management at Chaoyang University of Technology, Taiwan, R.O.C.
She received her M.S. and Ph.D. degrees in 1994 and 1998, respectively, from the Department of Computer Science and Information
Engineering at National Chung Cheng University. Her current research interests include database design, image processing and
data mining techniques.
S. Wesley Changchien is a professor with the Institute of Electronic Commerce at National Chung-Hsing University, Taiwan, R.O.C. He received a
BS degree in Mechanical Engineering (1989) and completed his MS (1993) and Ph.D. (1996) degrees in Industrial Engineering
at State University of New York at Buffalo, USA. His current research interests include electronic commerce, internet/database
marketing, knowledge management, data mining, and decision support systems.
Jau-Ji Shen received his Ph.D. degree in Information Engineering and Computer Science from National Taiwan University at Taipei, Taiwan
in 1988. From 1988 to 1994, he was the leader of the software group in Institute of Aeronautic, Chung-Sung Institute of Science
and Technology. He is currently an associate professor of information management department in the National Chung Hsing University
at Taichung. His research areas focus on the digital multimedia, database and information security. His current research areas
focus on data engineering, database techniques and information security.
Wei-Tse Wang received the B.A. (2001) and M.B.A (2003) degrees in Information Management at Chaoyang University of Technology, Taiwan,
R.O.C. His research interests include data mining, XML, and database compression. 相似文献
4.
The credit card industry has been growing rapidly recently, and thus huge numbers of consumers’ credit data are collected by the credit department of the bank. The credit scoring manager often evaluates the consumer’s credit with intuitive experience. However, with the support of the credit classification model, the manager can accurately evaluate the applicant’s credit score. Support Vector Machine (SVM) classification is currently an active research area and successfully solves classification problems in many domains. This study used three strategies to construct the hybrid SVM-based credit scoring models to evaluate the applicant’s credit score from the applicant’s input features. Two credit datasets in UCI database are selected as the experimental data to demonstrate the accuracy of the SVM classifier. Compared with neural networks, genetic programming, and decision tree classifiers, the SVM classifier achieved an identical classificatory accuracy with relatively few input features. Additionally, combining genetic algorithms with SVM classifier, the proposed hybrid GA-SVM strategy can simultaneously perform feature selection task and model parameters optimization. Experimental results show that SVM is a promising addition to the existing data mining methods. 相似文献
5.
The aim of this study is to define the risk factors that are effective in Breast Cancer (BC) occurrence, and to construct a supportive model that will promote the cause-and-effect relationships among the factors that are crucial to public health. In this study, we utilize Rule-Based Fuzzy Cognitive Map (RBFCM) approach that can successfully represent knowledge and human experience, introducing concepts to represent the essential elements and the cause-and-effect relationships among the concepts to model the behavior of any system. In this study, a decision-making system is constructed to evaluate risk factors of BC based on the information from oncologists. To construct causal relationship, the weight matrix of RBFCM is determined with the combination of the experts’ experience, expertise and views. The results of the proposed methodology will allow better understanding into several root causes, with the help of which, oncologists can improve their prevention and protection recommendation. The results showed that Social Class and Late Maternal Age can be seen as important modifiable factors; on the other hand, Benign Breast Disease, Family History and Breast Density can be considered as important factors as non-modifiable risk factors. This study is somehow weighing the interrelations of the BC risk factors and is enabling us to make a sensitivity analysis between the scenario studies and BC risk factors. A soft computing method is used to simulate the changes of a system over time and address “what if” questions to compare between different case studies. 相似文献
6.
Modeling adiabatic temperature rise during concrete hydration: A data mining approach 总被引:1,自引:0,他引:1
Alexandre G. Evsukoff Eduardo M.R. Fairbairn tore F. Faria Marcos M. Silvoso Romildo D. Toledo Filho 《Computers & Structures》2006,84(31-32):2351-2362
This paper presents a data mining approach for modeling the adiabatic temperature rise during concrete hydration. The model was developed based on experimental data obtained in the last thirty years for several mass concrete constructions in Brazil, including some of the hugest hydroelectric power plants in operation in the world. The input of the model is a variable data set corresponding to the binder physical and chemical properties and concrete mixture proportions. The output is a set of three parameters that determine a function which is capable to describe the adiabatic temperature rise during concrete hydration. The comparison between experimental data and modeling results shows the accuracy of the proposed approach and that data mining is a potential tool to predict thermal stresses in the design of massive concrete structures. 相似文献
7.
M. Wahde Z. Szallasi 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2006,10(4):338-345
There exist several methods for binary classification of gene expression data sets. However, in the majority of published
methods, little effort has been made to minimize classifier complexity. In view of the small number of samples available in
most gene expression data sets, there is a strong motivation for minimizing the number of free parameters that must be fitted
to the data. In this paper, a method is introduced for evolving (using an evolutionary algorithm) simple classifiers involving
a minimal subset of the available genes. The classifiers obtained by this method perform well, reaching 97% correct classification
of clinical outcome on training samples from the breast cancer data set published by van't Veer, and up to 89% correct classification
on validation samples from the same data set, easily outperforming previously published results. 相似文献
8.
Searching for simplified farmers' crop choice models for integrated watershed management in Thailand: A data mining approach 总被引:1,自引:0,他引:1
This study used the C4.5 data mining algorithm to model farmers' crop choice in two watersheds in Thailand. Previous attempts in the Integrated Water Resource Assessment and Management Project to model farmers' crop choice produced large sets of decision rules. In order to produce simplified models of farmers' crop choice, data mining operations were applied for each soil series in the study areas. The resulting decision trees were much smaller in size. Land type, water availability, tenure, capital, labor availability as well as non-farm and livestock income were found to be important considerations in farmers' decision models. Profitability was also found important although it was represented in approximate ranges. Unlike the general wisdom on farmers' crop choice, these decision trees came with threshold values and sequential order of the important variables. The decision trees were validated using the remaining unused set of data, and their accuracy in predicting farmers' decisions was around 84%. Because of their simple structure, the decision trees produced in this study could be useful to analysts of water resource management as they can be integrated with biophysical models for sustainable watershed management. 相似文献
9.
In privacy-preserving data mining (PPDM), a widely used method for achieving data mining goals while preserving privacy is based on k-anonymity. This method, which protects subject-specific sensitive data by anonymizing it before it is released for data mining, demands that every tuple in the released table should be indistinguishable from no fewer than k subjects. The most common approach for achieving compliance with k-anonymity is to replace certain values with less specific but semantically consistent values. In this paper we propose a different approach for achieving k-anonymity by partitioning the original dataset into several projections such that each one of them adheres to k-anonymity. Moreover, any attempt to rejoin the projections, results in a table that still complies with k-anonymity. A classifier is trained on each projection and subsequently, an unlabelled instance is classified by combining the classifications of all classifiers.Guided by classification accuracy and k-anonymity constraints, the proposed data mining privacy by decomposition (DMPD) algorithm uses a genetic algorithm to search for optimal feature set partitioning. Ten separate datasets were evaluated with DMPD in order to compare its classification performance with other k-anonymity-based methods. The results suggest that DMPD performs better than existing k-anonymity-based algorithms and there is no necessity for applying domain dependent knowledge. Using multiobjective optimization methods, we also examine the tradeoff between the two conflicting objectives in PPDM: privacy and predictive performance. 相似文献
10.
Multi-objective genetic algorithms based automated clustering for fuzzy association rules mining 总被引:1,自引:0,他引:1
Researchers realized the importance of integrating fuzziness into association rules mining in databases with binary and quantitative
attributes. However, most of the earlier algorithms proposed for fuzzy association rules mining either assume that fuzzy sets
are given or employ a clustering algorithm, like CURE, to decide on fuzzy sets; for both cases the number of fuzzy sets is
pre-specified. In this paper, we propose an automated method to decide on the number of fuzzy sets and for the autonomous
mining of both fuzzy sets and fuzzy association rules. We achieve this by developing an automated clustering method based
on multi-objective Genetic Algorithms (GA); the aim of the proposed approach is to automatically cluster values of a quantitative
attribute in order to obtain large number of large itemsets in less time. We compare the proposed multi-objective GA based
approach with two other approaches, namely: 1) CURE-based approach, which is known as one of the most efficient clustering
algorithms; 2) Chien et al. clustering approach, which is an automatic interval partition method based on variation of density.
Experimental results on 100 K transactions extracted from the adult data of USA census in year 2000 showed that the proposed
automated clustering method exhibits good performance over both CURE-based approach and Chien et al.’s work in terms of runtime,
number of large itemsets and number of association rules. 相似文献
11.
Multi-objective PSO algorithm for mining numerical association rules without a priori discretization
《Expert systems with applications》2014,41(9):4259-4273
In the domain of association rules mining (ARM) discovering the rules for numerical attributes is still a challenging issue. Most of the popular approaches for numerical ARM require a priori data discretization to handle the numerical attributes. Moreover, in the process of discovering relations among data, often more than one objective (quality measure) is required, and in most cases, such objectives include conflicting measures. In such a situation, it is recommended to obtain the optimal trade-off between objectives. This paper deals with the numerical ARM problem using a multi-objective perspective by proposing a multi-objective particle swarm optimization algorithm (i.e., MOPAR) for numerical ARM that discovers numerical association rules (ARs) in only one single step. To identify more efficient ARs, several objectives are defined in the proposed multi-objective optimization approach, including confidence, comprehensibility, and interestingness. Finally, by using the Pareto optimality the best ARs are extracted. To deal with numerical attributes, we use rough values containing lower and upper bounds to show the intervals of attributes. In the experimental section of the paper, we analyze the effect of operators used in this study, compare our method to the most popular evolutionary-based proposals for ARM and present an analysis of the mined ARs. The results show that MOPAR extracts reliable (with confidence values close to 95%), comprehensible, and interesting numerical ARs when attaining the optimal trade-off between confidence, comprehensibility and interestingness. 相似文献
12.
Frequent pattern mining: current status and future directions 总被引:10,自引:2,他引:10
Frequent pattern mining has been a focused theme in data mining research for over a decade. Abundant literature has been dedicated
to this research and tremendous progress has been made, ranging from efficient and scalable algorithms for frequent itemset
mining in transaction databases to numerous research frontiers, such as sequential pattern mining, structured pattern mining,
correlation mining, associative classification, and frequent pattern-based clustering, as well as their broad applications.
In this article, we provide a brief overview of the current status of frequent pattern mining and discuss a few promising
research directions. We believe that frequent pattern mining research has substantially broadened the scope of data analysis
and will have deep impact on data mining methodologies and applications in the long run. However, there are still some challenging
research issues that need to be solved before frequent pattern mining can claim a cornerstone approach in data mining applications.
The work was supported in part by the U.S. National Science Foundation NSF IIS-05-13678/06-42771 and NSF BDI-05-15813. Any
opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect
the views of the funding agencies. 相似文献
13.
Mehmet Kaya 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2006,10(7):578-586
Association rules form one of the most widely used techniques to discover correlations among attribute in a database. So far,
some efficient methods have been proposed to obtain these rules with respect to an optimal goal, such as: to maximize the
number of large itemsets and interesting rules or the values of support and confidence for the discovered rules. This paper
first introduces optimized fuzzy association rule mining in terms of three important criteria; strongness, interestingness
and comprehensibility. Then, it proposes multi-objective Genetic Algorithm (GA) based approaches for discovering these optimized
rules. Optimization technique according to given criterion may be one of two different forms; The first tries to determine
the appropriate fuzzy sets of quantitative attributes in a prespecified rule, which is also called as certain rule. The second
deals with finding both uncertain rules and their appropriate fuzzy sets. Experimental results conducted on a real data set
show the effectiveness and applicability of the proposed approach. 相似文献
14.
A novel approach for process mining based on event types 总被引:2,自引:0,他引:2
Lijie Wen Jianmin Wang Wil M. P. van der Aalst Biqing Huang Jiaguang Sun 《Journal of Intelligent Information Systems》2009,32(2):163-190
Despite the omnipresence of event logs in transactional information systems (cf. WFM, ERP, CRM, SCM, and B2B systems), historic
information is rarely used to analyze the underlying processes. Process mining aims at improving this by providing techniques
and tools for discovering process, control, data, organizational, and social structures from event logs, i.e., the basic idea
of process mining is to diagnose business processes by mining event logs for knowledge. Given its potential and challenges
it is no surprise that recently process mining has become a vivid research area. In this paper, a novel approach for process
mining based on two event types, i.e., START and COMPLETE, is proposed. Information about the start and completion of tasks
can be used to explicitly detect parallelism. The algorithm presented in this paper overcomes some of the limitations of existing
algorithms such as the α-algorithm (e.g., short-loops) and therefore enhances the applicability of process mining.
相似文献
Jiaguang SunEmail: |
15.
Most incremental mining and online mining algorithms concentrate on finding association rules or patterns consistent with entire current sets of data. Users cannot easily obtain results from only interesting portion of data. This may prevent the usage of mining from online decision support for multidimensional data. To provide ad-hoc, query-driven, and online mining support, we first propose a relation called the multidimensional pattern relation to structurally and systematically store context and mining information for later analysis. Each tuple in the relation comes from an inserted dataset in the database. We then develop an online mining approach called three-phase online association rule mining (TOARM) based on this proposed multidimensional pattern relation to support online generation of association rules under multidimensional considerations. The TOARM approach consists of three phases during which final sets of patterns satisfying various mining requests are found. It first selects and integrates related mining information in the multidimensional pattern relation, and then if necessary, re-processes itemsets without sufficient information against the underlying datasets. Some implementation considerations for the algorithm are also stated in detail. Experiments on homogeneous and heterogeneous datasets were made and the results show the effectiveness of the proposed approach. 相似文献
16.
One of the major challenges in data mining is the extraction of comprehensible knowledge from recorded data. In this paper, a coevolutionary-based classification technique, namely COevolutionary Rule Extractor (CORE), is proposed to discover classification rules in data mining. Unlike existing approaches where candidate rules and rule sets are evolved at different stages in the classification process, the proposed CORE coevolves rules and rule sets concurrently in two cooperative populations to confine the search space and to produce good rule sets that are comprehensive. The proposed coevolutionary classification technique is extensively validated upon seven datasets obtained from the University of California, Irvine (UCI) machine learning repository, which are representative artificial and real-world data from various domains. Comparison results show that the proposed CORE produces comprehensive and good classification rules for most datasets, which are competitive as compared with existing classifiers in literature. Simulation results obtained from box plots also unveil that CORE is relatively robust and invariant to random partition of datasets. 相似文献
17.
Bilal Alataş Erhan Akin 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2006,10(3):230-237
In this paper, a genetic algorithm (GA) is proposed as a search strategy for not only positive but also negative quantitative
association rule (AR) mining within databases. Contrary to the methods used as usual, ARs are directly mined without generating
frequent itemsets. The proposed GA performs a database-independent approach that does not rely upon the minimum support and
the minimum confidence thresholds that are hard to determine for each database. Instead of randomly generated initial population,
uniform population that forces the initial population to be not far away from the solutions and distributes it in the feasible
region uniformly is used. An adaptive mutation probability, a new operator called uniform operator that ensures the genetic
diversity, and an efficient adjusted fitness function are used for mining all interesting ARs from the last population in
only single run of GA. The efficiency of the proposed GA is validated upon synthetic and real databases. 相似文献
18.
An efficient approach to mining indirect associations 总被引:1,自引:0,他引:1
Discovering association rules is one of the important tasks in data mining. While most of the existing algorithms are developed
for efficient mining of frequent patterns, it has been noted recently that some of the infrequent patterns, such as indirect
associations, provide useful insight into the data. In this paper, we propose an efficient algorithm, called HI-mine, based on a new data structure, called HI-struct, for mining the complete set of indirect associations between items. Our experimental results show that HI-mine's performance is significantly better than that of the previously developed algorithm for mining indirect associations on
both synthetic and real world data sets over practical ranges of support specifications. 相似文献
19.
A crucial issue related to data mining on time-series is that of training period duration. The training horizon used impacts the nature of rules obtained and their predictability over time. Longer training horizons are generally sought, in order to discern sustained patterns with robust training data performance that extends well into the predictive period. However, in dynamic environments patterns that persist over time may be unavailable, and shorter-term patterns may hold higher predictive ability, albeit with shorter predictive periods. Such potentially useful shorter-term patterns may be lost when the training duration covers much longer periods. Too short a training duration can, of course, be susceptible to over-fitting to noise. We conduct experiments using different training horizons with daily-data for the S&P500 index and report the sensitivity of the performance of the obtained rules with respect to the training durations. We show that while the performance of the rules in the training period is important for inducing the “best” rules, it is not indicative of their performance in the test-period and propose alternative measures that can be used to help identify the appropriate training durations. 相似文献
20.
From sequential pattern mining to structured pattern mining: A pattern-growth approach 总被引:10,自引:0,他引:10 下载免费PDF全文
Jia-WeiHan JianPei Xi-FengYan 《计算机科学技术学报》2004,19(3):0-0
Sequential pattern mining is an important data mining problem with broad applications. However,it is also a challenging problem since the mining may have to generate or examine a combinatorially explosivenumber of intermediate subsequences. Recent studies have developed two major classes of sequential patternmining methods: (1) a candidate generation-and-test approach, represented by (i) GSP, a horizontal format-basedsequential pattern mining method, and (ii) SPADE, a vertical format-based method; and (2) a pattern-growthmethod, represented by PrefixSpan and its further extensions, such as gSpan for mining structured patterns. In this study, we perform a systematic introduction and presentation of the pattern-growth methodologyand study its principles and extensions. We first introduce two interesting pattern-growth algorithms, FreeSpanand PrefixSpan, for efficient sequential pattern mining. Then we introduce gSpan for mining structured patternsusing the same methodology. Their relative performance in l 相似文献