首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A program has been developed which derives classification rules from empirical observations and expresses these rules in a knowledge representation format called 'counting criteria'. Decision rules derived in this format are often more comprehensible than rules derived by existing machine learning programs such as AQ11. Use of the program is illustrated by the inference of discrimination criteria for certain types of bacteria based upon their biochemical characteristics. The program may be useful for the conceptual analysis of data and for the automatic generation of prototype knowledge bases for expert systems.  相似文献   

2.
Despite recent successes and advancements in artificial intelligence and machine learning, this domain remains under continuous challenge and guidance from phenomena and processes observed in natural world. Humans remain unsurpassed in their efficiency of dealing and learning from uncertain information coming in a variety of forms, whereas more and more robust learning and optimisation algorithms have their analytical engine built on the basis of some nature-inspired phenomena. Excellence of neural networks and kernel-based learning methods, an emergence of particle-, swarms-, and social behaviour-based optimisation methods are just few of many facts indicating a trend towards greater exploitation of nature inspired models and systems. This work intends to demonstrate how a simple concept of a physical field can be adopted to build a complete framework for supervised and unsupervised learning methodology. An inspiration for artificial learning has been found in the mechanics of physical fields found on both micro and macro scales. Exploiting the analogies between data and charged particles subjected to gravity, electrostatic and gas particle fields, a family of new algorithms has been developed and applied to classification, clustering and data condensation while properties of the field were further used in a unique visualisation of classification and classifier fusion models. The paper covers extensive pictorial examples and visual interpretations of the presented techniques along with some comparative testing over well-known real and artificial datasets.
Bogdan GabrysEmail:
  相似文献   

3.
The performance of eight machine learning classifiers were compared with three aphasia related classification problems. The first problem contained naming data of aphasic and non-aphasic speakers tested with the Philadelphia Naming Test. The second problem included the naming data of Alzheimer and vascular disease patients tested with Finnish version of the Boston Naming Test. The third problem included aphasia test data of patients suffering from four different aphasic syndromes tested with the Aachen Aphasia Test. The first two data sets were small. Therefore, the data used in the tests were artificially generated from the original confrontation naming data of 23 and 22 subjects, respectively. The third set contained aphasia test data of 146 aphasic speakers and was used as such in the experiments. With the first and the third data set the classifiers could successfully be used for the task, while the results with the second data set were less encouraging. However, based on the results, no single classifier performed exceptionally well with all data sets, suggesting that the selection of the classifier used for classification of aphasic data should be based on the experiments performed with the data set at hand.  相似文献   

4.
Although in the past machine learning algorithms have been successfully used in many problems, their serious practical use is affected by the fact that often they cannot produce reliable and unbiased assessments of their predictions' quality. In last few years, several approaches for estimating reliability or confidence of individual classifiers have emerged, many of them building upon the algorithmic theory of randomness, such as (historically ordered) transduction-based confidence estimation, typicalness-based confidence estimation, and transductive reliability estimation. Unfortunately, they all have weaknesses: either they are tightly bound with particular learning algorithms, or the interpretation of reliability estimations is not always consistent with statistical confidence levels. In the paper we describe typicalness and transductive reliability estimation frameworks and propose a joint approach that compensates the above-mentioned weaknesses by integrating typicalness-based confidence estimation and transductive reliability estimation into a joint confidence machine. The resulting confidence machine produces confidence values in the statistical sense. We perform series of tests with several different machine learning algorithms in several problem domains. We compare our results with that of a proprietary method as well as with kernel density estimation. We show that the proposed method performs as well as proprietary methods and significantly outperforms density estimation methods. Matjaž Kukar is currently Assistant Professor in the Faculty of Computer and Information Science at University of Ljubljana. His research interests include machine learning, data mining and intelligent data analysis, ROC analysis, cost-sensitive learning, reliability estimation, and latent structure analysis, as well as applications of data mining in medical and business problems.  相似文献   

5.
Steels of different classes (austenitic, martensitic, pearlitic, etc.) have different applications and characteristic areas of properties. In the present work two methods are used to predict steel class, based on the composition and heat treatment parameters: the physically-based Calphad method and data-driven machine learning method. They are applied to the same dataset, collected from open sources (mostly steels for high-temperature applications). Classification accuracy of 93.6% is achieved by machine learning model, trained on the concentration of three elements (C, Cr, Ni) and heat treatment parameters (heating temperatures). Calphad method gives 76% accuracy, based on the temperature and cooling rate. The reasons for misclassification by both methods are discussed, and it is shown that the part of them caused by ambiguity/inaccuracy in the data or limitations of the models used. For the rest of cases reasonable classification accuracy is demonstrated. We suggest that the reason of the supremacy of machine learning classifier is the small variation in the data used, which indeed does not change the steel class: the properties of steel should be insensitive to the details of the manufacturing process.  相似文献   

6.
This paper describes a synergistic approach that is applicable to a wide variety of system control problems. The approach utilizes a machine learning technique, goal-directed conceptual aggregation (GDCA), to facilitate dynamic decision-making. The application domain employed is Flexible Manufacturing System (FMS) scheduling and control. Simulation is used for the dual purpose of providing a realistic depiction of FMSs, and serves as an engine for demonstrating the viability of a synergistic system involving incremental learning. The paper briefly describes prior approaches to FMS scheduling and control, and machine learning. It outlines the GDCA approach, provides a generalized architecture for dynamic control problems, and describes the implementation of the system as applied to FMS scheduling and control. The paper concludes with a discussion of the general applicability of this approach.  相似文献   

7.
The aim of this paper is to provide a composite likelihood approach to handle spatially correlated survival data using pairwise joint distributions. With e-commerce data, a recent question of interest in marketing research has been to describe spatially clustered purchasing behavior and to assess whether geographic distance is the appropriate metric to describe purchasing dependence. We present a model for the dependence structure of time-to-event data subject to spatial dependence to characterize purchasing behavior from the motivating example from e-commerce data. We assume the Farlie-Gumbel-Morgenstern (FGM) distribution and then model the dependence parameter as a function of geographic and demographic pairwise distances. For estimation of the dependence parameters, we present pairwise composite likelihood equations. We prove that the resulting estimators exhibit key properties of consistency and asymptotic normality under certain regularity conditions in the increasing-domain framework of spatial asymptotic theory.  相似文献   

8.
As Building Information Modeling (BIM) workflows are becoming very relevant for the different stages of the project’s lifecycle, more data is produced and managed across it. The information and data accumulated in BIM-based projects present an opportunity for analysis and extraction of project knowledge from the inception to the operation phase. In other industries, Machine Learning (ML) has been demonstrated to be an effective approach to automate processes and extract useful insights from different types and sources of data. The rapid development of ML applications, the growing generation of BIM-related data in projects, and the different needs for use of this data present serious challenges to adopt and effectively apply ML techniques to BIM-based projects in the Architecture, Engineering, Construction and Operations (AECO) industry. While research on the use of BIM data through ML has increased in the past decade, it is still in a nascent stage. In order to asses where the industry stands today, this paper carries out a systematic literature review (SLR) identifying and summarizing common emerging areas of application and utilization of ML within the context of BIM-generated data. Moreover, the paper identifies research gaps and trends. Based on the observed limitations, prominent future research directions are suggested, focusing on information architecture and data, applications scalability, and human information interactions.  相似文献   

9.
Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach.  相似文献   

10.
Incremental learning techniques have been used extensively to address the data stream classification problem. The most important issue is to maintain a balance between accuracy and efficiency, i.e., the algorithm should provide good classification performance with a reasonable time response. This work introduces a new technique, named Similarity-based Data Stream Classifier (SimC), which achieves good performance by introducing a novel insertion/removal policy that adapts quickly to the data tendency and maintains a representative, small set of examples and estimators that guarantees good classification rates. The methodology is also able to detect novel classes/labels, during the running phase, and to remove useless ones that do not add any value to the classification process. Statistical tests were used to evaluate the model performance, from two points of view: efficacy (classification rate) and efficiency (online response time). Five well-known techniques and sixteen data streams were compared, using the Friedman’s test. Also, to find out which schemes were significantly different, the Nemenyi’s, Holm’s and Shaffer’s tests were considered. The results show that SimC is very competitive in terms of (absolute and streaming) accuracy, and classification/updating time, in comparison to several of the most popular methods in the literature.  相似文献   

11.
Software testing techniques and criteria are considered complementary since they can reveal different kinds of faults and test distinct aspects of the program. The functional criteria, such as Category Partition, are difficult to be automated and are usually manually applied. Structural and fault-based criteria generally provide measures to evaluate test sets. The existing supporting tools produce a lot of information including: input and produced output, structural coverage, mutation score, faults revealed, etc. However, such information is not linked to functional aspects of the software. In this work, we present an approach based on machine learning techniques to link test results from the application of different testing techniques. The approach groups test data into similar functional clusters. After this, according to the tester's goals, it generates classifiers (rules) that have different uses, including selection and prioritization of test cases. The paper also presents results from experimental evaluations and illustrates such uses.  相似文献   

12.
A hybrid machine learning approach to network anomaly detection   总被引:3,自引:0,他引:3  
Zero-day cyber attacks such as worms and spy-ware are becoming increasingly widespread and dangerous. The existing signature-based intrusion detection mechanisms are often not sufficient in detecting these types of attacks. As a result, anomaly intrusion detection methods have been developed to cope with such attacks. Among the variety of anomaly detection approaches, the Support Vector Machine (SVM) is known to be one of the best machine learning algorithms to classify abnormal behaviors. The soft-margin SVM is one of the well-known basic SVM methods using supervised learning. However, it is not appropriate to use the soft-margin SVM method for detecting novel attacks in Internet traffic since it requires pre-acquired learning information for supervised learning procedure. Such pre-acquired learning information is divided into normal and attack traffic with labels separately. Furthermore, we apply the one-class SVM approach using unsupervised learning for detecting anomalies. This means one-class SVM does not require the labeled information. However, there is downside to using one-class SVM: it is difficult to use the one-class SVM in the real world, due to its high false positive rate. In this paper, we propose a new SVM approach, named Enhanced SVM, which combines these two methods in order to provide unsupervised learning and low false alarm capability, similar to that of a supervised SVM approach.We use the following additional techniques to improve the performance of the proposed approach (referred to as Anomaly Detector using Enhanced SVM): First, we create a profile of normal packets using Self-Organized Feature Map (SOFM), for SVM learning without pre-existing knowledge. Second, we use a packet filtering scheme based on Passive TCP/IP Fingerprinting (PTF), in order to reject incomplete network traffic that either violates the TCP/IP standard or generation policy inside of well-known platforms. Third, a feature selection technique using a Genetic Algorithm (GA) is used for extracting optimized information from raw internet packets. Fourth, we use the flow of packets based on temporal relationships during data preprocessing, for considering the temporal relationships among the inputs used in SVM learning. Lastly, we demonstrate the effectiveness of the Enhanced SVM approach using the above-mentioned techniques, such as SOFM, PTF, and GA on MIT Lincoln Lab datasets, and a live dataset captured from a real network. The experimental results are verified by m-fold cross validation, and the proposed approach is compared with real world Network Intrusion Detection Systems (NIDS).  相似文献   

13.
Industrialized building construction is an approach that integrates manufacturing techniques into construction projects to achieve improved quality, shortened project duration, and enhanced schedule predictability. Time savings result from concurrently carrying out factory operations and site preparation activities. In an industrialized building construction factory, the accurate prediction of production cycle time is crucial to reap the advantage of improved schedule predictability leading to enhanced production planning and control. With the large amount of data being generated as part of the daily operations within such a factory, the present study proposes a machine learning approach to accurately estimate production time using (1) the physical characteristics of building components, (2) the real-time tracking data gathered using a radio frequency identification system, and (3) a set of engineered features constructed to capture the real-time loading conditions of the job shop. The results show a mean absolute percentage error and correlation coefficient of 11% and 0.80, respectively, between the actual and predicted values when using random forest models. The results confirm the significant effects of including shop utilization features in model training and suggest that predicting production time can be reasonably achieved.  相似文献   

14.
Curated collections of models are essential for the success of Machine Learning (ML) and Data Analytics in Model-Driven Engineering (MDE). However, current datasets are either too small or not properly curated. In this paper, we present ModelSet, a dataset composed of 5,466 Ecore models and 5,120 UML models which have been manually labelled to support ML tasks. We describe the structure of the dataset and explain how to use the associated library to develop ML applications in Python. Finally, we present some applications which can be addressed using ModelSet.Tool Website: https://github.com/modelset  相似文献   

15.
Ultra-precision machining (UPM) is an advanced manufacturing technology that experiences increasing demand. Therefore, it is necessary to minimize the environmental impacts from its enormous consumptions of resources. Achieving sustainable UPM is still a challenge is it involves complicated influencing relationships among relevant factors like energy consumption, and human health, which could affect sustainable performance. And some influencing relationships between two parameters have not been fully studied yet, which are named as undiscussed two-parameter relationships. Therefore, this paper proposed a new topic discovery model based on social network analysis (SNA) and machine learning approach to discover the undiscussed two-parameter relationships with high potential value in the sustainable UPM research field. By using the link prediction metrics obtained by SNA and principal components analysis in this study, the interactive relationships among the parameters of sustainable ultra-precision machining are determined to discover the potential values of undiscussed two-parameter topics. Then, the k-means algorithm is applied to classify the topics based on the similarity of the metrics results to present the potential value distribution of the undiscussed topics in sustainable UPM. From the metrics results, the topic of the relationship between environmental damage and resource waste was found to be the most valuable potential two-parameter topic in the area of sustainable UPM. This paper also contributes to showing the potential value distribution of undiscussed two-parameter relationships and predicting the sustainable development trend in the UPM sectors.  相似文献   

16.
Class imbalance limits the performance of most learning algorithms since they cannot cope with large differences between the number of samples in each class, resulting in a low predictive accuracy over the minority class. In this respect, several papers proposed algorithms aiming at achieving more balanced performance. However, balancing the recognition accuracies for each class very often harms the global accuracy. Indeed, in these cases the accuracy over the minority class increases while the accuracy over the majority one decreases. This paper proposes an approach to overcome this limitation: for each classification act, it chooses between the output of a classifier trained on the original skewed distribution and the output of a classifier trained according to a learning method addressing the course of imbalanced data. This choice is driven by a parameter whose value maximizes, on a validation set, two objective functions, i.e. the global accuracy and the accuracies for each class. A series of experiments on ten public datasets with different proportions between the majority and minority classes show that the proposed approach provides more balanced recognition accuracies than classifiers trained according to traditional learning methods for imbalanced data as well as larger global accuracy than classifiers trained on the original skewed distribution.  相似文献   

17.
We present a comparative study on the most popular machine learning methods applied to the challenging problem of customer churning prediction in the telecommunications industry. In the first phase of our experiments, all models were applied and evaluated using cross-validation on a popular, public domain dataset. In the second phase, the performance improvement offered by boosting was studied. In order to determine the most efficient parameter combinations we performed a series of Monte Carlo simulations for each method and for a wide range of parameters. Our results demonstrate clear superiority of the boosted versions of the models against the plain (non-boosted) versions. The best overall classifier was the SVM-POLY using AdaBoost with accuracy of almost 97% and F-measure over 84%.  相似文献   

18.
Traffic classification groups similar or related traffic data, which is one main stream technique of data fusion in the field of network management and security. With the rapid growth of network users and the emergence of new networking services, network traffic classification has attracted increasing attention. Many new traffic classification techniques have been developed and widely applied. However, the existing literature lacks a thorough survey to summarize, compare and analyze the recent advances of network traffic classification in order to deliver a holistic perspective. This paper carefully reviews existing network traffic classification methods from a new and comprehensive perspective by classifying them into five categories based on representative classification features, i.e., statistics-based classification, correlation-based classification, behavior-based classification, payload-based classification, and port-based classification. A series of criteria are proposed for the purpose of evaluating the performance of existing traffic classification methods. For each specified category, we analyze and discuss the details, advantages and disadvantages of its existing methods, and also present the traffic features commonly used. Summaries of investigation are offered for providing a holistic and specialized view on the state-of-art. For convenience, we also cover a discussion on the mostly used datasets and the traffic features adopted for traffic classification in the review. At the end, we identify a list of open issues and future directions in this research field.  相似文献   

19.
A scalable, incremental learning algorithm for classification problems   总被引:5,自引:0,他引:5  
In this paper a novel data mining algorithm, Clustering and Classification Algorithm-Supervised (CCA-S), is introduced. CCA-S enables the scalable, incremental learning of a non-hierarchical cluster structure from training data. This cluster structure serves as a function to map the attribute values of new data to the target class of these data, that is, classify new data. CCA-S utilizes both the distance and the target class of training data points to derive the cluster structure. In this paper, we first present problems with many existing data mining algorithms for classification problems, such as decision trees, artificial neural networks, in scalable and incremental learning. We then describe CCA-S and discuss its advantages in scalable, incremental learning. The testing results of applying CCA-S to several common data sets for classification problems are presented. The testing results show that the classification performance of CCA-S is comparable to the other data mining algorithms such as decision trees, artificial neural networks and discriminant analysis.  相似文献   

20.
Derivation and discovery of physical dynamics inherent in big data is one of the most major purposes of machine learning (ML) in the field of modern natural science. In the materials science, phase diagrams are often called as “road maps” to perfectly understand the conditions for phase formation and/or transformation in any material system caused by the associated thermodynamics. In this paper, we report a numerical experiment investigating whether the underlying thermodynamics can be derived from the big data constructed of local spatial composition and phase distribution data along with the help of ML. The artificial data analysed have been created assuming a steel composition based on the calculation phase diagram (CALPHAD) thermodynamics combined with the order-statistics-based sampling model. The hypothetical procedures of data acquisition assumed in this numerical experiment are as follows; (i) obtaining local analysis data on the composition and phase distribution in the same observation area using instruments such as electron probe micro analyser (EPMA) and electron backscattering diffraction (EBSD), and (ii) training the classification model based on a ML algorithm with compositional data as input and the phase data as output. The accuracies of the reconstructed phase diagrams have been estimated for three ML algorithms, i.e. support vector machine (SVM), random forest, and multilayer perceptron (MLP). The phase diagrams predicted using SVM and MLP are found to be adequately consistent with those of the CALPHAD method. We have also investigated the regression performance of the continuous data involved in the CALPHAD thermodynamics, such as the phase fractions of body-centred cubic, face-centred cubic, and cementite phases. Compared with the ML algorithms, the CALPHAD method is found to show superior predictive performance since it is based on the sophisticated physical model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号