首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The paper presents a neural network based multi-classifier system for the identification of Escherichia coli promoter sequences in strings of DNA. As each gene in DNA is preceded by a promoter sequence, the successful location of an E. coli promoter leads to the identification of the corresponding E. coli gene in the DNA sequence. A set of 324 known E. coli promoters and a set of 429 known non-promoter sequences were encoded using four different encoding methods. The encoded sequences were then used to train four different neural networks. The classification results of the four individual neural networks were then combined through an aggregation function, which used a variation of the logarithmic opinion pool method. The weights of this function were determined by a genetic algorithm. The multi-classifier system was then tested on 159 known promoter sequences and 171 non-promoter sequences not contained in the training set. The results obtained through this study proved that the same data set, when presented to neural networks in different forms, can provide slightly varying results. It also proves that when different opinions of more classifiers on the same input data are integrated within a multi-classifier system, we can obtain results that are better than the individual performances of the neural networks. The performances of our multi-classifier system outperform the results of other prediction systems for E. coli promoters developed so far.
Vasile PaladeEmail:
  相似文献   

2.
Abstract: The artificial immune recognition system (AIRS) has been shown to be an efficient approach to tackling a variety of problems such as machine learning benchmark problems and medical classification problems. In this study, the resource allocation mechanism of AIRS was replaced with a new one based on fuzzy logic. The new system, named Fuzzy-AIRS, was used as a classifier in the classification of three well-known medical data sets, the Wisconsin breast cancer data set (WBCD), the Pima Indians diabetes data set and the ECG arrhythmia data set. The performance of the Fuzzy-AIRS algorithm was tested for classification accuracy, sensitivity and specificity values, confusion matrix, computation time and receiver operating characteristic curves. Also, the AIRS and Fuzzy-AIRS algorithms were compared with respect to the amount of resources required in the execution of the algorithm. The highest classification accuracy obtained from applying the AIRS and Fuzzy-AIRS algorithms using 10-fold cross-validation was, respectively, 98.53% and 99.00% for classification of WBCD; 79.22% and 84.42% for classification of the Pima Indians diabetes data set; and 100% and 92.86% for classification of the ECG arrhythmia data set. Hence, these results show that Fuzzy-AIRS can be used as an effective classifier for medical problems.  相似文献   

3.
As many structures of protein–DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein–DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone.  相似文献   

4.
The use of machine learning tools in biological data analysis is increasing gradually. This is mainly because the effectiveness of classification and recognition systems has improved in a great deal to help medical experts in diagnosing. In this paper, we investigate the performance of an artificial immune system based k-nearest neighbors algorithm with and without cross-validation in a class of imbalanced problems from bioinformatics field. Furthermore, we used an unsupervised artificial immune system algorithm for reduction training data dimension and k-nearest neighbors algorithm for classification purpose. The conducted experiments showed the effectiveness of the proposed schema. By selecting the E. coli database, we could compare our classification accuracy with other methods which were presented in the literature. The proposed hybrid system produced much more accurate results than the Horton and Nakai's proposal [P. Horton, K. Nakai, A probabilistic classification system for predicting the cellular localization sites of proteins, in: Proceedings of the 4th International Conference on Intelligent Systems for Molecular Biology, AAAI Press, St. Louis, 1996, pp. 109–115; P. Horton, K. Nakai, Better prediction of protein cellular localization sites with the k-nearest neighbors classifier, in: Proceedings of Intelligent Systems in Molecular Biology, Halkidiki, Greece, 1997, pp. 368–383]. Besides the accuracy improvement, one of the important aspects of the proposed methodology is the complexity. As the artificial immune system provided data reduction, the training complexity of the proposed system is considerably low against the k-nearest neighbors classifier.  相似文献   

5.
Artificial Immune Recognition System (AIRS) classification algorithm, which has an important place among classification algorithms in the field of Artificial Immune Systems, has showed an effective and intriguing performance on the problems it was applied. AIRS was previously applied to some medical classification problems including Breast Cancer, Cleveland Heart Disease, Diabetes and it obtained very satisfactory results. So, AIRS proved to be an efficient artificial intelligence technique in medical field. In this study, the resource allocation mechanism of AIRS was changed with a new one determined by Fuzzy-Logic. This system, named as Fuzzy-AIRS was used as a classifier in the diagnosis of Breast Cancer and Liver Disorders, which are of great importance in medicine. The classifications of Breast Cancer and BUPA Liver Disorders datasets taken from University of California at Irvine (UCI) Machine Learning Repository were done using 10-fold cross-validation method. Reached classification accuracies were evaluated by comparing them with reported classifiers in UCI web site in addition to other systems that are applied to the related problems. Also, the obtained classification performances were compared with AIRS with regard to the classification accuracy, number of resources and classification time. Fuzzy-AIRS, which reached to classification accuracy of 98.51% for breast cancer, classified the Liver Disorders dataset with 83.36% accuracy. For both datasets, Fuzzy-AIRS obtained the highest classification accuracy according to the UCI web site. Beside of this success, Fuzzy-AIRS gained an important advantage over the AIRS by means of classification time. In the experiments, it was seen that the classification time in Fuzzy-AIRS was reduced about 70% of AIRS for both datasets. By reducing classification time as well as obtaining high classification accuracies in the applied datasets, Fuzzy-AIRS classifier proved that it could be used as an effective classifier for medical problems.  相似文献   

6.
Text data mining is a process of exploratory data analysis. Classification maps data into predefined groups or classes. It is often referred to as supervised learning because the classes are determined before examining the data. This paper describes the proposed k-Nearest Neighbor classifier that performs comparative cross-validation for the existing k-Nearest Neighbor classifier. The feasibility and the benefits of the proposed approach are demonstrated by means of data mining problem: direct marketing. Direct marketing has become an important application field of data mining. Comparative cross-validation involves estimation of accuracy by either stratified k-fold cross-validation or equivalent repeated random subsampling. While the proposed method may have a high bias; its performance (accuracy estimation in our case) may be poor due to a high variance. Thus the accuracy with the proposed k-Nearest Neighbor classifier was less than that with the existing k-Nearest Neighbor classifier, and the smaller the improvement in runtime the larger the improvement in precision and recall. In our proposed method we have determined the classification accuracy and prediction accuracy where the prediction accuracy is comparatively high.  相似文献   

7.
In this study, the traffic accidents recognizing risk factors related to the environmental (climatological) conditions that are associated with motor vehicles accidents on the Konya-Afyonkarahisar highway with the aid of Geographical Information Systems (GIS) have been determined using the combination of K-means clustering (KMC)-based attribute weighting (KMCAW) and classifier algorithms including artificial neural network (ANN) and adaptive network-based fuzzy inference system (ANFIS). The dynamic segmentation process in ArcGIS9.0 from the traffic accident reports recorded by District Traffic Agency has identified the locations of the motor vehicle accidents. The attributes obtained from this system are day, temperature, humidity, weather conditions, and month of occurred traffic accidents. The traffic accident dataset comprises five attributes (day, temperature, humidity, weather conditions, and month of occurred traffic accidents) and 358 observations including 179 without accident and 179 with accident. The proposed comprises two stages. In the first stage, the all attributes of dataset have been weighted using KMCAW method. The aims of this weighting method are both to increase the classification performance of used classifier algorithm and to transform from linearly non-separable traffic accidents dataset to a linearly separable dataset. In the second stage, after weighting process, ANN and ANFIS classifier algorithms have been separately used to determine the case of traffic accidents as with accident or without accident. In order to evaluate the performance of proposed method, the classification accuracy, sensitivity, specificity and area under the ROC (Receiver Operating Characteristic) curves (AUC) values have been used. While ANN and ANFIS classifiers obtained the overall prediction accuracies of 53.93 and 38.76%, respectively, the combination of KMCAW and ANN and the combination of KMCAW and ANFIS achieved the overall prediction accuracies of 74.15 and 55.06% on the prediction of traffic accidents. The experimental results have demonstrated that the proposed attribute weighting method called KMCAW is a robust and effective data pre-processing method in the prediction of traffic accidents on Konya-Afyonkarahisar highway in Turkey.  相似文献   

8.
This paper presents a new method for differential diagnosis of erythemato-squamous diseases based on Genetic Algorithm (GA) wrapped Bayesian Network (BN) Feature Selection (FS). With this aim, a GA based FS algorithm combined in parallel with a BN classifier is proposed.Basically, erythemato-squamous dataset contains six dermatological diseases defined with 34 features. In GA–BN algorithm, GA makes a heuristic search to find most relevant feature model that increase accuracy of BN algorithm with the use of a 10-fold cross-validation strategy. The subsets of features are sequentially used to identify six dermatological diseases via a BN fitting the corresponding data. The algorithm, in this case, produces 99.20% classification accuracy in the diagnosis of erythemato-squamous diseases. The strength of feature model generated for BN is furthermore tested with the use of Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Simple Logistics (SL) and Functional Decision Tree (FT). The resultant classification accuracies of algorithms are 98.36%, 97.00%, 98.36% and 97.81% respectively. On the other hand, BN algorithm with classification accuracy of 99.20% is quite a high diagnosis performance for erythemato-squamous diseases. The proposed algorithm makes no more than 3 misclassifications out of 366 instances. Furthermore, FS power of GA is also compared with two alternative search algorithms, i.e. Best First (BF) and Sequential Floating (SF).The obtained results have all together shown that the proposed GA–BN based FS and prediction strategy is very promising in diagnosis of erythemato-squamous diseases.  相似文献   

9.
To construct the model of gene expression using microarray techniques can reveal the regulation rules from the gene expression profiles. From S-system model, it is able to analyze the regulatory system dynamics. However, with 2N(N + 1) parameters (called a set), an S-system model of N-gene genetic networks takes lots of iterations to have convergent gene expression profiles. To mining the association between the gene expression profiles and 2N(N + 1) parameters may provide information about the probability of the convergent gene expression profiles instead of trying to obtain the convergent gene expression profiles in lots of iteration. Based on this novel approach, higher accuracy of the binary classifier can be used to analyze and prediction the convergence of the gene expression profiles from an initial set to reduce the search time of the inference problem. This paper applies popular data mining algorithms to the classification tasks and compares their accuracy rates with a dataset (250 cases, including 176 training cases and 74 test cases). According to decision rules of the chosen classifier, we can provide a convergence prediction of time-series gene expression profiles on the given set of parameters.  相似文献   

10.
In this study, a hierarchical electroencephalogram (EEG) classification system for epileptic seizure detection is proposed. The system includes the following three stages: (i) original EEG signals representation by wavelet packet coefficients and feature extraction using the best basis-based wavelet packet entropy method, (ii) cross-validation (CV) method together with k-Nearest Neighbor (k-NN) classifier used in the training stage to hierarchical knowledge base (HKB) construction, and (iii) in the testing stage, computing classification accuracy and rejection rate using the top-ranked discriminative rules from the HKB. The data set is taken from a publicly available EEG database which aims to differentiate healthy subjects and subjects suffering from epilepsy diseases. Experimental results show the efficiency of our proposed system. The best classification accuracy is about 100% via 2-, 5-, and 10-fold cross-validation, which indicates the proposed method has potential in designing a new intelligent EEG-based assistance diagnosis system for early detection of the electroencephalographic changes.  相似文献   

11.

New interaction paradigms combined with emerging technologies have produced the creation of diverse Natural User Interface (NUI) devices in the market. These devices enable the recognition of body gestures allowing users to interact with applications in a more direct, expressive, and intuitive way. In particular, the Leap Motion Controller (LMC) device has been receiving plenty of attention from NUI application developers because it allows them to address limitations on gestures made with hands. Although this device is able to recognize the position of several parts of the hands, developers are still left with the difficult task of recognizing gestures. For this reason, several authors approached this problem using machine learning techniques. We propose a classifier based on Approximate String Matching (ASM). In short, we encode the trajectories of the hand joints as character sequences using the K-means algorithm and then we analyze these sequences with ASM. It should be noted that, when using the K-means algorithm, we select the number of clusters for each part of the hands by considering the Silhouette Coefficient. Furthermore, we define other important factors to take into account for improving the recognition accuracy. For the experiments, we generated a balanced dataset including different types of gestures and afterwards we performed a cross-validation scheme. Experimental results showed the robustness of the approach in terms of recognizing different types of gestures, time spent, and allocated memory. Besides, our approach achieved higher performance rates than well-known algorithms proposed in the current state-of-art for gesture recognition.

  相似文献   

12.
We present a comparative study on the most popular machine learning methods applied to the challenging problem of customer churning prediction in the telecommunications industry. In the first phase of our experiments, all models were applied and evaluated using cross-validation on a popular, public domain dataset. In the second phase, the performance improvement offered by boosting was studied. In order to determine the most efficient parameter combinations we performed a series of Monte Carlo simulations for each method and for a wide range of parameters. Our results demonstrate clear superiority of the boosted versions of the models against the plain (non-boosted) versions. The best overall classifier was the SVM-POLY using AdaBoost with accuracy of almost 97% and F-measure over 84%.  相似文献   

13.
In this paper, a novel hybrid method, which integrates an effective filter maximum relevance minimum redundancy (MRMR) and a fast classifier extreme learning machine (ELM), has been introduced for diagnosing erythemato-squamous (ES) diseases. In the proposed method, MRMR is employed as a feature selection tool for dimensionality reduction in order to further improve the diagnostic accuracy of the ELM classifier. The impact of the type of activation functions, the number of hidden neurons and the size of the feature subsets on the performance of ELM have been investigated in detail. The effectiveness of the proposed method has been rigorously evaluated against the ES disease dataset, a benchmark dataset, from UCI machine learning database in terms of classification accuracy. Experimental results have demonstrated that our method has achieved the best classification accuracy of 98.89% and an average accuracy of 98.55% via 10-fold cross-validation technique. The proposed method might serve as a new candidate of powerful methods for diagnosing ES diseases.  相似文献   

14.
There is significant interest in the network management and industrial security community about the need to identify the “best” and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In practice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrelevant features and as a knowledge discovery tool to reveal the “best” features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the “best” features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques.  相似文献   

15.
This paper presents a novel method for diagnosis of hepatitis disease. The proposed method is based on a hybrid method that uses feature selection (FS) and artificial immune recognition system (AIRS) with fuzzy resource allocation mechanism. AIRS has showed an effective performance on several problems such as machine learning benchmark problems and medical classification problems like breast cancer, diabets, liver disorders classification. By hybridizing FS and AIRS with fuzzy resource allocation mechanism, a method is obtained to solve this diagnosis problem via classifying. The robustness of this method with regard to sampling variations is examined using a cross-validation method. We used hepatitis disease dataset which is taken from UCI machine learning repository. We obtained a classification accuracy of 92.59%, which is the highest one reached so far. The classification accuracy was obtained via 10-fold cross validation. The obtained classification accuracy of our system was 92.59% and it was very promising with regard to the other classification applications in literature for this problem. Also, sensitivity, and specificity values for hepatitis disease dataset were obtained as 100 and 85%.  相似文献   

16.
To develop Human-centric Driver Assistance Systems (HDAS) for automatic understanding and charactering of driver behaviors, an efficient feature extraction of driving postures based on Geronimo–Hardin–Massopust (GHM) multiwavelet transform is proposed, and Multilayer Perceptron (MLP) classifiers with three layers are then exploited in order to recognize four pre-defined classes of driving postures. With features extracted from a driving posture dataset created at Southeast University (SEU), the holdout and cross-validation experiments on driving posture classification are conducted by MLP classifiers, compared with the Intersection Kernel Support Vector Machines (IKSVMs), the k-Nearest Neighbor (kNN) classifier and the Parzen classifier. The experimental results show that feature extraction based on GHM multwavelet transform and MLP classifier, using softmax activation function in the output layer and hyperbolic tangent activation function in the hidden layer, offer the best classification performance compared to IKSVMs, kNN and Parzen classifiers. The experimental results also show that talking on a cellular phone is the most difficult one to classify among four predefined classes, which are 83.01% and 84.04% in the holdout and cross-validation experiments respectively. These results show the effectiveness of the feature extraction approach using GHM multiwavelet transform and MLP classifier in automatically understanding and characterizing driver behaviors towards Human-centric Driver Assistance Systems (HDAS).  相似文献   

17.
本文对实验证实的741条大肠杆菌Sigma70启动子的序列进行预测研究。首先,基于RNA聚合酶与DNA的相互作用,利用位置打分函数对序列中的保守位点进行了衡量;然后,根据启动子的序列特征,利用离散性指标对序列中不同的碱基信息含量进行测量;最后,利用多元非线性判别分析实现了对大肠杆菌启动子的预测。10折叠交叉检验结果显示,总体预测精度达到85%以上。与其它算法比较结果显示,我们开发的这一算法能够更好地预测大肠杆菌启动子。  相似文献   

18.
The use of artificial intelligence methods in biological data analysis has been increased recent since performance of the classification and detection systems have improved considerably to help medical experts in diagnosing. In this paper, we investigate the performance of an artificial immune system (AIS) based fuzzy k-NN algorithm with and without cross validation in a class of imbalanced problems in bioinformatics. Furthermore, we devise an unsupervised AIS algorithm in a supervised manner which contains a training stage for data reduction and a classification stage using fuzzy k-NN algorithm. The experiments show the efficacy of the proposed method with promising results. Using the Escherichia coli and yeast database, we compare the classification accuracy of the proposed method with those of other methods which have been proposed in the literature. The proposed hybrid system produced much more accurate results than the Horton and Nakai's method [P. Horton, K. Nakai, Better prediction of protein cellular localization sites with the k-nearest neighbors classifier, in: Proceedings of Intelligent Systems in Molecular Biology, Halkidiki, Greece, 1997, pp. 368–383]. Besides the improvement on the classification accuracy, one of the important aspects of the proposed method is the complexity. As the proposed AIS method incorporates data reduction in the training stage, the training complexity is considerably low comparing with the k-NN classifier.  相似文献   

19.
蛋白质相互作用中界面残基的识别在药物设计与生物体的新陈代谢等方面有着广泛应用。基于朴素贝叶斯分类器对属性条件独立性的要求,构建了由蛋白质序列谱和溶剂可及表面积组成的蛋白质相互作用特征模型。在一个具有代表性的蛋白质异源复合物组成的数据集中取得了68.1%的准确率、0.201 的相关系数、40.2%的特异度和 49.9%的灵敏度,取得了比其他方法更优的结果,且远优于随机的实验结果。通过一个三维可视化的结果更好地验证了方法的有效性。  相似文献   

20.
In this paper, a classifier motivated from statistical learning theory, i.e., support vector machine, with a new approach based on multiclass directed acyclic graph has been proposed for classification of four types of electrocardiogram signals. The motivation for selecting Directed Acyclic Graph Support Vector Machine (DAGSVM) is to have more accurate classifier with less computational cost. Empirical mode decomposition and subsequently singular value decomposition have been used for computing the feature vector matrix. Further, fivefold cross-validation and particle swarm optimization have been used for optimal selection of SVM model parameters to improve the performance of DAGSVM. A comparison has been made between proposed algorithm and other two classifiers, i.e., K-Nearest Neighbor (KNN) and Artificial Neural Network (ANN). The DAGSVM has yielded an average accuracy of 98.96% against 95.83% and 96.66% for the KNN and the ANN, respectively. The results obtained clearly confirm the superiority of the DAGSVM approach over other classifiers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号