共查询到18条相似文献,搜索用时 15 毫秒
1.
3.
采用以MLR为基学习器的Boosting算法模型,对79种硫代氨基甲酸酯类衍生物做抗HIV-1逆转录酶抑制活性的QSAR研究。以E-Dragon软件计算的7组描述符分别为自变量,以化合物的半数效应浓度EC_(50)值为因变量构成7个原始数据集,用PSO算法筛选变量并建立MLR模型。各描述符建立的MLR模型中仅有RDF描述符模型同时通过外部预测和内部验证,故确定以其建立关于硫代氨基甲酸酯类衍生物抗HIV-1逆转录酶抑制活性的Boosting-MLR预测模型。Boosting-MLR模型与MLR模型相比,训练结果的决定系数R~2分别为0.728和0.741,预测结果R~2则分别为0.718和0.667,表明其泛化能力明显增强。对Boosting-MLR模型进一步进行稳定性验证,证明其预测稳定性较高。 相似文献
4.
利用半经验量子化学PM3方法计算了106个HEPT类化合物的优势构象以及物理化学和电子结构等参数,讨论了该类化合物的这些参数对HIV-1逆转录酶抑制剂活性的影响,利用穷举回归、偏最小二乘和混沌遗传算法训练的人工神经网络方法建立了物理化学、电子结构等参数和其抑制HIV-1逆转录酶活性之间的定量结构活性关系模型。结果表明:(1)较大的logP、较高的最低空轨道能、较小的分子量均有利于提高该类化合物的抗HIV-1活性;(2)当该类化合物有适当大小的 R1取代基、C-6位苯环的间位被疏水性集团取代时该类化合物有较高的抗HIV-1生物活性。 相似文献
5.
Decision trees for hierarchical multi-label classification 总被引:3,自引:0,他引:3
Celine Vens Jan Struyf Leander Schietgat Sašo Džeroski Hendrik Blockeel 《Machine Learning》2008,73(2):185-214
Hierarchical multi-label classification (HMC) is a variant of classification where instances may belong to multiple classes
at the same time and these classes are organized in a hierarchy. This article presents several approaches to the induction
of decision trees for HMC, as well as an empirical study of their use in functional genomics. We compare learning a single
HMC tree (which makes predictions for all classes together) to two approaches that learn a set of regular classification trees
(one for each class). The first approach defines an independent single-label classification task for each class (SC). Obviously,
the hierarchy introduces dependencies between the classes. While they are ignored by the first approach, they are exploited
by the second approach, named hierarchical single-label classification (HSC). Depending on the application at hand, the hierarchy
of classes can be such that each class has at most one parent (tree structure) or such that classes may have multiple parents
(DAG structure). The latter case has not been considered before and we show how the HMC and HSC approaches can be modified
to support this setting. We compare the three approaches on 24 yeast data sets using as classification schemes MIPS’s FunCat
(tree structure) and the Gene Ontology (DAG structure). We show that HMC trees outperform HSC and SC trees along three dimensions:
predictive accuracy, model size, and induction time. We conclude that HMC trees should definitely be considered in HMC tasks
where interpretable models are desired. 相似文献
6.
本研究设计了一种新的RNA提取方法 ,解决了RNA提取时容易被降解和污染这一关键问题。通过加入Rnase抑制剂 ,消除了同外源RNase对RNA的降解 ,结合DNA难呈低盐溶液(140mmol·L -1NaCl)的原理 ,去除了DNA对RNA提取液的污染 ;先后使用酚和氯仿 ,有效地去除了蛋白质和酚类物的污染 ,利用抗氧化剂PVP和巯基乙醇 ,消除了内源酚类物质氧化变色对病毒RNA逆转录的影响。采用上述方法可以在4~5h内得到纯度高、含量大、完整性好的果树总RNA ,并获得了逆转录活性较强的病毒RNA ,同时使提取RNA的成本降低。这些方法对苹果、葡萄、桃、樱桃等果树总RNA的提取均适用。 相似文献
7.
TRPV1 (Transient Receptor Potential Vanilloid Type 1) receptor, a member of Transient Receptor Potential Vanilloid subfamily of ion channels, occurs in the peripheral and central nervous system, and plays a key role in transmission of pain. Consequently, this has been the target for discovery of several pain relieving agents which have undergone clinical trials. Though several TRPV1 antagonists have progressed to become clinical candidates, many are known to cause temperature elevation in humans, halting their further advancement, and signifying the need for new chemotypes. Different chemical classes of TRPV1 antagonists share three important features: an amide or an isostere flanked by an aromatic (or fused aromatic) ring with polar substitutions on one side, and a hydrophobic group on the other. Recent work identified new series of compounds with these and additional features, leading to improvement of properties, and development of clinical candidates. Herein, we describe a 3D-QSAR model (n = 62; R2 = 0.9 and Q2 = 0.75) developed from the piperazinyl-aryl series of compounds and a novel 5-point pharmacophore model is shown to fit several diverse scaffolds, six clinical candidates, five pre-clinical candidates and three lead compounds. The pharmacophore model can aid in finding new chemotypes as starting points that can be developed further. 相似文献
8.
This paper addresses the classification problem for applications with extensive amounts of data and a large number of features. The learning system developed utilizes a hierarchical multiple classifier scheme and is flexible, efficient, highly accurate and of low cost. The system has several novel features: (1) It uses a graph-theoretic clustering algorithm to group the training data into possibly overlapping cluster, each representing a dense region in the data space; (2) component classifiers trained on these dense regions are specialists whose probabilistic outputs are gated inputs to a super-classifier. Only those classifiers whose training clusters are most related to an unknown data instance send their outputs to the super-classifier; and (3) sub-class labelling is used to improve the classification of super-classes. The learning system achieves the goals of reducing the training cost and increasing the prediction accuracy compared to other multiple classifier algorithms. The system was tested on three large sets of data, two from the medical diagnosis domain and one from a forest cover classification problem. The results are superior to those obtained by several other learning algorithms. 相似文献
9.
随着信息技术的发展,文本信息数据正在爆炸式增长,从众多的文本数据中有效地获取有用信息是一个值得研究的问题。针对该任务提出基于层次特征提取的文本分类模型,考虑文本中句子级别的语义内容以及文本级别的语义内容,依次使用两种神经网络模型建模句子级的语义内容和文本级的语义内容,从而得到关于文本的全面特征,进而基于此特征对文本进行分类。实验结果表明,该方法能够更加准确地提取文本的特征,具有更高的分类准确度。 相似文献
10.
句群是介于句子和段落之间的一个处理单位。在语言概念空间句群有三个要素:领域、情景和背景,领域是最根本的。获取了句群领域,就能够确定情景框架,这对信息抽取和文本分类都是非常重要的。一些词语的概念符号中蕴含了领域信息,通过分析词语在句子中的语义角色以及词语位置、频次等可以得到句子的领域。根据领域关系可以合并领域相同或相似的句子,得到句群及其领域。实验表明,常见的四种领域关系能够很好地被处理,但在动态词处理、复合领域的识别等方面还需要改进。 相似文献
11.
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study 总被引:4,自引:0,他引:4
Software metrics-based quality classification models predict a software module as either fault-prone (fp) or not fault-prone (nfp). Timely application of such models can assist in directing quality improvement efforts to modules that are likely to be fp during operations, thereby cost-effectively utilizing the software quality testing and enhancement resources. Since several classification techniques are available, a relative comparative study of some commonly used classification techniques can be useful to practitioners. We present a comprehensive evaluation of the relative performances of seven classification techniques and/or tools. These include logistic regression, case-based reasoning, classification and regression trees (CART), tree-based classification with S-PLUS, and the Sprint-Sliq, C4.5, and Treedisc algorithms. The use of expected cost of misclassification (ECM), is introduced as a singular unified measure to compare the performances of different software quality classification models. A function of the costs of the Type I (a nfp module misclassified as fp) and Type II (a fp module misclassified as nfp) misclassifications, ECM is computed for different cost ratios. Evaluating software quality classification models in the presence of varying cost ratios is important, because the usefulness of a model is dependent on the system-specific costs of misclassifications. Moreover, models should be compared and preferred for cost ratios that fall within the range of interest for the given system and project domain. Software metrics were collected from four successive releases of a large legacy telecommunications system. A two-way ANOVA randomized-complete block design modeling approach is used, in which the system release is treated as a block, while the modeling method is treated as a factor. It is observed that predictive performances of the models is significantly different across the system releases, implying that in the software engineering domain prediction models are influenced by the characteristics of the data and the system being modeled. Multiple-pairwise comparisons are performed to evaluate the relative performances of the seven models for the cost ratios of interest to the case study. In addition, the performance of the seven classification techniques is also compared with a classification based on lines of code. The comparative approach presented in this paper can also be applied to other software systems. 相似文献
12.
Sanghamitra Bandyopadhyay Author Vitae 《Pattern recognition》2004,37(1):33-45
This article describes a clustering technique that can automatically detect any number of well-separated clusters which may be of any shape, convex and/or non-convex. This is in contrast to most other techniques which assume a value for the number of clusters and/or a particular cluster structure. The proposed technique is based on an iterative partitioning of the relative neighborhood graph, coupled with a post-processing step for merging small clusters. Techniques for improving the efficiency of the proposed scheme are implemented. The clustering scheme is able to detect outliers in data. It is also able to indicate the inherent hierarchical nature of the clusters present in a data set. Moreover, the proposed technique is also able to identify the situation when the data do not have any natural clusters at all. Results demonstrating the effectiveness of the clustering scheme are provided for several data sets. 相似文献
13.
The primary aim of risk-based software quality classification models is to detect, prior to testing or operations, components that are most-likely to be of high-risk. Their practical usage as quality assurance tools is gauged by the prediction-accuracy and cost-effective aspects of the models. Classifying modules into two risk groups is the more commonly practiced trend. Such models assume that all modules predicted as high-risk will be subjected to quality improvements. Due to the always-limited reliability improvement resources and the variability of the quality risk-factor, a more focused classification model may be desired to achieve cost-effective software quality assurance goals. In such cases, calibrating a three-group (high-risk, medium-risk, and low-risk) classification model is more rewarding. We present an innovative method that circumvents the complexities, computational overhead, and difficulties involved in calibrating pure or direct three-group classification models. With the application of the proposed method, practitioners can utilize an existing two-group classification algorithm thrice in order to yield the three risk-based classes. An empirical approach is taken to investigate the effectiveness and validity of the proposed technique. Some commonly used classification techniques are studied to demonstrate the proposed methodology. They include, the C4.5 decision tree algorithm, discriminant analysis, and case-based reasoning. For the first two, we compare the three-group model calibrated using the respective techniques with the one built by applying the proposed method. Any two-group classification technique can be employed by the proposed method, including those that do not provide a direct three-group classification model, e.x., logistic regression and certain binary classification trees, such as CART. Based on a case study of a large-scale industrial software system, it is observed that the proposed method yielded promising results. For a given classification technique, the expected cost of misclassification of the proposed three-group models were significantly better (generally) when compared to the techniques direct three-group model. In addition, the proposed method is also evaluated against an alternate indirect three-group classification method. 相似文献
14.
W.M. Muir G.J.M. Rosa Z. Xu M. Fountain 《Computational statistics & data analysis》2009,53(5):1566-1576
The microarray is an important and powerful tool for prescreening of genes for further research. However, alternative solutions are needed to increase power in small microarray experiments. Use of traditional parametric and even non-parametric tests for such small experiments lack power and have distributional problems. A mixture model is described that is performed directly on expression differences assuming that genes in alternative treatments are expressed or not in all combinations (i) not expressed in either condition, (ii) expressed only under the first condition, (iii) expressed only under the second condition, and (iv) expressed under both conditions, giving rise to 4 possible clusters with two treatments. The approach is termed a Mean-Difference-Mixture-Model (MD-MM) method. Accuracy and power of the MD-MM was compared to other commonly used methods, using both simulations, microarray data, and quantitative real time PCR (qRT-PCR). The MD-MM was found to be generally superior to other methods in most situations. The advantage was greatest in situations where there were few replicates, poor signal to noise ratios, or non-homogeneous variances. 相似文献
15.
16.
阿姆河三角洲作为典型干旱区,干旱胁迫和次生的盐胁迫决定了本地区生态环境的复杂性和独特性,给遥感地表覆盖制图带来一定的困难。在土地利用/覆盖(LULC)遥感图像分类任务中,数量大、质量高、成本低的样本和速度快、性能稳定的分类器是高效实现高精度分类的关键。在一些偏远地区开展土地利用/地表覆盖遥感图像分类依然面临着标记样本空间上稀疏、时间上不连续甚至是缺失,人工收集成本高等问题。为此,结合最优树集成和样本迁移的思想,构建了一种高效的地表覆盖自动更新的新方法。该方法通过变化检测在历史产品上的同期影像上进行样本标签的标记,并将过去的地表覆盖类型标签转移到同源目标影像上,使用最优树集成(Ensemble of optimum trees,OTE)完成地表覆盖自动分类。根据阿姆河三角洲地区地表覆盖分类试验结果,表明该方法可以提取有效的地表覆盖标签,并能较高精度发实现土地利用/地表覆盖的自动分类更新。 相似文献
17.
基于句类特征的作者写作风格分类研究 总被引:1,自引:1,他引:0
不同作家的作品有自己的特点,这些特点体现在词汇、句型、修辞手法等各个方面,尝试使用句类特征进行作者写作风格分类,进一步可以用于作者的识别。利用向量空间模型,以句类作为特征,并通过混合句类分解等技术对句类向量空间降维,使用itc算法对特征项进行权重计算,KNN算法进行分类并利用集成判决技术,形成作者写作风格分类器。本分类器的性能在近现代小说的按作者写作风格的分类和鉴别方面的性能是可以接受的,并有进一步提升的可能。 相似文献