首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 878 毫秒
1.
Open ontology learning is the process of extracting a domain ontology from a knowledge source in an unsupervised way. Due to its unsupervised nature, it requires filtering mechanisms to rate the importance and correctness of the extracted knowledge. This paper presents OntoCmaps, a domain-independent and open ontology learning tool that extracts deep semantic representations from corpora. OntoCmaps generates rich conceptual representations in the form of concept maps and proposes an innovative filtering mechanism based on metrics from graph theory. Our results show that using metrics such as Betweenness, PageRank, Hits and Degree centrality outperforms the results of standard text-based metrics (TF-IDF, term frequency) for concept identification. We propose voting schemes based on these metrics that provide a good performance in relationship identification, which again provides better results (in terms of precision and F-measure) than other traditional metrics such as frequency of co-occurrences. The approach is evaluated against a gold standard and is compared to the ontology learning tool Text2Onto. The OntoCmaps generated ontology is more expressive than Text2Onto ontology especially in conceptual relationships and leads to better results in terms of precision, recall and F-measure.  相似文献   

2.
Accurate and fast approaches for automatic ECG data classification are vital for clinical diagnosis of heart disease. To this end, we propose a novel multistage algorithm that combines various procedures for dimensionality reduction, consensus clustering of randomized samples and fast supervised classification algorithms for processing of the highly dimensional large ECG datasets. We carried out extensive experiments to study the effectiveness of the proposed multistage clustering and classification scheme using precision, recall and F-measure metrics. We evaluated the performance of numerous combinations of various methods for dimensionality reduction, consensus functions and classification algorithms incorporated in our multistage scheme. The results of the experiments demonstrate that the highest precision, recall and F-measure are achieved by the combination of the rank correlation coefficient for dimensionality reduction, HBGF consensus function and the SMO classifier with the polynomial kernel.  相似文献   

3.
Validation of overlapping clustering: A random clustering perspective   总被引:1,自引:0,他引:1  
As a widely used clustering validation measure, the F-measure has received increased attention in the field of information retrieval. In this paper, we reveal that the F-measure can lead to biased views as to results of overlapped clusters when it is used for validating the data with different cluster numbers (incremental effect) or different prior probabilities of relevant documents (prior-probability effect). We propose a new “IMplication Intensity” (IMI) measure which is based on the F-measure and is developed from a random clustering perspective. In addition, we carefully investigate the properties of IMI. Finally, experimental results on real-world data sets show that IMI significantly alleviates biased incremental and prior-probability effects which are inherent to the F-measure.  相似文献   

4.
In this paper, the concept of finding an appropriate classifier ensemble for named entity recognition is posed as a multiobjective optimization (MOO) problem. Our underlying assumption is that instead of searching for the best-fitting feature set for a particular classifier, ensembling of several classifiers those are trained using different feature representations could be a more fruitful approach, but it is crucial to determine the appropriate subset of classifiers that are most suitable for the ensemble. We use three heterogenous classifiers namely maximum entropy, conditional random field, and support vector machine in order to build a number of models depending upon the various representations of the available features. The proposed MOO-based ensemble technique is evaluated for three resource-constrained languages, namely Bengali, Hindi, and Telugu. Evaluation results yield the recall, precision, and F-measure values of 92.21, 92.72, and 92.46%, respectively, for Bengali; 97.07, 89.63, and 93.20%, respectively, for Hindi; and 80.79, 93.18, and 86.54%, respectively, for Telugu. We also evaluate our proposed technique with the CoNLL-2003 shared task English data sets that yield the recall, precision, and F-measure values of 89.72, 89.84, and 89.78%, respectively. Experimental results show that the classifier ensemble identified by our proposed MOO-based approach outperforms all the individual classifiers, two different conventional baseline ensembles, and the classifier ensemble identified by a single objective?Cbased approach. In a part of the paper, we formulate the problem of feature selection in any classifier under the MOO framework and show that our proposed classifier ensemble attains superior performance to it.  相似文献   

5.
In this paper, we propose a simulated annealing (SA) based multiobjective optimization (MOO) approach for classifier ensemble. Several different versions of the objective functions are exploited. We hypothesize that the reliability of prediction of each classifier differs among the various output classes. Thus, in an ensemble system, it is necessary to find out the appropriate weight of vote for each output class in each classifier. Diverse classification methods such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) are used to build different models depending upon the various representations of the available features. One most important characteristics of our system is that the features are selected and developed mostly without using any deep domain knowledge and/or language dependent resources. The proposed technique is evaluated for Named Entity Recognition (NER) in three resource-poor Indian languages, namely Bengali, Hindi and Telugu. Evaluation results yield the recall, precision and F-measure values of 93.95%, 95.15% and 94.55%, respectively for Bengali, 93.35%, 92.25% and 92.80%, respectively for Hindi and 84.02%, 96.56% and 89.85%, respectively for Telugu. Experiments also suggest that the classifier ensemble identified by the proposed MOO based approach optimizing the F-measure values of named entity (NE) boundary detection outperforms all the individual models, two conventional baseline models and three other MOO based ensembles.  相似文献   

6.
As many structures of protein–DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein–DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone.  相似文献   

7.
This paper discusses semantic processing using the Hidden Vector State (HVS) model. The HVS model extends the basic discrete Markov model by encoding context in each state as a vector. State transitions are then factored into a stack shift operation similar to those of a push-down automaton followed by a push of a new preterminal semantic category label. The key feature of the model is that it can capture hierarchical structure without the use of treebank data for training.Experiments have been conducted in the travel domain using the relatively simple ATIS corpus and the more complex DARPA Communicator Task. The results show that the HVS model can be robustly trained from only minimally annotated corpus data. Furthermore, when measured by its ability to extract attribute-value pairs from natural language queries in the travel domain, the HVS model outperforms a conventional finite-state semantic tagger by 4.1% in F-measure for ATIS and by 6.6% in F-measure for Communicator, suggesting that the benefit of the HVS model's ability to encode context increases as the task becomes more complex.  相似文献   

8.
The capability of humans in distinguishing salient objects from background is at par excellence. The researchers are yet to develop a model that matches the detection accuracy as well as computation time taken by the humans. In this paper we attempted to improve the detection accuracy without capitalizing much of computation time. The model utilizes the fact that maximal amount of information is present at the corners and edges of an object in the image. Firstly the keypoints are extracted from the image by using multi-scale Harris and multi-scale Gabor functions. Then the image is roughly segmented into two regions: a salient region and a background region, by constructing a convex hull over these keypoints. Finally the pixels of the two regions are considered as samples to be drawn from a multivariate kernel function whose parameters are estimated using expectation maximization algorithm, to yield a saliency map. The performance of the proposed model is evaluated in terms of precision, recall, F-measure, area under curve and computation time using six publicly available image datasets. Experimental results demonstrate that the proposed model outperformed the existing state-of-the-art methods in terms of recall, F-measure and area under curve on all the six datasets, and precision on four datasets. The proposed method also takes comparatively less computation time in comparison to many existing methods.  相似文献   

9.
In this paper, we propose a two-stage multiobjective-simulated annealing (MOSA)-based technique for named entity recognition (NER). At first, MOSA is used for feature selection under two statistical classifiers, viz. conditional random field (CRF) and support vector machine (SVM). Each solution on the final Pareto optimal front provides a different classifier. These classifiers are then combined together by using a new classifier ensemble technique based on MOSA. Several different versions of the objective functions are exploited. We hypothesize that the reliability of prediction of each classifier differs among the various output classes. Thus, in an ensemble system, it is necessary to find out the appropriate weight of vote for each output class in each classifier. We propose a MOSA-based technique to determine the weights for votes automatically. The proposed two-stage technique is evaluated for NER in Bengali, a resource-poor language, as well as for English. Evaluation results yield the highest recall, precision and F-measure values of 93.95, 95.15 and 94.55 %, respectively for Bengali and 89.01, 89.35 and 89.18 %, respectively for English. Experiments also suggest that the classifier ensemble identified by the proposed MOO-based approach optimizing the F-measure values of named entity (NE) boundary detection outperforms all the individual classifiers and four conventional baseline models.  相似文献   

10.
In this paper, a corpus-based thesaurus and WordNet were used to improve text categorization performance. We employed the k-NN algorithm and the back propagation neural network (BPNN) algorithms as the classifiers. The k-NN is a simple and famous approach for categorization, and the BPNNs has been widely used in the categorization and pattern recognition fields. However the standard BPNN has some generally acknowledged limitations, such as a slow training speed and can be easily trapped into a local minimum. To alleviate the problems of the standard BPNN, two modified versions, Morbidity neurons Rectified BPNN (MRBP) and Learning Phase Evaluation BPNN (LPEBP), were considered and applied to the text categorization. We conducted the experiments on both the standard reuter-21578 data set and the 20 Newsgroups data set. Experimental results showed that our proposed methods achieved high categorization effectiveness as measured by the precision, recall and F-measure protocols.  相似文献   

11.
ContextQuality assurance effort, especially testing effort, is frequently a major cost factor during software development. Consequently, one major goal is often to reduce testing effort. One promising way to improve the effectiveness and efficiency of software quality assurance is the use of data from early defect detection activities to provide a software testing focus. Studies indicate that using a combination of early defect data and other product data to focus testing activities outperforms the use of other product data only. One of the key challenges is that the use of data from early defect detection activities (such as inspections) to focus testing requires a thorough understanding of the relationships between these early defect detection activities and testing. An aggravating factor is that these relationships are highly context-specific and need to be evaluated for concrete environments.ObjectiveThe underlying goal of this paper is to help companies get a better understanding of these relationships for their own environment, and to provide them with a methodology for finding relationships in their own environments.MethodThis article compares three different strategies for evaluating assumed relationships between inspections and testing. We compare a confidence counter, different quality classes, and the F-measure including precision and recall.ResultsOne result of this case-study-based comparison is that evaluations based on the aggregated F-measures are more suitable for industry environments than evaluations based on a confidence counter. Moreover, they provide more detailed insights about the validity of the relationships.ConclusionWe have confirmed that inspection results are suitable data for controlling testing activities. Evaluated knowledge about relationships between inspections and testing can be used in the integrated inspection and testing approach In2Test to focus testing activities. Product data can be used in addition. However, the assumptions have to be evaluated in each new context.  相似文献   

12.
We describe a mechanism called SpaceGlue for adaptively locating services based on the preferences and locations of users in a distributed and dynamic network environment. In SpaceGlue, services are bound to physical locations, and a mobile user accesses local services depending on the current space he/she is visiting. SpaceGlue dynamically identifies the relationships between different spaces and links or “glues” spaces together depending on how previous users moved among them and used those services. Once spaces have been glued, users receive a recommendation of remote services (i.e., services provided in a remote space) reflecting the preferences of the crowd of users visiting the area. The strengths of bonds are implicitly evaluated by users and adjusted by the system on the basis of their evaluation. SpaceGlue is an alternative to existing schemes such as data mining and recommendation systems and it is suitable for distributed and dynamic environments. The bonding algorithm for SpaceGlue incrementally computes the relationships or “bonds” between different spaces in a distributed way. We implemented SpaceGlue using a distributed network application platform Ja-Net and evaluated it by simulation to show that it adaptively locates services reflecting trends in user preferences. By using “Mutual Information (MI)” and “F-measure” as measures to indicate the level of such trends and the accuracy of service recommendation, the simulation results showed that (1) in SpaceGlue, the F-measure increases depending on the level of MI (i.e., the more significant the trends, the greater the F-measure values), (2) SpaceGlue achives better precision and F-measure than “Flooding case (i.e., every service information is broadcast to everybody)” and “No glue case” by narrowing appropriate partners to send recommendations based on bonds, and (3) SpaceGlue achieves better F-measure with large number of spaces and users than other cases (i.e., “flooding” and “no glue”). Tomoko Itao is an alumna of NTT Network Innovation Laboratories  相似文献   

13.
《Applied Soft Computing》2008,8(2):839-848
For dealing with the adjacent input fuzzy sets having overlapping information, non-additive fuzzy rules are formulated by defining their consequent as the product of weighted input and a fuzzy measure. With the weighted input, need arises for the corresponding fuzzy measure. This is a new concept that facilitates the evolution of new fuzzy modeling. The fuzzy measures aggregate the information from the weighted inputs using the λ-measure. The output of these rules is in the form of the Choquet fuzzy integral. The underlying non-additive fuzzy model is investigated for identification of non-linear systems. The weighted input which is the additive S-norm of the inputs and their membership functions provides the strength of the rules and fuzzy densities required to compute fuzzy measures subject to q-measure are the unknown functions to be estimated. The use of q-measure is a powerful way of simplifying the computation of λ-measure that takes account of the interaction between the weighted inputs. Two applications; one real life application on signature verification and forgery detection, and another benchmark problem of a chemical plant illustrate the utility of the proposed approach. The results are compared with those existing in the literature.  相似文献   

14.
Software developers, testers and customers routinely submit issue reports to software issue trackers to record the problems they face in using a software. The issues are then directed to appropriate experts for analysis and fixing. However, submitters often misclassify an improvement request as a bug and vice versa. This costs valuable developer time. Hence automated classification of the submitted reports would be of great practical utility. In this paper, we analyze how machine learning techniques may be used to perform this task. We apply different classification algorithms, namely naive Bayes, linear discriminant analysis, k-nearest neighbors, support vector machine (SVM) with various kernels, decision tree and random forest separately to classify the reports from three open-source projects. We evaluate their performance in terms of F-measure, average accuracy and weighted average F-measure. Our experiments show that random forests perform best, while SVM with certain kernels also achieve high performance.  相似文献   

15.
Document classification and summarization are very important for document text retrieval. Generally, humans can recognize fields such as ?Sports? or ?Politics? based on specific words called Field Association (FA) words in those document fields. The traditional method causes misleading redundant words (unnecessary words) to be registered because the quality of the resulting FA words depends on learning data pre-classified by hand. Therefore recall and precision of document classification are degraded if the classified fields classified by hand are ambiguous. We propose two criteria: deleting unnecessary words with low frequencies, and deleting unnecessary words using category information. Moreover, using the proposed criteria unnecessary words can be deleted from the FA words dictionary created by the traditional method. Experimental results showed that 25% of 38 372 FA word candidates were identified as unnecessary and deleted automatically when the presented method was used. Furthermore, precision and F-measure were improved by 26% and 15%, respectively, compared with the traditional method.  相似文献   

16.
Discovering the interactions between the persons mentioned in a set of topic documents can help readers construct the background of the topic and facilitate document comprehension. To discover person interactions, we need a detection method that can identify text segments containing information about the interactions. Information extraction algorithms then analyze the segments to extract interaction tuples and construct a network of person interaction. In this article, we define interaction detection as a classification problem. The proposed interaction detection method, called feature‐based interactive segment recognizer (FISER), exploits 19 features covering syntactic, context‐dependent, and semantic information in text to detect intra‐clausal and inter‐clausal interactive segments in topic documents. Empirical evaluations demonstrate that FISER outperformed many well‐known relation extraction and protein–protein interaction detection methods on identifying interactive segments in topic documents. In addition, the precision, recall, and F1‐score of the best feature combination are 72.9%, 55.8%, and 63.2%, respectively.  相似文献   

17.
A software birthmark refers to the inherent characteristics of a program that can be used to identify the program. In this paper, a method for detecting the theft of Java programs through a static software birthmark is proposed that is based on the control flow information. The control flow information shows the structural characteristics and the possible behaviors during the execution of program. Flow paths (FP) and behaviors in Java programs are formally described here, and a set of behaviors of FPs is used as a software birthmark. The similarity is calculated by matching the pairs of similar behaviors from two birthmarks. Experiments centered on the proposed birthmark with respect to precision and recall. The performance was evaluated by analyzing the F-measure curves. The experimental results show that the proposed birthmark is a more effective measure compared to earlier approaches for detecting copied programs, even in cases where such programs are aggressively modified.  相似文献   

18.
We present a comparative study on the most popular machine learning methods applied to the challenging problem of customer churning prediction in the telecommunications industry. In the first phase of our experiments, all models were applied and evaluated using cross-validation on a popular, public domain dataset. In the second phase, the performance improvement offered by boosting was studied. In order to determine the most efficient parameter combinations we performed a series of Monte Carlo simulations for each method and for a wide range of parameters. Our results demonstrate clear superiority of the boosted versions of the models against the plain (non-boosted) versions. The best overall classifier was the SVM-POLY using AdaBoost with accuracy of almost 97% and F-measure over 84%.  相似文献   

19.
汉语框架语义角色的自动标注   总被引:3,自引:0,他引:3  
基于山西大学自主开发的汉语框架语义知识库(CFN),将语义角色标注问题通过IOB策略转化为词序列标注问题,采用条件随机场模型,研究了汉语框架语义角色的自动标注.模型以词为基本标注单元,选择词、词性、词相对于目标词的位置、目标词及其组合为特征.针对每个特征设定若干可选的窗口,组合构成模型的各种特征模板,基于统计学中的正交表,给出一种较优模板选择方法.全部实验在选出的25个框架的6 692个例句的语料上进行.对每一个框架,分别按照其例句训练一个模型,同时进行语义角色的边界识别与分类,进行2-fold交叉验证.在给定句子中的目标词以及目标词所属的框架情况下,25个框架交叉验证的实验结果的准确率、召回率、F1-值分别达到74.16%,52.70%和61.62%.  相似文献   

20.
This work proposes an extension of Bing Liu’s aspect-based opinion mining approach in order to apply it to the tourism domain. The extension concerns with the fact that users refer differently to different kinds of products when writing reviews on the Web. Since Liu’s approach is focused on physical product reviews, it could not be directly applied to the tourism domain, which presents features that are not considered by the model. Through a detailed study of on-line tourism product reviews, we found these features and then model them in our extension, proposing the use of new and more complex NLP-based rules for the tasks of subjective and sentiment classification at the aspect-level. We also entail the task of opinion visualization and summarization and propose new methods to help users digest the vast availability of opinions in an easy manner. Our work also included the development of a generic architecture for an aspect-based opinion mining tool, which we then used to create a prototype and analyze opinions from TripAdvisor in the context of the tourism industry in Los Lagos, a Chilean administrative region also known as the Lake District. Results prove that our extension is able to perform better than Liu’s model in the tourism domain, improving both Accuracy and Recall for the tasks of subjective and sentiment classification. Particularly, the approach is very effective in determining the sentiment orientation of opinions, achieving an F-measure of 92% for the task. However, on average, the algorithms were only capable of extracting 35% of the explicit aspect expressions, using a non-extended approach for this task. Finally, results also showed the effectiveness of our design when applied to solving the industry’s specific issues in the Lake District, since almost 80% of the users that used our tool considered that our tool adds valuable information to their business.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号