首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this study, the differences among widely used weighting schemes are studied by means of ordering terms according to their discriminative abilities using a recently developed framework which expresses term weights in terms of the ratio and absolute difference of term occurrence probabilities. Having observed that the ordering of terms is dependent on the weighting scheme under concern, it is emphasized that this can be explained by the way different schemes use term occurrence differences in generating term weights. Then, it is proposed that the relevance frequency which is shown to provide the best scores on several datasets can be improved by taking into account the way absolute difference values are used in other widely used schemes. Experimental results on two different datasets have shown that improved F 1 scores can be achieved.  相似文献   

2.

The distribution of documents over two classes in binary text categorization problem is generally uneven where resampling approaches are shown to improve F 1 scores. The improvement achieved is mainly due to the gain in recall where precision may deteriorate. Since precision is the primary concern in some applications, achieving higher F 1 scores with a desired level of trade-off between precision and recall is important. In this study, we present an analytical comparison between unanimity and majority voting rules. It is shown that unanimity rule can provide better F 1 scores compared to majority voting when an ensemble of high recall but low precision classifiers is considered. Then, category-based undersampling is proposed to generate high recall members. The experiments conducted on three datasets have shown that superior F 1 scores can be realized compared to the support vector machines(SVM)-based baseline system and voting over a random undersampling-based ensemble.

  相似文献   

3.
在文本分类领域中,目前关于特征权重的研究存在两方面不足:一方面,对于基于文档频率的特征权重算法,其中的文档频率常常忽略特征的词频信息;另一方面,对特征与类别的关系表达不够准确和充分。针对以上两点不足,提出一种新的基于词频的类别相关特征权重算法(全称CDF-AICF)。该算法在度量特征权重时,考虑了特征在每个词频下的文档频率。同时,为了准确表达特征与类别的关系,提出了两个新的概念:类别相关文档频率CDF和平均逆类频率AICF,分别用于表示特征对类别的表现力和区分力。最后,通过与其它5个特征权重度量方法相比较,在三个数据集上进行分类实验。结果显示,CDF-AICF的分类性能优于其它5种度量方法。  相似文献   

4.
Harun Uğuz 《Knowledge》2011,24(7):1024-1032
Text categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the last ten years. A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier. Therefore, feature selection is often used in text categorization to reduce the dimensionality of the feature space and to improve performance. In this study, two-stage feature selection and feature extraction is used to improve the performance of text categorization. In the first stage, each term within the document is ranked depending on their importance for classification using the information gain (IG) method. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Thereby, during text categorization, terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; thus, the computational time and complexity of categorization is reduced. To evaluate the effectiveness of dimension reduction methods on our purposed model, experiments are conducted using the k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets collection for text categorization. The experimental results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.  相似文献   

5.
The current work intended to enhance our knowledge of changes or lack of changes in the speech signal when people were being deceptive. In particular, the study attempted to investigate the appropriateness of using speech cues in detecting deception. Truthful, deceptive and control speech were elicited from ten speakers in an interview setting. The data were subjected to acoustic analysis and results are presented on a range of speech parameters including fundamental frequency (f0), overall amplitude and mean vowel formants F1, F2 and F3. A significant correlation could not be established between deceptiveness/truthfulness and any of the acoustic features examined. Directions for future work are highlighted.  相似文献   

6.
In modern theories of rewriting structures, hyper-sentential and hyper-algebraic extensions of languages-families have abstracted the imminent features of iterated parallel substitution. After introducing the concept of a (depth-bounded) translation, we show that for each language L hyper-algebraic over a natural family F there are F-translations Δ, ?D and languages L1, …, Lm hyper-sentential over F such that
, for some p, q ? 0.Two specializations of this result are given, when more assumptions are made about F. These are, firstly, a translation theorem and, secondly, an alphabetic homomorphism theorem for hyper-algebraic extensions (an alphabetic homomorphism is a letter-to-letter or letter-to-ε homomorphism).  相似文献   

7.
We study the problem of listing all closed sets of a closure operator σ that is a partial function on the power set of some finite ground set E, i.e., σ:FF with FP(E). A very simple divide-and-conquer algorithm is analyzed that correctly solves this problem if and only if the domain of the closure operator is a strongly accessible set system. Strong accessibility is a strict relaxation of greedoids as well as of independence systems. This algorithm turns out to have delay O(|E|(TF+Tσ+|E|)) and space O(|E|+SF+Sσ), where TF, SF, Tσ, and Sσ are the time and space complexities of checking membership in F and computing σ, respectively. In contrast, we show that the problem becomes intractable for accessible set systems. We relate our results to the data mining problem of listing all support-closed patterns of a dataset and show that there is a corresponding closure operator for all datasets if and only if the set system satisfies a certain confluence property.  相似文献   

8.
Regional radiometric-geological mapping of the outcropping basement complex in the Gabal Eteiqa area has been carried out through the application of factor analysis technique

Three factor scores (F?1 F?2 and F?3), which reflect the interrelation of the seven spectrometric variables (TC, eU, eTh, K, eU/eTh, eU/K and eTh/K), are sufficient to outline the different rock units, F?1 outlines the highly-radioactive rocks such as granodiorites, granites, ring complexes and acidic volcanics. The granodiorite and ring complexes are completely differentiated by F?2 scores. The F?3 values enable the granitic plutons to be divided into numerous subunits (e.g. G1" G2, G3 and G4). It is believed that the low radiometric level of G4 is due to the Quaternary wadi deposits that overlie the granites, an interpretation confirmed by aerial photomosaics.  相似文献   

9.
The gravimetric water content (GWC, %), a commonly used measure of leaf water content, describes the ratio of water to dry matter for each individual leaf. To date, the relationship between spectral reflectance and GWC in leaves is poorly understood due to the confounding effects of unpredictably varying water and dry matter ratios on spectral response. Few studies have attempted to estimate GWC from leaf reflectance spectra, particularly for a variety of species. This paper investigates the spectroscopic estimation of leaf GWC using continuous wavelet analysis applied to the reflectance spectra (350-2500 nm) of 265 leaf samples from 47 species observed in tropical forests of Panama. A continuous wavelet transform was performed on each of the reflectance spectra to generate a wavelet power scalogram compiled as a function of wavelength and scale. Linear relationships were built between wavelet power and GWC expressed as a function of dry mass (LWCD) and fresh mass (LWCF) in order to identify wavelet features (coefficients) that are most sensitive to changes in GWC. The derived wavelet features were then compared to three established spectral indices used to estimate GWC across a wide range of species.Eight wavelet features observed between 1300 and 2500 nm provided strong correlations with LWCD, though correlations between spectral indices and leaf GWC were poor. In particular, two features captured amplitude variations in the broad shape of the reflectance spectra and three features captured variations in the shape and depth of dry matter (e.g., protein, lignin, cellulose) absorptions centered near 1730 and 2100 nm. The eight wavelet features used to predict LWCD and LWCF were not significantly different; however, predictive models used to determine LWCD and LWCF differed. The most accurate estimates of LWCD and LWCF obtained from a single wavelet feature showed root mean square errors (RMSEs) of 28.34% (R2 = 0.62) and 4.86% (R2 = 0.69), respectively. Models using a combination of features resulted in a noticeable improvement predicting LWCD and LWCF with RMSEs of 26.04% (R2 = 0.71) and 4.34% (R2 = 0.75), respectively. These results provide new insights into the role of dry matter absorption features in the shortwave infrared (SWIR) spectral region for the accurate spectral estimation of LWCD and LWCF. This emerging spectral analytical approach can be applied to other complex datasets including a broad range of species, and may be adapted to estimate basic leaf biochemical elements such as nitrogen, chlorophyll, cellulose, and lignin.  相似文献   

10.
In this article we propose a data treatment strategy to generate new discriminative features, called compound-features (or c-features), for the sake of text classification. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. The idea is that, when c-features are used in conjunction with single-features, the ambiguity and noise inherent to their bag-of-words representation are reduced. We use c-features composed of two terms in order to make their usage computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and single-label multi-class text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as kNN (13% gain in micro-average F1 in the 20 Newsgroups collection) to the most complex one, the state-of-the-art SVM (10% gain in macro-average F1 in the collection OHSUMED).  相似文献   

11.
Algorithms for computing Coulomb-Bessel functions are considered, with emphasis on obtaining accurate values when the argument x is inside the classical turning point xλ. Algorithms of Barnett et al. for the generalized Coulomb functions and their derivatives are discussed in the context of the phase integral formalism. Modified or alternative algorithms are considered that are designed to be valid for all values of argument x and index λ for the functions Fλ(x), Gλ(x). An algorithm for a ccelerating convergence of a power series by conversion to a continued fraction is presented, and is applied to the evaluation of spherical Bessel functions. An explicit formula for the integrand of the phase integral is presented for spherical Bessel functions. The methods considered need to be augmented by an efficient algorithm for computing the logarithmic derivative of G0 + iF0 for Coulomb functions when x is smaller than the charge parameter η.  相似文献   

12.
This paper introduces new specificity measuring methods of terms using inside and outside information. Specificity of a term is the quantity of domain specific information contained in the term. Specific terms have a larger quantity of domain information than general terms. Specificity is an important necessary condition for building hierarchical relations among terms. If t1 is a hyponym of t2 in a domain term hierarchy, then the specificity of t1 is greater than that of t2. As domain specific terms are commonly compounds of the generic level term and some modifiers, inside information is important to represent the meaning of terms. Outside contextual information is also used to complement the shortcomings of inside information. We propose an information theoretic method to measure the quantity of terms. Experiments showed promising results with a precision of 73.9% when applied to terms in the MeSH thesaurus.  相似文献   

13.
Entering information on a computer keyboard is a ubiquitous mode of expression and communication. We investigate whether typing behavior is connected to two factors: the cognitive demands of a given task and the demographic features of the typist. We utilize features based on keystroke dynamics, stylometry, and “language production”, which are novel hybrid features that capture the dynamics of a typists linguistic choices. Our study takes advantage of a large data set (~350 subjects) made up of relatively short samples (~450 characters) of free text. Experiments show that these features can recognize the cognitive demands of task that an unseen typist is engaged in, and can classify his or her demographics with better than chance accuracy. We correctly distinguish High vs. Low cognitively demanding tasks with accuracy up to 72.39%. Detection of non-native speakers of English is achieved with F1=0.462 over a baseline of 0.166, while detection of female typists reaches F1=0.524 over a baseline of 0.442. Recognition of left-handed typists achieves F1=0.223 over a baseline of 0.100. Further analyses reveal that novel relationships exist between language production as manifested through typing behavior, and both cognitive and demographic factors.  相似文献   

14.
Massive textual data management and mining usually rely on automatic text classification technology. Term weighting is a basic problem in text classification and directly affects the classification accuracy. Since the traditional TF-IDF (term frequency & inverse document frequency) is not fully effective for text classification, various alternatives have been proposed by researchers. In this paper we make comparative studies on different term weighting schemes and propose a new term weighting scheme, TF-IGM (term frequency & inverse gravity moment), as well as its variants. TF-IGM incorporates a new statistical model to precisely measure the class distinguishing power of a term. Particularly, it makes full use of the fine-grained term distribution across different classes of text. The effectiveness of TF-IGM is validated by extensive experiments of text classification using SVM (support vector machine) and kNN (k nearest neighbors) classifiers on three commonly used corpora. The experimental results show that TF-IGM outperforms the famous TF-IDF and the state-of-the-art supervised term weighting schemes. In addition, some new findings different from previous studies are obtained and analyzed in depth in the paper.  相似文献   

15.
This work studies three variants of a three-machine flowshop problem with two operations per job to minimize makespan (F3/o = 2/Cmax). A set of n jobs are classified into three mutually exclusive families A, B and C. The families A, B and C are defined as the set of jobs that is scheduled in machine sequence (M1M2), (M1M3) and (M1M3), respectively, where (MxMy) specifies the machine sequence for the job that is processed first on Mx, and then on My. Specifically, jobs with the same route (machine sequence) are classified into the same family. Three variants of F3/o = 2/Cmax are studied. First, F3/GT, no-idle, o = 2/Cmax, in which both machine no-idle and GT restrictions are considered. The GT assumption requires that all jobs in the same family are processed contiguously on the machine and the machine no-idle assumption requires that all machines work continuously without idle time. Second, the problem F3/GT, o = 2/Cmax, in which the machine no-idle restriction in the first variant is relaxed, is considered. Third, the problem F3/no-idle, o = 2/Cmax with the GT assumption in the first variant relaxed is considered. Based on the dominance conditions developed, the optimal solution is polynomially derived for each variant. These results may narrow down the gap between easy and hard cases of the general problem.  相似文献   

16.
The frequency moments of a sequence containingmielements of typei, 1⩽in, are the numbersFk=∑ni=1 mki. We consider the space complexity of randomized algorithms that approximate the numbersFk, when the elements of the sequence are given one by one and cannot be stored. Surprisingly, it turns out that the numbersF0,F1, andF2can be approximated in logarithmic space, whereas the approximation ofFkfork⩾6 requiresnΩ(1)space. Applications to data bases are mentioned as well.  相似文献   

17.
This paper proposes a two-stage feedforward neural network (FFNN) based approach for modeling fundamental frequency (F0) values of a sequence of syllables. In this study, (i) linguistic constraints represented by positional, contextual and phonological features, (ii) production constraints represented by articulatory features and (iii) linguistic relevance tilt parameters are proposed for predicting intonation patterns. In the first stage, tilt parameters are predicted using linguistic and production constraints. In the second stage, F0 values of the syllables are predicted using the tilt parameters predicted from the first stage, and basic linguistic and production constraints. The prediction performance of the neural network models is evaluated using objective measures such as average prediction error (μ), standard deviation (σ) and linear correlation coefficient (γX,Y). The prediction accuracy of the proposed two-stage FFNN model is compared with other statistical models such as Classification and Regression Tree (CART) and Linear Regression (LR) models. The prediction accuracy of the intonation models is also analyzed by conducting listening tests to evaluate the quality of synthesized speech obtained after incorporation of intonation models into the baseline system. From the evaluation, it is observed that prediction accuracy is better for two-stage FFNN models, compared to the other models.  相似文献   

18.
Term weighting is a strategy that assigns weights to terms to improve the performance of sentiment analysis and other text mining tasks. In this paper, we propose a supervised term weighting scheme based on two basic factors: Importance of a term in a document (ITD) and importance of a term for expressing sentiment (ITS), to improve the performance of analysis. For ITD, we explore three definitions based on term frequency. Then, seven statistical functions are employed to learn the ITS of each term from training documents with category labels. Compared with the previous unsupervised term weighting schemes originated from information retrieval, our scheme can make full use of the available labeling information to assign appropriate weights to terms. We have experimentally evaluated the proposed method against the state-of-the-art method. The experimental results show that our method outperforms the method and produce the best accuracy on two of three data sets.  相似文献   

19.
A new procedure is proposed to balance type I and II errors in significance testing for differential expression of individual genes. Suppose that a collection, Fk, of k lists of selected genes is available, each of them approximating by their content the true set of differentially expressed genes. For example, such sets can be generated by a subsampling counterpart of the delete-d-jackknife method controlling the per-comparison error rate for each subsample. A final list of candidate genes, denoted by S, is composed in such a way that its contents be closest in some sense to all the sets thus generated. To measure “closeness” of gene lists, we introduce an asymmetric distance between sets with its asymmetry arising from a generally unequal assignment of the relative costs of type I and type II errors committed in the course of gene selection. The optimal set S is defined as a minimizer of the average asymmetric distance from an arbitrary set S to all sets in the collection Fk. The minimization problem can be solved explicitly, leading to a frequency criterion for the inclusion of each gene in the final set. The proposed method is tested by resampling from real microarray gene expression data with artificially introduced shifts in expression levels of pre-defined genes, thereby mimicking their differential expression.  相似文献   

20.
In this study, we investigated a dependence of anionic species of ionic liquids (ILs) (IL: perfluoroalkyltrifluoroborate anions ([CnF2n+1BF3] (n = 0, 1, 2) and bis(perfluoroalkylsulfonyl)imide anions ([(CmF2m+1SO2)(CnF2n+1SO2)N] (m, n = 0, 1, 2)) on electrochemical and electromechanical properties. 1-Ethyl-3-methylimidazolium (EMI+) was selected as a cation for ILs. 1-Ethyl-3-methylimidazolium trifluoromethyltrifluoroborate (EMI[CF3BF3]), 1-ethyl-3-methylimidazolium pentafluoroethyltrifluoroborate (EMI[CF3CF2BF3]), 1-ethyl-3-methylimidazolium fluorosulfonyl(trifluoromethylsulfonyl)imide (EMI[FTA]) and 1-ethyl-3-methylimidazolium pentafluoroethylsulfonyl(trifluoromethylsulfonyl)imide (EMI[C1C2]) were synthesized according to the literatures. The generated strains of the bucky-gel electrodes of the actuators containing EMI[CF3BF3] (in the high frequency range: 10-0.5 Hz) and EMI[CF3CF2BF3] (in the high frequency range of 1-0.5 Hz) are larger than that containing EMI[BF4] (that is to say the quick response). For low frequencies (0.1-0.005 Hz), the generated strain containing EMI[CF3CF2BF3] was larger than those containing other ILs (EMI[CnF2n+1BF3] (n = 0, 1) and EMI[(CmF2m+1SO2)(CnF2n+1SO2)N] (m, n = 0, 1, 2)). The Young's modulus of actuators containing EMI[CF3BF3] and EMI[CF3CF2BF3] were 145 and 110 MPa, respectively. The melting points of EMI[CF3BF3] and EMI[CF3CF2BF3] are lower than that of EMI[BF4].Therefore, trifluoromethyltrifluoroborate ([CF3BF3]) and pentafluoroethyltrifluoroborate ([CF3CF2BF3]) anions performed much better as the actuator using the polymer-supported bucky-gel electrode containing the IL. These results are considered to be the actuator enough to apply actual applications (e.g. tactile display).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号