期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using the absolute difference of term occurrence probabilities in?binary text categorization

Hakan Alt?n?ay Zafer Erenel 《Applied Intelligence》2012,36(1):148-160

In this study, the differences among widely used weighting schemes are studied by means of ordering terms according to their discriminative abilities using a recently developed framework which expresses term weights in terms of the ratio and absolute difference of term occurrence probabilities. Having observed that the ordering of terms is dependent on the weighting scheme under concern, it is emphasized that this can be explained by the way different schemes use term occurrence differences in generating term weights. Then, it is proposed that the relevance frequency which is shown to provide the best scores on several datasets can be improved by taking into account the way absolute difference values are used in other widely used schemes. Experimental results on two different datasets have shown that improved F ₁ scores can be achieved. 相似文献

2.

Improving the precision-recall trade-off in undersampling-based binary text categorization using unanimity rule

Erenel Zafer Alt&#;n&#;ay Hakan 《Neural computing & applications》2012,22(1):83-100

The distribution of documents over two classes in binary text categorization problem is generally uneven where resampling approaches are shown to improve F ₁ scores. The improvement achieved is mainly due to the gain in recall where precision may deteriorate. Since precision is the primary concern in some applications, achieving higher F ₁ scores with a desired level of trade-off between precision and recall is important. In this study, we present an analytical comparison between unanimity and majority voting rules. It is shown that unanimity rule can provide better F ₁ scores compared to majority voting when an ensemble of high recall but low precision classifiers is considered. Then, category-based undersampling is proposed to generate high recall members. The experiments conducted on three datasets have shown that superior F ₁ scores can be realized compared to the support vector machines(SVM)-based baseline system and voting over a random undersampling-based ensemble.

相似文献

3.

基于词频的类别相关的特征权重算法

张羚《计算机应用研究》2017,34(2)

在文本分类领域中,目前关于特征权重的研究存在两方面不足：一方面,对于基于文档频率的特征权重算法,其中的文档频率常常忽略特征的词频信息;另一方面,对特征与类别的关系表达不够准确和充分。针对以上两点不足,提出一种新的基于词频的类别相关特征权重算法(全称CDF-AICF)。该算法在度量特征权重时,考虑了特征在每个词频下的文档频率。同时,为了准确表达特征与类别的关系,提出了两个新的概念：类别相关文档频率CDF和平均逆类频率AICF,分别用于表示特征对类别的表现力和区分力。最后,通过与其它5个特征权重度量方法相比较,在三个数据集上进行分类实验。结果显示,CDF-AICF的分类性能优于其它5种度量方法。相似文献

4.

A two-stage feature selection method for text categorization by using information gain,principal component analysis and genetic algorithm 总被引：1，自引：0，他引：1

Harun Uğuz 《Knowledge》2011,24(7):1024-1032

Text categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the last ten years. A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier. Therefore, feature selection is often used in text categorization to reduce the dimensionality of the feature space and to improve performance. In this study, two-stage feature selection and feature extraction is used to improve the performance of text categorization. In the first stage, each term within the document is ranked depending on their importance for classification using the information gain (IG) method. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Thereby, during text categorization, terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; thus, the computational time and complexity of categorization is reduced. To evaluate the effectiveness of dimension reduction methods on our purposed model, experiments are conducted using the k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets collection for text categorization. The experimental results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure. 相似文献

5.

Detecting suspicious behaviour using speech: Acoustic correlates of deceptive speech – An exploratory investigation

Christin Kirchhübel David M. Howard 《Applied ergonomics》2013

The current work intended to enhance our knowledge of changes or lack of changes in the speech signal when people were being deceptive. In particular, the study attempted to investigate the appropriateness of using speech cues in detecting deception. Truthful, deceptive and control speech were elicited from ten speakers in an interview setting. The data were subjected to acoustic analysis and results are presented on a range of speech parameters including fundamental frequency (f₀), overall amplitude and mean vowel formants F₁, F₂ and F₃. A significant correlation could not be established between deceptiveness/truthfulness and any of the acoustic features examined. Directions for future work are highlighted. 相似文献

6.

A decomposition theorem for hyper-algebraic extensions of language families

Jan van Leeuwen Derick Wood 《Theoretical computer science》1976,1(3):199-214

In modern theories of rewriting structures, hyper-sentential and hyper-algebraic extensions of languages-families have abstracted the imminent features of iterated parallel substitution. After introducing the concept of a (depth-bounded) translation, we show that for each language L hyper-algebraic over a natural family

F

there are

F

-translations Δ, ?D and languages L₁, …, L_m hyper-sentential over

F

such that

, for some p, q ? 0.Two specializations of this result are given, when more assumptions are made about

F

. These are, firstly, a translation theorem and, secondly, an alphabetic homomorphism theorem for hyper-algebraic extensions (an alphabetic homomorphism is a letter-to-letter or letter-to-ε homomorphism). 相似文献

7.

Listing closed sets of strongly accessible set systems with applications to data mining 总被引：2，自引：0，他引：2

Mario Boley Tamás Horváth Axel Poigné 《Theoretical computer science》2010,411(3):691-700

We study the problem of listing all closed sets of a closure operator σ that is a partial function on the power set of some finite ground set E, i.e., σ:F→F with F⊆P(E). A very simple divide-and-conquer algorithm is analyzed that correctly solves this problem if and only if the domain of the closure operator is a strongly accessible set system. Strong accessibility is a strict relaxation of greedoids as well as of independence systems. This algorithm turns out to have delay O(|E|(T_F+T_σ+|E|)) and space O(|E|+S_F+S_σ), where T_F, S_F, T_σ, and S_σ are the time and space complexities of checking membership in F and computing σ, respectively. In contrast, we show that the problem becomes intractable for accessible set systems. We relate our results to the data mining problem of listing all support-closed patterns of a dataset and show that there is a corresponding closure operator for all datasets if and only if the set system satisfies a certain confluence property. 相似文献

8.

Regional remapping of the basement complex outcrops,using factor analysis to spectrometric data,of the Gabal Eteiqa,Eastern Desert,Egypt

H. A. HUSSEIN S. I. RABIE S. H. ABDEL NABI 《International journal of remote sensing》2013,34(5):811-823

Regional radiometric-geological mapping of the outcropping basement complex in the Gabal Eteiqa area has been carried out through the application of factor analysis technique

Three factor scores (F_?1 F_?2 and F_?3), which reflect the interrelation of the seven spectrometric variables (TC, eU, eTh, K, eU/eTh, eU/K and eTh/K), are sufficient to outline the different rock units, F_?1 outlines the highly-radioactive rocks such as granodiorites, granites, ring complexes and acidic volcanics. The granodiorite and ring complexes are completely differentiated by F_?2 scores. The F_?3 values enable the granitic plutons to be divided into numerous subunits (e.g. G₁" G₂, G₃ and G₄). It is believed that the low radiometric level of G₄ is due to the Quaternary wadi deposits that overlie the granites, an interpretation confirmed by aerial photomosaics. 相似文献

9.

Spectroscopic determination of leaf water content using continuous wavelet analysis 总被引：12，自引：0，他引：12

T. Cheng A. Sánchez-Azofeifa 《Remote sensing of environment》2011,115(2):659-670

The gravimetric water content (GWC, %), a commonly used measure of leaf water content, describes the ratio of water to dry matter for each individual leaf. To date, the relationship between spectral reflectance and GWC in leaves is poorly understood due to the confounding effects of unpredictably varying water and dry matter ratios on spectral response. Few studies have attempted to estimate GWC from leaf reflectance spectra, particularly for a variety of species. This paper investigates the spectroscopic estimation of leaf GWC using continuous wavelet analysis applied to the reflectance spectra (350-2500 nm) of 265 leaf samples from 47 species observed in tropical forests of Panama. A continuous wavelet transform was performed on each of the reflectance spectra to generate a wavelet power scalogram compiled as a function of wavelength and scale. Linear relationships were built between wavelet power and GWC expressed as a function of dry mass (LWC_D) and fresh mass (LWC_F) in order to identify wavelet features (coefficients) that are most sensitive to changes in GWC. The derived wavelet features were then compared to three established spectral indices used to estimate GWC across a wide range of species.Eight wavelet features observed between 1300 and 2500 nm provided strong correlations with LWC_D, though correlations between spectral indices and leaf GWC were poor. In particular, two features captured amplitude variations in the broad shape of the reflectance spectra and three features captured variations in the shape and depth of dry matter (e.g., protein, lignin, cellulose) absorptions centered near 1730 and 2100 nm. The eight wavelet features used to predict LWC_D and LWC_F were not significantly different; however, predictive models used to determine LWC_D and LWC_F differed. The most accurate estimates of LWC_D and LWC_F obtained from a single wavelet feature showed root mean square errors (RMSEs) of 28.34% (R² = 0.62) and 4.86% (R² = 0.69), respectively. Models using a combination of features resulted in a noticeable improvement predicting LWC_D and LWC_F with RMSEs of 26.04% (R² = 0.71) and 4.34% (R² = 0.75), respectively. These results provide new insights into the role of dry matter absorption features in the shortwave infrared (SWIR) spectral region for the accurate spectral estimation of LWC_D and LWC_F. This emerging spectral analytical approach can be applied to other complex datasets including a broad range of species, and may be adapted to estimate basic leaf biochemical elements such as nitrogen, chlorophyll, cellulose, and lignin. 相似文献

10.

Word co-occurrence features for text classification

Fábio Figueiredo Leonardo Rocha Thierson Couto Thiago Salles Marcos André Gonçalves Wagner Meira Jr. 《Information Systems》2011

In this article we propose a data treatment strategy to generate new discriminative features, called compound-features (or c-features), for the sake of text classification. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. The idea is that, when c-features are used in conjunction with single-features, the ambiguity and noise inherent to their bag-of-words representation are reduced. We use c-features composed of two terms in order to make their usage computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and single-label multi-class text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as kNN (13% gain in micro-average F₁ in the 20 Newsgroups collection) to the most complex one, the state-of-the-art SVM (10% gain in macro-average F₁ in the collection OHSUMED). 相似文献

11.

Algorithms for regular and irregular coulomb and bessel functions

R.K. Nesbet 《Computer Physics Communications》1984,32(4):341-347

Algorithms for computing Coulomb-Bessel functions are considered, with emphasis on obtaining accurate values when the argument x is inside the classical turning point x_λ. Algorithms of Barnett et al. for the generalized Coulomb functions and their derivatives are discussed in the context of the phase integral formalism. Modified or alternative algorithms are considered that are designed to be valid for all values of argument x and index λ for the functions F_λ(x), G_λ(x). An algorithm for a ccelerating convergence of a power series by conversion to a continued fraction is presented, and is applied to the evaluation of spherical Bessel functions. An explicit formula for the integrand of the phase integral is presented for spherical Bessel functions. The methods considered need to be augmented by an efficient algorithm for computing the logarithmic derivative of G₀ + iF₀ for Coulomb functions when x is smaller than the charge parameter η. 相似文献

12.

Determining the specificity of terms using inside-outside information: a necessary condition of term hierarchy mining

Pum-Mo Ryu Key-Sun Choi 《Information Processing Letters》2006,100(2):76-82

This paper introduces new specificity measuring methods of terms using inside and outside information. Specificity of a term is the quantity of domain specific information contained in the term. Specific terms have a larger quantity of domain information than general terms. Specificity is an important necessary condition for building hierarchical relations among terms. If t₁ is a hyponym of t₂ in a domain term hierarchy, then the specificity of t₁ is greater than that of t₂. As domain specific terms are commonly compounds of the generic level term and some modifiers, inside information is important to represent the meaning of terms. Outside contextual information is also used to complement the shortcomings of inside information. We propose an information theoretic method to measure the quantity of terms. Experiments showed promising results with a precision of 73.9% when applied to terms in the MeSH thesaurus. 相似文献

13.

Utilizing linguistically enhanced keystroke dynamics to predict typist cognition and demographics

《International journal of human-computer studies》2015

Entering information on a computer keyboard is a ubiquitous mode of expression and communication. We investigate whether typing behavior is connected to two factors: the cognitive demands of a given task and the demographic features of the typist. We utilize features based on keystroke dynamics, stylometry, and “language production”, which are novel hybrid features that capture the dynamics of a typists linguistic choices. Our study takes advantage of a large data set (~350 subjects) made up of relatively short samples (~450 characters) of free text. Experiments show that these features can recognize the cognitive demands of task that an unseen typist is engaged in, and can classify his or her demographics with better than chance accuracy. We correctly distinguish High vs. Low cognitively demanding tasks with accuracy up to 72.39%. Detection of non-native speakers of English is achieved with F₁=0.462 over a baseline of 0.166, while detection of female typists reaches F₁=0.524 over a baseline of 0.442. Recognition of left-handed typists achieves F₁=0.223 over a baseline of 0.100. Further analyses reveal that novel relationships exist between language production as manifested through typing behavior, and both cognitive and demographic factors. 相似文献

14.

Turning from TF-IDF to TF-IGM for term weighting in text classification

《Expert systems with applications》2016

Massive textual data management and mining usually rely on automatic text classification technology. Term weighting is a basic problem in text classification and directly affects the classification accuracy. Since the traditional TF-IDF (term frequency & inverse document frequency) is not fully effective for text classification, various alternatives have been proposed by researchers. In this paper we make comparative studies on different term weighting schemes and propose a new term weighting scheme, TF-IGM (term frequency & inverse gravity moment), as well as its variants. TF-IGM incorporates a new statistical model to precisely measure the class distinguishing power of a term. Particularly, it makes full use of the fine-grained term distribution across different classes of text. The effectiveness of TF-IGM is validated by extensive experiments of text classification using SVM (support vector machine) and kNN (k nearest neighbors) classifiers on three commonly used corpora. The experimental results show that TF-IGM outperforms the famous TF-IDF and the state-of-the-art supervised term weighting schemes. In addition, some new findings different from previous studies are obtained and analyzed in depth in the paper. 相似文献

15.

Three-machine flowshop with two operations per job to minimize makespan

Ling-Huey Su Cheng-Te Lin 《Computers & Industrial Engineering》2006

This work studies three variants of a three-machine flowshop problem with two operations per job to minimize makespan (F3/o = 2/C_max). A set of n jobs are classified into three mutually exclusive families A, B and C. The families A, B and C are defined as the set of jobs that is scheduled in machine sequence (M₁, M₂), (M₁, M₃) and (M₁, M₃), respectively, where (M_x, M_y) specifies the machine sequence for the job that is processed first on M_x, and then on M_y. Specifically, jobs with the same route (machine sequence) are classified into the same family. Three variants of F3/o = 2/C_max are studied. First, F3/GT, no-idle, o = 2/C_max, in which both machine no-idle and GT restrictions are considered. The GT assumption requires that all jobs in the same family are processed contiguously on the machine and the machine no-idle assumption requires that all machines work continuously without idle time. Second, the problem F3/GT, o = 2/C_max, in which the machine no-idle restriction in the first variant is relaxed, is considered. Third, the problem F3/no-idle, o = 2/C_max with the GT assumption in the first variant relaxed is considered. Based on the dominance conditions developed, the optimal solution is polynomially derived for each variant. These results may narrow down the gap between easy and hard cases of the general problem. 相似文献

16.

The Space Complexity of Approximating the Frequency Moments

《Journal of Computer and System Sciences》1999,58(1):137-147

The frequency moments of a sequence containingm_ielements of typei, 1⩽i⩽n, are the numbersF_k=∑ⁿ_i=1 m^k_i. We consider the space complexity of randomized algorithms that approximate the numbersF_k, when the elements of the sequence are given one by one and cannot be stored. Surprisingly, it turns out that the numbersF₀,F₁, andF₂can be approximated in logarithmic space, whereas the approximation ofF_kfork⩾6 requiresn^Ω(1)space. Applications to data bases are mentioned as well. 相似文献

17.

Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis

V. Ramu Reddy K. Sreenivasa Rao 《Computer Speech and Language》2013,27(5):1105-1126

This paper proposes a two-stage feedforward neural network (FFNN) based approach for modeling fundamental frequency (F₀) values of a sequence of syllables. In this study, (i) linguistic constraints represented by positional, contextual and phonological features, (ii) production constraints represented by articulatory features and (iii) linguistic relevance tilt parameters are proposed for predicting intonation patterns. In the first stage, tilt parameters are predicted using linguistic and production constraints. In the second stage, F₀ values of the syllables are predicted using the tilt parameters predicted from the first stage, and basic linguistic and production constraints. The prediction performance of the neural network models is evaluated using objective measures such as average prediction error (μ), standard deviation (σ) and linear correlation coefficient (γ_X,Y). The prediction accuracy of the proposed two-stage FFNN model is compared with other statistical models such as Classification and Regression Tree (CART) and Linear Regression (LR) models. The prediction accuracy of the intonation models is also analyzed by conducting listening tests to evaluate the quality of synthesized speech obtained after incorporation of intonation models into the baseline system. From the evaluation, it is observed that prediction accuracy is better for two-stage FFNN models, compared to the other models. 相似文献

18.

A study of supervised term weighting scheme for sentiment analysis

《Expert systems with applications》2014,41(7):3506-3513

Term weighting is a strategy that assigns weights to terms to improve the performance of sentiment analysis and other text mining tasks. In this paper, we propose a supervised term weighting scheme based on two basic factors: Importance of a term in a document (ITD) and importance of a term for expressing sentiment (ITS), to improve the performance of analysis. For ITD, we explore three definitions based on term frequency. Then, seven statistical functions are employed to learn the ITS of each term from training documents with category labels. Compared with the previous unsupervised term weighting schemes originated from information retrieval, our scheme can make full use of the available labeling information to assign appropriate weights to terms. We have experimentally evaluated the proposed method against the state-of-the-art method. The experimental results show that our method outperforms the method and produce the best accuracy on two of three data sets. 相似文献

19.

Balancing type one and two errors in multiple testing for differential expression of genes

Alexander Gordon Galina Glazko 《Computational statistics & data analysis》2009,53(5):1622-1629

A new procedure is proposed to balance type I and II errors in significance testing for differential expression of individual genes. Suppose that a collection, F_k, of k lists of selected genes is available, each of them approximating by their content the true set of differentially expressed genes. For example, such sets can be generated by a subsampling counterpart of the delete-d-jackknife method controlling the per-comparison error rate for each subsample. A final list of candidate genes, denoted by S^∗, is composed in such a way that its contents be closest in some sense to all the sets thus generated. To measure “closeness” of gene lists, we introduce an asymmetric distance between sets with its asymmetry arising from a generally unequal assignment of the relative costs of type I and type II errors committed in the course of gene selection. The optimal set S^∗ is defined as a minimizer of the average asymmetric distance from an arbitrary set S to all sets in the collection F_k. The minimization problem can be solved explicitly, leading to a frequency criterion for the inclusion of each gene in the final set. The proposed method is tested by resampling from real microarray gene expression data with artificially introduced shifts in expression levels of pre-defined genes, thereby mimicking their differential expression. 相似文献

20.

High performance polymer actuator based on carbon nanotube-ionic liquid gel: Effect of ionic liquid

Naohiro TerasawaAuthor Vitae Ichiroh TakeuchiAuthor VitaeHajime MatsumotoAuthor Vitae Ken MukaiAuthor VitaeKinji AsakaAuthor Vitae 《Sensors and actuators. B, Chemical》2011,156(2):539-545

In this study, we investigated a dependence of anionic species of ionic liquids (ILs) (IL: perfluoroalkyltrifluoroborate anions ([C_nF_2n+1BF₃]⁻ (n = 0, 1, 2) and bis(perfluoroalkylsulfonyl)imide anions ([(C_mF_2m+1SO₂)(C_nF_2n+1SO₂)N]⁻ (m, n = 0, 1, 2)) on electrochemical and electromechanical properties. 1-Ethyl-3-methylimidazolium (EMI⁺) was selected as a cation for ILs. 1-Ethyl-3-methylimidazolium trifluoromethyltrifluoroborate (EMI[CF₃BF₃]), 1-ethyl-3-methylimidazolium pentafluoroethyltrifluoroborate (EMI[CF₃CF₂BF₃]), 1-ethyl-3-methylimidazolium fluorosulfonyl(trifluoromethylsulfonyl)imide (EMI[FTA]) and 1-ethyl-3-methylimidazolium pentafluoroethylsulfonyl(trifluoromethylsulfonyl)imide (EMI[C₁C₂]) were synthesized according to the literatures. The generated strains of the bucky-gel electrodes of the actuators containing EMI[CF₃BF₃] (in the high frequency range: 10-0.5 Hz) and EMI[CF₃CF₂BF₃] (in the high frequency range of 1-0.5 Hz) are larger than that containing EMI[BF₄] (that is to say the quick response). For low frequencies (0.1-0.005 Hz), the generated strain containing EMI[CF₃CF₂BF₃] was larger than those containing other ILs (EMI[C_nF_2n+1BF₃] (n = 0, 1) and EMI[(C_mF_2m+1SO₂)(C_nF_2n+1SO₂)N] (m, n = 0, 1, 2)). The Young's modulus of actuators containing EMI[CF₃BF₃] and EMI[CF₃CF₂BF₃] were 145 and 110 MPa, respectively. The melting points of EMI[CF₃BF₃] and EMI[CF₃CF₂BF₃] are lower than that of EMI[BF₄].Therefore, trifluoromethyltrifluoroborate ([CF₃BF₃]⁻) and pentafluoroethyltrifluoroborate ([CF₃CF₂BF₃]⁻) anions performed much better as the actuator using the polymer-supported bucky-gel electrode containing the IL. These results are considered to be the actuator enough to apply actual applications (e.g. tactile display). 相似文献