共查询到20条相似文献,搜索用时 15 毫秒
1.
There are three factors involved in text classification. These are classification model, similarity measure and document representation model. In this paper, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classifier. In our experiments, we have used the centroid-based text classifier, which is a simple and robust text classification scheme. We will compare four different types of document representations: N-grams, Single terms, phrases and RDR which is a logic-based document representation. The N-gram representation is a string-based representation with no linguistic processing. The Single term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. We have experimented with many text collections and we have obtained similar results. Here, we base our arguments on experiments conducted on Reuters-21578. We show that RDR, the more complex representation, produces more effective classifier on Reuters-21578, followed by the phrase approach. 相似文献
2.
It has recently been suggested that assuming independence between labels is not suitable for real-world multi-label classification. To account for label dependencies, this paper proposes a supervised topic modeling algorithm, namely labelset topic model (LsTM). Our algorithm uses two labelset layers to capture label dependencies. LsTM offers two major advantages over existing supervised topic modeling algorithms: it is straightforward to interpret and it allows words to be assigned to combinations of labels, rather than a single label. We have performed extensive experiments on several well-known multi-label datasets. Experimental results indicate that the proposed model achieves performance on par with and often exceeding that of state-of-the-art methods both qualitatively and quantitatively. 相似文献
3.
Abdelkamel Tari Islam Elgedawy Abdelnasser Dahmani 《Journal of Intelligent Information Systems》2009,32(3):237-265
Nowadays more and more companies and organizations implement their business services in the Internet due to the tremendous
progress made recently in the field of Web services. It becomes possible to publish, locate and invoke applications across
the Web. Thus, the ability to select efficiently and integrate at runtime services located in different sites on the Web is
an important issue. In some situations, if no single Web service can satisfy the request of the user, there should be a possibility
to combine existing services together in order to meet the user’s request. This paper provides a dual-layered model for web
services, where the first model layer captures the high-level functional specifications (namely goals, achievement contexts,
and external behaviours), and the second model layer captures the low-level functional specifications (namely interfaces).
This model allows the service composition process to be performed on both high-level and low-level specifications. We also
introduce the composition operators (both high-level and low-level) to allow composition of virtual services. 相似文献
4.
文档表示模型可以将非结构化的文本数据转化为结构化数据,是多种自然语言处理任务的基础,而目前基于词的模型在文档表示任务中有着无法直接表示文档的缺陷。针对此问题,基于生成对抗网络GAN可以使用两个神经网络进行对抗学习,从而很好地学习到原始数据分布的特点,提出了文档表示模型WADM,使用去噪自编码器作为其判别网络,由其隐层直接得到文档的分布表示。实验表明,WADM能够准确抽取文档特征,相比基于词的模型具有更强的文档表示能力。 相似文献
5.
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents. 相似文献
6.
7.
The authors review a large number of document development systems for both text and graphics from the perspectives of source-language and direct-manipulation models. They describe the task domain and discuss the pros and cons of direct-manipulation techniques versus a programming-language source code and of a procedural versus declarative schemes. They then establish a framework for analyzing and designing multiple-representation systems. The central theme is that program constructs and visual feedback are complementary to each other and that a hybrid approach would be most desirable.<> 相似文献
8.
文本自动分类技术为Internet上日益严重的"信息过载"问题提供了一种强有力的解决方法.面向中文文本分类领域,将ontology引入到N-Gram统计文本模型中,提出了一种基于"领域概念 有效词链"的多索引策略和相应的权重计算、参数平滑方法.通过在真实数据集上实验表明:应用领域本体的N-Gram中文文本分类模型不仅降低了索引项的数目,而且提高了文本分类的准确率. 相似文献
9.
Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recently, word embedding approaches have drawn much attention in text mining research. One of the main benefits of such approaches is the use of global corpuses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages. Relying only on pre-trained word vectors may neglect the local context and increase word ambiguity. In this study, a new approach, Content Tree Word Embedding (CTWE), is introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors. CTWE is basically a framework for document representation while using word embedding feature learning. The CTWE structure is locally learned from training data and ultimately represents the local context. While CTWE is constructed, each word vector is updated based on its location in the content tree. For the task of classification, the results show an improvement in F-score and accuracy measures when using two deep learning-based word embedding approaches, namely GloVe and Word2Vec. 相似文献
10.
11.
《Information and Software Technology》2006,48(8):687-695
The widespread availability of machine understandable information on the Semantic Web offers some opportunities to improve traditional search. In this paper, we propose a hybrid web search architecture-HWS, which combines text search with semantic search to improve precision and recall. The components in HWS are described in detail, including several novel algorithms proposed to support the hybrid web search. 相似文献
12.
Abstract
The bag-of-words approach to text document representation
typically results in vectors of the order of 5000–20,000
components as the representation of documents. To make effective
use of various statistical classifiers, it may be necessary to
reduce the dimensionality of this representation. We point out
deficiencies in class discrimination of two popular such
methods, Latent Semantic Indexing (LSI), and sequential feature
selection according to some relevant criterion. As a remedy, we
suggest feature transforms based on Linear Discriminant Analysis
(LDA). Since LDA requires operating both with large and dense
matrices, we propose an efficient intermediate dimension
reduction step using either a random transform or LSI. We report
good classification results with the combined feature transform
on a subset of the Reuters-21578 database. Drastic reduction of
the feature vector dimensionality from 5000 to 12 actually
improves the classification performance.An erratum to this article can be found at 相似文献
13.
This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier. 相似文献
14.
We present a simple and yet effective approach for document classification to incorporate rationales elicited from annotators into the training of any off-the-shelf classifier. We empirically show on several document classification datasets that our classifier-agnostic approach, which makes no assumptions about the underlying classifier, can effectively incorporate rationales into the training of multinomial naïve Bayes, logistic regression, and support vector machines. In addition to being classifier-agnostic, we show that our method has comparable performance to previous classifier-specific approaches developed for incorporating rationales and feature annotations. Additionally, we propose and evaluate an active learning method tailored specifically for the learning with rationales framework. 相似文献
15.
为了支持语义网对不精确性知识的机器理解与机器推理,提出了不精确性语义网本体的设想。通过分析不精确性和概念的语义,界定不精确性包括模糊性和粗糙性,其分别源于人脑形成概念时所采用的圈定方式和导出方式,进而指出不精确性概念实际上就是模糊粗糙概念,而不精确性偏序关系并不实际存在,由此推证出不精确性语义网本体的集合表达式模型即模糊粗糙概念偏序集。给出了该模型的两种实用性表示形式:模糊粗糙概念表和模糊粗糙概念格,后者具有约简性和惟一性的良好性质,据此可编写简洁规范的OWL文档,从而支持语义网的实际运行。 相似文献
16.
As is well known, the computational complexity in the mixed integer programming (MIP) problem is one of the main issues in model predictive control (MPC) of hybrid systems such as mixed logical dynamical systems. Thus several efficient MIP solvers such as multi-parametric MIP solvers have been extensively developed to cope with this problem. On the other hand, as an alternative approach to this issue, this paper addresses how a deterministic finite automaton, which is a part of a hybrid system, should be expressed to efficiently solve the MIP problem to which the MPC problem is reduced. More specifically, a modeling method to represent a deterministic finite automaton in the form of a linear state equation with a smaller set of binary input variables and binary linear inequalities is proposed. After a motivating example is described, a derivation procedure of a linear state equation with linear inequalities representing a deterministic finite automaton is proposed as three steps; modeling via an implicit system, coordinate transformation to a linear state equation, and state feedback binarization. Various significant properties on the proposed modeling are also presented throughout the proofs on the derivation procedure. 相似文献
17.
稀疏表示分类方法(SRC)在人脸识别方面取得了当前最好的分类结果,针对SRC存在的问题,提出稀疏近邻表示方法(SNRC).在局部线性嵌入方法前提假设成立的条件下,SNRC通过稀疏近邻表示实现目标分类.在几个不同数据集上的实验结果显示,SNRC适用于呈非线性分布的数据集,并取得了较好的效果.进一步的分析表明,SNRC能够较好的适用于那些通过降维方法得到的低维数据的分类问题,尤其适用于基于近邻保持的一类降维方法得到的低维数据,并且具有较低的时间复杂度. 相似文献
18.
Recent research of sparse signal representation has aimed at learning discriminative sparse models instead of purely reconstructive ones for classification tasks, such as sparse representation based classification (SRC) which obtains state-of-the-art results in face recognition. In this paper, a new method is proposed in that direction. With the assumption of locally linear embedding, the proposed method achieves the classification goal via sparse neighbor representation, combining the reconstruction property, sparsity and discrimination power. The experiments on several data sets are performed and results show that the proposed method is acceptable for nonlinear data sets. Further, it is argued that the proposed method is well suited for the classification of low dimensional data dimensionally reduced by dimensionality reduction methods, especially the methods obtaining the low dimensional and neighborhood preserving embeddings, and it costs less time. 相似文献
19.
周朴雄 《计算机工程与应用》2008,44(25):155-156
针对WEB文档分类中KNN算法计算复杂度高的缺点,不同于以往从减少训练样本集大小和采用快速算法角度来降低KNN算法的计算复杂度,从并行的角度出发,提出一种在Hyper-cube SIMD模型上的并行算法,其关键部分的时间计算复杂度从O(n2)降为O(log(n)),该算法与传统的串行算法相比,能显著地提高分类速度。 相似文献
20.
目前的文本单类别分类算法在进行增量学习时需要进行大量的重复计算,提出了一种新的用于文本的单类别分类算法,在不降低分类效果的同时,有效地减少了加入新样本学习时所需的计算量,从而比较适合于需要进行增量学习的情况。该方法已进行了测试实验,获得了较好的实验结果。 相似文献