首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
一种应用向量聚合技术的KNN中文文本分类方法   总被引:3,自引:2,他引:3  
针对KNN文本分类方法中不考虑特征词关联的问题,提出一种改进方法.这种方法基于对体现词和类别问相关程度的CHI统计值分布的分析,应用向量聚合技术很好地解决了关联特征词的提取问题.其特点在于:聚合文本向量中相关联的特征词作为特征项,从而取代传统方法中一个特征词对应向量一维的做法,这样不但缩减了向量的维教,而且加强了特征项对文本分类的贡献.实验表明该方法明显提高了分类的准确率和召回率。  相似文献   

2.
基于向量空间模型的贝叶斯文本分类方法   总被引:2,自引:0,他引:2  
提出基于向量空间模型的贝叶斯文本分类方法。首先提取出文本训练集的特征词,建立特征向量空间模型。然后采用贝叶斯文本分类方法对未知类别文档进行分类。给出了贝叶斯文本分类方法过程的详细描述和文本分类的一个测试实例。  相似文献   

3.
传统的文本分类方法假设训练集与测试集中的特 征词服从相同的概率分布,但在实际应用中,以上假设存在偏差,会影响到最终的分类结果。针对这一情况,本文采用迁移学习,通过计算特征词的迁移量对训练集中向量空间模型进行修正,最终使训练集与测试集中特征词的分布概率趋于一致。将提出的方法应用于中文垃圾邮件过滤与中、英文网页分类中,在CHI统计特征选择基础上进行特征迁移,实验结果表明新方法可以有效消除特征词分布的差异性,使文本分类的各项指标明显提高。  相似文献   

4.
信息增益方法从整个训练集角度进行特征赋权,该模式不适合构造类别特征向量.通过改进的朴素贝叶斯方法选择类别特征用于构造类别向量,再利用词频信息改进信息增益模型用于文本特征选择,改善了信息增益模型对于中频词信息利用不足问题,提出一种基于类别的文本特征加权改进模型.随后的文本分类试验表明,提出的加权模型相比较于传统的信息增益方法具有较好的文本分类效果.  相似文献   

5.
基于主动学习支持向量机的文本分类   总被引:2,自引:0,他引:2       下载免费PDF全文
提出基于主动学习支持向量机的文本分类方法,首先采用向量空间模型(VSM)对文本特征进行提取,使用互信息对文本特征进行降维,然后提出主动学习算法对支持向量机进行训练,使用训练后的分类器对新的文本进行分类,实验结果表明该方法具有良好的分类性能。  相似文献   

6.
提出了一个基于级连神经网络(Cascade-Correlation Neural Network,CCNN)和SVD(Singular Value Decomposition)的文本分类新模型。该神经网络用级连相关算法来训练网络。大部分的文本分类系统用向量空间模型(Vector Space Model,VSM)来表现文档,然而这种方法需要很高的维度,并且考虑不到文本特征词间的语义隐含信息,因此分类效果不是太理想。引入SVD来学习和表现文本特征词,在降低特征维度的基础上,将文本特征的隐含信息表现出来。实验证明,在加快训练速度的基础上,提高了分类的精度。  相似文献   

7.
基于向量空间模型的文本分类方法的文本表示具有高纬度、高稀疏的特点,特征表达能力较弱,且特征工程依赖人工提取,成本较高。针对该问题,提出基于双通道词向量的卷积胶囊网络文本分类算法。将Word2Vec训练的词向量与基于特定文本分类任务扩展的语境词向量作为神经网络的2个输入通道,并采用具有动态路由机制的卷积胶囊网络模型进行文本分类。在多个英文数据集上的实验结果表明,双通道的词向量训练方式优于单通道策略,与LSTM、RAE、MV-RNN等算法相比,该算法具有较高的文本分类准确率。  相似文献   

8.
黄剑韬 《计算机应用》2011,31(Z2):67-69
为了降低基于向量空间模型(VSM)的文本分类方法的向量维数,并减少噪声对分类的影响,现利用商空间的粒度理论对基于VSM的分类模型进行改进,提出了一种基于商空间的新的VSM分类方法,该方法降低了基于VSM文本分类的向量维数,提高了不同文本之间的辨别能力.  相似文献   

9.
目前多数文本分类方法无法有效反映句子中不同单词的重要程度,且在神经网络训练过程中获得的词向量忽略了汉字本身的结构信息。构建一种GRU-ATT-Capsule混合模型,并结合CW2Vec模型训练中文词向量。对文本数据进行预处理,使用传统的词向量方法训练的词向量作为模型的第1种输入,通过CW2Vec模型训练得到的包含汉字笔画特征的中文词向量作为第2种输入,完成文本表示。利用门控循环单元分别提取2种不同输入的上下文特征并结合注意力机制学习文本中单词的重要性,将2种不同输入提取出的上下文特征进行融合,通过胶囊网络学习文本局部与全局之间的关系特征实现文本分类。在搜狗新闻数据集上的实验结果表明,GRU-ATT-Capsule混合模型相比TextCNN、BiGRU-ATT模型在测试集分类准确率上分别提高2.35和4.70个百分点,融合笔画特征的双通道输入混合模型相比单通道输入混合模型在测试集分类准确率上提高0.45个百分点,证明了GRU-ATT-Capsule混合模型能有效提取包括汉字结构在内的更多文本特征,提升文本分类效果。  相似文献   

10.
针对标题文本特征少、特征维度高和分布不均匀导致分类性能不佳的问题,该文提出了一种利用分类体系结构信息的双向特征选择算法,并在该方法基础上实现标题分类。该方法以具有严格层级关系的分类体系为应用前提,利用类别与词的同现和分布关系进行特征词和候选类别的双向选择,构建类别向量空间;通过分析标题文本特征词在层级类别向量空间的分布所表现出的类别语义信息,确定文本所在层级以及所在层级的候选类别;之后利用分类器对未能成功分类的标题进行分类。在人工标引数据集上的实验结果表明,该方法在不进行语料扩展和外部知识库添加的基础上仍可有效地确定文本所在层级,实现多级学科的分类;并可在识别类别语义信息的基础上,降低候选类别数目,提高分类效率。  相似文献   

11.
Even though there have been strong research activities about distributed virtual shared-memory (DVSM) systems, their architectures have been not widely used in current high-performance computing markets. The reason is that the previously introduced DVSM systems use conventional interconnection technologies like Ethernet, which incurs high execution overhead due to process interruption at data communication for memory consistency. In this paper, we present the DVSM architecture based on the next generation of an interconnection technique, the InfiniBand Architecture (IBA). Because the IBA supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations in hardware, we can minimize the communication overhead for memory consistency on the DVSM system. For characterizing multithreaded applications on our IBA-based DVSM system, we examined two different shared-memory programming models, i.e. SPMD and OpenMP benchmarks. We show that our DVSM to use full features of the IBA can improve the performance significantly over the IPoIB-based DVSM system in all benchmarks, and also comparable to the bus-based shared-memory multiprocessor system in some benchmarks.  相似文献   

12.
For the past decades computer engineers have focused on building high-performance and large-scale computer systems with low-cost. One of the examples is a distributed-memory computer system like a cluster, where fast processing nodes to use commodity processors are connected through a high speed network. But it is not easy to develop applications on this system, because a programmer must consider all data and control dependences between processes and program them explicitly. For alleviating this problem the distributed virtual shared-memory (DVSM) system has been proposed. It is well known that the performance of the DVSM system highly depends on the network’s performance and programming semantics, and currently its performance is very limited on a conventional network. Recently many advanced hardware-based interconnection technologies have been introduced, and one of them is the InfiniBand Architecture (IBA) which supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations. In this paper, we present the implementation of our InfiniBand-based DVSM system and analyze the performance of SPEC OMP benchmarks in detail by comparing with the DVSM based on the traditional network architecture and the hardware shared-memory multiprocessor (SMP) system. As experiment result, we show that our DVSM system to use full features of the IBA can improve the performance significantly over the IPoIB-based traditional system on the IBA, and furthermore the performance of one application on the IBA-based DVSM system is better than on the hardware SMP.  相似文献   

13.
目前大多数链路预测方法都是针对丢失链路的结构性预测,缺乏针对未来时刻网络链路的时序性预测,为此提出了一种基于频繁闭图关联规则的链路预测方法。将形式化后的动态网络划分为训练集和测试集,基于Apriori思想从训练集中提取频繁闭图,并根据频繁闭图的时间间隔建立时延分布矩阵,用于表征频繁闭图之间的时序关联规则,在此基础上预测测试集中的网络结构。将该方法运用于不同时间尺度下的AS级Internet动态网络中,结果表明,该方法能够以很高的精确率预测波动型动态网络的链路。  相似文献   

14.
Dynamic storage allocation is an important part of a large class of computer programs written in C and C + +. High-performance algorithms for dynamic storage allocation have been, and will continue to be, of considerable interest. This paper presents detailed measurements of the cost of dynamic storage allocation in 11 diverse C and C + + programs using five very different dynamic storage allocation implementations, including a conservative garbage collection algorithm. Four of the allocator implementations measured are publicly available on the Internet. A number of the programs used in these measurements are also available on the Internet to facilitate further research in dynamic storage allocation. Finally, the data presented in this paper is an abbreviated version of more extensive statistics that are also publicly available on the Internet.  相似文献   

15.
Yiu-Wing   《Computer Communications》2006,29(18):3710-3717
Internet telephony is promising for long-distance calls because of its low service charge and value-added functions. To provide Internet telephony to the general public, a service provider can operate a telephone gateway in each servicing city to bridge the local telephone network and the Internet, so that users can use telephones or fax machines to access this gateway for services. In this paper, we propose a dynamic bandwidth allocation scheme for two purposes: (1) each telephone gateway can fully utilize the available bandwidth to serve more telephone and fax sessions and (2) it can respond to the changing environments. We exploit three properties for dynamic bandwidth allocation. First, in a telephone session, each user usually alternates between speaking and listening. When a user is not speaking, she does not send any voice stream and hence the bandwidth can be dynamically released from this session for the other sessions. Second, voice traffic is elastic because it can be further compressed at the cost of a lower quality. Third, fax traffic is flexible because it can be temporarily delayed. We exploit these three properties to allocate bandwidth to telephone and fax sessions dynamically. When a telephone gateway adopts dynamic bandwidth allocation, it can serve more telephone and fax sessions while providing acceptably good quality-of-service (QoS), and it can give more stable QoS when the available bandwidth varies.  相似文献   

16.
万维网Web是Internet上广泛使用的一种服务,它为因特网用户提供了丰富多样的信息资源。随着Web的发展,初期的静态页面已不能满足用户的需求,活动和动态页面成为Web中不可缺少的内容。本文探讨了在Web应用开发中采用服务器端比较流行的ASP来实现动态页面的方法。  相似文献   

17.
Autonomous management of a multi-tier Internet service involves two critical and challenging tasks, one understanding its dynamic behaviors when subjected to dynamic workloads and second adaptive management of its resources to achieve performance guarantees. We propose a statistical machine learning based approach to achieve session slowdown guarantees of a multi-tier Internet service. Session slowdown is the relative ratio of a session’s total queueing delay to its total processing time. It is a compelling performance metric of session-based online transactions because it directly measures user-perceived relative performance and it is independent of the session length. However, there is no analytical model for session slowdown on multi-tier servers. We first conduct training to learn the statistical regression models that quantitatively capture an Internet service’s dynamic behaviors as relationships between various service parameters. Then, we propose a dynamic resource provisioning approach that utilizes the learned regression models to efficiently achieve session slowdown guarantee under dynamic workloads. The approach is based on the combination of offline training and online monitoring of the Internet service behavior. Simulations using the industry standard TPC-W benchmark demonstrate the effectiveness and efficiency of the regression based resource provisioning approach for session slowdown oriented performance guarantee of a multi-tier e-commerce application.  相似文献   

18.
The focus of this paper is on joint feature re-extraction and classification in cases when the training data set is small. An iterative semi-supervised support vector machine (SVM) algorithm is proposed, where each iteration consists both feature re-extraction and classification, and the feature re-extraction is based on the classification results from the previous iteration. Feature extraction is first discussed in the framework of Rayleigh coefficient maximization. The effectiveness of common spatial pattern (CSP) feature, which is commonly used in Electroencephalogram (EEG) data analysis and EEG-based brain computer interfaces (BCIs), can be explained by Rayleigh coefficient maximization. Two other features are also defined using the Rayleigh coefficient. These features are effective for discriminating two classes with different means or different variances. If we extract features based on Rayleigh coefficient maximization, a large training data set with labels is required in general; otherwise, the extracted features are not reliable. Thus we present an iterative semi-supervised SVM algorithm embedded with feature re-extraction. This iterative algorithm can be used to extract these three features reliably and perform classification simultaneously in cases where the training data set is small. Each iteration is composed of two main steps: (i) the training data set is updated/augmented using unlabeled test data with their predicted labels; features are re-extracted based on the augmented training data set. (ii) The re-extracted features are classified by a standard SVM. Regarding parameter setting and model selection of our algorithm, we also propose a semi-supervised learning-based method using the Rayleigh coefficient, in which both training data and test data are used. This method is suitable when cross-validation model selection may not work for small training data set. Finally, the results of data analysis are presented to demonstrate the validity of our approach. Editor: Olivier Chapelle.  相似文献   

19.
In this paper we present the Cataclysm server platform for handling extreme overloads in hosted Internet applications. The primary contribution of our work is to develop a low overhead, highly scalable admission control technique for Internet applications. Cataclysm provides several desirable features, such as guarantees on response time by conducting accurate size-based admission control, revenue maximization at multiple time-scales via preferential admission of important requests and dynamic capacity provisioning, and the ability to be operational even under extreme overloads. Cataclysm can transparently trade-off the accuracy of its decision making with the intensity of the workload allowing it to handle incoming rates of several tens of thousands of requests/second. We implement a prototype Cataclysm hosting platform on a Linux cluster and demonstrate the benefits of our integrated approach using a variety of workloads.  相似文献   

20.
Consistent dynamic PCA based on errors-in-variables subspace identification   总被引:4,自引:0,他引:4  
In this paper, we make a comparison between dynamic principal component analysis (PCA) and errors-in-variables (EIV) subspace model identification (SMI) and establish consistency conditions for the two approaches. We first demonstrate the relationship between dynamic PCA and SMI. Then we show that when process variables are corrupted by measurement noise dynamic PCA fails to give a consistent estimate of the process model in general whether or not process noise is present. We then propose an indirect dynamic PCA approach for the consistent estimate of the process model resorting to EIV SMI algorithms. Consistent dynamic PCA models are obtained with and without process disturbances. Additional features of the indirect approach include (i) easy determination of the number of lagged variables in the model; (ii) determination of the number of significant process disturbances; and (iii) consistent estimate of the dynamic PCA models with and without process disturbances. We conduct two simulation examples and an industrial case study to support our theoretical results, where the relationship between dynamic PCA and EIV SMI is numerically verified.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号