首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 575 毫秒
1.
随着互联网的普及,人类获取特定信息需求的增加,如何快速获取特定类别信息是当前搜索引擎,门户网站等必须解决的问题。当前网页分类的任务都由机器学习的文本分类算法完成,但传统的机器学习分类方法基本没有考虑文本数据特征,提供无差别的分类服务。该系统充分考虑网页文本数据的特征,以文本标题为突破口实现快速分类以及依据SVM的普通分类。快速分类依据文本标题通过分词模型训练快速对应到分类标签上,完成快速分类。如果快速分类不成功则将文本内容通过结巴分词器分词,word2vec进行分词向量的训练,再根据分类要求通过SVM进行分类,完成普通的分类。通过提供两种不同的服务来完成不同的需求。  相似文献   

2.
基于N元汉字串模型的文本表示和实时分类的研究与实现   总被引:4,自引:0,他引:4  
该文提出了一种基于N元汉字串特征的文本向量空间表示模型,用这个表示模型实现了一个文本实时分类系统。对比使用词语做为特征的文本向量空间模型,这种新的模型由于使用快速的多关键词匹配技术,不使用分词等复杂计算,可以实现实时文本分类。由于N元汉字串的文本表示模型中的特征抽取中不需要使用词典分词,从而可以提取出一些非词的短语结构,在特殊的应用背景,如网络有害信息判别中,能自动提取某些更好的特征项。实验结果表明,使用简单的多关键词匹配和使用复杂的分词,对分类系统的效果影响是很小的。该文的研究表明N元汉字串特征和词特征的表示能力在分类问题上基本是相同的,但是N元汉字串特征的分类系统可以比分词系统的性能高出好几倍。该文还描述了使用这种模型的自动文本分类系统,包括分类系统的结构,特征提取,文本相似度计算公式,并给出了评估方法和实验结果。  相似文献   

3.
针对文本信息的分类问题,提出正向最大匹配分词算法与自组织映射神经网络融合算法(MMSOM)。利用正向最大匹配分词算法对文本信息进行自动提取,设定关键词信息规范框架,将规范化后的文本信息量化结果作为神经网络输入,结合文本分词结果,实现分类对象信息提取与分类的自动化。将该算法应用于藻类水华领域专家分类问题,分类结果表明了算法的可行性和有效性。  相似文献   

4.
近年来,随着生活节奏的提高和互联网的迅速发展,人们更倾向于在众多社交平台上用短文本进行交流,进而可能有人通过发布垃圾文本妨碍人们的正常社交,扰乱网络的绿色环境.为了解决这个问题,我们提出了基于TF-IDF和改进BP神经网络的社交平台垃圾文本检测的方法.通过该方法,实现对社交平台上的垃圾文本过滤.首先,通过结巴分词和去停分词构造关键词数据集;其次,对文本表示的关键词向量运用计算各关键词的权重从而对文本向量进行降维,得到特征向量;最后,在此基础上,运用BP神经网络分类器对短文本进行分类,检测出垃圾文本并进行过滤.实验结果表明用该方法在1000维文本特征向量的情况下分类平均准确率达到了97.720%.  相似文献   

5.
针对现有基于语义知识规则分析的文本相似性度量方法存在时间复杂度高的局限性,提出基于分类词典的文本相似性度量方法。利用汉语词法分析系统ICTCLAS对文本分词,运用TF×IDF方法提取文本关键词,遍历分类词典获取关键词编码,通过计算文本关键词编码的近似性来衡量原始文本之间的相似度。选取基于语义知识规则和基于统计两个类别的相似性度量方法作为对比方法,通过传统聚类与KNN分类分别对相似性度量方法进行效果验证。数值实验结果表明,新方法在聚类与分类实验中均能取得较好的实验结果,相较于其他基于语义分析的相似性度量方法还具有良好的时间效率。  相似文献   

6.
基于统计分词的中文网页分类   总被引:9,自引:3,他引:9  
本文将基于统计的二元分词方法应用于中文网页分类,实现了在事先没有词表的情况下通过统计构造二字词词表,从而根据网页中的文本进行分词,进而进行网页的分类。因特网上不同类型和来源的文本内容用词风格和类型存在相当的差别,新词不断出现,而且易于获得大量的同类型文本作为训练语料。这些都为实现统计分词提供了条件。本文通过试验测试了统计分词构造二字词表用于中文网页分类的效果。试验表明,在统计阈值选择合适的时候,通过构建的词表进行分词进而进行网页分类,能有效地提高网页分类的分类精度。此外,本文还分析了单字和分词对于文本分类的不同影响及其原因。  相似文献   

7.
提出了一种结合关键词特征和共现词对特征的向量空间模型。首先,通过分词和去除停用词提取文本中的候选关键词,利用文本频率筛选关键词特征。然后,基于获得的关键词特征两两构造候选共现词对,定义支持度和置信度筛选共现词对特征。最后,结合关键词特征和共现词对特征构建向量空间模型。文本分类实验结果表明,提出的模型具有更强的文本分类能力。  相似文献   

8.
如何快速地整理海量信息,对不同的文本进行有效分类,已成为获取有价值信息的瓶颈。本文提出的中文文本分类方法,较好地解决了信息的实时分类问题,在实践中收到了良好的效果。由于汉语文本的特殊性,在分类器训练前对训练文本进行自动分词和降维预处理。许多文本往往可能归到多个类,因此分类算法采用模糊c-原型算法。实验表明,该方法综合效果较好,可以实现文本的快速分类。  相似文献   

9.
本文阐述了一个中文文本分类系统的设计和实现,对文本分类系统的系统结构、特征提取、训练算法、分类算法等进行了详细介绍,将基于统计的二元分词方法应用于中文文本分类,并提出了一种基于汉语中单字词及二字词统计特性的中文文本分类方法,实现了在事先没有词表的情况下,通过统计构造单字及二字词词表,从而对文本进行分词,然后再进行文本的分类。  相似文献   

10.
基于后缀树模型的文本实时分类系统的研究和实现   总被引:8,自引:1,他引:8  
本文在面向网络内容分析的前提下,提出了一种基于后缀树的文本向量空间模型(VSM) ,并在此模型之上实现了文本分类系统。对比基于词的VSM,该模型利用后缀树的快速匹配,实时获得文本的向量表示,不需要对文本进行分词、特征抽取等复杂计算。同时,该模型能够保证训练集中文本的更改,对分类结果产生实时影响。实验结果和算法分析表明,我们系统的文本预处理的时间复杂度为O(N) ,远远优于分词系统的预处理时间复杂度。此外,由于不需要分词和特征抽取,分类过程与具体语种无关,所以是一种独立语种的分类方法。  相似文献   

11.
Interval type-2 fuzzy sets (IT2 FS) play a central role in fuzzy sets as models for words and in engineering applications of T2 FSs. These fuzzy sets are characterized by their footprints of uncertainty (FOU), which in turn are characterized by their boundaries-upper and lower membership functions (MF). The centroid of an IT2 FS, which is an IT1 FS, provides a measure of the uncertainty in the IT2 FS. The main purpose of this paper is to quantify the centroid of a non-symmetric IT2 FS with respect to geometric properties of its FOU. This is very important because interval data collected from subjects about words suggests that the FOUs of most words are non-symmetrical. Using the results in this paper, it is possible to formulate and solve forward problems, i.e., to go from parametric non-symmetric IT2 FS models to data with associated uncertainty bounds. We provide some solutions to such problems for non-symmetrical triangular, trapezoidal, Gaussian and shoulder FOUs.  相似文献   

12.
广义区间二型模糊集合的词计算   总被引:3,自引:1,他引:2  
莫红  王涛 《自动化学报》2012,38(5):707-715
普通的模糊集合是点值为二维的一型模糊集合,二型模糊集合(Type-2 fuzzy sets, T2 FS)是点值为三维的模糊集合, T2 FS比相应的一型难以理解和计算. 为了让人们更好地理解T2 FS并推广其应用, 本文提出了广义区间二型模糊集合(Generalized interval type-2 fuzzy sets, GIT2 FS)的定义, 并将其分成三类:离散型、半离散型及连续型,分别给出相应的数学表达式与扩展原理公式,并得到了GIT2 FS在两种不同的模糊逻辑算子下的词计算.  相似文献   

13.
In Part 1 of this two-part paper, we bounded the centroid of a symmetric interval type-2 fuzzy set (T2 FS), and consequently its uncertainty, using geometric properties of its footprint of uncertainty (FOU). We then used these bounds to solve forward problems, i.e., to go from parametric interval T2 FS models to data. The main purpose of the present paper is to formulate and solve inverse problems, i.e., to go from uncertain data to parametric interval T2 FS models, which we call type-2 fuzzistics. Given interval data collected from people about a phrase, and the inherent uncertainties associated with that data, which can be described statistically using the first- and second-order statistics about the end-point data, we establish parametric FOUs such that their uncertainty bounds are directly connected to statistical uncertainty bounds. These results should find applicability in computing with words  相似文献   

14.
Interval type-2 fuzzy sets (T2 FS) play a central role in fuzzy sets as models for words and in engineering applications of T2 FSs. These fuzzy sets are characterized by their footprints of uncertainty (FOU), which in turn are characterized by their boundaries-upper and lower membership functions (MF). In this two-part paper, we focus on symmetric interval T2 FSs for which the centroid (which is an interval type-1 FS) provides a measure of its uncertainty. Intuitively, we anticipate that geometric properties about the FOU, such as its area and the center of gravities (centroids) of its upper and lower MFs, will be associated with the amount of uncertainty in such a T2 FS. The main purpose of this paper (Part 1) is to demonstrate that our intuition is correct and to quantify the centroid of a symmetric interval T2 FS, and consequently its uncertainty, with respect to such geometric properties. It is then possible, for the first time, to formulate and solve forward problems, i.e., to go from parametric interval T2 FS models to data with associated uncertainty bounds. We provide some solutions to such problems. These solutions are used in Part 2 to solve some inverse problems, i.e., to go from uncertain data to parametric interval T2 FS models (T2 fuzzistics)  相似文献   

15.
In the research domain of intelligent buildings and smart home, modeling and optimization of the thermal comfort and energy consumption are important issues. This paper presents a type-2 fuzzy method based data-driven strategy for the modeling and optimization of thermal comfort words and energy consumption. First, we propose a methodology to convert the interval survey data on thermal comfort words to the interval type-2 fuzzy sets (IT2 FSs) which can reflect the inter-personal and intra-personal uncertainties contained in the intervals. This data-driven strategy includes three steps: survey data collection and pre-processing, ambiguity-preserved conversion of the survey intervals to their representative type-1 fuzzy sets (T1 FSs), IT2 FS modeling. Then, using the IT2 FS models of thermal comfort words as antecedent parts, an evolving type-2 fuzzy model is constructed to reflect the online observed energy consumption data. Finally, a multiobjective optimization model is presented to recommend a reasonable temperature range that can give comfortable feeling while reducing energy consumption. The proposed method can be used to realize comfortable but energy-saving environment in smart home or intelligent buildings.  相似文献   

16.
17.
Interval Type-2 Fuzzy Logic Systems Made Simple   总被引:9,自引:0,他引:9  
To date, because of the computational complexity of using a general type-2 fuzzy set (T2 FS) in a T2 fuzzy logic system (FLS), most people only use an interval T2 FS, the result being an interval T2 FLS (IT2 FLS). Unfortunately, there is a heavy educational burden even to using an IT2 FLS. This burden has to do with first having to learn general T2 FS mathematics, and then specializing it to an IT2 FSs. In retrospect, we believe that requiring a person to use T2 FS mathematics represents a barrier to the use of an IT2 FLS. In this paper, we demonstrate that it is unnecessary to take the route from general T2 FS to IT2 FS, and that all of the results that are needed to implement an IT2 FLS can be obtained using T1 FS mathematics. As such, this paper is a novel tutorial that makes an IT2 FLS much more accessible to all readers of this journal. We can now develop an IT2 FLS in a much more straightforward way  相似文献   

18.
Early detection of ventricular fibrillation (VF) is crucial for the success of the defibrillation therapy in automatic devices. A high number of detectors have been proposed based on temporal, spectral, and time-frequency parameters extracted from the surface electrocardiogram (ECG), showing always a limited performance. The combination ECG parameters on different domain (time, frequency, and time-frequency) using machine learning algorithms has been used to improve detection efficiency. However, the potential utilization of a wide number of parameters benefiting machine learning schemes has raised the need of efficient feature selection (FS) procedures. In this study, we propose a novel FS algorithm based on support vector machines (SVM) classifiers and bootstrap resampling (BR) techniques. We define a backward FS procedure that relies on evaluating changes in SVM performance when removing features from the input space. This evaluation is achieved according to a nonparametric statistic based on BR. After simulation studies, we benchmark the performance of our FS algorithm in AHA and MIT-BIH ECG databases. Our results show that the proposed FS algorithm outperforms the recursive feature elimination method in synthetic examples, and that the VF detector performance improves with the reduced feature set.  相似文献   

19.
The focus of this paper is the linguistic weighted average (LWA), where the weights are always words modeled as interval type-2 fuzzy sets (IT2 FSs), and the attributes may also (but do not have to) be words modeled as IT2 FSs; consequently, the output of the LWA is an IT2 FS. The LWA can be viewed as a generalization of the fuzzy weighted average (FWA) where the type-1 fuzzy inputs are replaced by IT2 FSs. This paper presents the theory, algorithms, and an application of the LWA. It is shown that finding the LWA can be decomposed into finding two FWAs. Since the LWA can model more uncertainties, it should have wide applications in distributed and hierarchical decision-making.  相似文献   

20.
This paper presents a very practical type-2-fuzzistics methodology for obtaining interval type-2 fuzzy set (IT2 FS) models for words, one that is called an interval approach (IA). The basic idea of the IA is to collect interval endpoint data for a word from a group of subjects, map each subject's data interval into a prespecified type-1 (T1) person membership function, interpret the latter as an embedded T1 FS of an IT2 FS, and obtain a mathematical model for the footprint of uncertainty (FOU) for the word from these T1 FSs. The IA consists of two parts: the data part and the FS part. In the data part, the interval endpoint data are preprocessed, after which data statistics are computed for the surviving data intervals. In the FS part, the data are used to decide whether the word should be modeled as an interior, left-shoulder, or right-shoulder FOU. Then, the parameters of the respective embedded T1 MFs are determined using the data statistics and uncertainty measures for the T1 FS models. The derived T1 MFs are aggregated using union leading to an FOU for a word, and finally, a mathematical model is obtained for the FOU. In order that all researchers can either duplicate our results or use them in their research, the raw data used for our codebook examples, as well as a MATLAB M-file for the IA, have been put on the Internet at: http://sipi.usc.edu/$sim$mendel.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号