首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Spam mail classification considered complex and error-prone task in the distributed computing environment. There are various available spam mail classification approaches such as the naive Bayesian classifier, logistic regression and support vector machine and decision tree, recursive neural network, and long short-term memory algorithms. However, they do not consider the document when analyzing spam mail content. These approaches use the bag-of-words method, which analyzes a large amount of text data and classifies features with the help of term frequency-inverse document frequency. Because there are many words in a document, these approaches consume a massive amount of resources and become infeasible when performing classification on multiple associated mail documents together. Thus, spam mail is not classified fully, and these approaches remain with loopholes. Thus, we propose a term frequency topic inverse document frequency model that considers the meaning of text data in a larger semantic unit by applying weights based on the document’s topic. Moreover, the proposed approach reduces the scarcity problem through a frequency topic-inverse document frequency in singular value decomposition model. Our proposed approach also reduces the dimensionality, which ultimately increases the strength of document classification. Experimental evaluations show that the proposed approach classifies spam mail documents with higher accuracy using individual document-independent processing computation. Comparative evaluations show that the proposed approach performs better than the logistic regression model in the distributed computing environment, with higher document word frequencies of 97.05%, 99.17% and 96.59%.  相似文献   

2.
Understanding a word in context relies on a cascade of perceptual and conceptual processes, starting with modality-specific input decoding, and leading to the unification of the word''s meaning into a discourse model. One critical cognitive event, turning a sensory stimulus into a meaningful linguistic sign, is the access of a semantic representation from memory. Little is known about the changes that activating a word''s meaning brings about in cortical dynamics. We recorded the electroencephalogram (EEG) while participants read sentences that could contain a contextually unexpected word, such as ‘cold’ in ‘In July it is very cold outside’. We reconstructed trajectories in phase space from single-trial EEG time series, and we applied three nonlinear measures of predictability and complexity to each side of the semantic access boundary, estimated as the onset time of the N400 effect evoked by critical words. Relative to controls, unexpected words were associated with larger prediction errors preceding the onset of the N400. Accessing the meaning of such words produced a phase transition to lower entropy states, in which cortical processing becomes more predictable and more regular. Our study sheds new light on the dynamics of information flow through interfaces between sensory and memory systems during language processing.  相似文献   

3.
In bibliometric research, keyword analysis of publications provides an effective way not only to investigate the knowledge structure of research domains, but also to explore the developing trends within domains. To identify the most representative keywords, many approaches have been proposed. Most of them focus on using statistical regularities, syntax, grammar, or network-based characteristics to select representative keywords for the domain analysis. In this paper, we argue that the domain knowledge is reflected by the semantic meanings behind keywords rather than the keywords themselves. We apply the Google Word2Vec model, a model of a word distribution using deep learning, to represent the semantic meanings of the keywords. Based on this work, we propose a new domain knowledge approach, the Semantic Frequency-Semantic Active Index, similar to Term Frequency-Inverse Document Frequency, to link domain and background information and identify infrequent but important keywords. We adopt a semantic similarity measuring process before statistical computation to compute the frequencies of “semantic units” rather than keyword frequencies. Semantic units are generated by word vector clustering, while the Inverse Document Frequency is extended to include the semantic inverse document frequency; thus only words in the inverse documents with a certain similarity will be counted. Taking geographical natural hazards as the domain and natural hazards as the background discipline, we identify the domain-specific knowledge that distinguishes geographical natural hazards from other types of natural hazards. We compare and discuss the advantages and disadvantages of the proposed method in relation to existing methods, finding that by introducing the semantic meaning of the keywords, our method supports more effective domain knowledge analysis.  相似文献   

4.
We apply the semi-supervised recursive autoencoders (RAE) model for the sentiment classification task of Tibetan short text, and we obtain a better classification effect. The input of the semi-supervised RAE model is the word vector. We crawled a large amount of Tibetan text from the Internet, got Tibetan word vectors by using Word2vec, and verified its validity through simple experiments. The values of parameter α and word vector dimension are important to the model effect. The experiment results indicate that when α is 0.3 and the word vector dimension is 60, the model works best. Our experiment also shows the effectiveness of the semi-supervised RAE model for Tibetan sentiment classification task and suggests the validity of the Tibetan word vectors we trained.  相似文献   

5.
刘保旗  林丽  郭主恩 《包装工程》2024,45(2):110-117
目的 为解决传统感性设计研究中意象实验耗时大以及小样本偶然性等问题,依托现有网络评价文本信息提取了用户意象认知。方法 首先,爬取大规模汽车外观评论文本,构建语义分析词汇库,构建word2vec词向量模型;然后,基于模型获取词库内部的语义联系,计算高频关键形容词之间的语义离散性,以构建代表性意象词空间;最后,通过语义量化匹配将评论映射到意象词空间,得到大规模用户对各车型的显著性意象表征,明确了指定意象词汇下的汽车外观匹配结果。结果 运用该方法挖掘汽车外观显著性意象与基于人工评价的实验结果无显著性差异且具有高度相关性,证明了该方法的有效性。结论 以该方法挖掘用户意象认知,运用了现有的大批量用户反馈知识,提高了意象分析效率,有助于决策者快速理解消费者对汽车外观的感性知识,在设计迭代中可使产品更符合市场期望;对比相关研究,基于语义量化匹配的方式无需对超高维向量进行降维和聚类,避免了以往研究因特征降维而可能导致的词向量语义联系的损失,以得到更为准确的意象挖掘结果。  相似文献   

6.
In this paper, we investigate whether a semantic representation of patent documents provides added value for a multi-dimensional visual exploration of a patent landscape compared to traditional approaches that use tf–idf (term frequency–inverse document frequency). Word embeddings from a pre-trained word2vec model created from patent text are used to calculate pairwise similarities in order to represent each document in the semantic space. Then, a hierarchical clustering method is applied to create several semantic aggregation levels for a collection of patent documents. For visual exploration, we have seamlessly integrated multiple interaction metaphors that combine semantics and additional metadata for improving hierarchical exploration of large document collections.  相似文献   

7.
马全福 《包装工程》2021,42(18):274-281
目的 视觉性与空间性是文字叙事设计的最重要特征,以视觉的形式全程参与叙事过程是其根本所在,分析文字的视觉叙事特征并形成一套可行的叙事设计方法.方法 全面考察各文字类型,从文字发展的源流、脉络及应用形式等考辨,紧扣视觉分析各文字类型具有的视觉叙事特征与功能,在此基础上研究各类文字的叙事设计方法.结论 立足于视觉进行叙事文本的建构和解读是文字叙事设计研究的基础,适用于所有的文字类型及组合形式.文字叙事设计可从叙事性文字设计和文字编排的叙事设计两个层面进行研究,叙事性文字设计主要有形意互文、形声意合文、主题情境附加3种方法,而文字编排的叙事设计主要有情境编排和形象编排两种方法.  相似文献   

8.
Word vector representation is widely used in natural language processing tasks. Most word vectors are generated based on probability model, its bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. Recently, neural-network language models CBOW and Skip-Gram are developed as continuous-space language models for words representation in high dimensional real-valued vectors. These vector representations have recently demonstrated promising results in various NLP tasks because of their superiority in capturing syntactic and contextual regularities in language. In this paper, we propose a new strategy based on optimization in contiguous subset of documents and regression method in combination of vectors, two of new models CBOW-OR and SkipGram-OR for word vector learning are established. Experimental results show that for some words-pair, the cosine distance obtained by the CBOW-OR (or SkipGram-OR) model is generally larger and is more reasonable than CBOW (or Skip-Gram), the vector space for Skip-Gram and SkipGram-OR keep the same structure property in Euclidean distance, and the model SkipGram-OR keeps higher performance for retrieval the relative words-pair as a whole. Both CBOW-OR and SkipGram-OR model are inherent parallel models and can be expected to apply in large-scale information processing.  相似文献   

9.
Traditional topic models have been widely used for analyzing semantic topics from electronic documents. However, the obvious defects of topic words acquired by them are poor in readability and consistency. Only the domain experts are possible to guess their meaning. In fact, phrases are the main unit for people to express semantics. This paper presents a Distributed Representation-Phrase Latent Dirichlet Allocation (DRPhrase LDA) which is a phrase topic model. Specifically, we reasonably enhance the semantic information of phrases via distributed representation in this model. The experimental results show the topics quality acquired by our model is more readable and consistent than other similar topic models.  相似文献   

10.
11.
In English patent document information retrieval, Multi Word Terms (MWTs) are an important factor in determining how relevant a patent document is for a particular search query. Detecting the correct boundaries for these MWTs is no trivial task and often complicated by the special writing style of the patent domain. In this paper we describe a method for detecting MWTs in patent sentences based on a method for detecting technical entities using deep learning. On our annotated dataset of 22 patents, our method achieved an average precision of 0.75, an average recall of 0.74 and an average F1 score of 0.74. Further, we argue for the use of domain specific word embedding resources and suggest that our model mostly learns whether individual words should be included in MWTs or not.  相似文献   

12.
Reviewer recommendation problem in the research field usually refers to invite experts to comment on the quality of papers, proposals, etc. How to effectively and accurately recommend reviewers for the submitted papers and proposals is a meaningful and still tough task. At present, many unsupervised recommendation methods have been researched to solve this task. In this paper, a novel classification method named Word Mover’s Distance–Constructive Covering Algorithm (WMD–CCA, for short) is proposed to solve the reviewer recommendation problem as a classification issue. A submission or a reviewer is described by some tags, such as keywords, research interests, and so on. First, the submission or the reviewer is represented as some vectors by a word embedding method. That is to say, each tag describing a submission or a reviewer is represented as a vector. Second, the Word Mover’s Distance (WMD, for short) method is used to measure the minimum distances between submissions and reviewers. Actually, the papers usually have research field information, and utilizing them well might improve the reviewer recommendation accuracy. So finally, the reviewer recommendation task is transformed into a classification problem which is solved by a supervised learning method- Constructive Covering Algorithm (CCA, for short). Comparative experiments are conducted with 4 public datasets and a synthetic dataset from Baidu Scholar, which show that the proposed method WMD–CCA effectively solves the reviewer recommendation task as a classification issue and improves the recommendation accuracy.  相似文献   

13.
Due to the widespread usage of social media in our recent daily lifestyles, sentiment analysis becomes an important field in pattern recognition and Natural Language Processing (NLP). In this field, users’ feedback data on a specific issue are evaluated and analyzed. Detecting emotions within the text is therefore considered one of the important challenges of the current NLP research. Emotions have been widely studied in psychology and behavioral science as they are an integral part of the human nature. Emotions describe a state of mind of distinct behaviors, feelings, thoughts and experiences. The main objective of this paper is to propose a new model named BERT-CNN to detect emotions from text. This model is formed by a combination of the Bidirectional Encoder Representations from Transformer (BERT) and the Convolutional Neural networks (CNN) for textual classification. This model embraces the BERT to train the word semantic representation language model. According to the word context, the semantic vector is dynamically generated and then placed into the CNN to predict the output. Results of a comparative study proved that the BERT-CNN model overcomes the state-of-art baseline performance produced by different models in the literature using the semeval 2019 task3 dataset and ISEAR datasets. The BERT-CNN model achieves an accuracy of 94.7% and an F1-score of 94% for semeval2019 task3 dataset and an accuracy of 75.8% and an F1-score of 76% for ISEAR dataset.  相似文献   

14.
Understanding semantic word shifts in scientific domains is essential for facilitating interdisciplinary communication. Using a data set of published papers in the field of information retrieval (IR), this paper studies the semantic shifts of words in IR based on mining per-word topic distribution over time. We propose that semantic word shifts not only occur over time, but also over topics. The shifts are examined from two perspectives, the topic-level and the context-level. According to the over-time word-topic distribution, stable words and unstable words are recognized. The diverging and converging trends in the unstable type reveal characteristics of the topic evolution process. The context-level shifts are further detected by similarities between word vectors. Our work associates semantic word shifts with the evolving of topics, which facilitates a better understanding of semantic word shifts from both topics and contexts.  相似文献   

15.
基于矢量空间模型和最大熵模型的词义问题解决策略   总被引:2,自引:0,他引:2  
针对单义词的词义问题构建了融合触发对(trigger pair)的矢量空间模型用来进行词义相似度的计算,并以此为基础进行了词语的聚类;针对多义词的词义问题应用融合远距离上下文信息的最大熵模型进行了有导词义消歧的研究。为克服以往词义消歧评测中通过人工构造带有词义标记的测试例句而带来的覆盖程度小、主观影响大等问题,将模型的评测直接放到了词语聚类和分词歧义这两个实际的应用中。分词歧义的消解正确率达到了92%,词语聚类的结果满足进一步应用的需要。  相似文献   

16.
Text classification has always been an increasingly crucial topic in natural language processing. Traditional text classification methods based on machine learning have many disadvantages such as dimension explosion, data sparsity, limited generalization ability and so on. Based on deep learning text classification, this paper presents an extensive study on the text classification models including Convolutional Neural Network-Based (CNN-Based), Recurrent Neural Network-Based (RNN-based), Attention Mechanisms-Based and so on. Many studies have proved that text classification methods based on deep learning outperform the traditional methods when processing large-scale and complex datasets. The main reasons are text classification methods based on deep learning can avoid cumbersome feature extraction process and have higher prediction accuracy for a large set of unstructured data. In this paper, we also summarize the shortcomings of traditional text classification methods and introduce the text classification process based on deep learning including text preprocessing, distributed representation of text, text classification model construction based on deep learning and performance evaluation.  相似文献   

17.
Text mining has become a major research topic in which text classification is the important task for finding the relevant information from the new document. Accordingly, this paper presents a semantic word processing technique for text categorization that utilizes semantic keywords, instead of using independent features of the keywords in the documents. Hence, the dimensionality of the search space can be reduced. Here, the Back Propagation Lion algorithm (BP Lion algorithm) is also proposed to overcome the problem in updating the neuron weight. The proposed text classification methodology is experimented over two data sets, namely, 20 Newsgroup and Reuter. The performance of the proposed BPLion is analysed, in terms of sensitivity, specificity, and accuracy, and compared with the performance of the existing works. The result shows that the proposed BPLion algorithm and semantic processing methodology classifies the documents with less training time and more classification accuracy of 90.9%.  相似文献   

18.
Xiaoling Sun  Kun Ding 《Scientometrics》2018,116(3):1735-1748
Knowledge memes are the cultural equivalent of genes that play an important role in the evolution of knowledge. In this paper, we are trying to identify and tracking scientific and technological knowledge memes, and infer the relationship between science and technology at micro-level. A new carbon nanomaterial—graphene is taken as an example, and publications and patents are used as data sources for the representation of science and technology. Citation networks of publications and patents are constructed, on which a knowledge meme discovery algorithm is used, in order to identify memes that play a key role in the evolution of scientific and technological knowledge. Then the diffusion and co-occurrence of knowledge memes are shown, and a word embedding model is used to track the semantic change of the memes. The research could provide guidance for promoting knowledge innovation and making research policy.  相似文献   

19.
In recent years, many text summarization models based on pre-training methods have achieved very good results. However, in these text summarization models, semantic deviations are easy to occur between the original input representation and the representation that passed multi-layer encoder, which may result in inconsistencies between the generated summary and the source text content. The Bidirectional Encoder Representations from Transformers (BERT) improves the performance of many tasks in Natural Language Processing (NLP). Although BERT has a strong capability to encode context, it lacks the fine-grained semantic representation. To solve these two problems, we proposed a semantic supervision method based on Capsule Network. Firstly, we extracted the fine-grained semantic representation of the input and encoded result in BERT by Capsule Network. Secondly, we used the fine-grained semantic representation of the input to supervise the fine-grained semantic representation of the encoded result. Then we evaluated our model on a popular Chinese social media dataset (LCSTS), and the result showed that our model achieved higher ROUGE scores (including R-1, R-2), and our model outperformed baseline systems. Finally, we conducted a comparative study on the stability of the model, and the experimental results showed that our model was more stable.  相似文献   

20.
针对人-机器人语音交互中经过语音识别的文本指令,提出了一种利用汉语拼音中声韵母作为特征的深度学习文本分类模型。首先,以无人驾驶车语音导航控制为人机交互的应用背景,分析其文本指令结构并分别构建单一意图与复杂意图语料库;其次,在以字符作为文本分类特征的基础上,结合汉语拼音与英文单词的区别,提出了一种利用拼音声韵母字符作为中文文本分类的特征表示方法;然后,用门控递归单元(GRU)代替传统递归神经网络单元以解决其难以捕获长时间维度特征的不足,为提取信息的高阶特征、缩短特征序列长度并加快模型收敛速度,建立了一种结合卷积神经网络及GRU递归神经网络的深度学习文本分类模型。最后,为验证模型在处理长、短序列任务上的表现,在上述两个语料库上对提出的模型分别进行十折交叉测试,并与其他分类方法进行比较与分析,结果表明该模型显著地提高了分类准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号