首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
通过python爬取豆瓣网站上《少年的你》的短评文本,对评论文本进行清洗并利用构建的分词词典和停用词词典分别进行分词处理和去停用词处理后得到较为规范化的文本.利用TF-IDF算法提取评论文本的关键词,以关键词为基础建立LDA主题模型,从定量的角度提取评论主题,从而分析观众对这部电影的情感态度和评论的热点话题,为消费者的购买行为提供一定的决策支持,同时为商品提供者提供一定的发展方向.  相似文献   

2.
In recent years, the volume of information in digital form has increased tremendously owing to the increased popularity of the World Wide Web. As a result, the use of techniques for extracting useful information from large collections of data, and particularly documents, has become more necessary and challenging. Text clustering is such a technique; it consists in dividing a set of text documents into clusters (groups), so that documents within the same cluster are closely related, whereas documents in different clusters are as different as possible. Clustering depends on measuring the content (i.e., words) of a document in terms of relevance. Nevertheless, as documents usually contain a large number of words, some of them may be irrelevant to the topic under consideration or redundant. This can confuse and complicate the clustering process and make it less accurate. Accordingly, feature selection methods have been employed to reduce data dimensionality by selecting the most relevant features. In this study, we developed a text document clustering optimization model using a novel genetic frog-leaping algorithm that efficiently clusters text documents based on selected features. The proposed approach is based on two metaheuristic algorithms: a genetic algorithm (GA) and a shuffled frog-leaping algorithm (SFLA). The GA performs feature selection, and the SFLA performs clustering. To evaluate its effectiveness, the proposed approach was tested on a well-known text document dataset: the “20Newsgroup” dataset from the University of California Irvine Machine Learning Repository. Overall, after multiple experiments were compared and analyzed, it was demonstrated that using the proposed algorithm on the 20Newsgroup dataset greatly facilitated text document clustering, compared with classical K-means clustering. Nevertheless, this improvement requires longer computational time.  相似文献   

3.
4.
Text mining has become a major research topic in which text classification is the important task for finding the relevant information from the new document. Accordingly, this paper presents a semantic word processing technique for text categorization that utilizes semantic keywords, instead of using independent features of the keywords in the documents. Hence, the dimensionality of the search space can be reduced. Here, the Back Propagation Lion algorithm (BP Lion algorithm) is also proposed to overcome the problem in updating the neuron weight. The proposed text classification methodology is experimented over two data sets, namely, 20 Newsgroup and Reuter. The performance of the proposed BPLion is analysed, in terms of sensitivity, specificity, and accuracy, and compared with the performance of the existing works. The result shows that the proposed BPLion algorithm and semantic processing methodology classifies the documents with less training time and more classification accuracy of 90.9%.  相似文献   

5.
Spam mail classification considered complex and error-prone task in the distributed computing environment. There are various available spam mail classification approaches such as the naive Bayesian classifier, logistic regression and support vector machine and decision tree, recursive neural network, and long short-term memory algorithms. However, they do not consider the document when analyzing spam mail content. These approaches use the bag-of-words method, which analyzes a large amount of text data and classifies features with the help of term frequency-inverse document frequency. Because there are many words in a document, these approaches consume a massive amount of resources and become infeasible when performing classification on multiple associated mail documents together. Thus, spam mail is not classified fully, and these approaches remain with loopholes. Thus, we propose a term frequency topic inverse document frequency model that considers the meaning of text data in a larger semantic unit by applying weights based on the document’s topic. Moreover, the proposed approach reduces the scarcity problem through a frequency topic-inverse document frequency in singular value decomposition model. Our proposed approach also reduces the dimensionality, which ultimately increases the strength of document classification. Experimental evaluations show that the proposed approach classifies spam mail documents with higher accuracy using individual document-independent processing computation. Comparative evaluations show that the proposed approach performs better than the logistic regression model in the distributed computing environment, with higher document word frequencies of 97.05%, 99.17% and 96.59%.  相似文献   

6.
Data fusion is a multidisciplinary research area that involves different domains. It is used to attain minimum detection error probability and maximum reliability with the help of data retrieved from multiple healthcare sources. The generation of huge quantity of data from medical devices resulted in the formation of big data during which data fusion techniques become essential. Securing medical data is a crucial issue of exponentially-pacing computing world and can be achieved by Intrusion Detection Systems (IDS). In this regard, since singular-modality is not adequate to attain high detection rate, there is a need exists to merge diverse techniques using decision-based multimodal fusion process. In this view, this research article presents a new multimodal fusion-based IDS to secure the healthcare data using Spark. The proposed model involves decision-based fusion model which has different processes such as initialization, pre-processing, Feature Selection (FS) and multimodal classification for effective detection of intrusions. In FS process, a chaotic Butterfly Optimization (BO) algorithm called CBOA is introduced. Though the classic BO algorithm offers effective exploration, it fails in achieving faster convergence. In order to overcome this, i.e., to improve the convergence rate, this research work modifies the required parameters of BO algorithm using chaos theory. Finally, to detect intrusions, multimodal classifier is applied by incorporating three Deep Learning (DL)-based classification models. Besides, the concepts like Hadoop MapReduce and Spark were also utilized in this study to achieve faster computation of big data in parallel computation platform. To validate the outcome of the presented model, a series of experimentations was performed using the benchmark NSLKDDCup99 Dataset repository. The proposed model demonstrated its effective results on the applied dataset by offering the maximum accuracy of 99.21%, precision of 98.93% and detection rate of 99.59%. The results assured the betterment of the proposed model.  相似文献   

7.
Text keywords are defined as meaningful and important words in a document, which provide a precise overview of its content and reflect the author’s writing intention. Keyword extraction methods have received a lot of attentions, among which is the network-based method. However, existing network-based keyword extraction methods only consider the connections between words in a document, while ignoring the impact of sentences. Since a sentence is made of many words, while words affect one another in a sentence, neglecting the influence of sentences will result in the loss of information. In this paper, we introduce a word network whose nodes represent words in a document, and define that any keyword extraction method based on a word network is called as a Word-net method. Then, we propose a new network model which considers the influence of sentences, and a new word-sentence method based on the new model. Experimental results demonstrate that our method outperforms the Word-net method, the classical term frequency-inverse document frequency (TF-IDF) method, most frequent method and TextRank method. The precision, recall, and F-measure of our result are respectively 7.95, 8.27 and 6.54% higher than the Word-net result, and the average precision of our result is 17.56% higher than the TF-IDF result. A two-way analysis of variance is employed to validate the empirical analysis, which indicates that keyword extraction methods and keyword numbers have statistically significant effects on the evaluation of metric values.  相似文献   

8.
施琦  胡威  许德骅  彭魏  苏晨  朱衡 《包装工程》2021,42(20):152-158, 217
目的 在当前疫情影响的环境下,提出产品设计与电商行业大数据深度融合的方法.方法 首先,通过产品类型寻找合适的数据源,应用Python自然语言编写爬虫从数据源爬取两类数据,一类是以文本类资源为主的用户评论,另一类是以图像类资源为主的产品主视图.然后,使用文本聚类分析和机器学习的方法对两类数据进行批量化处理,生成有效数据,并对有效数据进行图表可视化.最终,综合分析图表,从而有效的指导设计团队进行设计工作.结论 通过大数据分析,精准挖掘疫情背景下产品的现有发展趋势,重新定义用户在疫情背景下对现有产品使用的心理反馈.这种设计方法使产品设计方案打破设计师个人认知所带来的局限性,使方案更加符合当前疫情下用户的消费心理,顺应市场的发展.以儿童学步鞋为例,证明了该方法的有效性.  相似文献   

9.
基于改进人工鱼群算法的机械故障聚类诊断方法   总被引:1,自引:1,他引:0  
陈安华  周博  张会福  文宏 《振动与冲击》2012,31(17):145-148
发展新的理论或方法快速准确地实现机械故障信号的聚类诊断是众多学者研究热点。由于人工鱼群优化算法具有结构简单,良好的并行性、快速性等特点,把人工鱼群优化算法引入机械故障诊断中。基于人工鱼群算法的基本原理提出了一种改进的人工鱼群追尾聚类算法,定义了相似度因子和聚类判别因子,建立了模拟人工鱼群追尾行为的机械故障聚类诊断模型,并将之应用于机械故障特征信息的聚类分析。实例分析表明了本文方法的有效性。  相似文献   

10.
An extended latent Dirichlet allocation (LDA) model is presented in this paper for patent competitive intelligence analysis. After part-of-speech tagging and defining the noun phrase extraction rules, technological words have been extracted from patent titles and abstracts. This allows us to go one step further and perform patent analysis at content level. Then LDA model is used for identifying underlying topic structures based on latent relationships of technological words extracted. This helped us to review research hot spots and directions in subclasses of patented technology in a certain field. For the extension of the traditional LDA model, another institution-topic probability level is added to the original LDA model. Direct competing enterprises’ distribution probability and their technological positions are identified in each topic. Then a case study is carried on within one of the core patented technology in next generation telecommunication technology-LTE. This empirical study reveals emerging hot spots of LTE technology, and finds that major companies in this field have been focused on different technological fields with different competitive positions.  相似文献   

11.
Spark is a distributed data processing framework based on memory. Memory allocation is a focus question of Spark research. A good memory allocation scheme can effectively improve the efficiency of task execution and memory resource utilization of the Spark. Aiming at the memory allocation problem in the Spark2.x version, this paper optimizes the memory allocation strategy by analyzing the Spark memory model, the existing cache replacement algorithms and the memory allocation methods, which is on the basis of minimizing the storage area and allocating the execution area according to the demand. It mainly including two parts: cache replacement optimization and memory allocation optimization. Firstly, in the storage area, the cache replacement algorithm is optimized according to the characteristics of RDD Partition, which is combined with PCA dimension. In this section, the four features of RDD Partition are selected. When the RDD cache is replaced, only two most important features are selected by PCA dimension reduction method each time, thereby ensuring the generalization of the cache replacement strategy. Secondly, the memory allocation strategy of the execution area is optimized according to the memory requirement of Task and the memory space of storage area. In this paper, a series of experiments in Spark on Yarn mode are carried out to verify the effectiveness of the optimization algorithm and improve the cluster performance.  相似文献   

12.
The meaning of a word includes a conceptual meaning and a distributive meaning. Word embedding based on distribution suffers from insufficient conceptual semantic representation caused by data sparsity, especially for low-frequency words. In knowledge bases, manually annotated semantic knowledge is stable and the essential attributes of words are accurately denoted. In this paper, we propose a Conceptual Semantics Enhanced Word Representation (CEWR) model, computing the synset embedding and hypernym embedding of Chinese words based on the Tongyici Cilin thesaurus, and aggregating it with distributed word representation to have both distributed information and the conceptual meaning encoded in the representation of words. We evaluate the CEWR model on two tasks: word similarity computation and short text classification. The Spearman correlation between model results and human judgement are improved to 64.71%, 81.84%, and 85.16% on Wordsim297, MC30, and RG65, respectively. Moreover, CEWR improves the F1 score by 3% in the short text classification task. The experimental results show that CEWR can represent words in a more informative approach than distributed word embedding. This proves that conceptual semantics, especially hypernymous information, is a good complement to distributed word representation.  相似文献   

13.
Spark is the most popular in-memory processing framework for big data analytics. Memory is the crucial resource for workloads to achieve performance acceleration on Spark. The extant memory capacity configuration approach in Spark is to statically configure the memory capacity for workloads based on user’s specifications. However, without the deep knowledge of the workload’s system-level characteristics, users in practice often conservatively overestimate the memory utilizations of their workloads and require resource manager to grant more memory share than that they actually need, which leads to the severe waste of memory resources. To address the above issue, SMConf, an automated memory capacity configuration solution for in-memory computing workloads in Spark is proposed. SMConf is designed based on the observation that, though there is not one-size-fit-all proper configuration, the one-size-fit-bunch configuration can be found for in-memory computing workloads. SMConf classifies typical Spark workloads into categories based on metrics across layers of Spark system stack. For each workload category, an individual memory requirement model is learned from the workload’s input data size and the strong-correlated configuration parameters. For an ad-hoc workload, SMConf matches its memory requirement signature to one of the workload categories with small-sized input data and determines its proper memory capacity configuration with the corresponding memory requirement model. Experimental results demonstrate that, compared to the conservative default configuration, SMConf can reduce the memory resource provision to Spark workloads by up to 69% with the slight performance degradation, and reduce the average turnaround time of Spark workloads by up to 55% in the multi-tenant environments.  相似文献   

14.
Nowadays, the amount of wed data is increasing at a rapid speed, which presents a serious challenge to the web monitoring. Text sentiment analysis, an important research topic in the area of natural language processing, is a crucial task in the web monitoring area. The accuracy of traditional text sentiment analysis methods might be degraded in dealing with mass data. Deep learning is a hot research topic of the artificial intelligence in the recent years. By now, several research groups have studied the sentiment analysis of English texts using deep learning methods. In contrary, relatively few works have so far considered the Chinese text sentiment analysis toward this direction. In this paper, a method for analyzing the Chinese text sentiment is proposed based on the convolutional neural network (CNN) in deep learning in order to improve the analysis accuracy. The feature values of the CNN after the training process are nonuniformly distributed. In order to overcome this problem, a method for normalizing the feature values is proposed. Moreover, the dimensions of the text features are optimized through simulations. Finally, a method for updating the learning rate in the training process of the CNN is presented in order to achieve better performances. Experiment results on the typical datasets indicate that the accuracy of the proposed method can be improved compared with that of the traditional supervised machine learning methods, e.g., the support vector machine method.  相似文献   

15.
Fast ICA算法是基于一批已取得的样本数据进行处理,它不适用信道矩阵变化的情况;虽基于自然梯度的Info max法是根据单次观测的样本值来调整分离矩阵,但它仅适合单类信源情况。在信道恒定和变化情况下,仿真比较上述算法的优缺点,同时为解决在线算法中收敛速度和稳态误差的矛盾,提出一种改进的变步长算法。该算法将步长变化与信号的分离程度相联系,根据信号之间的相似性测度变化量自适应地控制步长,最后仿真验证该算法的实用性。  相似文献   

16.
姚永红  张旭 《声学技术》2022,41(6):923-928
文章提出了一种基于极坐标格式算法(Polar Format Algorithm,PFA)进行聚束多子阵合成孔径声呐成像的改进方法,建立了非“停-走-停”条件下的斜视成像模型,推导了信号由时域到波数域的解析表达式,给出了信号处理流程。该方法首先使用场景中心点的精确距离史对平台运动误差进行补偿,并通过极坐标算法处理得到粗聚焦的图像。其次,为了解决非场景中心点的残余空变相位误差的补偿问题,对粗聚焦图像进行分块自聚焦处理,使场景边缘点的聚焦效果得到改善。最后,经过子图拼接及几何校正后得到完整的精聚焦图像。仿真及分析结果表明,该方法提高了方位向性能指标,同时也能准确补偿平台运动误差,可以很好地应用于多子阵声呐成像。该方法在大运动误差、大斜视情况下仍具有较好的鲁棒性。  相似文献   

17.
Text classification has always been an increasingly crucial topic in natural language processing. Traditional text classification methods based on machine learning have many disadvantages such as dimension explosion, data sparsity, limited generalization ability and so on. Based on deep learning text classification, this paper presents an extensive study on the text classification models including Convolutional Neural Network-Based (CNN-Based), Recurrent Neural Network-Based (RNN-based), Attention Mechanisms-Based and so on. Many studies have proved that text classification methods based on deep learning outperform the traditional methods when processing large-scale and complex datasets. The main reasons are text classification methods based on deep learning can avoid cumbersome feature extraction process and have higher prediction accuracy for a large set of unstructured data. In this paper, we also summarize the shortcomings of traditional text classification methods and introduce the text classification process based on deep learning including text preprocessing, distributed representation of text, text classification model construction based on deep learning and performance evaluation.  相似文献   

18.
目的 随着互联网的发展,用户评论快速增长,利用这一海量数据进行文本分析,结合Kano模型,以此来获取更加全面的用户定制需求。方法 提出了一种基于大数据评论文本挖掘的方法,来获取老年手杖个性化定制需求。首先将老年手杖分为三大不同产品等级,挑选典型产品,爬取用户评论;其次通过文本的对应分析获取不同等级手杖用户需求的差异;接着利用LDA模型、结合德尔菲专家法获取用户需求族群;最后利用Kano模型进行需求等级划分,并结合Fisher精确检验进行差异显著度检验。结果 识别出老年手杖的基本型、期望型、兴奋型不同等级用户需求,以指导老年手杖个性化定制界面的设计。结论 结果表明大数据挖掘与Kano模型相结合的方法,能够有效地获取用户个性化需求层级,并指导定制平台的搭建,为产品个性化定制平台的设计提供科学依据。  相似文献   

19.
In the era of big data, traditional regression models cannot deal with uncertain big data efficiently and accurately. In order to make up for this deficiency, this paper proposes a quantum fuzzy regression model, which uses fuzzy theory to describe the uncertainty in big data sets and uses quantum computing to exponentially improve the efficiency of data set preprocessing and parameter estimation. In this paper, data envelopment analysis (DEA) is used to calculate the degree of importance of each data point. Meanwhile, Harrow, Hassidim and Lloyd (HHL) algorithm and quantum swap circuits are used to improve the efficiency of high-dimensional data matrix calculation. The application of the quantum fuzzy regression model to small-scale financial data proves that its accuracy is greatly improved compared with the quantum regression model. Moreover, due to the introduction of quantum computing, the speed of dealing with high-dimensional data matrix has an exponential improvement compared with the fuzzy regression model. The quantum fuzzy regression model proposed in this paper combines the advantages of fuzzy theory and quantum computing which can efficiently calculate high-dimensional data matrix and complete parameter estimation using quantum computing while retaining the uncertainty in big data. Thus, it is a new model for efficient and accurate big data processing in uncertain environments.  相似文献   

20.
Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning. This paper implements and compares unsupervised and semi-supervised clustering analysis of BOAArgo ocean text data. Unsupervised K-Means and Affinity Propagation (AP) are two classical clustering algorithms. The Election-AP algorithm is proposed to handle the final cluster number in AP clustering as it has proved to be difficult to control in a suitable range. Semi-supervised samples thermocline data in the BOA-Argo dataset according to the thermocline standard definition, and use this data for semi-supervised cluster analysis. Several semi-supervised clustering algorithms were chosen for comparison of learning performance: Constrained-K-Means, Seeded-K-Means, SAP (Semi-supervised Affinity Propagation), LSAP (Loose Seed AP) and CSAP (Compact Seed AP). In order to adapt the single label, this paper improves the above algorithms to SCKM (improved Constrained-K-Means), SSKM (improved Seeded-K-Means), and SSAP (improved Semi-supervised Affinity Propagationg) to perform semi-supervised clustering analysis on the data. A DSAP (Double Seed AP) semi-supervised clustering algorithm based on compact seeds is proposed as the experimental data shows that DSAP has a better clustering effect. The unsupervised and semi-supervised clustering results are used to analyze the potential patterns of marine data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号