首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Twitter has nowadays become a trending microblogging and social media platform for news and discussions. Since the dramatic increase in its platform has additionally set off a dramatic increase in spam utilization in this platform. For Supervised machine learning, one always finds a need to have a labeled dataset of Twitter. It is desirable to design a semi-supervised labeling technique for labeling newly prepared recent datasets. To prepare the labeled dataset lot of human affords are required. This issue has motivated us to propose an efficient approach for preparing labeled datasets so that time can be saved and human errors can be avoided. Our proposed approach relies on readily available features in real-time for better performance and wider applicability. This work aims at collecting the most recent tweets of a user using Twitter streaming and prepare a recent dataset of Twitter. Finally, a semi-supervised machine learning algorithm based on the self-training technique was designed for labeling the tweets. Semi-supervised support vector machine and semi-supervised decision tree classifiers were used as base classifiers in the self-training technique. Further, the authors have applied K means clustering algorithm to the tweets based on the tweet content. The principled novel approach is an ensemble of semi-supervised and unsupervised learning wherein it was found that semi-supervised algorithms are more accurate in prediction than unsupervised ones. To effectively assign the labels to the tweets, authors have implemented the concept of voting in this novel approach and the label pre-directed by the majority voting classifier is the actual label assigned to the tweet dataset. Maximum accuracy of 99.0% has been reported in this paper using a majority voting classifier for spam labeling.

  相似文献   

2.
Social media websites such as Facebook, Twitter, etc. has changed the way peoples communicate and make decision. In this regard, various companies are willing to use these media to raise their reputation. In this paper, a reputation management system is proposed which measures the reputation of a given company by using the social media data, particularly tweets of Twitter. Taking into account the name of the company and its' related tweets, it is determined that a given tweet has either negative or positive impact on the company's reputation or product. The proposed method is based on N-gram learning approach, which consists of two steps: train step and test step. In the training step, we consider four profiles i.e. positive, negative, neutral, and irrelevant profiles for each company. Then 80% of the available tweets are used to build the companies' profiles. Each profile contains the terms that have been appeared in the tweets of each company together with the terms' frequencies. Then in the test step, which is performed on the 20% remaining tweets of the dataset, each tweet is compared with all of the built profiles, based on distance criterion to examine how the given tweet affects a company's reputation. Evaluation of the proposed method indicates that this method has a better efficiency and performance in terms of recall and precision compared to the previous methods such as Neural Network and Bayesian method.  相似文献   

3.
Combining machine learning with social network analysis (SNA) can leverage vast amounts of social media data to better respond to crises. We present a case study using Twitter data from the March 2019 Nebraska floods in the United States, which caused over $1 billion in damage in the state and widespread evacuations of residents. We use a subset of machine learning, deep learning (DL), to classify text content of 11,982 tweets, and we integrate that with SNA to understand the structure of tweet interactions. Our DL approach pre‐trains our model with a DL language technique, BERT, and then trains the model using the standard training dataset to sort a dataset of tweets into classes tailored to crisis events. Several performance measures demonstrate that our two‐tiered trained model improves domain adaptation and generalization across different extreme weather event types. This approach identifies the role of Twitter during the damage containment stage of the flood. Our SNA identifies accounts that function as primary sources of information on Twitter. Together, these two approaches help crisis managers filter large volumes of data and overcome challenges faced by simple statistical models and other computational techniques to provide useful information during crises like flooding.  相似文献   

4.
Twitter has become an important data source for detecting events, especially tracking detailed information for events of a specific domain. Previous studies on targeted-domain Twitter information extraction have used supervised learning techniques to identify domain-related tweets, however, the need for extensive manual labeling makes these supervised systems extremely expensive to build and maintain. What’s more, most of these existing work fail to consider spatiotemporal factors, which are essential attributes of target-domain events. In this paper, we propose a semi-supervised method for Automatical Targeted-domain Spatiotemporal Event Detection (ATSED) in Twitter. Given a targeted domain, ATSED first learns tweet labels from historical data, and then detects on-going events from real-time Twitter data streams. Specifically, an efficient label generation algorithm is proposed to automatically recognize tweet labels from domain-related news articles, a customized classifier is created for Twitter data analysis by utilizing tweets’ distinguishing features, and a novel multinomial spatial-scan model is provided to identify geographical locations for detected events. Experiments on 305 million tweets demonstrated the effectiveness of this new approach.  相似文献   

5.
Twitter provides search services to help people find users to follow by recommending popular users or the friends of their friends. However, these services neither offer the most relevant users to follow nor provide a way to find the most interesting tweet messages for each user. Recently, collaborative filtering techniques for recommendations based on friend relationships in social networks have been widely investigated. However, since such techniques do not work well when friend relationships are not sufficient, we need to take advantage of as much other information as possible to improve the performance of recommendations.In this paper, we propose TWILITE, a recommendation system for Twitter using probabilistic modeling based on latent Dirichlet allocation which recommends top-K users to follow and top-K tweets to read for a user. Our model can capture the realistic process of posting tweet messages by generalizing an LDA model as well as the process of connecting to friends by utilizing matrix factorization. We next develop an inference algorithm based on the variational EM algorithm for learning model parameters. Based on the estimated model parameters, we also present effective personalized recommendation algorithms to find the users to follow as well as the interesting tweet messages to read. The performance study with real-life data sets confirms the effectiveness of the proposed model and the accuracy of our personalized recommendations.  相似文献   

6.
ABSTRACT

Twitter has become a popular microblogging service that allows millions of active users share news, emergent social events, personal opinions, etc. That leads to a large amount of data producing every day and the problem of managing tweets becomes extremely difficult. To categorize the tweets and make easily in searching, the users can use the hashtags embedding in their tweets. However, valid hashtags are not restricted which lead to a very heterogeneous set of hashtags created on Twitter, increasing the difficulty of tweet categorization. In this paper, we propose a hashtag recommendation method based on analyzing the content of tweets, user characteristics, and currently popular hashtags on Twitter. The proposed method uses personal profiles of the users to discover the relevant hashtags. First, a combination of tweet contents and user characteristics is used to find the top-k similar tweets. We exploit the content of historical tweets, used hashtags, and the social interaction to build the user profiles. The user characteristics can help to find the close users and enhance the accuracy of finding the similar tweets to extract the hashtag candidates. Then a set of hashtag candidates is ranked based on their popularity in long and short periods. The experiments on tweet data showed that the proposed method significantly improves the performance of hashtag recommendation systems.  相似文献   

7.
In emergencies, Twitter is an important platform to get situational awareness simultaneously. Therefore, information about Twitter users’ location is a fundamental aspect to understand the disaster effects. But location extraction is a challenging task. Most of the Twitter users do not share their locations in their tweets. In that respect, there are different methods proposed for location extraction which cover different fields such as statistics, machine learning, etc. This study is a sample study that utilizes geo-tagged tweets to demonstrate the importance of the location in disaster management by taking three cases into consideration. In our study, tweets are obtained by utilizing the “earthquake” keyword to determine the location of Twitter users. Tweets are evaluated by utilizing the Latent Dirichlet Allocation (LDA) topic model and sentiment analysis through machine learning classification algorithms including the Multinomial and Gaussian Naïve Bayes, Support Vector Machine (SVM), Decision Tree, Random Forest, Extra Trees, Neural Network, k Nearest Neighbor (kNN), Stochastic Gradient Descent (SGD), and Adaptive Boosting (AdaBoost) classifications. Therefore, 10 different machine learning algorithms are applied in our study by utilizing sentiment analysis based on location-specific disaster-related tweets by aiming fast and correct response in a disaster situation. In addition, the effectiveness of each algorithm is evaluated in order to gather the right machine learning algorithm. Moreover, topic extraction via LDA is provided to comprehend the situation after a disaster. The gathered results from the application of three cases indicate that Multinomial Naïve Bayes and Extra Trees machine learning algorithms give the best results with an F-measure value over 80%. The study aims to provide a quick response to earthquakes by applying the aforementioned techniques.  相似文献   

8.
Twitter has become a major tool for spreading news, for dissemination of positions and ideas, and for the commenting and analysis of current world events. However, with more than 500 million tweets flowing per day, it is necessary to find efficient ways of collecting, storing, managing, mining and visualizing all this information. This is especially relevant if one considers that Twitter has no ways of indexing tweet contents, and that the only available categorization “mechanism” is the #hashtag, which is totally dependent of a user's will to use it. This paper presents an intelligent platform and framework, named MISNIS - Intelligent Mining of Public Social Networks’ Influence in Society - that facilitates these issues and allows a non-technical user to easily mine a given topic from a very large tweet's corpus and obtain relevant contents and indicators such as user influence or sentiment analysis.When compared to other existent similar platforms, MISNIS is an expert system that includes specifically developed intelligent techniques that: (1) Circumvent the Twitter API restrictions that limit access to 1% of all flowing tweets. The platform has been able to collect more than 80% of all flowing portuguese language tweets in Portugal when online; (2) Intelligently retrieve most tweets related to a given topic even when the tweets do not contain the topic #hashtag or user indicated keywords. A 40% increase in the number of retrieved relevant tweets has been reported in real world case studies.The platform is currently focused on Portuguese language tweets posted in Portugal. However, most developed technologies are language independent (e.g. intelligent retrieval, sentiment analysis, etc.), and technically MISNIS can be easily expanded to cover other languages and locations.  相似文献   

9.
Web-blogging sites such as Twitter and Facebook are heavily influenced by emotions, sentiments, and data in the modern era. Twitter, a widely used microblogging site where individuals share their thoughts in the form of tweets, has become a major source for sentiment analysis. In recent years, there has been a significant increase in demand for sentiment analysis to identify and classify opinions or expressions in text or tweets. Opinions or expressions of people about a particular topic, situation, person, or product can be identified from sentences and divided into three categories: positive for good, negative for bad, and neutral for mixed or confusing opinions. The process of analyzing changes in sentiment and the combination of these categories is known as “sentiment analysis.” In this study, sentiment analysis was performed on a dataset of 90,000 tweets using both deep learning and machine learning methods. The deep learning-based model long-short-term memory (LSTM) performed better than machine learning approaches. Long short-term memory achieved 87% accuracy, and the support vector machine (SVM) classifier achieved slightly worse results than LSTM at 86%. The study also tested binary classes of positive and negative, where LSTM and SVM both achieved 90% accuracy.  相似文献   

10.
微博作为一种新型的社会媒体,以其信息的高实时性、话题动态关注、传播速度快的特点,逐渐被人们所接受和使用。筛选出相关话题的微博信息,帮助用户关注话题的动态发展,成为迫切需要解决的问题。由于微博信息篇幅极短、包含的信息和特征少等特点,为相关话题微博信息的筛选带来了新的挑战,而传统的文本分类技术已不再适用。该文提出了基于信息熵的筛选规则学习算法,利用学习得到的规则对微博信息进行有效的筛选。算法利用信息熵来评价规则的好坏,同时基于模拟退火的随机策略使算法中的规则选择避免了过于贪心。分别通过来自新浪微博的约九万条标注数据和TREC2011中约三千条特定话题的标注数据进行实验,该文算法相比于CPAR和SVM算法,学习得到的规则在筛选时取得了较高的F值。  相似文献   

11.
In recent years, with increased opportunities to post content on social media, a number of users are experiencing information overload in relation to social media use. This study addresses how Japanese Twitter users suffering from information overload cope with their stress, focusing on two actions: (1) The “unfriending” activities and (2) The changes in tweet processing methods. Objective data, such as numbers of friends, were collected through Twitter's open Application Programming Interfaces (APIs), and subjective data, such as perceived information overload and tweet processing methods, were collected through a web-based survey as a panel dataset (n = 778). The results demonstrated that although users experience information overload, they continue to increase their number of friends, and that the users who experience information overload modify their usage habits to avoid seeing all received tweets. In short, users do not choose a strategy to reduce the absolute number of received tweets, but only a strategy that involves changing the processing method of the received tweets.  相似文献   

12.
This paper describes a method for handling multi-class and multi-label classification problems based on the support vector machine formalism. This method has been applied to the language identification problem in Twitter. The system evaluation was performed mainly on a Twitter data set developed in the TweetLID workshop. This data set contains bilingual tweets written in the most commonly used Iberian languages (i.e., Spanish, Portuguese, Catalan, Basque, and Galician) as well as the English language. We address the following problems: (1) social media texts. We propose a suitable tokenization that processes the peculiarities of Twitter; (2) multilingual tweets. Since a tweet can belong to more than one language, we need to use a multi-class and multi-label classifier; (3) similar languages. We study the main confusions among similar languages; and (4) unbalanced classes. We propose threshold-based strategy to favor classes with less data. We have also studied the use of Wikipedia and the addition of new tweets in order to increase the training data set. Additionally, we have tested our system on Bergsma corpus, a collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari alphabets. To our knowledge, we obtained the best results published on the TweetLID data set and results that are in line with the best results published on Bergsma data set.  相似文献   

13.
Twitter is a radiant platform with a quick and effective technique to analyze users’ perceptions of activities on social media. Many researchers and industry experts show their attention to Twitter sentiment analysis to recognize the stakeholder group. The sentiment analysis needs an advanced level of approaches including adoption to encompass data sentiment analysis and various machine learning tools. An assessment of sentiment analysis in multiple fields that affect their elevations among the people in real-time by using Naive Bayes and Support Vector Machine (SVM). This paper focused on analysing the distinguished sentiment techniques in tweets behaviour datasets for various spheres such as healthcare, behaviour estimation, etc. In addition, the results in this work explore and validate the statistical machine learning classifiers that provide the accuracy percentages attained in terms of positive, negative and neutral tweets. In this work, we obligated Twitter Application Programming Interface (API) account and programmed in python for sentiment analysis approach for the computational measure of user’s perceptions that extract a massive number of tweets and provide market value to the Twitter account proprietor. To distinguish the results in terms of the performance evaluation, an error analysis investigates the features of various stakeholders comprising social media analytics researchers, Natural Language Processing (NLP) developers, engineering managers and experts involved to have a decision-making approach.  相似文献   

14.
Deniz Kılınç 《Software》2019,49(9):1352-1364
There are many data sources that produce large volumes of data. The Big Data nature requires new distributed processing approaches to extract the valuable information. Real-time sentiment analysis is one of the most demanding research areas that requires powerful Big Data analytics tools such as Spark. Prior literature survey work has shown that, though there are many conventional sentiment analysis researches, there are only few works realizing sentiment analysis in real time. One major point that affects the quality of real-time sentiment analysis is the confidence of the generated data. In more clear terms, it is a valuable research question to determine whether the owner that generates sentiment is genuine or not. Since data generated by fake personalities may decrease accuracy of the outcome, a smart/intelligent service that can identify the source of data is one of the key points in the analysis. In this context, we include a fake account detection service to the proposed framework. Both sentiment analysis and fake account detection systems are trained and tested using Naïve Bayes model from Apache Spark's machine learning library. The developed system consists of four integrated software components, ie, (i) machine learning and streaming service for sentiment prediction, (ii) a Twitter streaming service to retrieve tweets, (iii) a Twitter fake account detection service to assess the owner of the retrieved tweet, and (iv) a real-time reporting and dashboard component to visualize the results of sentiment analysis. The sentiment classification performances of the system for offline and real-time modes are 86.77% and 80.93%, respectively.  相似文献   

15.
Twitter and Reddit are two of the most popular social media sites used today. In this paper, we study the use of machine learning and WordNet-based classifiers to generate an interest profile from a user’s tweets and use this to recommend loosely related Reddit threads which the reader is most likely to be interested in. We introduce a genre classification algorithm using a similarity measure derived from WordNet lexical database for English to label genres for nouns in tweets. The proposed algorithm generates a user’s interest profile from their tweets based on a referencing taxonomy of genres derived from the genre-tagged Brown Corpus augmented with a technology genre. The top K genres of a user’s interest profile can be used for recommending subreddit articles in those genres. Experiments using real life test cases collected from Twitter have been done to compare the performance on genre classification by using the WordNet classifier and machine learning classifiers such as SVM, Random Forests, and an ensemble of Bayesian classifiers. Empirically, we have obtained similar results from the two different approaches with a sufficient number of tweets. It seems that machine learning algorithms as well as the WordNet ontology are viable tools for developing recommendation engine based on genre classification. One advantage of the WordNet approach is simplicity and no learning is required. However, the WordNet classifier tends to have poor precision on users with very few tweets.  相似文献   

16.
The problem of assessing the mechanisms underlying the phenomenon of virality of social network posts is of great value for many activities, such as advertising and viral marketing, influencing and promoting, early monitoring and emergency response. Among the several social networks, Twitter.com is one of the most effective in propagating information in real time, and the propagation effectiveness of a post (i.e., tweet) is related to the number of times the tweet has been retweeted. Different models have been proposed in the literature to understand the retweet proneness of a tweet (tendency or inclination of a tweet to be retweeted). In this paper, a further step is presented, thus several features extracted from Twitter data have been analyzed to create predictive models, with the aim of predicting the degree of retweeting of tweets (i.e., the number of retweets a given tweet may get). The main goal is to obtain indications about the probable number of retweets a tweet may obtain from the social network. In the paper, the usage of the classification trees with recursive partitioning procedure for prediction has been proposed and the obtained results have been compared, in terms of accuracy and processing time, with respect to other methods. The Twitter data employed for the proposed study have been collected by using the Twitter Vigilance study and research platform of DISIT Lab in the last 18 months. The work has been developed in the context of smart city projects of the European Commission RESOLUTE H2020, in which the capacity of communicating information is fundamental for advertising, promoting alerts of civil protection, etc.  相似文献   

17.
Twitter is a worldwide social media platform where millions of people frequently express ideas and opinions about any topic. This widespread success makes the analysis of tweets an interesting and possibly lucrative task, being those tweets rarely objective and becoming the targeting for large-scale analysis. In this paper, we explore the idea of integrating two fundamental aspects of a tweet, the proper textual content and its underlying structural information, when addressing the tweet categorization task. Thus, not only we analyze textual content of tweets but also analyze the structural information provided by the relationship between tweets and users, and we propose different methods for effectively combining both kinds of feature models extracted from the different knowledge sources. In order to test our approach, we address the specific task of determining the political opinion of Twitter users within their political context, observing that our most refined knowledge integration approach performs remarkably better (about 5 points above) than the textual-based classic model.  相似文献   

18.
To improve a tweet in Twitter, we would like to estimate the effectiveness of a draft before it is sent. The total number of retweets of a tweet can be considered as a measure for the tweet’s effectiveness. To estimate the number of retweets for an author, we propose a procedure to learn a personalized model from his/her past tweets. We propose three types of new features based on the contents of the tweets: Entity, Pair, and Topic. Empirical results from seven authors indicate that the Pair and Topic features have statistically significant improvements on the correlation coefficient between the estimates and the actual numbers of retweets. We study different combinations of the three types of features, and many of the combinations significantly improve the result further.  相似文献   

19.
Recently, Twitter has become a prominent part of social protest movement communication. This study examines Twitter as a new kind of citizen journalism platform emerging at the aggregate in the context of such “crisis” situations by undertaking a case study of the use of Twitter in the 2011 Wisconsin labor protests. A corpus of more than 775,000 tweets tagged with #wiunion during the first 3 weeks of the protests provides the source of the analyses. Findings suggest that significant differences exist between users who tweet via mobile devices, and thus may be present at protests, and those who tweet from computers. Mobile users post fewer URLs overall; however, when they do, they are more likely to link to traditional news sources and to provide additional hashtags for context. Over time, all link-posting declines, as users become better able to convey first-hand information. Notably, results for most analyses significantly change when restricted to original tweets only, rather than including retweets.  相似文献   

20.
潘文雯  赵洲  俞俊  吴飞 《自动化学报》2021,47(11):2547-2556
转发预测在社交媒体网站(Social media sites, SMS)中是一个很有挑战性的问题. 本文研究了SMS中的图像转发预测问题, 预测用户再次转发图像推特的图像共享行为. 与现有的研究不同, 本文首先提出异构图像转发建模网络(Image retweet modeling, IRM), 所利用的是用户之前转发图像推特中的相关内容、之后在SMS中的联系和被转发者的偏好三方面的内容. 在此基础上, 提出文本引导的多模态神经网络, 构建新型多方面注意力排序网络学习框架, 从而学习预测任务中的联合图像推特表征和用户偏好表征. 在Twitter的大规模数据集上进行的大量实验表明, 我们的方法较之现有的解决方案而言取得了更好的效果.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号