首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 828 毫秒
1.
Online news has become one of the major channels for Internet users to get news. News websites are daily overwhelmed with plenty of news articles. Huge amounts of online news articles are generated and updated everyday, and the processing and analysis of this large corpus of data is an important challenge. This challenge needs to be tackled by using big data techniques which process large volume of data within limited run times. Also, since we are heading into a social-media data explosion, techniques such as text mining or social network analysis need to be seriously taken into consideration.In this work we focus on one of the most common daily activities: web news reading. News websites produce thousands of articles covering a wide spectrum of topics or categories which can be considered as a big data problem. In order to extract useful information, these news articles need to be processed by using big data techniques. In this context, we present an approach for classifying huge amounts of different news articles into various categories (topic areas) based on the text content of the articles. Since these categories are constantly updated with new articles, our approach is based on Evolving Fuzzy Systems (EFS). The EFS can update in real time the model that describes a category according to the changes in the content of the corresponding articles. The novelty of the proposed system relies in the treatment of the web news articles to be used by these systems and the implementation and adjustment of them for this task. Our proposal not only classifies news articles, but it also creates human interpretable models of the different categories. This approach has been successfully tested using real on-line news.  相似文献   

2.
In an uncertain business environment, competitive intelligence requires peripheral vision to scan and identify weak signals that can affect the future business environment. Weak signals are defined as imprecise and early indicators of impending important events or trends, which are considered key to formulating new potential business items. However, existing methods for discovering weak signals rely on the knowledge and expertise of experts, whose services are not widely available and tend to be costly. They may even provide different analysis results. Therefore, this paper presents a quantitative method that identifies weak signal topics by exploiting keyword-based text mining. The proposed method is illustrated using Web news articles related to solar cells. As a supportive tool for the expert-based approach, this method can be incorporated into long-term business planning processes to assist experts in identifying potential business items.  相似文献   

3.
Personalizing news content requires to choose the appropriate depth of personalization and to assess the extent to which readers’ explicit expressions of interest in general and specific news topics can be used as the basis for personalization. A preliminary survey examined 117 respondents’ attitudes towards news content personalization and their interest in various news topics and subtopics. The second survey examined 23 participants’ declared and actual interests. Participants preferred personalization based on general news topics. Declared interest in general news topics adequately predicted the actual interests in some topics, while in others users’ interests differed between general news topics and subtopics. The variance in interest in items also differed among topics. Thus, different personalization methods should be used for different topics. For some, such as ‘Sports’, users show either high interest or no interest at all. In the latter case most articles related to the topic should be removed, with the exception of items that refer to unique events that may raise general interest according to the expressed interest. In other topics, such as ‘Science & Technology’, most users are interested in important articles, even if they are not interested in the general news topic. Here, the filtering technique should identify the important articles and present them to all readers. The results can be used to develop effective and simple personalization mechanisms which can be applied to the personalization of news, as well as to other domains.  相似文献   

4.
User profiling is an important step for solving the problem of personalized news recommendation. Traditional user profiling techniques often construct profiles of users based on static historical data accessed by users. However, due to the frequent updating of news repository, it is possible that a user’s fine-grained reading preference would evolve over time while his/her long-term interest remains stable. Therefore, it is imperative to reason on such preference evaluation for user profiling in news recommenders. Besides, in content-based news recommenders, a user’s preference tends to be stable due to the mechanism of selecting similar content-wise news articles with respect to the user’s profile. To activate users’ reading motivations, a successful recommender needs to introduce “somewhat novel” articles to users.In this paper, we initially provide an experimental study on the evolution of user interests in real-world news recommender systems, and then propose a novel recommendation approach, in which the long-term and short-term reading preferences of users are seamlessly integrated when recommending news items. Given a hierarchy of newly-published news articles, news groups that a user might prefer are differentiated using the long-term profile, and then in each selected news group, a list of news items are chosen as the recommended candidates based on the short-term user profile. We further propose to select news items from the user–item affinity graph using absorbing random walk model to increase the diversity of the recommended news list. Extensive empirical experiments on a collection of news data obtained from various popular news websites demonstrate the effectiveness of our method.  相似文献   

5.
RSS news articles that are either partially or completely duplicated in content are easily found on the Internet these days, which require Web users to sort through the articles to identify non-redundant information. This manual-filtering process is time-consuming and tedious. In this paper, we present a new filtering and clustering approach, called FICUS, which starts with identifying and eliminating redundant RSS news articles using a fuzzy set information retrieval approach and then clusters the remaining non-redundant RSS news articles according to their degrees of resemblance. FICUS uses a tree hierarchy to organize clusters of RSS news articles. The contents of the respective clusters are captured by the representative keywords from RSS news articles in the clusters so that searching and retrieval of similar RSS news articles is fast and efficient. FICUS is simple, since it uses the pre-defined word-correlation factors to determine related (words in) RSS news articles and filter redundant ones, and is supported by well-known and yet simple mathematical models, such as the standard deviation, vector space model, and probability theory, to generate clusters of non-redundant RSS news articles. Experiments performed on (test sets of) RSS news articles on various topics, which were downloaded from different online sources, verify the accuracy of FICUS on eliminating redundant RSS news articles, clustering similar RSS news articles together, and segregating different RSS news articles in terms of their?contents. In addition, further empirical studies show that FICUS outperforms well-known approaches adopted for clustering RSS news articles.  相似文献   

6.
Online news articles,as a new format of press releases,have sprung up on the Internet.With its convenience and recency,more and more people prefer to read news online instead of reading the paper-format press releases.However,a gigantic amount of news events might be released at a rate of hundreds,even thousands per hour.A challenging problem is how to efficiently select specific news articles from a large corpus of newly-published press releases to recommend to individual readers,where the selected news items should match the reader’s reading preference as much as possible.This issue refers to personalized news recommendation.Recently,personalized news recommendation has become a promising research direction as the Internet provides fast access to real-time information from multiple sources around the world.Existing personalized news recommendation systems strive to adapt their services to individual users by virtue of both user and news content information.A variety of techniques have been proposed to tackle personalized news recommendation,including content-based,collaborative filtering systems and hybrid versions of these two.In this paper,we provide a comprehensive investigation of existing personalized news recommenders.We discuss several essential issues underlying the problem of personalized news recommendation,and explore possible solutions for performance improvement.Further,we provide an empirical study on a collection of news articles obtained from various news websites,and evaluate the effect of different factors for personalized news recommendation.We hope our discussion and exploration would provide insights for researchers who are interested in personalized news recommendation.  相似文献   

7.
This research addresses the problem of analyzing the temporal dynamics of business organizations. In particular, we concentrate on inferring the related businesses, i.e., are there groups of companies that are highly correlated through some measurement (metric)? We argue that business relationships derived from general literature (i.e., newspaper articles, news items etc.) may help us create a network of related companies (business networks). On the other hand, relative movement of stock prices can give us an indication of related companies (asset graphs). We also expect to see some relationships between these two kinds of networks. We adapt the asset graph construction approach from the literature for our asset graph implementations, and then, define our methodology for business network construction. Finally, an introduction to the exploration of some relationships between the asset graphs and business networks is presented.  相似文献   

8.
There are large and growing textual corpora in which people express contrastive opinions about the same topic.This has led to an increasing number of studies about contrastive opinion mining.However,there are several notable issues with the existing studies.They mostly focus on mining contrastive opinions from multiple data collections,which need to be separated into their respective collections beforehand.In addition,existing models are opaque in terms of the relationship between topics that are extracted and the sentences in the corpus which express the topics;this opacity does not help us understand the opinions expressed in the corpus.Finally,contrastive opinion is mostly analysed qualitatively rather than quantitatively.This paper addresses these matters and proposes a novel unified latent variable model(contraLDA),which:mines contrastive opinions from both single and multiple data collections,extracts the sentences that project the contrastive opinion,and measures the strength of opinion contrastiveness towards the extracted topics.Experimental results show the effectiveness of our model in mining contrasted opinions,which outperformed our baselines in extracting coherent and informative sentiment-bearing topics.We further show the accuracy of our model in classifying topics and sentiments of textual data,and we compared our results to five strong baselines.  相似文献   

9.
Recommending online news articles has become a promising research direction as the Internet provides fast access to real-time information from multiple sources around the world. Many online readers have their own reading preference on news articles; however, a group of users might be interested in similar fascinating topics. It would be helpful to take into consideration the individual and group reading behavior simultaneously when recommending news items to online users. In this paper, we propose PENETRATE, a novel PErsonalized NEws recommendaTion framework using ensemble hieRArchical clusTEring to provide attractive recommendation results. Specifically, given a set of online readers, our approach initially separates readers into different groups based on their reading histories, where each user might be designated to several groups. Once a collection of newly-published news items is provided, we can easily construct a news hierarchy for each user group. When recommending news articles to a given user, the hierarchies of multiple user groups that the user belongs to are merged into an optimal one. Finally a list of news articles are selected from this optimal hierarchy based on the user’s personalized information, as the recommendation result. Extensive empirical experiments on a set of news articles collected from various popular news websites demonstrate the efficacy of our proposed approach.  相似文献   

10.
Story clustering is a critical step for news retrieval, topic mining, and summarization. Nonetheless, the task remains highly challenging owing to the fact that news topics exhibit clusters of varying densities, shapes, and sizes. Traditional algorithms are found to be ineffective in mining these types of clusters. This paper offers a new perspective by exploring the pairwise visual cues deriving from near-duplicate keyframes (NDK) for constraint-based clustering. We propose a constraint-driven co-clustering algorithm (CCC), which utilizes the near-duplicate constraints built on top of text, to mine topic-related stories and the outliers. With CCC, the duality between stories and their underlying multimodal features is exploited to transform features in low-dimensional space with normalized cut. The visual constraints are added directly to this new space, while the traditional DBSCAN is revisited to capitalize on the availability of constraints and the reduced dimensional space. We modify DBSCAN with two new characteristics for story clustering: 1) constraint-based centroid selection and 2) adaptive radius. Experiments on TRECVID-2004 corpus demonstrate that CCC with visual constraints is more capable of mining news topics of varying densities, shapes and sizes, compared with traditional k-means, DBSCAN, and spectral co-clustering algorithms.  相似文献   

11.
Firms perform environmental surveillance to identify important events and their developments. To alleviate the stringent information processing and analysis requirements, automated methods are needed to discover from online news articles distinct episodes (stages) of an important event. We propose a time-adjoining frequent itemset-based method that incorporates essential temporal characteristics of news articles for event episode discovery. With a corpus of 1468 news articles that pertain to 248 episodes of 53 different events, we empirically evaluate the proposed method and include several prevalent techniques as benchmarks. The results show that our method outperforms the benchmark techniques consistently and significantly, attaining the cluster recall, cluster precision, and F-measure values at 0.706, 0.593, and 0.584, respectively.  相似文献   

12.
Many daily activities present information in the form of a stream of text, and often people can benefit from additional information on the topic discussed. TV broadcast news can be treated as one such stream of text; in this paper we discuss finding news articles on the web that are relevant to news currently being broadcast. We evaluated a variety of algorithms for this problem, looking at the impact of inverse document frequency, stemming, compounds, history, and query length on the relevance and coverage of news articles returned in real time during a broadcast. We also evaluated several postprocessing techniques for improving the precision, including reranking using additional terms, reranking by document similarity, and filtering on document similarity. For the best algorithm, 84–91% of the articles found were relevant, with at least 64% of the articles being on the exact topic of the broadcast. In addition, a relevant article was found for at least 70% of the topics.  相似文献   

13.
Scalable algorithms for association mining   总被引:10,自引:0,他引:10  
Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets, and then forming conditional implication rules among them. We present efficient algorithms for the discovery of frequent itemsets which forms the compute intensive phase of the task. The algorithms utilize the structural properties of frequent itemsets to facilitate fast discovery. The items are organized into a subset lattice search space, which is decomposed into small independent chunks or sublattices, which can be solved in memory. Efficient lattice traversal techniques are presented which quickly identify all the long frequent itemsets and their subsets if required. We also present the effect of using different database layout schemes combined with the proposed decomposition and traversal techniques. We experimentally compare the new algorithms against the previous approaches, obtaining improvements of more than an order of magnitude for our test databases  相似文献   

14.
Web news provides a quick and convenient means to create collections of large documents. The creation of a web news corpus has typically required the construction of a set of HTML parsing rules to identify content text. In general, these parsing rules are written manually and treat different web pages differently. We address this issue and propose a news content recognition algorithm that is language and layout independent. Our method first scans a given HTML document and roughly localizes a set of candidate news areas. Next, we apply a designed scoring function to rank the best content. To validate this approach, we evaluate the systems performance using 1092 items of multilingual web news data covering 17 global regions and 11 distinct languages. We compare these data with nine published content extraction systems using standard settings. The results of this empirical study show that our method outperforms the second-best approach (Boilerpipe) by 6.04 and 10.79 % with regard to the relative micro and macro F-measures, respectively. We also apply our system to monitor online RSS news distribution. It collected 0.4 million news articles from 200 RSS channels in 20 days. This sample quality test shows that our method achieved 93 % extraction accuracy for large news streams.  相似文献   

15.
We propose a novel four-step hybrid approach for retrieval and composition of video newscasts based on information contained in different metadata sets. In the first step, we use conventional retrieval techniques to isolate video segments from the data universe using segment metadata. In the second step, retrieved segments are clustered into potential news items using a dynamic technique sensitive to the information contained in the segments. In the third step, we apply a transitive search technique to increase the recall of the retrieval system. In the final step, we increase recall performance by identifying segments possessing creation-time relationships. A quantitative analysis of the performance of the process on a newscast composition shows an increase in recall by 59 percent over the conventional keyword-based search technique used in the first step  相似文献   

16.
The automatic extraction and recognition of news captions and annotations can be of great help locating topics of interest in digital news video libraries. To achieve this goal, we present a technique, called Video OCR (Optical Character Reader), which detects, extracts, and reads text areas in digital video data. In this paper, we address problems, describe the method by which Video OCR operates, and suggest applications for its use in digital news archives. To solve two problems of character recognition for videos, low-resolution characters and extremely complex backgrounds, we apply an interpolation filter, multi-frame integration and character extraction filters. Character segmentation is performed by a recognition-based segmentation method, and intermediate character recognition results are used to improve the segmentation. We also include a method for locating text areas using text-like properties and the use of a language-based postprocessing technique to increase word recognition rates. The overall recognition results are satisfactory for use in news indexing. Performing Video OCR on news video and combining its results with other video understanding techniques will improve the overall understanding of the news video content.  相似文献   

17.
18.
This paper addresses an integrated information mining techniques for broadcasting TV-news. This utilizes technique from the fields of acoustic, image, and video analysis, for information on news story title, newsman and scene identification. The goal is to construct a compact yet meaningful abstraction of broadcast TV-news, allowing users to browse through large amounts of data in a non-linear fashion with flexibility and efficiency. By adding acoustic analysis, a news program can be partitioned into news and commercial clips, with 90% accuracy on a data set of 400 h TV-news recorded off the air from July 2005 to August 2006. By applying speaker identification and/or image detection techniques, each news stories can be segmented with a better accuracy of 95.92%. On-screen captions or subtitles are recognized by OCR techniques to produce the text title of each news stories. The extracted title words can be used to link or to navigate more related news contents on the WWW. In cooperation with facial and scene analysis and recognition techniques, OCR results can provide users with multimodal query on specific news stories. Some experimental results are presented and discussed for the system reliability, performance evaluation and comparison.  相似文献   

19.
目前对防洪重大事件新闻舆情研究较少,针对今日头条平台中关于长江 5 号洪水过境重庆期间,头条新闻的新闻报道和公众评论组成的新闻舆情进行研究。基于自然语言处理的中文分词、LDA 主题模型等方法对由新闻报道和公众评论建立的语料进行探究,通过添加防洪先验知识改进 LDA 主题模型挖掘效果,从起始、集中、衰退 3 个阶段进行舆情发展思考,从新闻媒体和公众 2 个方面对洪水过境重庆的热度和主题 2 个维度组成的热点进行分析,提出防洪重大事件新闻舆情挖掘分析框架。分析结果表明:新闻舆情时间性强,热度、主题与时间关联度高,新闻舆情热点演化过程与洪水演进过程基本一致,可为政府部门把握舆情演化路径、提前做好舆情预警、掌握舆论主动权提供参考。  相似文献   

20.
To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web intrapage informative structure mining based on the document object model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM'S practical applicability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号