首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
News recommendation and user interaction are important features in many Web-based news services. The former helps users identify the most relevant news for further information. The latter enables collaborated information sharing among users with their comments following news postings. This research is intended to marry these two features together for an adaptive recommender system that utilizes reader comments to refine the recommendation of news in accordance with the evolving topic. This then turns the traditional “push-data” type of news recommendation to “discussion” moderator that can intelligently assist online forums. In addition, to alleviate the problem of recommending essentially identical articles, the relationship (duplicate, generalization, or specialization) between recommended news articles and the original posting is investigated. Our experiments indicate that our proposed solutions provide an improved news recommendation service in forum-based social media.  相似文献   

2.
Huge numbers of documents are being generated on the Web, especially for news articles and social media. How to effectively organize these evolving documents so that readers can easily browse or search is a challenging task. Existing methods include classification, clustering, and chronological or geographical ordering, which only provides a partial view of the relations among news articles. To better utilize cross‐document relations in organizing news articles, in this paper, we propose a novel approach to organize news archives by exploiting their near‐duplicate relations. First, we use a sentence‐level statistics‐based approach to near‐duplicate copy detection, which is language independent, simple but effective. Since content‐based approaches are usually time consuming and not robust to term substitutions, near‐duplicate detection approach can be used. Second, by extracting the cross‐document relations in a block‐sharing graph, we can derive a near‐duplicate clustering by cross‐document relations in which users can easily browse and find out unnecessary repetitions among documents. From the experimental results, we observed high efficiency and good accuracy of the proposed approach in detecting and clustering near‐duplicate documents in news archives.  相似文献   

3.
Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.  相似文献   

4.
Online news has become one of the major channels for Internet users to get news. News websites are daily overwhelmed with plenty of news articles. Huge amounts of online news articles are generated and updated everyday, and the processing and analysis of this large corpus of data is an important challenge. This challenge needs to be tackled by using big data techniques which process large volume of data within limited run times. Also, since we are heading into a social-media data explosion, techniques such as text mining or social network analysis need to be seriously taken into consideration.In this work we focus on one of the most common daily activities: web news reading. News websites produce thousands of articles covering a wide spectrum of topics or categories which can be considered as a big data problem. In order to extract useful information, these news articles need to be processed by using big data techniques. In this context, we present an approach for classifying huge amounts of different news articles into various categories (topic areas) based on the text content of the articles. Since these categories are constantly updated with new articles, our approach is based on Evolving Fuzzy Systems (EFS). The EFS can update in real time the model that describes a category according to the changes in the content of the corresponding articles. The novelty of the proposed system relies in the treatment of the web news articles to be used by these systems and the implementation and adjustment of them for this task. Our proposal not only classifies news articles, but it also creates human interpretable models of the different categories. This approach has been successfully tested using real on-line news.  相似文献   

5.
Online information is growing enormously day by day with the blessing of World Wide Web. Search engines often provide users with abundant collection of articles; in particular, news articles which are retrieved from different news sources reporting on the same event. In this work, we aim to produce high quality multi document news summaries by taking into account the generic components of a news story within a specific domain. We also present an effective method, named Genetic-Case Base Reasoning, to identify cross-document relations from un-annotated texts. Following that, we propose a new sentence scoring model based on fuzzy reasoning over the identified cross-document relations. The experimental findings show that the proposed approach performed better that the conventional graph based and cluster based approach.  相似文献   

6.
The synergistic application of CBR to IR   总被引:1,自引:0,他引:1  
In this paper we discuss a hybrid approach combining Case-Based Reasoning (CBR) and Information Retrieval (IR) for the retrieval of full-text documents. Our hybrid CBR-IR approach takes as input a standard symbolic representation of a problem case and retrieves texts of relevant cases from a document collection dramatically larger than the case base available to the CBR system. Our system works by first performing a standard HYPO-style CBR analysis and then using the texts associated with certain important classes of cases found in this analysis to seed a modified version of INQUERY's relevance feedback mechanism in order to generate a query composed of individual terms or pairs of terms. Our approach provides two benefits: it extends the reach of CBR (for retrieval purposes) to much larger corpora, and it enables the injection of knowledge-based techniques into traditional IR. We describe our CBR-IR approach and report on on-going experiments.This research was supported by NSF Grant no. EEC-9209623, State/Industry/University Cooperative Research on Intelligent Information Retrieval, Digital Equipment Corporation, and the National Center for Automated Information Research.  相似文献   

7.
8.
This paper provided a content analysis of studies in the field of cognition in e-learning that were published in five Social Sciences Citation Index (SSCI) journals (i.e. Computers and Education, British Journal of Educational Technology, Innovations in Education and Teaching International, Educational Technology Research & Development, and Journal of Computer Assisted Learning) from 2001 to 2005. Among the 1027 articles published in these journals from 2001 to 2005, 444 articles were identified as being related to the topic of cognition in e-learning. These articles were cross analyzed by published years, journal, research topic, and citation count. Furthermore, 16 highly-cited articles across different topics were chosen for further analysis according to their research settings, participants, research design types, and research methods. It was found from the analysis of the 444 articles that “Instructional Approaches,” “Learning Environment,” and “Metacognition” were the three most popular research topics, but the analysis of the citation counts suggested that the studies related to “Instructional Approaches,” “Information Processing” and “Motivation” might have a greater impact on subsequent research. Although the use of questionnaires might still be the main method of gathering research data in e-learning cognitive studies, a clear trend was observed that more and more studies were utilizing learners’ log files or online messages as data sources for analysis. The results of the analysis provided insights for educators and researchers into research trends and patterns of cognition in e-learning.  相似文献   

9.
This paper presents three CBR systems that have been developed over seven years in collaboration with two industrial partners. In this research, case based reasoning (CBR) is used to compute costs of construction projects. In contrast with previous work in the field of CBR, the focus is on choosing strategies that are compatible with user needs and characteristics. Comparing the three strategies reveals advantages and drawbacks while illustrating a “real-life” evolution of a CBR architecture in an industrial context. An important conclusion is that the ways users perform tasks have a direct influence on the best architecture for the CBR system (e.g. transformational/derivational analogy). Incremental development of strategies in the final system improves user interaction, expedites time consuming tasks and favours identification of synergy between techniques such as CBR and data mining.  相似文献   

10.
Text Retrieval from Document Images Based on Word Shape Analysis   总被引:2,自引:1,他引:2  
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.  相似文献   

11.
Ranking plays important role in contemporary information search and retrieval systems. Among existing ranking algorithms, link analysis based algorithms have been proved to be effective for ranking documents retrieved from large-scale text repositories such as the current Web. Recent developments in semantic Web raise considerable interest in designing new ranking paradigms for various semantic search applications. While ranking methods in this context exist, they have not gained much popularity. In this article we introduce the idea of the “Rational Research” model which reflects search behaviour of a “rational” researcher in a scientific research environment, and propose the RareRank algorithm for ranking entities in semantic search systems, in particular, we focus on elaborating the rationale and implementation of the algorithm. Experiments are performed using the RareRank algorithm and the results are evaluated by domain experts using popular ranking performance measures. A comparison study with existing link-based ranking algorithms reveals the benefits of the proposed method.  相似文献   

12.
Search engines are among the most popular as well as useful services on the web. There is a need, however, to cater to the preferences of the users when supplying the search results to them. We propose to maintain the search profile of each user, on the basis of which the search results would be determined. This requires the integration of techniques for measuring search quality, learning from the user feedback and biased rank aggregation, etc. For the purpose of measuring web search quality, the “user satisfaction” is gauged by the sequence in which he picks up the results, the time he spends at those documents and whether or not he prints, saves, bookmarks, e-mails to someone or copies-and-pastes a portion of that document. For rank aggregation, we adopt and evaluate the classical fuzzy rank ordering techniques for web applications, and also propose a few novel techniques that outshine the existing techniques. A “user satisfaction” guided web search procedure is also put forward. Learning from the user feedback proceeds in such a way that there is an improvement in the ranking of the documents that are consistently preferred by the users. As an integration of our work, we propose a personalized web search system.  相似文献   

13.
Web文本表示方法作为所有Web文本分析的基础工作,对文本分析的结果有深远的影响。提出了一种多维度的Web文本表示方法。传统的文本表示方法一般都是从文本内容中提取特征,而文档的深层次特征和外部特征也可以用来表示文本。本文主要研究文本的表层特征、隐含特征和社交特征,其中表层特征和隐含特征可以由文本内容中提取和学习得到,而文本的社交特征可以通过分析文档与用户的交互行为得到。所提出的多维度文本表示方法具有易用性,可以应用于各种文本分析模型中。在实验中,改进了两种常用的文本聚类算法——K-means和层次聚类算法,并命名为多维度K-means MDKM和多维度层次聚类算法MDHAC。通过大量的实验表明了本方法的高效性。此外,我们在各种特征的结合实验结果中还有一些深层次的发现。  相似文献   

14.
In this paper we present a study of structural features of handwriting extracted from three characters “d”, “y”, and “f” and grapheme “th”. The features used are based on the standard features used by forensic document examiners. The process of feature extraction is presented along with the results. Analysis of the usefulness of features was conducted via searching the optimal feature sets using the wrapper method. A neural network was used as a classifier and a genetic algorithm was used to search for optimal feature sets. It is shown that most of the structural micro features studied, do possess discriminative power, which justifies their use in forensic analysis of handwriting. The results also show that the grapheme possessed significantly higher discriminating power than any of the three single characters studied, which supports the opinion that a character form is affected by its adjacent characters.  相似文献   

15.
Many daily activities present information in the form of a stream of text, and often people can benefit from additional information on the topic discussed. TV broadcast news can be treated as one such stream of text; in this paper we discuss finding news articles on the web that are relevant to news currently being broadcast. We evaluated a variety of algorithms for this problem, looking at the impact of inverse document frequency, stemming, compounds, history, and query length on the relevance and coverage of news articles returned in real time during a broadcast. We also evaluated several postprocessing techniques for improving the precision, including reranking using additional terms, reranking by document similarity, and filtering on document similarity. For the best algorithm, 84–91% of the articles found were relevant, with at least 64% of the articles being on the exact topic of the broadcast. In addition, a relevant article was found for at least 70% of the topics.  相似文献   

16.
17.
In this paper we discuss a multimedia news system that we have developed in the Multimedia Information Research Laboratory at the University of Ottawa and beyond. We focus on the feature set—that is, the tools and facilities associated with the system. We explain the functionality of each and give some real examples of the system in action. We then outline the architecture—the system consists of a production server for document authoring, a conferencing system for collaborative news article creation, a content database for authors, a hypernews database for hypermedia news documents, a news database server with aging and archiving, and user sites. The goal was to have all components of the system communicating on OCRInet—an R&D ATM network in the Ottawa region. We then introduce the challenges of representing and browsing large video objects and to this end we introduce a novel solution that we call video-tiles. This video tool is an effective way to browse large news videos that are frequently connected with our multimedia news articles.  相似文献   

18.
This work is about a real-world application of automated deduction. The application is the management of documents (such as mathematical textbooks) as they occur in a readily available tool. In this Slicing Information Technology tool, documents are decomposed (sliced) into small units. A particular application task is to assemble a new document from such units in a selective way, based on the user's current interest and knowledge. It is argued that this task can be naturally expressed through logic, and that automated deduction technology can be exploited for solving it. More precisely, we rely on first-order clausal logic with some default negation principle, and we propose a model computation theorem prover as a suitable deduction mechanism. Beyond solving the task at hand as such, with this work we contribute to the quest for arguments in favor of automated deduction techniques in the real world. Also, we argue why we think that automated deduction techniques are the best choice here.  相似文献   

19.
An unsolved problem in logic-based information retrieval is how to obtain automatically logical representations for documents and queries. This problem limits the impact of logical models for information retrieval because their full expressive power cannot be harnessed. In this paper we propose a method for producing logical document representations which goes further than other simplistic “bag-of-words” approaches. The suggested procedure adopts popular information retrieval heuristics, such as document length corrections and global term distribution. This work includes a report of several experiments applying partial document representations in the context of a propositional model of information retrieval. The benefits of this expressive framework, powered by the new logical indexing approach, become apparent in the evaluation.  相似文献   

20.
Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy and hypernymy. To exploit semantic relationships, ontologies such as WordNet have been used to improve clustering results. However, WordNet-based clustering methods mostly rely on single-term analysis of text; they do not perform any phrase-based analysis. In addition, these methods utilize synonymy to identify concepts and only explore hypernymy to calculate concept frequencies, without considering other semantic relationships such as hyponymy. To address these issues, we combine detection of noun phrases with the use of WordNet as background knowledge to explore better ways of representing documents semantically for clustering. First, based on noun phrases as well as single-term analysis, we exploit different document representation methods to analyze the effectiveness of hypernymy, hyponymy, holonymy, and meronymy. Second, we choose the most effective method and compare it with the WordNet-based clustering method proposed by others. The experimental results show the effectiveness of semantic relationships for clustering are (from highest to lowest): hypernymy, hyponymy, meronymy, and holonymy. Moreover, we found that noun phrase analysis improves the WordNet-based clustering method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号