Document clustering is an intentional act that reflects individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document clustering must consider individual preferences and needs to support personalization in document categorization. Most existing document-clustering techniques, generally anchoring in pure content-based analysis, generate a single set of clusters for all individuals without tailoring to individuals' preferences and thus are unable to support personalization. The partial-clustering-based personalized document-clustering approach, incorporating a target individual's partial clustering into the document-clustering process, has been proposed to facilitate personalized document clustering. However, given a collection of documents to be clustered, the individual might have categorized only a small subset of the collection into his or her personal folders. In this case, the small partial clustering would degrade the effectiveness of the existing personalized document-clustering approach for this particular individual. In response, we extend this approach and propose the collaborative-filtering-based personalized document-clustering (CFC) technique that expands the size of an individual's partial clustering by considering those of other users with similar categorization preferences. Our empirical evaluation results suggest that when given a small-sized partial clustering established by an individual, the proposed CFC technique generally achieves better clustering effectiveness for the individual than does the partial-clustering-based personalized document-clustering technique.  相似文献   

贝叶斯算法因其简单、快速、分类精确度高等优点被广泛应用于垃圾邮件过滤中,然而随着时间的推移,概念漂移现象导致贝叶斯分类器准确率下降。针对此问题,提出了基于用户反馈的客户端贝叶斯动态学习算法,可自动学习新的邮件样本,计算复杂度较低。实验表明该方法能较好地适应概念漂移,满足邮件分类的个性化需求,有很好的实用性。  相似文献   

User profiling is an important step for solving the problem of personalized news recommendation. Traditional user profiling techniques often construct profiles of users based on static historical data accessed by users. However, due to the frequent updating of news repository, it is possible that a user’s fine-grained reading preference would evolve over time while his/her long-term interest remains stable. Therefore, it is imperative to reason on such preference evaluation for user profiling in news recommenders. Besides, in content-based news recommenders, a user’s preference tends to be stable due to the mechanism of selecting similar content-wise news articles with respect to the user’s profile. To activate users’ reading motivations, a successful recommender needs to introduce “somewhat novel” articles to users.In this paper, we initially provide an experimental study on the evolution of user interests in real-world news recommender systems, and then propose a novel recommendation approach, in which the long-term and short-term reading preferences of users are seamlessly integrated when recommending news items. Given a hierarchy of newly-published news articles, news groups that a user might prefer are differentiated using the long-term profile, and then in each selected news group, a list of news items are chosen as the recommended candidates based on the short-term user profile. We further propose to select news items from the user–item affinity graph using absorbing random walk model to increase the diversity of the recommended news list. Extensive empirical experiments on a collection of news data obtained from various popular news websites demonstrate the effectiveness of our method.  相似文献   

In recent years, people have begun to pay more and more attention to the effect of news on financial instrument markets (i.e., the markets for trading financial instruments). Researchers in the financial domain have conducted many studies demonstrating the effect of different types of news on trade activities in financial instrument markets such as volatility in trade price, trade volume, trading frequency, and so on. In this paper, an ontology for knowledge about news regarding financial instruments is provided. The ontology contains two parts: the first part presents a hierarchy framework for the domain knowledge that primarily includes classes of news, classes of financial instrument markets participants, classes of financial instruments, and primary relations between these classes. In the second part, a causal map is used to demonstrate how classes of news are causally related with classes of financial instruments. Finally, a case concerning the “9/11 American terror attack” is analyzed. On the basis of the ontology, it is first comprehensive to understand the knowledge about news in financial instrument markets; second, it helps building trading models based on news in the financial instrument markets; third, systems (e.g., systems for prediction of stock price based on news, systems for supporting financial market participants to search relevant news) design and development in this domain are facilitated and supported by this ontology.  相似文献   

Automatic classification of shots extracted by news videos plays an important role in the context of news video segmentation, which is an essential step towards effective indexing of broadcasters digital databases. In spite of the efforts reported by the researchers involved in this field, no techniques providing fully satisfactory performance have been presented until now. In this paper, we propose a multi-expert approach for unsupervised shot classification. The proposed multi-expert system (MES) combines three algorithms that are model-free and do not require a specific training phase. In order to assess the performance of the MES, we built up a database significantly wider than those typically used in the field. Experimental results demonstrate the effectiveness of the proposed approach both in terms of shot classification and of news story detection capability.  相似文献   

以当前的"消极学习型分类法"加"动态更新训练集"的组合模式,不足以解决好动态文本分类中的概念漂移问题.为此,受消极分类法基本思想的启发,并借鉴k-NN算法的优点,提出了针对概念漂移问题的"消极特征选择模式"的概念和基于此模式的动态文本分类算法.测试结果表明,新算法很好地解决了当前存在的难点问题,具有高可靠性、高实用性等优点.  相似文献   

Traditional approaches for text data stream classification usually require the manual labeling of a number of documents, which is an expensive and time consuming process. In this paper, to overcome this limitation, we propose to classify text streams by keywords without labeled documents so as to reduce the burden of labeling manually. We build our base text classifiers with the help of keywords and unlabeled documents to classify text streams, and utilize classifier ensemble algorithms to cope with concept drifting in text data streams. Experimental results demonstrate that the proposed method can build good classifiers by keywords without manual labeling, and when the ensemble based algorithm is used, the concept drift in the streams can be well detected and adapted, which performs better than the single window algorithm.  相似文献   

针对文本流分类中的概念漂移问题,以垃圾邮件过滤为应用背景,提出一种能适应概念漂移的垃圾邮件基于案例推理CBR(Case-based Reasoning)过滤算法。算法采用CBR过滤垃圾邮件,研究CBR过程中的案例库管理技术,提出基于惩罚降噪和等价除冗的案例库修正算法,以适应概念漂移问题。在真实数据集上的实验验证了提出的案例修正算法获得的垃圾邮件过滤效率的提高,可以更好地解决垃圾邮件中的概念漂移问题。  相似文献   


Recommender systems use machine-learning techniques to make predictions about resources. The medical field is one where much research is currently being conducted on recommender system utility. In the last few years, the amount of information available online that relates to healthcare has increased tremendously. Patients nowadays are more aware and look for answers to healthcare problems online. This has resulted in a dire need of an effective reliable online system to recommend the physician that is best suited to a particular patient in a limited time. In this article, a hybrid doctor-recommender system is proposed, by combining different recommendation approaches: content base, collaborative and demographic filtering to effectively tackle the issue of doctor recommendation. The proposed system addresses the issue of personalization through analysing patient's interest towards selecting a doctor. It uses a novel adoptive algorithm to construct a doctor's ranking function. Moreover, this ranking function is used to translate patients’ criteria for selecting a doctor into a numerical base rating, which will eventually be used in the recommendation of doctors. The system has been evaluated thoroughly, and result show that recommendations are reasonable and can fulfil patient's demand for reliable doctor's selection effectively.  相似文献   

The present paper introduces a context-aware recommendation system for journalists to enable the identification of similar topics across different sources. More specifically a journalist-based recommendation system that can be automatically configured is presented to exploit news according to expert preferences. News contextual features are also taken into account due to the their special nature: time, current user interests, location or existing trends are combined with traditional recommendation techniques to provide an adaptive framework that deals with heterogeneous data providing an enhanced collaborative filtering system. Since the Wesomender approach is able to generate context-aware recommendations in the journalism field, a quantitative evaluation with the aim of comparing Wesomender results with the expectations of a team of experts is also performed to show that a context-aware adaptive recommendation engine can fulfil the needs of journalists daily work when retrieving timely and primary information is required.  相似文献   

在实际的邮件过滤应用中,由于垃圾邮件本身的一些因素,像传统的支持向量机分类模型把一个邮件样本明确地归为某一类就很容易出错,而以一定概率的输出判断是否属于某一类则较为合理。根据这种思想,本文在传统支持向量机邮件分类器基础上,提出了一种分类器优化思想,通过对分类输出进行概率计算,并对概率的阈值进行判断,从而确定邮件所属类别。实验证明这种方法是有效可行的。  相似文献   

Recommender systems in online shopping automatically select the most appropriate items to each user, thus shortening his/her product searching time in the shops and adapting the selection as his/her particular preferences evolve over time. This adaptation process typically considers that a user's interest in a given type of product always decreases with time from the moment of the last purchase. However, the necessity of a product for a user depends on both the nature of the own item and the personal preferences of the user, being even possible that his/her interest increases over time from the purchase. Some existing approaches focus only on the first factor, missing the point that the influence of time can be very different for different users. To solve this limitation, we present a filtering strategy that exploits the semantics formalized in an ontology in order to link items (and their features) to time functions. The novelty lies within the fact that the shapes of these functions are corrected by temporal curves built from the consumption stereotypes into which each user fits best. Our preliminary experiments involving real users have revealed significant improvements of recommendation precision with regard to previous time-driven filtering approaches.  相似文献   

An adaptive seamless streaming dissemination system for vehicular networks is presented in this work. An adaptive streaming system is established at each local server to prefetch and buffer stream data. The adaptive streaming system computes the parts of prefetched stream data for each user and stores them temporarily at the local server, based on current situation of the users and the environments where they are located. Thus, users can download the prefetched stream data from the local servers instead of from the Internet directly, meaning that the video playing problem caused by network congestion can be avoided. Several techniques such as stream data prefetching, stream data forwarding, and adaptive dynamic decoding were utilized for enhancing the adaptability of different users and environments and achieving the best transmission efficiency. Fuzzy logic inference systems are utilized to determine if a roadside base station or a vehicle can be chosen to transfer stream data for users. Considering the uneven deployment of BSs and vehicles, a bandwidth reservation mechanism for premium users was proposed to ensure the QoS of the stream data premium users received. A series of simulations were conducted, with the experimental results verifying the effectiveness and feasibility of the proposed work.  相似文献   

It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.  相似文献   

In this paper, we present a system that automatically translates Arabic text embedded in images into English. The system consists of three components: text detection from images, character recognition, and machine translation. We formulate the text detection as a binary classification problem and apply gradient boosting tree (GBT), support vector machine (SVM), and location-based prior knowledge to improve the F1 score of text detection from 78.95% to 87.05%. The detected text images are processed by off-the-shelf optical character recognition (OCR) software. We employ an error correction model to post-process the noisy OCR output, and apply a bigram language model to reduce word segmentation errors. The translation module is tailored with compact data structure for hand-held devices. The experimental results show substantial improvements in both word recognition accuracy and translation quality. For instance, in the experiment of Arabic transparent font, the BLEU score increases from 18.70 to 33.47 with use of the error correction module.  相似文献   

基于概念空间的文本分类研究   总被引:3,自引:0,他引:3  
1.引言随着文本信息的快速增长,特别是Internet上在线信息的增加,文本(网页)自动分类已成为一项具有较大实用价值的关键技术,是组织和管理数据的有力手段。文本分类的方法分为两类:一是基于知识的分类方法;二是基于统计的分类方法。基于知识的文本分类系统应用于某一具体领域,需要该领域的知识库作为支撑。由于知识提取、更新、维护以及自我学习等方面存在的种种问题,使得它适用  相似文献   

This paper presents new a feature transformation technique applied to improve the screening accuracy for the automatic detection of pathological voices. The statistical transformation is based on Hidden Markov Models, obtaining a transformation and classification stage simultaneously and adjusting the parameters of the model with a criterion that minimizes the classification error. The original feature vectors are built up using classic short-term noise parameters and mel-frequency cepstral coefficients. With respect to conventional approaches found in the literature of automatic detection of pathological voices, the proposed feature space transformation technique demonstrates a significant improvement of the performance with no addition of new features to the original input space. In view of the results, it is expected that this technique could provide good results in other areas such as speaker verification and/or identification.  相似文献   

We describe a technique (the adaptive creation of free lists) for dynamic storage allocation that is particularly suited to situations in which the distribution of sizes of blocks requested has one or more sharp peaks. We describe a particular dynamic storage allocation system and the environment in which it runs, and give the results of some experiments to determine the usefulness of the technique in this system. Our experiments also tested the efficacy of a technique suggested by Knuth for improving the performance of similar systems.  相似文献   

This paper presents an innovative solution to model distributed adaptive systems in biomedical environments. We present an original TCBR-HMM (Text Case Based Reasoning-Hidden Markov Model) for biomedical text classification based on document content. The main goal is to propose a more effective classifier than current methods in this environment where the model needs to be adapted to new documents in an iterative learning frame. To demonstrate its achievement, we include a set of experiments, which have been performed on OSHUMED corpus. Our classifier is compared with Naive Bayes and SVM techniques, commonly used in text classification tasks. The results suggest that the TCBR-HMM Model is indeed more suitable for document classification. The model is empirically and statistically comparable to the SVM classifier and outperforms it in terms of time efficiency.  相似文献   

