为此,我们提出了一种新的用于评分预测的细粒度特征交互网络(FFIN)。首先,模型并没有将用户的所有评论聚合成一个统一的向量,而是将用户和物品的每条评论单独建模,通过堆叠的扩展卷积分层地为每个评论文本构建多层次表示,充分地捕获了评论的多粒度语义信息;其次,模型在每个语义层次上构建用户和物品评论的细粒度特征交互,这有效避免了单粒度交互导致的次级重要信息被忽略的问题;最后,由于用户的评论行为通常是主观且个性化的,我们没有使用注意力机制来识别重要信息,而是通过类似于图像识别的层次结构来识别高阶显著信号,并将其用于最终的评分预测。我们在6个来自Amazon和Yelp的具有不同特征的真实数据集上进行了广泛的实验。我们的结果表明,与最近提出的最先进的模型相比,所提出的FFIN在预测精度方面获得了显著的性能提升。进一步的实验分析表明,多粒度特征的交互,不仅突出了评论中的相关信息,还大大提高了评分预测的可解释性。  相似文献   

Mobile apps (applications) have become a popular form of software, and the app reviews by users have become an important feedback resource. Users may raise some issues in their reviews when they use apps, such as a functional bug, a network lag, or a request for a feature. Understanding these issues can help developers to focus on users’ concerns, and help users to evaluate similar apps for download or purchase. However, we do not know which types of issues are raised in a review. Moreover, the amount of user reviews is huge and the nature of the reviews’ text is unstructured and informal. In this paper, we analyze 3 902 user reviews from 11 mobile apps in a Chinese app store — 360 Mobile Assistant, and uncover 17 issue types. Then, we propose an approach CSLabel that can label user reviews based on the raised issue types. CSLabel uses a cost-sensitive learning method to mitigate the effects of the imbalanced data, and optimizes the setting of the support vector machine (SVM) classifier’s kernel function. Results show that CSLabel can correctly label reviews with the precision of 66.5%, the recall of 69.8%, and the F1 measure of 69.8%. In comparison with the state-of-the-art approach, CSLabel improves the precision by 14%, the recall by 30%, the F1 measure by 22%. Finally, we apply our approach to two real scenarios: 1) we provide an overview of 1 076 786 user reviews from 1 100 apps in the 360 Mobile Assistant and 2) we find that some issue types have a negative correlation with users’ evaluation of apps.  相似文献   

5G边缘计算靠近用户侧提供服务,而边缘侧汇聚着用户的敏感信息,用户非法接入或合法用户自身的恶意行为威胁到整个边缘网络的安全。将机器学习算法应用于边缘计算架构,提出一种基于行为的用户异常检测方案。对用户行为进行建模,采用独热编码和互信息进行数据预处理和特征选择,并利用极限梯度提升算法训练一个多分类器分类识别进入园区的用户,根据识别结果与用户身份是否一致来判定用户是否异常。在此基础上,通过孤立森林算法对授权用户历史行为数据进行模型训练,从而检测可信任用户的行为是否异常,实现对小型固定园区内未授权用户的识别以及对授权用户异常行为的检测。实验结果表明,该方案可满足边缘计算场景的时间复杂度要求,并且能够有效区分不同用户,分类准确率达到0.953,而对异常行为样本的误报率仅为0.01。  相似文献   

Abstract. Providing a customized result set based upon a user preference is the ultimate objective of many content-based image retrieval systems. There are two main challenges in meeting this objective: First, there is a gap between the physical characteristics of digital images and the semantic meaning of the images. Secondly, different people may have different perceptions on the same set of images. To address both these challenges, we propose a model, named Yoda, that conceptualizes content-based querying as the task of soft classifying images into classes. These classes can overlap, and their members are different for different users. The “soft” classification is hence performed for each and every image feature, including both physical and semantic features. Subsequently, each image will be ranked based on the weighted aggregation of its classification memberships. The weights are user-dependent, and hence different users would obtain different result sets for the same query. Yoda employs a fuzzy-logic based aggregation function for ranking images. We show that, in addition to some performance benefits, fuzzy aggregation is less sensitive to noise and can support disjunctive queries as compared to weighted-average aggregation used by other content-based image retrieval systems. Finally, since Yoda heavily relies on user-dependent weights (i.e., user profiles) for the aggregation task, we utilize the users' relevance feedback to improve the profiles using genetic algorithms (GA). Our learning mechanism requires fewer user interactions, and results in a faster convergence to the user's preferences as compared to other learning techniques. Correspondence to: Y.-S. Chen (E-mail: yishinc@usc.edu) This research has been funded in part by NSF grants EEC-9529152 (IMSC ERC) and IIS-0082826, NIH-NLM R01-LM07061, DARPA and USAF under agreement nr. F30602-99-1-0524, and unrestricted cash gifts from NCR, Microsoft, and Okawa Foundation.  相似文献   

With location-based social network (LBSN) flourishing, location check-in records offer us sufficient information resource to do relative mining. Among locations visited by a user, those attracting relatively more visits from that user can serve as a support for further mining and improvement for location-based services. Therefore, great significance lies in the partition for visited locations based on a user’s visiting frequency. The aim of our paper is to partition locations for individual users by utilizing classification in machine learning, categorizing the location for a user once he or she makes initial check-in there. After feature extraction for each initial check-in record, we evaluate the contribution of three feature categories. The results show the contribution of different feature categories varies in classification, where social features appear to offer the least contribution. At last, we do a final test on the whole sample, comparing the results with two baselines based on majority voting respectively. The results largely outperform the baselines in general, demonstrating the effectiveness of classification.  相似文献   

消费金融的欺诈检测是学术界和产业界的一个重要问题,现阶段比较流行的做法是利用机器学习方法通过提取用户的固有特征来实现。随着团伙化欺诈的出现,传统的机器学习方法在欺诈用户样本数量小及特征数据不足的情况下,显得无能为力。团伙欺诈用户之间有很强的关联关系,该文利用用户间的通话数据构建用户关联网络,通过网络统计指标和DeepWalk算法提取用户节点的图特征,充分利用图的拓扑结构信息和邻居节点信息,将其与用户固有特征一起作为特征输入,使用LightGBM模型对上述多种特征进行学习。实验结果表明,采用图表示学习方法后,AUC指标与仅使用用户固有特征相比提高了7.3%。  相似文献   

检测托攻击的本质是对真实用户和虚假用户进行分类,现有的检测算法对于具有选择项的流行攻击、段攻击等攻击方式的检测鲁棒性较差。针对这一问题,通过分析真实用户和虚假用户的评分分布情况,结合ID3决策树提出基于用户评分离散度的托攻击检测Dispersion-C算法。算法通过用户评分极端评分比、去极端评分方差和用户评分标准差3个特征衡量用户评分离散度,并将其作为ID3决策树算法的分类特征,根据不同特征的信息增益选择特征作为分类属性,训练分类器。实验结果表明,Dispersion-C算法对各类托攻击均有良好的检测效果,具有较好的鲁棒性。  相似文献   

特征选择一直是机器学习和数据挖掘中的一个重要问题。在多标签学习任务中,数据集中的每个样本都与多个标签相关联,标签与标签之间通常也是相关的。在多标签高维数据分析中,为降低特征维数和提高分类性能,研究者们提出了多标签特征选择方法。系统综述了多标签特征选择的研究进展。在介绍多标签分类以及评价准则之后,详细分析了多标签特征选择的三类方法,即过滤式算法、包裹式算法和嵌入式算法,对多标签特征选择未来的研究提出展望。  相似文献   

Training data plays an essential role in modern applications of machine learning. However, gathering labeled training data is time-consuming. Therefore, labeling is often outsourced to less experienced users, or completely automated. This can introduce errors, which compromise valuable training data, and lead to suboptimal training results. We thus propose a novel approach that uses the power of pretrained classifiers to visually guide users to noisy labels, and let them interactively check error candidates, to iteratively improve the training data set. To systematically investigate training data, we propose a categorization of labeling errors into three different types, based on an analysis of potential pitfalls in label acquisition processes. For each of these types, we present approaches to detect, reason about, and resolve error candidates, as we propose measures and visual guidance techniques to support machine learning users. Our approach has been used to spot errors in well-known machine learning benchmark data sets, and we tested its usability during a user evaluation. While initially developed for images, the techniques presented in this paper are independent of the classification algorithm, and can also be extended to many other types of training data.  相似文献   

Content-based e-mail spam filtering continues to be a challenging machine learning problem. Usually, the joint distribution of e-mails and labels changes from user to user and from time to time, and the training data are poor representatives of the true distribution. E-mail service providers have two options for automatic spam filtering at the service-side: a single global filter for all users or a personalized filter for each user. The practical usefulness of these options, however, depends upon the robustness and scalability of the filter. In this paper, we address these challenges by presenting a robust personalizable spam filter based on local and global discrimination modeling. Our filter exploits highly discriminating content terms, identified by their relative risk, to transform the input space into a two-dimensional feature space. This transformation is obtained by linearly pooling the discrimination information provided by each term for spam or non-spam classification. Following this local model, a linear discriminant is learned in the feature space for classification. We also present a strategy for personalizing the local and global models using unlabeled e-mails, without requiring user’s feedback. Experimental evaluations and comparisons are presented for global and personalized spam filtering, for varying distribution shift, for handling the problem of gray e-mails, on unseen e-mails, and with varying filter size. The results demonstrate the robustness and effectiveness of our filter and its suitability for global and personalized spam filtering at the service-side.  相似文献   

运动传感驱动的3D直观手势交互   总被引:2,自引:1,他引:2  
为了使手势交互方式较少受到场地和光线的限制,提出利用加速度传感器作为输入设备进行手势识别的方法.对每种手势只要求用户做一次示范表演,通过添加噪声等手段来提高训练数据生成的自动化程度;将训练数据经过预处理和特征提取之后用于训练机器学习模型(隐马尔科夫模型和支持向量机).在包含70种手势的测试集上进行实验,平均识别率超过90%;并开发了幻灯片手势控制和手势拨号2个基于手势的人机交互原型系统,结果表明文中方法能够显著地提升用户在人机交互中的体验.  相似文献   

In order to meet the requirement of customised services for online communities, sentiment classification of online reviews has been applied to study the unstructured reviews so as to identify users’ opinions on certain products. The purpose of this article is to select features for sentiment classification of Chinese online reviews with techniques well performed in traditional text classification. First, adjectives, adverbs and verbs are identified as the potential text features containing sentiment information. Then, four statistical feature selection methods, such as document frequency (DF), information gain (IG), chi-squared statistic (CHI) and mutual information (MI), are adopted to select features. After that, the Boolean weighting method is applied to set feature weights and construct a vector space model. Finally, a support vector machine (SVM) classifier is employed to predict the sentiment polarity of online reviews. Comparative experiments are conducted based on hotel online reviews in Chinese. The results indicate that the highest accuracy of the sentiment classification of Chinese online reviews is achieved by taking adjectives, adverbs and verbs together as the feature. Besides that, different feature selection methods make distinct performances on sentiment classification, as DF performs the best, CHI follows and IG ranks the last, whereas MI is not suitable for sentiment classification of Chinese online reviews. This conclusion will be helpful to improve the accuracy of sentiment classification and be useful for further research.  相似文献   

Demographics prediction is an important component of user profile modeling. The accurate prediction of users’ demographics can help promote many applications, ranging from web search, personalization to behavior targeting. In this paper, we focus on how to predict users’ demographics, including “gender”, “job type”, “marital status”, “age” and “number of family members”, based on mobile data, such as users’ usage logs, physical activities and environmental contexts. The core idea is to build a supervised learning framework, where each user is represented as a feature vector and users’ demographics are considered as prediction targets. The most important component is to construct features from raw data and then supervised learning models can be applied. We propose a feature construction framework, CFC (contextual feature construction), where each feature is defined as the conditional probability of one user activity under the given contexts. Consequently, besides employing standard supervised learning models, we propose a regularized multi-task learning framework to model different kinds of demographics predictions collectively. We also propose a cost-sensitive classification framework for regression tasks, in order to benefit from the existing dimension reduction methods. Finally, due to the limited training instances, we employ ensemble to avoid overfitting. The experimental results show that the framework achieves classification accuracies on “gender”, “job” and “marital status” as high as 96%, 83% and 86%, respectively, and achieves Root Mean Square Error (RMSE) on “age” and “number of family members” as low as 0.69 and 0.66 respectively, under the leave-one-out evaluation.  相似文献   

混合模式的网络流量分类方法   总被引:2,自引:0,他引:2  
胡婷  王勇  陶晓玲 《计算机应用》2010,30(10):2653-2655
为了更好地满足用户对各类Internet业务服务质量越来越精细的要求,流量分类是网络管理的重要环节之一。通过分析、对比基于端口号匹配、特征字段分析和流统计特征的机器学习分类方法的应用现状及其优缺点,针对单一分类方法存在的分类准确度不高、分类时间长等问题,提出一种混合模式的网络流量分类方案。此方案结合端口号匹配和机器学习分类方法,采用输出结果可视化的自组织映射网络算法实现网络流量在应用层的分类。实验表明,该方案能有效地实现对网络流量应用类型的分类,分类结果可视化效果好。  相似文献   

使用机器学习方法进行新闻的情感自动分类   总被引:6,自引:0,他引:6  
本文主要研究机器学习方法在新闻文本的情感分类中的应用,判断其是正面还是负面。我们利用朴素贝叶斯和最大熵方法进行新闻及评论语料的情感分类研究。实验表明,机器学习方法在基于情感的文本分类中也能取得不错的分类性能,最高准确率能达到90%。同时我们也发现,对于基于情感的文本分类,选择具有语义倾向的词汇作为特征项、对否定词正确处理和采用二值作为特征项权重能提高分类的准确率。总之,基于情感的文本分类是一个更具挑战性的工作。  相似文献   

In this paper we propose a machine learning approach to classify melanocytic lesions as malignant or benign, using dermoscopic images. The lesion features used in the classification framework are inspired on border, texture, color and structures used in popular dermoscopy algorithms performed by clinicians by visual inspection. The main weakness of dermoscopy algorithms is the selection of a set of weights and thresholds, that appear not to be robust or independent of population. The use of machine learning techniques allows to overcome this issue. The proposed method is designed and tested on an image database composed of 655 images of melanocytic lesions: 544 benign lesions and 111 malignant melanoma. After an image pre-processing stage that includes hair removal filtering, each image is automatically segmented using well known image segmentation algorithms. Then, each lesion is characterized by a feature vector that contains shape, color and texture information, as well as local and global parameters. The detection of particular dermoscopic patterns associated with melanoma is also addressed, and its inclusion in the classification framework is discussed. The learning and classification stage is performed using AdaBoost with C4.5 decision trees. For the automatically segmented database, classification delivered a specificity of 77% for a sensitivity of 90%. The same classification procedure applied to images manually segmented by an experienced dermatologist yielded a specificity of 85% for a sensitivity of 90%.  相似文献   

现有钢琴乐谱难度分类主要由人工方式完成,效率不高,而自动识别乐谱难度等级的算法对类别的拟合度较低。因此,与传统将乐谱难度等级识别归结为回归问题不同,本文直接将其建模为基于支持向量机的分类问题。并结合钢琴乐谱分类主观性强、特征之间普遍存在相关性等特点,利用测度学习理论有难度等级标签乐谱的先验知识,依据特征对难度区分的贡献度,改进高斯径向基核函数,从而提出一种测度学习支持向量机分类算法——ML-SVM算法。在9类和4类难度两个乐谱数据集上,我们将ML-SVM算法与逻辑回归,基于线性核函数、多项式核函数、高斯径向基核函数的支持向量机算法以及结合主成分分析的各个支持向量机算法进行了对比,实验结果表明我们提出算法的识别正确率优于现有算法,分别为68.74%和84.67%。所提算法有效提高了基于高斯径向基核函数支持向量机算法在本应用问题中的分类性能。  相似文献   

Machine Learning for User Modeling   总被引:25,自引:0,他引:25  
At first blush, user modeling appears to be a prime candidate for straightforward application of standard machine learning techniques. Observations of the user's behavior can provide training examples that a machine learning system can use to form a model designed to predict future actions. However, user modeling poses a number of challenges for machine learning that have hindered its application in user modeling, including: the need for large data sets; the need for labeled data; concept drift; and computational complexity. This paper examines each of these issues and reviews approaches to resolving them.  相似文献   

针对群智感知平台中的任务分配问题,提出了一种任务需求特征提取算法和用户标签分类方法相结合的T REA U LCM任务分配模型.首先,通过任务需求特征提取算法提取感知任务的类别关键词;然后,通过多线性神经网络和多核学习对数据集进行训练得到分类器,通过分类器对用户的类型标签进行预测;最后,根据任务的类别关键词结合空间位置信息和用户参与度筛选有该任务类别标签且最大化满足任务需求的用户分发任务.仿真结果表明,T REA U LCM任务分配模型在任务匹配度和任务分配效率方面有较好的可行性.  相似文献   

