首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Hierarchical clustering of mixed data based on distance hierarchy   总被引:1,自引:0,他引:1  
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity.  相似文献   

2.
孙翀  卢炎生 《计算机科学》2012,39(3):170-173
频繁模式挖掘可以发现数据中频繁出现的模式,是关联规则挖掘的重要步骤。并行频繁模式算法将其应用到并行环境中,以对海量数据进行挖掘。在Apache软件基金会的Mahout项目实现的基础上,对计数和排序阶段以及算法的执行顺序提出了新的优化策略。优化后的设计将计数信息存储在分布式协调系统上,充分地利用了分布式协调系统的高可用性、适宜存储元数据信息的特点。该设计减小了小文件在分布式文件系统(HDFS)上的开销,同时保留了其优点,还能使计数过程和排序过程并行执行,减小了计算节点的内存开销。对比了文件系统I/O的开销,并分析了实现设计中的难点,为未来的工作打下了基础。  相似文献   

3.
This study has proposed an approach that enables online stores to offer customized marketing by segmenting their customers based on customers’ psychographic data. Online stores can concentrate on more profitable activities by identifying customers’ value as they segment their customers into a few groups of customers with similar intentions to purchase. To segment online customers, based on previous research that explains the behavior of online customers regarding purchasing, the approach has employed the factors that affect the customers’ intention to purchase on the Web. We integrated the clustering results of SOM (self-organized map) and the k-means algorithm into a single model. Online stores can develop promotional marketing and offer personalized service for e-customers, who are more valuable and more promising, according to the market segments presented by our approach.  相似文献   

4.
Analyzing bank databases for customer behavior management is difficult since bank databases are multi-dimensional, comprised of monthly account records and daily transaction records. This study proposes an integrated data mining and behavioral scoring model to manage existing credit card customers in a bank. A self-organizing map neural network was used to identify groups of customers based on repayment behavior and recency, frequency, monetary behavioral scoring predicators. It also classified bank customers into three major profitable groups of customers. The resulting groups of customers were then profiled by customer's feature attributes determined using an Apriori association rule inducer. This study demonstrates that identifying customers by a behavioral scoring model is helpful characteristics of customer and facilitates marketing strategy development.  相似文献   

5.
Automatic generation of concept hierarchies using WordNet   总被引:1,自引:1,他引:1  
This paper examines and proposes the automatic generation of concept hierarchies using WordNet. Existing researches have mostly explored the utilization of concept hierarchies, but have not addressed the prohibitive cost occurred in building large hierarchies manually. Several studies have examined the automatic generation of concept hierarchies for the numerical type data from a database. However, very little is known about the automatic generation of concept hierarchies for the nominal type data from a database, which is the subject of this paper. We propose the WordNet library method that first eliminates the ambiguity of the senses of nominal data values, constructs the concept hierarchy by overlapping the hypernym of the remaining senses, and lastly adjusts the resultant concept hierarchy to the preference of users. The proposed method is tested with a faculty employment database of a university. The automatic generation of hierarchies turns out to save efforts of experts or designers who build the concept hierarchies, and makes the hierarchy more objectively built than it is manually done.  相似文献   

6.
基于概念分层的个性化推荐算法   总被引:8,自引:0,他引:8  
熊馨  王卫平  叶跃祥 《计算机应用》2005,25(5):1006-1008,1015
协同过滤算法(couaborative filtering)目前较为成功地应用于个性化推荐系统中,但随着系统规模的扩大,面临很严重的稀疏性问题,制约了推荐效果。文中提出概念分层的方法对用户-项矩阵进行改进,同时使用交易数据和点击流数据,将相似用户选择项与多层次关联规则推荐项相结合,在稀疏数据集上表现出较好的性能。  相似文献   

7.
Non-negative matrix factorization (NMF) has become a popular technique for finding low-dimensional representations of data. While the standard NMF can only be performed in the original feature space, one variant of NMF, named concept factorization, can be naturally kernelized and inherits all the strengths of NMF. To make use of label information, we propose a semi-supervised concept factorization technique called discriminative concept factorization (DCF) for data representation in this paper. DCF adopts a unified objective to combine the task of data reconstruction with the task of classification. These two tasks have mutual impacts on each other, which results in a concept factorization adapted to the classification task and a classifier built on the low-dimensional representations. Furthermore, we develop an iterative algorithm to solve the optimization problem through alternative convex programming. Experimental results on three real-word classification tasks demonstrate the effectiveness of DCF.  相似文献   

8.
Spatial attributes are important factors for predicting customer behavior. However, thorough studies on this subject have never been carried out. This paper presents a new idea that incorporates spatial predicates describing the spatial relationships between customer locations and surrounding objects into customer attributes. More specifically, we developed two algorithms in order to achieve spatially enabled customer segmentation. First, a novel filtration algorithm is proposed that can select more relevant predicates from the huge amounts of spatial predicates than existing filtration algorithms. Second, since spatial predicates fundamentally involve some uncertainties, a rough set-based spatial data classification algorithm is developed to handle the uncertainties and therefore provide effective spatial data classification. A series of experiments were conducted and the results indicate that our proposed methods are superior to existing methods for data classification.  相似文献   

9.
孙翀  卢炎生 《计算机科学》2013,40(8):165-171
将原始图中节点分配到多个分组并根据原始边来确立分组间关系,这样得到的图称作汇总图。汇总图的规模可以由用户设定,用户可以通过浏览小规模的汇总图来获得原始图的相关信息。K-SGS方法是一种新的基于节点概念分层的图汇总算法,它解决了传统K-SNAP算法的汇总图规模参数受限问题。为了解决该问题,算法引入了节点的属性值概念分层,从而增强了图汇总过程中节点分组的灵活性:不仅可以合并同值的节点,还可合并具有相似值的节点。除了关注汇总过程中边的信息损失外,K-SGS方法还关注节点的信息损失,它将图汇总问题建模成多目标规划问题,并通过分层序列法和基于分级的统一评价函数两种不同策略来解决该问题。算法上,提出了快速的层次聚类方法,使得每轮可以合并多个聚类,从而提高效率。经数据集上的实验表明,新算法能生产各种规模参数的汇总图,并具有较好的汇总质量。  相似文献   

10.
Clustering has been widely adopted in numerous applications, including pattern recognition, data analysis, image processing, and market research. When performing data mining, traditional clustering algorithms which use distance-based measurements to calculate the difference between data are unsuitable for non-numeric attributes such as nominal, Boolean, and categorical data. Applying an unsuitable similarity measurement in clustering may cause some valuable information embedded in the data attributes to be lost, and hence low quality clusters will be created. This paper proposes a novel hierarchical clustering algorithm, referred to as MPM, for the clustering of non-numeric data. The goals of MPM are to retain the data features of interest while effectively grouping data objects into clusters with high intra-similarity and low inter-similarity. MPM achieves these goals through two principal methods: (1) the adoption of a novel similarity measurement which has the ability to capture the “characterized properties” of information, and (2) the application of matrix permutation and matrix participation partitioning to the results of the similarity measurement (constructed in the form of a similarity matrix) in order to assign data to appropriate clusters. This study also proposes a heuristic-based algorithm, the Heuristic_MPM, to reduce the processing times required for matrix permutation and matrix partitioning, which together constitute the bulk of the total MPM execution time. An erratum to this article is available at .  相似文献   

11.
将面向属性的归纳方法应用到网上书店中,通过概念层次技术从用户的注册信息中归纳出用户的访问需求,从而实时主动地为用户提供个性化服务。实验证明该方法对研究用户的兴趣爱好有意义。  相似文献   

12.
The more the telecom services marketing paradigm evolves, the more important it becomes to retain high value customers. Traditional customer segmentation methods based on experience or ARPU (Average Revenue per User) consider neither customers’ future revenue nor the cost of servicing customers of different types. Therefore, it is very difficult to effectively identify high-value customers. In this paper, we propose a novel customer segmentation method based on customer lifecycle, which includes five decision models, i.e. current value, historic value, prediction of long-term value, credit and loyalty. Due to the difficulty of quantitative computation of long-term value, credit and loyalty, a decision tree method is used to extract important parameters related to long-term value, credit and loyalty. Then a judgments matrix formulated on the basis of characteristics of data and the experience of business experts is presented. Finally a simple and practical customer value evaluation system is built. This model is applied to telecom operators in a province in China and good accuracy is achieved.  相似文献   

13.
ILP-based concept discovery in multi-relational data mining   总被引:1,自引:0,他引:1  
Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate high-quality patterns. In this work, an ILP-based concept discovery method, namely Confidence-based Concept Discovery (C2D), is described in which strong declarative biases and user-defined specifications are relaxed. Moreover, this new method directly works on relational databases. In addition to this, a new confidence-based pruning is used in this technique. We also describe how to define and use aggregate predicates as background knowledge in the proposed method. In order to use aggregate predicates, we show how to handle numerical attributes by using comparison operators on them. Finally, we analyze the effect of incorporating unrelated facts for generating transitive rules on the proposed method. A set of experiments are conducted on real-world problems to test the performance of the proposed method.  相似文献   

14.
医学图像自动分割的混合聚类方法   总被引:1,自引:0,他引:1  
医学图像的分割效果和自动化程度对计算机辅助诊断和可视化等方面有重要影响。针对医学图像低对比度、噪声影响大的特点,提出一种混合聚类方法:在预处理图像之后,将每个像素的邻域特征向量送入自组织特征映射网络SOM(self-organizingmap)中进行训练;作为初步聚类的结果,SOM的输出典型向量根据命中图(Hits-Map)过滤,再由层次合并聚类方法进一步处理。在比较了一种聚类评价指数和两种图像分割评价指数之后,采用图像分割量化指数来确定聚类的最佳类别数;再通过后处理得到最后分割结果,分析表明这个方法是有效的。同时,也指出其不足之处和进一步研究的方向。  相似文献   

15.
For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of clustering. In this paper, a novel density-based hierarchical clustering scheme for streaming data is proposed in order to improve both accuracy and effectiveness; it is based on the agglomerative clustering framework. Traditionally, clustering algorithms for streaming data often use the cluster center to represent the whole cluster when conducting cluster merging, which may lead to unsatisfactory results. We argue that even if the data set is accessed only once, some parameters, such as the variance within cluster, the intra-cluster density and the inter-cluster distance, can be calculated accurately. This may bring measurable benefits to the process of cluster merging. Furthermore, we employ a general framework that can incorporate different criteria and, given the same criteria, will produce similar clustering results for both streaming and non-streaming data. In experimental studies, the proposed method demonstrates promising results with reduced time and space complexity.  相似文献   

16.
应用主动轮廓线生长模型的细胞核自动分割   总被引:3,自引:0,他引:3  
胡敏  平西建  郭戈  丁益洪 《计算机工程》2006,32(1):37-39,129
提出了一种改进的主动轮廓线模型应用于细胞核的分割。在利用极限腐蚀检测到每个细胞核的种子点后,以种子点为中心点分别建兢一个基于极坐标描述的生长Snake模型,加入了一个基于区域相似度的生长能量,克服了传统模型须将初始轮廓置于真实边界附近的缺点;在应用贪心算法求解时,搜索空间由常规的8邻域减少为径向的两个相邻量化点,提高了计算效率。  相似文献   

17.
CAD mesh models have been widely employed in current CAD/CAM systems, where it is quite useful to recognize the features of the CAD mesh models. The first step of feature recognition is to segment the CAD mesh model into meaningful parts. Although there are lots of mesh segmentation methods in literature, the majority of them are not suitable to CAD mesh models. In this paper, we design a mesh segmentation method based on clustering, dedicated to the CAD mesh model. Specifically, by the agglomerative clustering method, the given CAD mesh model is first clustered into the sparse and dense triangle regions. Furthermore, the sparse triangle region is separated into planar regions, cylindrical regions, and conical regions by the Gauss map of the triangular faces and Hough transformation; the dense triangle region is also segmented by the mean shift operation performed on the mean curvature field defined on the mesh faces. Lots of empirical results demonstrate the effectiveness and efficiency of the CAD mesh segmentation method in this paper.  相似文献   

18.
为加快分层强化学习中任务层次结构的自动生成速度,提出了一种基于多智能体系统的并行自动分层方法,该方法以Sutton提出的Option分层强化学习方法为理论框架,首先由多智能体合作对状态空间进行并行探测并集中聚类产生状态子空间,然后多智能体并行学习生成各子空间上内部策略,最终生成Option.以二维有障碍栅格空间内两点间最短路径规划为任务背景给出了算法并进行了仿真实验和分析,结果表明,并行自动分层方法生成任务层次结构的速度明显快于以往的串行自动分层方法.本文的方法适用于空间探测、路径规划、追逃等类问题领域.  相似文献   

19.
This paper investigates the abilities of adaptive resonance theory (ART) neural networks as miners of hierarchical thematic structure in text collections. We present experimental results with binary ART1 on the benchmark Reuter-21578 corpus. Using both quantitative evaluation with the standard F 1 measure and qualitative visualization of the hierarchy obtained with ART, we discuss how useful ART built hierarchies would be to a user intending to use it as a means to find and access textual information. Our F 1 results show that ART1 produces hierarchical clustering that exhibit a quality exceeding k-means and a hierarchical clustering algorithm. However, we identify several critical problem areas that would make it rather impractical to actually use such a hierarchy in a real-life environment. These predicaments point to the importance of semantic feature selection. Our main contribution is to test in details the applicability of ART to the important domain of hierarchical document clustering, an application of Adaptive Resonance that had received little attention until now.
Louis MasseyEmail:
  相似文献   

20.
本文首先对聚类算法进行了分析,然后以中小型商业批发企业为例,设计了一种反映客户价值与客户关系质量的客户细分模型,应用K-Means聚类方法进行了实际的挖掘。探讨在中小型企业不能提供完备数据的情况下,只要设计出合理的细分模型并选择合适的算法仍然可以实现有效的客户细分。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号