共查询到20条相似文献,搜索用时 15 毫秒
1.
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity. 相似文献
2.
频繁模式挖掘可以发现数据中频繁出现的模式,是关联规则挖掘的重要步骤。并行频繁模式算法将其应用到并行环境中,以对海量数据进行挖掘。在Apache软件基金会的Mahout项目实现的基础上,对计数和排序阶段以及算法的执行顺序提出了新的优化策略。优化后的设计将计数信息存储在分布式协调系统上,充分地利用了分布式协调系统的高可用性、适宜存储元数据信息的特点。该设计减小了小文件在分布式文件系统(HDFS)上的开销,同时保留了其优点,还能使计数过程和排序过程并行执行,减小了计算节点的内存开销。对比了文件系统I/O的开销,并分析了实现设计中的难点,为未来的工作打下了基础。 相似文献
3.
This study has proposed an approach that enables online stores to offer customized marketing by segmenting their customers based on customers’ psychographic data. Online stores can concentrate on more profitable activities by identifying customers’ value as they segment their customers into a few groups of customers with similar intentions to purchase. To segment online customers, based on previous research that explains the behavior of online customers regarding purchasing, the approach has employed the factors that affect the customers’ intention to purchase on the Web. We integrated the clustering results of SOM (self-organized map) and the k-means algorithm into a single model. Online stores can develop promotional marketing and offer personalized service for e-customers, who are more valuable and more promising, according to the market segments presented by our approach. 相似文献
4.
Analyzing bank databases for customer behavior management is difficult since bank databases are multi-dimensional, comprised of monthly account records and daily transaction records. This study proposes an integrated data mining and behavioral scoring model to manage existing credit card customers in a bank. A self-organizing map neural network was used to identify groups of customers based on repayment behavior and recency, frequency, monetary behavioral scoring predicators. It also classified bank customers into three major profitable groups of customers. The resulting groups of customers were then profiled by customer's feature attributes determined using an Apriori association rule inducer. This study demonstrates that identifying customers by a behavioral scoring model is helpful characteristics of customer and facilitates marketing strategy development. 相似文献
5.
Automatic generation of concept hierarchies using WordNet 总被引:1,自引:1,他引:1
This paper examines and proposes the automatic generation of concept hierarchies using WordNet. Existing researches have mostly explored the utilization of concept hierarchies, but have not addressed the prohibitive cost occurred in building large hierarchies manually. Several studies have examined the automatic generation of concept hierarchies for the numerical type data from a database. However, very little is known about the automatic generation of concept hierarchies for the nominal type data from a database, which is the subject of this paper. We propose the WordNet library method that first eliminates the ambiguity of the senses of nominal data values, constructs the concept hierarchy by overlapping the hypernym of the remaining senses, and lastly adjusts the resultant concept hierarchy to the preference of users. The proposed method is tested with a faculty employment database of a university. The automatic generation of hierarchies turns out to save efforts of experts or designers who build the concept hierarchies, and makes the hierarchy more objectively built than it is manually done. 相似文献
6.
7.
Non-negative matrix factorization (NMF) has become a popular technique for finding low-dimensional representations of data. While the standard NMF can only be performed in the original feature space, one variant of NMF, named concept factorization, can be naturally kernelized and inherits all the strengths of NMF. To make use of label information, we propose a semi-supervised concept factorization technique called discriminative concept factorization (DCF) for data representation in this paper. DCF adopts a unified objective to combine the task of data reconstruction with the task of classification. These two tasks have mutual impacts on each other, which results in a concept factorization adapted to the classification task and a classifier built on the low-dimensional representations. Furthermore, we develop an iterative algorithm to solve the optimization problem through alternative convex programming. Experimental results on three real-word classification tasks demonstrate the effectiveness of DCF. 相似文献
8.
Spatially enabled customer segmentation using a data classification method with uncertain predicates
Spatial attributes are important factors for predicting customer behavior. However, thorough studies on this subject have never been carried out. This paper presents a new idea that incorporates spatial predicates describing the spatial relationships between customer locations and surrounding objects into customer attributes. More specifically, we developed two algorithms in order to achieve spatially enabled customer segmentation. First, a novel filtration algorithm is proposed that can select more relevant predicates from the huge amounts of spatial predicates than existing filtration algorithms. Second, since spatial predicates fundamentally involve some uncertainties, a rough set-based spatial data classification algorithm is developed to handle the uncertainties and therefore provide effective spatial data classification. A series of experiments were conducted and the results indicate that our proposed methods are superior to existing methods for data classification. 相似文献
9.
将原始图中节点分配到多个分组并根据原始边来确立分组间关系,这样得到的图称作汇总图。汇总图的规模可以由用户设定,用户可以通过浏览小规模的汇总图来获得原始图的相关信息。K-SGS方法是一种新的基于节点概念分层的图汇总算法,它解决了传统K-SNAP算法的汇总图规模参数受限问题。为了解决该问题,算法引入了节点的属性值概念分层,从而增强了图汇总过程中节点分组的灵活性:不仅可以合并同值的节点,还可合并具有相似值的节点。除了关注汇总过程中边的信息损失外,K-SGS方法还关注节点的信息损失,它将图汇总问题建模成多目标规划问题,并通过分层序列法和基于分级的统一评价函数两种不同策略来解决该问题。算法上,提出了快速的层次聚类方法,使得每轮可以合并多个聚类,从而提高效率。经数据集上的实验表明,新算法能生产各种规模参数的汇总图,并具有较好的汇总质量。 相似文献
10.
Hewijin Christine Jiau Yi-Jen Su Yeou-Min Lin Shang-Rong Tsai 《Journal of Intelligent Information Systems》2006,26(2):185-207
Clustering has been widely adopted in numerous applications, including pattern recognition, data analysis, image processing,
and market research. When performing data mining, traditional clustering algorithms which use distance-based measurements
to calculate the difference between data are unsuitable for non-numeric attributes such as nominal, Boolean, and categorical
data. Applying an unsuitable similarity measurement in clustering may cause some valuable information embedded in the data
attributes to be lost, and hence low quality clusters will be created. This paper proposes a novel hierarchical clustering
algorithm, referred to as MPM, for the clustering of non-numeric data. The goals of MPM are to retain the data features of
interest while effectively grouping data objects into clusters with high intra-similarity and low inter-similarity. MPM achieves
these goals through two principal methods: (1) the adoption of a novel similarity measurement which has the ability to capture
the “characterized properties” of information, and (2) the application of matrix permutation and matrix participation partitioning
to the results of the similarity measurement (constructed in the form of a similarity matrix) in order to assign data to appropriate
clusters. This study also proposes a heuristic-based algorithm, the Heuristic_MPM, to reduce the processing times required
for matrix permutation and matrix partitioning, which together constitute the bulk of the total MPM execution time.
An erratum to this article is available at . 相似文献
11.
12.
The more the telecom services marketing paradigm evolves, the more important it becomes to retain high value customers. Traditional customer segmentation methods based on experience or ARPU (Average Revenue per User) consider neither customers’ future revenue nor the cost of servicing customers of different types. Therefore, it is very difficult to effectively identify high-value customers. In this paper, we propose a novel customer segmentation method based on customer lifecycle, which includes five decision models, i.e. current value, historic value, prediction of long-term value, credit and loyalty. Due to the difficulty of quantitative computation of long-term value, credit and loyalty, a decision tree method is used to extract important parameters related to long-term value, credit and loyalty. Then a judgments matrix formulated on the basis of characteristics of data and the experience of business experts is presented. Finally a simple and practical customer value evaluation system is built. This model is applied to telecom operators in a province in China and good accuracy is achieved. 相似文献
13.
ILP-based concept discovery in multi-relational data mining 总被引:1,自引:0,他引:1
Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate high-quality patterns. In this work, an ILP-based concept discovery method, namely Confidence-based Concept Discovery (C2D), is described in which strong declarative biases and user-defined specifications are relaxed. Moreover, this new method directly works on relational databases. In addition to this, a new confidence-based pruning is used in this technique. We also describe how to define and use aggregate predicates as background knowledge in the proposed method. In order to use aggregate predicates, we show how to handle numerical attributes by using comparison operators on them. Finally, we analyze the effect of incorporating unrelated facts for generating transitive rules on the proposed method. A set of experiments are conducted on real-world problems to test the performance of the proposed method. 相似文献
14.
医学图像自动分割的混合聚类方法 总被引:1,自引:0,他引:1
医学图像的分割效果和自动化程度对计算机辅助诊断和可视化等方面有重要影响。针对医学图像低对比度、噪声影响大的特点,提出一种混合聚类方法:在预处理图像之后,将每个像素的邻域特征向量送入自组织特征映射网络SOM(self-organizingmap)中进行训练;作为初步聚类的结果,SOM的输出典型向量根据命中图(Hits-Map)过滤,再由层次合并聚类方法进一步处理。在比较了一种聚类评价指数和两种图像分割评价指数之后,采用图像分割量化指数来确定聚类的最佳类别数;再通过后处理得到最后分割结果,分析表明这个方法是有效的。同时,也指出其不足之处和进一步研究的方向。 相似文献
15.
For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of clustering. In this paper, a novel density-based hierarchical clustering scheme for streaming data is proposed in order to improve both accuracy and effectiveness; it is based on the agglomerative clustering framework. Traditionally, clustering algorithms for streaming data often use the cluster center to represent the whole cluster when conducting cluster merging, which may lead to unsatisfactory results. We argue that even if the data set is accessed only once, some parameters, such as the variance within cluster, the intra-cluster density and the inter-cluster distance, can be calculated accurately. This may bring measurable benefits to the process of cluster merging. Furthermore, we employ a general framework that can incorporate different criteria and, given the same criteria, will produce similar clustering results for both streaming and non-streaming data. In experimental studies, the proposed method demonstrates promising results with reduced time and space complexity. 相似文献
16.
17.
CAD mesh models have been widely employed in current CAD/CAM systems, where it is quite useful to recognize the features of the CAD mesh models. The first step of feature recognition is to segment the CAD mesh model into meaningful parts. Although there are lots of mesh segmentation methods in literature, the majority of them are not suitable to CAD mesh models. In this paper, we design a mesh segmentation method based on clustering, dedicated to the CAD mesh model. Specifically, by the agglomerative clustering method, the given CAD mesh model is first clustered into the sparse and dense triangle regions. Furthermore, the sparse triangle region is separated into planar regions, cylindrical regions, and conical regions by the Gauss map of the triangular faces and Hough transformation; the dense triangle region is also segmented by the mean shift operation performed on the mean curvature field defined on the mesh faces. Lots of empirical results demonstrate the effectiveness and efficiency of the CAD mesh segmentation method in this paper. 相似文献
18.
为加快分层强化学习中任务层次结构的自动生成速度,提出了一种基于多智能体系统的并行自动分层方法,该方法以Sutton提出的Option分层强化学习方法为理论框架,首先由多智能体合作对状态空间进行并行探测并集中聚类产生状态子空间,然后多智能体并行学习生成各子空间上内部策略,最终生成Option.以二维有障碍栅格空间内两点间最短路径规划为任务背景给出了算法并进行了仿真实验和分析,结果表明,并行自动分层方法生成任务层次结构的速度明显快于以往的串行自动分层方法.本文的方法适用于空间探测、路径规划、追逃等类问题领域. 相似文献
19.
Louis Massey 《Neural computing & applications》2009,18(3):261-273
This paper investigates the abilities of adaptive resonance theory (ART) neural networks as miners of hierarchical thematic
structure in text collections. We present experimental results with binary ART1 on the benchmark Reuter-21578 corpus. Using
both quantitative evaluation with the standard F
1 measure and qualitative visualization of the hierarchy obtained with ART, we discuss how useful ART built hierarchies would
be to a user intending to use it as a means to find and access textual information. Our F
1 results show that ART1 produces hierarchical clustering that exhibit a quality exceeding k-means and a hierarchical clustering algorithm. However, we identify several critical problem areas that would make it rather
impractical to actually use such a hierarchy in a real-life environment. These predicaments point to the importance of semantic
feature selection. Our main contribution is to test in details the applicability of ART to the important domain of hierarchical
document clustering, an application of Adaptive Resonance that had received little attention until now.
相似文献
Louis MasseyEmail: |
20.
Wang Xing Zheng Cheng-zeng 《数字社区&智能家居》2008,(Z1)
本文首先对聚类算法进行了分析,然后以中小型商业批发企业为例,设计了一种反映客户价值与客户关系质量的客户细分模型,应用K-Means聚类方法进行了实际的挖掘。探讨在中小型企业不能提供完备数据的情况下,只要设计出合理的细分模型并选择合适的算法仍然可以实现有效的客户细分。 相似文献