首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross‐document coreference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the Web using automatically extracted keywords. Given an ambiguous personal name, first, we download text snippets for the given name from a Web search engine. We then represent each instance of the ambiguous name by a term‐entity model (TEM), a model that we propose to represent the Web appearance of an individual. A TEM of a person captures named entities and attribute values that are useful to disambiguate that person from his or her namesakes (i.e., different people who share the same name). We then use group average agglomerative clustering to identify the instances of an ambiguous name that belong to the same person. Ideally, each cluster must represent a different namesake. However, in practice it is not possible to know the number of namesakes for a given ambiguous personal name in advance. To circumvent this problem, we propose a novel normalized cuts‐based cluster stopping criterion to determine the different people on the Web for a given ambiguous name. Finally, we annotate each person with an ambiguous name using keywords selected from the clusters. We evaluate the proposed method on a data set of over 2500 documents covering 200 different people for 20 ambiguous names. Experimental results show that the proposed method outperforms numerous baselines and previously proposed name disambiguation methods. Moreover, the extracted keywords reduce ambiguity of a name in an information retrieval task, which underscores the usefulness of the proposed method in real‐world scenarios.  相似文献   

2.
Categorical data clustering is a difficult and challenging task due to the special characteristic of categorical attributes: no natural order. Thus, this study aims to propose a two-stage method named partition-and-merge based fuzzy genetic clustering algorithm (PM-FGCA) for categorical data. The proposed PM-FGCA uses a fuzzy genetic clustering algorithm to partition the dataset into a maximum number of clusters in the first stage. Then, the merge stage is designed to select two clusters among the clusters that generated in the first stage based on its inter-cluster distances and merge two selected clusters to one cluster. This procedure is repeated until the number of clusters equals to the predetermined number of clusters. Thereafter, some particular instances in each cluster are considered to be re-assigned to other clusters based on the intra-cluster distances. The proposed PM-FGCA is implemented on ten categorical datasets from UCI machine learning repository. In order to evaluate the clustering performance, the proposed PM-FGCA is compared with some existing methods such as k-modes algorithm, fuzzy k-modes algorithm, genetic fuzzy k-modes algorithm, and non-dominated sorting genetic algorithm using fuzzy membership chromosomes. Adjusted Ranked Index (ARI), Normalized Mutual Information (NMI), and Davies–Bouldin (DB) index are selected as three clustering validation indices which are represented to both external index (i.e., ARI and NMI) and internal index (i.e., DB). Consequently, the experimental result shows that the proposed PM-FGCA outperforms the benchmark methods in terms of the tested indices.  相似文献   

3.
为降低传统FCM算法的计算复杂性,提高Web用户聚类的效果,文中提出了一种改进的基于特征属性的Web用户模糊聚类算法。首先通过用户访问页面的次数和时间建立Web用户兴趣度矩阵,并根据商品的特征属性值将Web用户兴趣度矩阵映射为用户对特征属性的偏好矩阵,从而有效降低数据稀疏性;然后以此为数据集,对传统的FCM算法进行了改进,将聚类中心分为活动和稳定两种,忽略稳定聚类中的距离计算以降低计算复杂性。最后通过仿真实验证实了新算法的有效性和可行性。  相似文献   

4.
为了实现Web服务请求数据的快速聚类,并提高聚类的准确率,提出一种基于增量式时间序列和任务调度的Web数据聚类算法,该算法进行了Web数据在时间序列上的聚类定义,并采用增量式时间序列聚类方法,通过数据压缩的形式降低Web数据的复杂性,进行基于服务时间相似性的时间序列数据聚类。针对Web集群服务的最佳服务任务调度问题,通过以服务器执行能力为标准来分配服务任务。实验仿真结果表明,相比基于网格的高维数据层次聚类算法和基于增量学习的多目标模糊聚类算法,提出的算法在聚类时间、聚类精度、服务执行成功率上均获得了更好的效果。  相似文献   

5.
传统的模糊c均值算法需要提前输入聚类个数,但输入错误的聚类数会产生错误的聚类结果。为此,提出一种基于人工免疫细胞膜型的模糊聚类算法。引入种群规模迭代与模糊聚类迭代相结合的双迭代思路,利用种群规模迭代指导聚类数的自动生成,在每次种群规模迭代中加入模糊聚类迭代,同时将克隆选择、抗体免疫抑制等操作融入计算过程。理论分析与仿真结果表明,该算法能搜寻到正确的聚类个数,具有较好的聚类效果。  相似文献   

6.
Web页面和客户群体的模糊聚类算法   总被引:17,自引:0,他引:17  
web日志挖掘在电子商务和个性化web等方面有着广泛的应用.文章介绍了一种web页面和客户群体的模糊聚类算法.在该算法中,首先根据客户对Web站点的浏览情况分别建立Web页面和客户的模糊集,在此基础上根据Max—Min模糊相似性度量规则构造相应的模糊相似矩阵,然后根据模糊相似矩阵直接进行聚类.实验结果表明该算法是有效的.  相似文献   

7.
提出了一种改进的基于对称点距离的蚂蚁聚类算法。该算法不再采用Euclidean距离来计算类内对象的相似性,而是使用新的对称点距离来计算相似性,在处理带有对称性质的数据集时,可以有效地识别给定数据集的聚类数目和合适的划分。在该算法中,用人工蚂蚁代表数据对象,根据算法给定的聚类规则来寻找最合适的聚类划分。最后用本算法与标准的蚂蚁聚类算法分别对不同的数据集进行了聚类实验。实验结果证实了算法的有效性。  相似文献   

8.
结合Web用户访问特点,针对Web用户访问路径聚类分析中普遍存在的对象类别不确定性现象进行了研究.结合模糊聚类和可能性聚类的特点,提出来一种新的用户访问路径的可能性模糊聚类算法.新方法通过定义相关的截集,自动地将对象分配到若干簇中,避免了人工干预,实现了交叉聚类的目的.新方法建立在leader聚类算法的框架上,只需要扫描数据集一遍使得算法效率大大提高.在标准数据集上的对比试验表明新算法不仅是有效的,而且效率较高.  相似文献   

9.
Major problems exist in both crisp and fuzzy clustering algorithms. The fuzzy c-means type of algorithms use weights determined by a power m of inverse distances that remains fixed over all iterations and over all clusters, even though smaller clusters should have a larger m. Our method uses a different “distance” for each cluster that changes over the early iterations to fit the clusters. Comparisons show improved results. We also address other perplexing problems in clustering: (i) find the optimal number K of clusters; (ii) assess the validity of a given clustering; (iii) prevent the selection of seed vectors as initial prototypes from affecting the clustering; (iv) prevent the order of merging from affecting the clustering; and (v) permit the clusters to form more natural shapes rather than forcing them into normed balls of the distance function. We employ a relatively large number K of uniformly randomly distributed seeds and then thin them to leave fewer uniformly distributed seeds. Next, the main loop iterates by assigning the feature vectors and computing new fuzzy prototypes. Our fuzzy merging then merges any clusters that are too close to each other. We use a modified Xie-Bene validity measure as the goodness of clustering measure for multiple values of K in a user-interaction approach where the user selects two parameters (for eliminating clusters and merging clusters after viewing the results thus far). The algorithm is compared with the fuzzy c-means on the iris data and on the Wisconsin breast cancer data.  相似文献   

10.
通过对Web日志的聚类分析,可以发现用户的群体特征,甚至可以预测用户将来的访问模式,进而为不同的用户群提供个性化服务。针对现有方法的一般缺陷,包括特征选择单一无法充分体现用户兴趣偏好和传统Hierarchical算法在用户聚类时存在的收敛效率低、易受用户访问多样性影响的问题,提出了基于多重特征的双层用户聚类方法。该方法采用多重特征对用户相似性进行度量,并在此基础上进行双层聚类。首先采用基于密度的DBSCAN算法来排除用户会话中的离群对象和发现不规则簇,然后再采用自底向上的Hierarchical方法对第一层的聚类结果进行聚类。实验结果表明,本文方法具有良好的稳定性和聚类效果。  相似文献   

11.
提出一个基于Web日志的web用户群体和站点URL聚类算法.使用用户浏览行为描述和用户浏览时间离散化方法建立了Web站点的用户事务矩阵,并在此基础上对Web用户群体和站点URL进行聚类.由于在聚类过程中同时考虑了用户对URL的浏览时间和访问次数,使算法的精度和效率都大大提高.同时,该算法能较好地处理类间重叠问题,使算法具有较好的实用性.最后对算法的有效性和可伸缩性进行了研究.  相似文献   

12.
针对传统模糊C均值聚类算法和基于K-means++优化聚类中心的模糊C均值算法存在初始聚类中心敏感、聚类速度收敛慢、聚类算法需要人为给定聚类数目等缺陷,受密度峰值聚类算法(Clustering by Fast Search and Find of Density Peaks,CFSFDP)的启发,提出了基于密度峰值算法优化的模糊C均值聚类算法,自适应产生初始聚类中心,确定聚类数目,并优化算法收敛过程。实验结果表明,改进后的算法与传统模糊聚类C均值算法相比能够准确地得到簇的数目,性能有明显的提高,并加快算法的收敛速度,达到相对更好的聚类效果。  相似文献   

13.
Web prefetching is an attractive solution to reduce the network resources consumed by Web services as well as the access latencies perceived by Web users. Unlike Web caching, which exploits the temporal locality, Web prefetching utilizes the spatial locality of Web objects. Specifically, Web prefetching fetches objects that are likely to be accessed in the near future and stores them in advance. In this context, a sophisticated combination of these two techniques may cause significant improvements on the performance of the Web infrastructure. Considering that there have been several caching policies proposed in the past, the challenge is to extend them by using data mining techniques. In this paper, we present a clustering-based prefetching scheme where a graph-based clustering algorithm identifies clusters of “correlated” Web pages based on the users’ access patterns. This scheme can be integrated easily into a Web proxy server, improving its performance. Through a simulation environment, using a real data set, we show that the proposed integrated framework is robust and effective in improving the performance of the Web caching environment.  相似文献   

14.
This article presents a multi-objective genetic algorithm which considers the problem of data clustering. A given dataset is automatically assigned into a number of groups in appropriate fuzzy partitions through the fuzzy c-means method. This work has tried to exploit the advantage of fuzzy properties which provide capability to handle overlapping clusters. However, most fuzzy methods are based on compactness and/or separation measures which use only centroid information. The calculation from centroid information only may not be sufficient to differentiate the geometric structures of clusters. The overlap-separation measure using an aggregation operation of fuzzy membership degrees is better equipped to handle this drawback. For another key consideration, we need a mechanism to identify appropriate fuzzy clusters without prior knowledge on the number of clusters. From this requirement, an optimization with single criterion may not be feasible for different cluster shapes. A multi-objective genetic algorithm is therefore appropriate to search for fuzzy partitions in this situation. Apart from the overlap-separation measure, the well-known fuzzy Jm index is also optimized through genetic operations. The algorithm simultaneously optimizes the two criteria to search for optimal clustering solutions. A string of real-coded values is encoded to represent cluster centers. A number of strings with different lengths varied over a range correspond to variable numbers of clusters. These real-coded values are optimized and the Pareto solutions corresponding to a tradeoff between the two objectives are finally produced. As shown in the experiments, the approach provides promising solutions in well-separated, hyperspherical and overlapping clusters from synthetic and real-life data sets. This is demonstrated by the comparison with existing single-objective and multi-objective clustering techniques.  相似文献   

15.
结合密度聚类和模糊聚类的特点,提出一种基于密度的模糊代表点聚类算法.首先利用密度对数据点成为候选聚类中心点的可能性进行处理,密度越高的点成为聚类中心点的可能性越大;然后利用模糊方法对聚类中心点进行确定;最后通过合并聚类中心点确定最终的聚类中心.所提出算法具有很好的自适应性,能够处理不同形状的聚类问题,无需提前规定聚类个数,能够自动确定真实存在的聚类中心点,可解释性好.通过结合不同聚类方法的优点,最终实现对数据的有效划分.此外,所提出的算法对于聚类数和初始化、处理不同形状的聚类问题以及应对异常值等方面具有较好的鲁棒性.通过在人工数据集和UCI真实数据集上进行实验,表明所提出算法具有较好的聚类性能和广泛的适用性.  相似文献   

16.
基于模糊聚类的文本挖掘算法   总被引:8,自引:3,他引:5       下载免费PDF全文
针对传统FCM算法对孤立点比较敏感,须预先指定聚类数目的缺陷,提出一种新的模糊聚类算法NSFCM,将其应用干文本挖掘中。NSFCM对数据对象的隶属度增加一个权值,以减少孤立点对聚类中心的影响。采用平均信息熵确定聚类数,通过密度函数获得初始聚类中心。仿真结果证明,该算法聚类的精度和执行效率均高于FCM算法,效果较好。  相似文献   

17.
提出一种新的动态模糊聚类的方法,针对传统的模糊聚类需要预先确定聚类数的问题,提出采用动态自组织映射神经网络来确定聚类数,并通过文本向量空间模型和TF-IDF方法来确定文本的特征向量,再将动态自组织映射神经网络得到的聚类数,用模糊C均值算法(FCM)函数处理,得到聚类的结果。该算法同仅用动态自组织映射神经网络算法的运行结果相比,具有运行聚类结果精度高的优点,模糊聚类更适合处理语义的多样性和文本归属的模糊性,实验验证了算法的有效性。  相似文献   

18.
Algorithms for clustering Web search results have to be efficient and robust. Furthermore they must be able to cluster a data set without using any kind of a priori information, such as the required number of clusters. Clustering algorithms inspired by the behavior of real ants generally meet these requirements. In this article we propose a novel approach to ant‐based clustering, based on fuzzy logic. We show that it improves existing approaches and illustrates how our algorithm can be applied to the problem of Web search results clustering. © 2007 Wiley Periodicals, Inc. Int J Int Syst 22: 455–474, 2007.  相似文献   

19.
模糊C均值聚类算法在算法初始化时需要人为设定聚类类别数、随机初始化聚类中心,致使该算法容易陷入局部最优值.为解决此类问题,在蚁群算法中引入信息素更新机制,使其输出的聚类中心更具全局优化的特征和较强鲁棒性的特点;用蚁群算法得到的聚类中心来初始化FCM算法的聚类中心,解决了FCM算法对初始聚类中心敏感的问题;使用结合熵信息与数据几何结构的聚类有效性评价方法对FCM算法和优化FCM算法进行评价,评价结果表明优化的FCM算法性能更优.在仿真实验中,利用提出的优化算法和FCM算法对自然图像、纹理图像和SAR图像进行分割实验,从图像分割的准确性和算法的实时性做对比实验,验证了优化算法的有效性.  相似文献   

20.
Clustering Web data is one important technique for extracting knowledge from the Web. In this paper, a novel method is presented to facilitate the clustering. The method determines the appropriate number of clusters and provides suitable representatives for each cluster by inference from a Bayesian network. Furthermore, by means of the Bayesian network, the contents of the Web pages are converted into vectors of lower dimensions. The method is also extended for hierarchical clustering, and a useful heuristic is developed to select a good hierarchy. The experimental results show that the clusters produced benefit from high quality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号