首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 779 毫秒
1.
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.  相似文献   

2.
随着大数据和人工智能的高速发展,针对多媒体数据的结构化处理与基于内容的检索受到极大的关注,面对多媒体数据结构化后的海量高维特征向量,如何快速、准确地检索是人工智能处理大规模数据所必须解决的问题。最近提出的分层可通航小世界图HNSW检索算法在多个公开数据集取得了最佳的性能表现,但该算法存在内存开销大的问题。而基于量化编码的检索算法能够压缩数据集向量,大幅度降低内存占用。将量化编码和分层可通航小世界图算法结合,提出了2种基于量化编码改进的HNSW算法,分别是使用标量量化编码向量的HNSWSQ算法和使用乘积量化编码向量的HNSWPQ算法,2种算法使用不同的量化策略存储原始向量编码,以降低内存开销,再通过HNSW算法建立索引达到缩短检索耗时的目的。其中HNSWSQ算法在多个数据集上获得了与HNSW算法相近的查全率和平均检索耗时,而内存开销大幅降低。实验结果表明,HNSWSQ算法在SIFT-1M和GIST-1M数据集上的内存开销比HNSW算法分别降低了45.1%和70.4%。  相似文献   

3.
In this state-of-the-art report we discuss relevant research works related to the visualization of complex, multi-variate data. We discuss how different techniques take effect at specific stages of the visualization pipeline and how they apply to multi-variate data sets being composed of scalars, vectors and tensors. We also provide a categorization of these techniques with the aim for a better overview of related approaches. Based on this classification we highlight combinable and hybrid approaches and focus on techniques that potentially lead towards new directions in visualization research. In the second part of this paper we take a look at recent techniques that are useful for the visualization of complex data sets either because they are general purpose or because they can be adapted to specific problems.  相似文献   

4.
Kernel Grower 是一种有效的核聚类方法, 它具有计算精度高的优点. 然而, Kernel Grower在应用中的一个关键问题是对于大规模数据运算速度缓慢, 这在很大程度上制约了该方法的应用. 本文提出了一种大规模数据的快速核聚类方法, 该方法通过近似最小包含球快速算法, 显著地提高了的Kernel Grower计算速度, 并且该方法的计算复杂度仅与样本个数成线性关系. 在人工数据集和标准测试集上的模拟实验均说明本文算法的有效性. 本文还给出该方法在真实彩色图像分割中应用.  相似文献   

5.
基于空间分割的数据简化和分类   总被引:1,自引:1,他引:1  
数据简化的目的是简化数据集并保留有用的分类结构 .本文提出一个基于空间分隔的数据简化和分类算法 ,该算法将常规数据库的记录映射到多维空间上 ,从而将数据简化过程转变成在多维空间中同类数据的空间合并问题 ,也就是多维空间中不同类数据的空间分隔问题 ,最终得到一系列分隔空间 ,达到数据简化和分类的作用 .该方法用现实世界的 7个数据集进行评估 ,并与 C4.5所获得的结果进行比较 ,效果是显著的 ,并且结果唯一  相似文献   

6.
当前的图像识别领域,大部分的分类或者识别方法都建立在已有大量数据的基础上,将大量数据投入训练,经过采样分析、特征提取后做判别分类。然而在现实世界中,大多数目标分类问题并没有大量的标注数据。为了解决基于小样本数据集的图像识别问题,本文首先使用数据增强方法扩充数据集,然后利用多层卷积神经网络将图像映射到高维嵌入空间中,再使用原型网络得到每个类的原型点,根据嵌入空间中测试图像与各个类原型点之间的距离将其分类。实验结果表明,该方法在小样本条件下具有较高的识别准确率和较强的鲁棒性。  相似文献   

7.
针对现有的一阶段Top-K高效用项集挖掘算法挖掘过程中阈值提升慢,迭代时生成大量候选项集造成内存占用过多等问题,提出一种基于重用链表(R-list)的Top-K高效用挖掘算法RHUM。使用一种新的数据结构R-list来存储并快速访问项集信息,无需第2次扫描数据库进行项集挖掘。该算法重用内存以保存候选集信息,结合改进的RSD阈值提升策略对数据进行预处理,期间采用更严格的剪枝参数在递归搜索的过程中同时计算多个项集的效用来缩小搜索空间。在不同类型数据集中的实验结果表明:RHUM算法在内存效率方面均优于其他一阶段算法,且在K值变化时能保持稳定。  相似文献   

8.
面向大规模数据的快速并行聚类划分算法研究   总被引:1,自引:0,他引:1  
牛新征  佘堑 《计算机科学》2012,39(1):134-137,151
随着聚类分析中处理数据量的急剧增加,面对大规模数据,传统K-Means聚类算法面临着巨大挑战。为了提高传统K-Means聚类算法的效率,针对已有基于MPI的并行K-Means聚类算法和基于Hadoop的分布式K-Means云聚类算法,从聚心初始化和通信模式等入手,提出了改进思路和具体实现。实验结果表明,所提算法能大大减少通信量和计算量,具有较高的执行效率。研究结果可以为以后设计更好的大规模数据快速并行聚类划分算法提供研究依据。  相似文献   

9.
对随机投影算法的离群数据挖掘技术研究   总被引:1,自引:0,他引:1  
[d]维点集离群数据挖掘技术是目前数据挖掘领域的研究热点之一。当前基于距离或最近邻概念进行离群数据挖掘时,在高维数据情况下的挖掘效果不佳,鉴于此,将基于角度的离群因子应用到高维离群数据挖掘中,提出一种新的基于随机投影算法的离群数据挖掘方案,它只需要用接近线性时间的方法就能预测所有数据点的基于角度的离群因子。该方法可以用于并行环境进行并行加速。对近似质量进行了理论分析,以保证算法的可靠性。合成和真实数据集实验结果表明,对超高维数据集,该方法效率高、可伸缩性强。  相似文献   

10.
基于CPN Tools的性能评价仿真研究   总被引:1,自引:0,他引:1  
着色Petri网作为一种高级Petri网,由于引入了时间,颜色集,层次结构等概念,使之更适合对大型复杂系统的仿真模拟与性能分析。CPN Tools支持各种随机概率分布,可以在模拟过程中提取各种数据,生成各种不同的性能分析结果,并支持连续模拟,分析,从而可以对现实系统进行更加精确的仿真。利用CPN Tools模拟过程中为CPN模型定义各种数据采集器,可以得到更加准确的性能分析报告。本文通过一个对快餐店进行仿真以及性能分析的简单的例子说明了着色Petri网的特性与CPN Tools的仿真与性能分析方法。  相似文献   

11.
可视化技术对于分析和探究大规模的多维数据集变得越来越重要,其中最重要的一种可视化技术是一种面向像素的可视化技术,其基本原理是将数据集中的每个数据值映射成屏幕上的一个像素并对这些像素按一定的规则充分地加以排列,以便将尽可能多的数据对象以人们熟悉的图形图像展现在屏幕上。递归模式技术是面向像素的可视化技术的一种,它基于简单地来回排列,允许用户参与定义结构和设置参数,主要适用于有自然顺序的数据集。在股票数据分析中,利用递归模式技术比较容易描述交易数据库中股票价格的变化情况,并预测股票的走势。  相似文献   

12.
In today's knowledge‐, service‐, and cloud‐based economy, an overwhelming amount of business‐related data are being generated at a fast rate daily from a wide range of sources. These data increasingly show all the typical properties of big data: wide physical distribution, diversity of formats, nonstandard data models, and independently managed and heterogeneous semantics. In this context, there is a need for new scalable and process‐aware services for querying, exploration, and analysis of process data in the enterprise because (1) process data analysis services should be capable of processing and querying large amount of data effectively and efficiently and, therefore, have to be able to scale well with the infrastructure's scale and (2) the querying services need to enable users to express their data analysis and querying needs using process‐aware abstractions rather than other lower‐level abstractions. In this paper, we introduce ProcessAtlas, ie, an extensible large‐scale process data querying and analysis platform for analyzing process data in the enterprise. The ProcessAtlas platform offers an extensible architecture by adopting a service‐based model so that new analytical services can be plugged into the platform. In ProcessAtlas, we present a domain‐specific model for representing process knowledge, ie, process‐level entities, abstractions, and the relationships among them modeled as graphs. We provide services for discovering, extracting, and analyzing process data. We provide efficient mapping and execution of process‐level queries into graph‐level queries by using scalable process query services to deal with the process data size growth and with the infrastructure's scale. We have implemented ProcessAtlas as a MapReduce‐based prototype and report on experiments performed on both synthetic and real‐world datasets.  相似文献   

13.
代谢组学数据处理研究的进展   总被引:2,自引:0,他引:2  
代谢组学是继基因组学和蛋白质组学之后,于上世纪90年代新近发展起来的一个新的科学研究热点.代谢组学数据往往非常复杂,因此数据处理已经成为代谢组学研究中的关键技术和瓶颈之一.本文按照代谢组学数据预处理一代谢物靶标分析和代谢轮廓分析中的数据处理-代谢指纹分析中的数据处理的线索,评述国内外在代谢组学数据处理方面的最新进展,并介绍代谢组学数据处理软件.  相似文献   

14.
科学数据引用是近年来逐渐兴起与成熟的一个领域,世界各国都对其进行了一系列相关工作。地学作为一门涉及范围极广的科学,在众多科研工作中积累了大量的科学数据。概述了我国科学数据共享体系的现状与存在的问题,以寒区旱区科学数据中心提供的数据施引文献库为例,分析了该中心数据集自2006~2013年(注:2013年数据不全)的被引情况。研究结果发现:科学数据集自公开到被广泛引用需要约2年的时间积累,2年之后被引情况大幅度增长。该数据集遥感数据最多,其次是土地和气象数据。但从被引情况来看,以遥感和气象被引次数居首,说明用户对此类数据的需求较大。  相似文献   

15.
Many recent sensor devices are being equipped with flash memories due to their unique advantages: non-volatile storage, small size, shock-resistance, fast read access and power efficiency. The ability of storing large amounts of data in sensor devices necessitates the need for efficient indexing structures to locate required information.The challenge with flash memories is that they are unsuitable for maintaining dynamic data structures because of their specific read, write and wear constraints; this combined with very limited data memory on sensor devices prohibits the direct application of most existing indexing methods.In this paper we propose a suite of index structures and algorithms which permit us to efficiently support several types of historical online queries on flash-equipped sensor devices: temporally constrained aggregate queries, historical online sampling queries and pattern matching queries. We have implemented our methods using nesC and have run extensive experiments in TOSSIM, the simulation environment of TinyOS. Our experimental evaluation using trace-driven real world data sets demonstrates the efficiency of our indexing algorithms.  相似文献   

16.
Recently, emerging technologies related to various sensors, product identification, and wireless communication give us new opportunities for improving the efficiency of automotive maintenance operations, in particular, implementing predictive maintenance. The key point of predictive maintenance is to develop an algorithm that can analyze degradation status of automotive and make predictive maintenance decisions. In this study, as a basis for implementing the predictive maintenance of automotive engine oil, we propose an algorithm to determine the suitable change time of automotive engine oil by analyzing its degradation status with mission profile data. For this, we use several statistical methods such as factor analysis, discriminant and classification analysis, and regression analysis. We identify main factors of mission profile and engine oil quality with factor analysis. Subsequently, with regression analysis, we specify relations between main factors considering the types of mission profile of automotive: urban-mode and highway-mode. Based on them, we determine the proper change time of engine oil through discriminant and classification analysis. To evaluate the proposed approach, we carry out a case study and have discussion about limitations of our approach.  相似文献   

17.
在大的数据集合中,开采其中的频繁项目集集合是数据挖掘中极具挑战的重要任务。已经有很多高效的算法被总结了出来。本文提出了一种思想,即开采频繁项目集集合的一 个子集,我们称之为频繁无析取规则集集合,而并非开采完全的频繁项目集集合。我们证明能借助它不读取数据库而还原出频繁项目集集合的全集和它们的支持度。本文还提 提出了一个开采无析取规则集集合的算法HOPE-Ⅱ,实验结果显示了其高效性。我们将它与另一种称为频繁封闭集的精简集进行对比,几乎所有的实验结果都显示使用无析取规则集集合比使用封闭集集合来开采频繁项目集集合更有效。  相似文献   

18.
Two experimental comparisons of data flow and mutation testing are presented. These techniques are widely considered to be effective for unit-level software testing, but can only be analytically compared to a limited extent. We compare the techniques by evaluating the effectiveness of test data developed for each. We develop ten independent sets of test data for a number of programs: five to satisfy the mutation criterion and five to satisfy the all-uses data-flow criterion. These test sets are developed using automated tools, in a manner consistent with the way a test engineer might be expected to generate test data in practice. We use these test sets in two separate experiments. First we measure the effectiveness of the test data that was developed for one technique in terms of the other. Second, we investigate the ability of the test sets to find faults. We place a number of faults into each of our subject programs, and measure the number of faults that are detected by the test sets. Our results indicate that while both techniques are effective, mutation-adequate test sets are closer to satisfying the data flow criterion, and detect more faults.  相似文献   

19.
Weather radars produce large amounts of data and this has important implications for the archiving and analysis of data. The need for methods to deal with weather radar data sets will only increase as the United States National Weather Service (NWS) continues deployment of its 137 WSR-88D radars as part of the NEXRAD program. In this article we describe a compression and archiving strategy for weather radar data and present results for 62 days of reflectivity data from a radar operated by the NWS, as well as results for 60 days of reflectivity data for a radar operated by the Bureau of Meteorology in Australia. In their original format, these two sets require 60 GB of storage. In the format we describe, they require 4.8 GB and the data is portable across many platforms. The software for manipulating the converted data is simple, efficient, and easy to implement in C or Fortran. The savings in disk space and reduction in reading time compare favorably with what is attainable with deflation, the algorithm used in the popular gzip compression program. © 1997 John Wiley & Sons, Ltd.  相似文献   

20.
An active learning algorithm is devised for training Self-Organizing Feature Maps on large data sets. Active learning algorithms recognize that not all exemplars are created equal. Thus, the concepts of exemplar age and difficulty are used to filter the original data set such that training epochs are only conducted over a small subset of the original data set. The ensuing Hierarchical Dynamic Subset Selection algorithm introduces definitions for exemplar difficulty suitable to an unsupervised learning context and therefore appropriate Self-organizing map (SOM) stopping criteria. The algorithm is benchmarked on several real world data sets with training set exemplar counts in the region of 30–500 thousand. Cluster accuracy is demonstrated to be at least as good as that from the original SOM algorithm while requiring a fraction of the computational overhead.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号