首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
为解决谱聚类在大规模数据集上存在的计算耗时和无法聚类等性能瓶颈制约,提出了基于Spark技术的大规模数据集谱聚类的并行化算法。首先,通过单向循环迭代优化相似矩阵的构建,避免重复计算;然后,通过位置变换和标量乘法替换来优化Laplacian矩阵的构建与正规化,降低存储需求;最后,采用近似特征向量计算来进一步减少计算量。不同测试数据集上的实验结果表明:随着测试数据集的规模增加,所提算法的单向循环迭代和近似特征值计算的运行时间呈线性增长,增长缓慢,其近似特征向量计算与精确特征向量计算取得相近的聚类效果,并且算法在大规模数据集上表现出良好的可扩展性。在获得较好的谱聚类性能的基础上,改进算法提高了运行效率,有效缓解了谱聚类的计算耗时及无法聚类问题。  相似文献   

2.
3.
Support Vector Machine (SVM) regression is an important technique in data mining. The SVM training is expensive and its cost is dominated by: (i) the kernel value computation, and (ii) a search operation which finds extreme training data points for adjusting the regression function in every training iteration. Existing training algorithms for SVM regression are not scalable to large datasets because: (i) each training iteration repeatedly performs expensive kernel value computations, which is inefficient and requires holding the whole training dataset in memory; (ii) the search operation used in each training iteration considers the whole search space which is very expensive. In this article, we significantly improve the scalability and efficiency of SVM regression by exploiting the high performance of Graphics Processing Units (GPUs) and solid state drives (SSDs). Our key ideas are as follows. (i) To reduce the cost of repeated kernel value computations and avoid holding the whole training dataset in the GPU memory, we precompute all the kernel values and store them in the CPU memory extended by the SSD; together with an efficient strategy to read the precomputed kernel values, reusing precomputed kernel values with an efficient retrieval is much faster than computing them on-the-fly. This also alleviates the restriction that the training dataset has to fit into the GPU memory, and hence makes our algorithm scalable to large datasets, especially for large datasets with very high dimensionality. (ii) To enhance the performance of the frequently used search operation, we design an algorithm that minimizes the search space and the number of accesses to the GPU global memory; this optimized search algorithm also avoids branch divergence (one of the causes for poor performance) among GPU threads to achieve high utilization of the GPU resources. Our proposed techniques together form a scalable solution to the SVM regression which we call SIGMA. Our extensive experimental results show that SIGMA is highly efficient and can handle very large datasets which the state-of-the-art GPU-based algorithm cannot handle. On the datasets of size that the state-of-the-art GPU-based algorithm can handle, SIGMA consistently outperforms the state-of-the-art GPU-based algorithm by an order of magnitude and achieves up to 86 times speedup.  相似文献   

4.
现有的压缩感知MIMO-OFDM信道估计方法多采用正交匹配追踪算法及其改进的算法。针对该类算法重构大规模的数据存在计算复杂度高、存储量大等问题,提出了基于梯度追踪算法的MIMO-OFDM 稀疏信道估计方法。梯度追踪算法采用最速下降法对目标函数解最优解,即每步迭代时计算目标函数的搜索方向和搜索步长,并以此选择原子得到每次迭代重构值的最优解。本文使用梯度追踪算法对信道进行估计,并与传统的最小二乘估计算法、正交匹配追踪算法的性能和计算复杂度进行比较。仿真结果表明,梯度追踪算法能够保证较好的估计效果,减少了导频开销,降低了运算复杂度,提高了重构效率。  相似文献   

5.
支持向量机方法具有良好的分类准确率、稳定性与泛化性,在网络流量分类领域已有初步应用,但在面对大规模网络流量分类问题时却存在计算复杂度高、分类器训练速度慢的缺陷。为此,提出一种基于比特压缩的快速SVM方法,利用比特压缩算法对初始训练样本集进行聚合与压缩,建立具有权重信息的新样本集,在损失尽量少原始样本信息的前提下缩减样本集规模,进一步利用基于权重的SVM算法训练流量分类器。通过大规模样本集流量分类实验对比,快速SVM方法能在损失较少分类准确率的情况下,较大程度地缩减流量分类器的训练时间以及未知样本的预测时间,同时,在无过度压缩前提下,其分类准确率优于同等压缩比例下的随机取样SVM方法。本方法在保留SVM方法较好分类稳定性与泛化性能的同时,有效提升了其应对大规模流量分类问题的能力。  相似文献   

6.
The present work is intended to address two of the major difficulties that can be found when tackling the estimation of the local orientation of the data in a scene, a task which is usually accomplished by means of the computation of the structure tensor-based directional field. On one hand, the orientation information only exists in the non-homogeneous regions of the dataset, while it is zero in the areas where the gradient (i.e. the first-order intensity variation) remains constant. Due to this lack of information, there are many cases in which the overall shape of the represented objects cannot be precisely inferred from the directional field. On the other hand, the orientation estimation is highly dependent on the particular choice of the averaging window used for its computation (since a collection of neighboring gradient vectors is needed to obtain a dominant orientation), typically resulting in vector fields which vary from very irregular (thus yielding a noisy estimation) to very uniform (but at the expense of a loss of angular resolution). The proposed solution to both drawbacks is the regularization of the directional field; this process extends smoothly the previously computed vectors to the whole dataset while preserving the angular information of relevant structures. With this purpose, the paper introduces a suitable mathematical framework and deals with the d-dimensional variational formulation which is derived from it. The proposed formulation is finally translated into the frequency domain in order to obtain an increase of insight on the regularization problem, which can be understood as a low-pass filtering of the directional field. The frequency domain point of view also allows for an efficient implementation of the resulting iterative algorithm. Simulation experiments involving datasets of different dimensionality prove the validity of the theoretical approach.  相似文献   

7.
In many data stream mining applications, traditional density estimation methods such as kernel density estimation, reduced set density estimation can not be applied to the density estimation of data streams because of their high computational burden, processing time and intensive memory allocation requirement. In order to reduce the time and space complexity, a novel density estimation method Dm-KDE over data streams based on the proposed algorithm m-KDE which can be used to design a KDE estimator with the fixed number of kernel components for a dataset is proposed. In this method, Dm-KDE sequence entries are created by algorithm m-KDE instead of all kernels obtained from other density estimation methods. In order to further reduce the storage space, Dm-KDE sequence entries can be merged by calculating their KL divergences. Finally, the probability density functions over arbitrary time or entire time can be estimated through the obtained estimation model. In contrast to the state-of-the-art algorithm SOMKE, the distinctive advantage of the proposed algorithm Dm-KDE exists in that it can achieve the same accuracy with much less fixed number of kernel components such that it is suitable for the scenarios where higher on-line computation about the kernel density estimation over data streams is required.We compare Dm-KDE with SOMKE and M-kernel in terms of density estimation accuracy and running time for various stationary datasets. We also apply Dm-KDE to evolving data streams. Experimental results illustrate the effectiveness of the proposed method.  相似文献   

8.
提出了一种新的适用于不平衡数据集的Adaboost算法(ILAdaboost),该算法利用每一轮学习到的基分类器对原始数据集进行测试评估,并根据评估结果将原始数据集分成四个子集,然后在四个子集中重新采样形成平衡的数据集供下一轮基分类器学习,由于抽样过程中更加倾向于少数类和分错的多数类,故合成分类器的分界面会偏离少数类。该算法在UCI的10个典型不平衡数据集上进行实验,在保证多数类分类精度的同时提高了少数类的分类精度以及GMA。  相似文献   

9.
针对DBN算法训练时间复杂度高,容易过拟合等问题,受模糊理论启发,提出了一种基于模糊划分和模糊加权的集成深度信念网络,即FE-DBN(ensemble deep belief network with fuzzy partition and fuzzy weighting),用于处理大样本数据的分类问题。通过模糊聚类算法FCM将训练数据划分为多个子集,在各个子集上并行训练不同结构的DBN,将每个分类器的结果进行模糊加权。在人工数据集、UCI数据集上的实验结果表明,提出的FE-DBN比DBN精度均有所提升,具有更快的运行时间。  相似文献   

10.
对支持向量机的大规模训练问题进行了深入研究,提出一种类似SMO的块增量算法.该算法利用increase和decrease两个过程依次对每个输入数据块进行学习,避免了传统支持向量机学习算法在大规模数据集情况下急剧增大的计算开销.理论分析表明新算法能够收敛到近似最优解.基于KDD数据集的实验结果表明,该算法能够获得接近线性的训练速率,且泛化性能和支持向量数目与LIBSVM方法的结果接近.  相似文献   

11.
We present a novel formulation for quasi-supervised learning that extends the learning paradigm to large datasets. Quasi-supervised learning computes the posterior probabilities of overlapping datasets at each sample and labels those that are highly specific to their respective datasets. The proposed formulation partitions the data into sample groups to compute the dataset posterior probabilities in a smaller computational complexity. In experiments on synthetic as well as real datasets, the proposed algorithm attained significant reduction in the computation time for similar recognition performances compared to the original algorithm, effectively generalizing the quasi-supervised learning paradigm to applications characterized by very large datasets.  相似文献   

12.
The challenges of the classification for the large-scale and high-dimensional datasets are: (1) It requires huge computational burden in the training phase and in the classification phase; (2) it needs large storage requirement to save many training data; and (3) it is difficult to determine decision rules in the high-dimensional data. Nonlinear support vector machine (SVM) is a popular classifier, and it performs well on a high-dimensional dataset. However, it easily leads overfitting problem especially when the data are not evenly distributed. Recently, profile support vector machine (PSVM) is proposed to solve this problem. Because local learning is superior to global learning, multiple linear SVM models are trained to get similar performance to a nonlinear SVM model. However, it is inefficient in the training phase. In this paper, we proposed a fast classification strategy for PSVM to speed up the training time and the classification time. We first choose border samples near the decision boundary from training samples. Then, the reduced training samples are clustered to several local subsets through MagKmeans algorithm. In the paper, we proposed a fast search method to find the optimal solution for MagKmeans algorithm. Each cluster is used to learn multiple linear SVM models. Both artificial datasets and real datasets are used to evaluate the performance of the proposed method. In the experimental result, the proposed method prevents overfitting and underfitting problems. Moreover, the proposed strategy is effective and efficient.  相似文献   

13.
通过协同求解多个概念漂移问题并充分挖掘相关概念漂移问题中蕴含的有效信息,共享矢量链支持向量机(shared vector chain supported vector machines,SVC-SVM)在面向多任务概念漂移分类时表现出良好性能。然而实际应用中的概念漂移问题通常有较大的数据容量,较高的计算代价限制了SVC-SVM方法的推广能力。针对这个弱点,借鉴核心向量机的近线性时间复杂度的优势,提出了适于多任务概念漂移大规模数据的共享矢量链核心向量机(shared vector chain core vector machines,SVC-CVM)。SVC-CVM具有渐近线性时间复杂度的算法特点,同时又继承了SVC-SVM方法协同求解多个概念漂移问题带来的良好性能,实验验证了该方法在多任务概念漂移大规模数据集上的有效性和快速性。  相似文献   

14.
针对经典支持向量机在增量学习中的不足,提出一种基于云模型的最接近支持向量机增量学习算法。该方法利用最接近支持向量机的快速学习能力生成初始分类超平面,并与k近邻法对全部训练集进行约简,在得到的较小规模的精简集上构建云模型分类器直接进行分类判断。该算法模型简单,不需迭代求解,时间复杂度较小,有较好的抗噪性,能较好地体现新增样本的分布规律。仿真实验表明,本算法能够保持较好的分类精度和推广能力,运算速度较快。  相似文献   

15.
Training time of traditional multilayer perceptrons (MLPs) using back-propagation algorithm rises seriously with the problem scale. For multi-class problems, the convergence ratio is very low for training MLPs. The huge time-consuming and low convergence ratio greatly restricts the applications of MLPs on problems with tens and thousands of samples. To deal with these disadvantages, this paper proposes a fast BP network with dynamic sample selection (BPNDSS) method which can dynamically select the samples containing more contribution to the variation of the decision boundary for training after each iteration epoch. The proposed BPNDSS can significantly increase the training speed by only selecting a small subset of the whole samples. Moreover, two kinds of modular single-hidden-layer approaches are adopted to decompose a multi-class problem into multiple binary-class sub-problems, which result in the high rate of convergence. The experiments on Letter and MNIST handwritten recognition database show the effectiveness and the efficiency of BPNDSS. Moreover, BPNDSS results in comparable classification performance to the convolutional neural networks (CNNs), support vector machine, Adaboost, C4.5, and nearest neighbour algorithms. To further demonstrate the training speed improvement of the dynamic sample selection approach on large-scale datasets, we modify CNN to propose a dynamic sample selection CNN (DynCNN). Experiments on Image-Net dataset illustrate that DynCNN can result in similar performance to CNN, but consume less training time.  相似文献   

16.
在大数据环境背景下,传统机器学习算法多采用单机离线训练的方式,显然已经无法适应持续增长的大规模流式数据的变化。针对该问题,提出一种基于Flink平台的分布式在线集成学习算法。该方法基于Flink分布式计算框架,首先通过数据并行的方式对在线学习算法进行分布式在线训练;然后将训练出的多个子模型通过随机梯度下降算法进行模型的动态权重分配,实现对多个子模型的结果聚合;与此同时,对于训练效果不好的模型利用其样本进行在线更新;最后通过单机与集群环境在不同数据集上做实验对比分析。实验结果表明,在线学习算法结合Flink框架的分布式集成训练,能达到集中训练方式下的性能,同时大大提高了训练的时间效率。  相似文献   

17.
离群点检测算法在网络入侵检测、医疗辅助诊断等领域具有十分广泛的应用。针对LDOF、CBOF及LOF算法在大规模数据集和高维数据集的检测过程中存在的执行时间长及检测率较低的问题,提出了基于图上随机游走(BGRW)的离群点检测算法。首先初始化迭代次数、阻尼因子以及数据集中每个对象的离群值;其次根据对象之间的欧氏距离推导出漫步者在各对象之间的转移概率;然后通过迭代计算得到数据集中每个对象的离群值;最后将数据集中离群值最高的对象判定为离群点并输出。在UCI真实数据集与复杂分布的合成数据集上进行实验,将BGRW算法与LDOF、CBOF和LOF算法在执行时间、检测率和误报率指标上进行对比。实验结果表明,BGRW算法能够有效降低执行时间并在检测率及误报率指标上优于对比算法。  相似文献   

18.
大规模高维不平衡数据是异常检测中的重大挑战.单类支持向量机在处理不平衡数据方面非常有效,但不适合大规模高维数据,同时单类支持向量机的核函数对检测性能也具有重要的影响.文中提出了一个深度自编码器与单类支持向量机相结合的异常检测模型,深度自编码器不仅负责提取特征和降维,同时拟合出了一个自适应核函数.深度自编码器与单类支持向...  相似文献   

19.
支持向量数据描述(SVDD)是一种无监督学习算法,在图像识别和信息安全等领域有重要应用。坐标下降方法是求解大规模分类问题的有效方法,具有简洁的操作流程和快速的收敛速率。文中针对大规模SVDD提出一种高效的对偶坐标下降算法,算法每步迭代的子问题都可获得解析解,并可使用加速策略和简便运算减少计算量。同时给出3种子问题的选择方法,并分析对比各自优劣。实验对仿真和真实大规模数据库进行算法验证。与LibSVDD相比,文中方法更具优势,1。4s求解105样本规模的ijcnn文本库。  相似文献   

20.
带平衡约束的矩形布局问题源于卫星舱设备布局设计,属于组合优化问题。深度强化学习利用奖赏机制,通过数据训练实现高性能决策优化。针对布局优化问题,提出一种基于深度强化学习的新算法DAR及其扩展算法IDAR。DAR用指针网络输出定位顺序,再利用定位机制给出布局结果,算法的时间复杂度是O(n3);IDAR算法在DAR的基础上引入迭代机制,算法时间复杂度是O(n4),但能给出更好的结果。测试表明DAR算法具有较好的学习能力,用小型布局问题进行求解训练所获得的模型,能有效应用在大型问题上。在两个大规模典型算例的对照实验中,提出算法分别超出和接近目前最优解,具有时间和质量上的优势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号