首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 156 毫秒
1.
多种应用场合需要寻找给定数据库中相似度最大的前k对数据。然而由于应用领域需要处理的数据规模呈上升趋势,计算这样的最相似k对数据,难度非常大。提出一种基于MapReduce的最相似k对数据搜索方案,该方案首先将所有数据对分割成多个组,然后提出所有数据对分组算法和核心数据对分组算法,通过单独计算每个组中的最近似k对数据,从所有组的最近似k对数据中选择相似度最高的k对数据,进而确定最近似k对数据。最后基于合成数据和真实数据进行实验,性能评估结果表明本文MapReduce算法的有效性和可扩展性。  相似文献   

2.
大规模数据常因其分布式存储特性导致寻找其相似度最大的前k对数据比较困难.针对上述问题,提出一种基于MapReduce的最相似k对数据查询方案.该方案首先将所有数据对分割成多个组,然后提出所有数据对分组算法和核心数据对分组算法,通过单独计算每个组中的最近似k对数据,再从所有组的最近似k对数据中选择相似度最高的k对数据,进而正确地确定最近似k对数据.最后基于合成数据和真实数据进行实验,通过改变最近似数据对数k和机器数目s验证算法性能.实验结果表明增加机器数目s能够提升算法的运行效率和可扩展性,而k参数的变化对基于MapReduce的算法影响不大.  相似文献   

3.
通过相似度支持度优化基于K近邻的协同过滤算法   总被引:19,自引:0,他引:19  
个性化推荐系统能基于用户个人兴趣为用户提供定制信息.此类系统通常使用协同过滤技术实现,其中一种广泛使用的经典模型是基于用户评分相似度的k近邻模型.使用k近邻模型需要预先计算出用户或者项目的k个最近邻居,k值过大时会导致计算量过大而影响推荐产生的实时性,而k值过小则会导致推荐精度下降.为解决此问题,该文中提出了一种新的最近邻度量--相似度支持度.基于相似度支持度,该文提出了数种能够在保持推荐精度和密度的前提下维持合理规模的k近邻的策略.在真实大规模数据集上的实验结果表明,相比传统算法,该文提出的策略能够在保证推荐精度的前提下大幅降低计算复杂度.  相似文献   

4.
最近邻查询在多个领域具有广泛的应用,如组合过滤、基于位置的服务、决策支持系统等。而且随着Web信息实体抽取、隐私保护信息转化、图像识别等技术的发展和普及,在诸多领域,不确定性文本数据普遍存在,基于信息论的TF-IDF算法,可以将文本型的相似匹配转化为数值型的向量的计算,具有严密性和有效性。但TF-IDF信息的余弦距离不属于度量空间,难于构建索引。为此主要研究了面向不确定文本数据基于余弦相似度的相似性查询方法。通过分析不确定性余弦相似度计算的特性,提出了快速相似度计算方法。通过对余弦距离的计算进行转换,构建改进的索引结构s MVP-tree(statistic multiple vantage point tree),并给出了基于余弦相似度面向不确定性数据的相似度计算方法。最后,结合该相似度计算方法提出了分布式环境下k NN查询和Rk NN查询算法。大量的基于真实数据的实验验证了算法的正确性和有效性。  相似文献   

5.
传统k最近邻算法kNN在数据分类中具有广泛的应用,但该算法具有较多的冗余计算,致使处理高维数据时花费较多的计算时间。同时,基于地标点谱聚类的分类算法(LC-kNN和RC-kNN)中距离当前测试点的最近邻点存在部分缺失,导致其准确率降低。针对上述问题,提出一种基于聚类的环形k最近邻算法。提出的算法在聚类算法的基础上,首先将训练集中相似度较高的数据点聚成一个簇,然后以当前测试点为中心设置一个环形过滤器,最后通过kNN算法对过滤器中的点进行分类,其中聚类算法可以根据实际情况自由选择。算法性能已在UCI数据库中6组公开数据集上进行了实验测试,实验结果表明:AkNN_E与AkNN_H算法比kNN算法在计算量上平均减少51%,而在准确率上比LC-kNN和RC-kNN算法平均提高3%。此外,当数据在10 000维的情况下该算法仍然有效。  相似文献   

6.
针对因使用基于距离的相似性度量,传统聚类内部指标随着数据维数的增加而性能下降的问题,提出了一种基于共享近邻相似度的聚类内部指标.首先,利用共享近邻相似度和k最近邻(kNN)方法,估计数据点的密度,构建融合密度的共享近邻相似度图.然后,根据融合密度的共享近邻相似度图,利用最大流算法,计算出类内相似度和类间分离度,并结合两者计算出聚类内部指标.通过对人工数据集和真实数据集的测试表明,与9个基于距离的传统聚类内部指标相比,该指标能更准确评估数据集的最佳划分和预测数据集的最佳类数.因此,该指标处理复杂类结构和高维数据的能力优于所对比的其他聚类内部指标.  相似文献   

7.
针对基于邻域的协同过滤算法只考虑相似度这一因素和传统矩阵分解出现负值的问题,提出了一种非负矩阵分解和项目热度相结合的两阶段k近邻选择算法,把项目热度融入到相似度计算中,有效缓解了数据稀疏性问题.实验结果表明:提出的算法与经典算法相比,推荐精度更高.  相似文献   

8.
为提高Mahout中协同过滤算法处理大数据的能力,对云计算平台进行研究,提出一种基于MapReduce模型计算相似度的方法。通过设计4个MapReduce任务,实现对数似然相似度算法的并行化;结合算法自身的特点,采用复合键对和同现矩阵的思想将大量小键值对合并为大键值对,以减少中间计算量和通信开销。实验结果表明,和Mahout中的单机版相似度算法相比,基于Hadoop平台的对数似然相似度算法具有很好的加速比和可扩展性,能够提升推荐算法的效率。  相似文献   

9.
协同过滤技术目前被广泛应用于个性化推荐系统中.为了使用户的最近邻居集合更加精确有效,提出了基于用户兴趣度和用户特征的优化协同过滤推荐算法.首先通过计算用户对项目的兴趣度来对用户进行分组;然后采用贝叶斯算法分析出用户具有不同特征时对项目的喜好程度;最后采用一种新的相似度度量方法计算出目标用户的最近邻居集合.实验表明该算法提高了最近邻居集合的有效性和准确度,推荐质量较以往算法有明显提高.  相似文献   

10.
协同过滤算法是服务推荐系统中最有效和应用最广泛的推荐方法,其侧重于提高推荐结果的准确性。然而,在大数据背景下,用户行为数据不仅经常频繁更新而且数据规模增长迅速,传统的协同过滤算法需要穷举搜索所有数据,相似度计算耗时较高,推荐效率低,无法满足用户实时体验的需求服务。快速从大数据中获得高质量的推荐服务成为一种新的需求,为此,提出基于局部敏感哈希技术的协同过滤算法,算法过滤了绝大多数不相似的项目,避免了冗余的相似度计算,另一方面算法将用户行为数据哈希为二进制哈希编码,进而保护用户隐私。最后,在不同规模尺寸的数据集上与主流算法对比,实验表明提出的算法在效率和准确度间能够取得较好的折衷。  相似文献   

11.
Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient access to dataspaces, in this paper, we first introduce our survey of data features in the real dataspaces. Based on the features observed in our study, several partitioning based index approaches are proposed to accelerate the query processing in dataspaces. Specifically, the vertical partitioning index utilizes the partitions on tokens to merge and compress data. We can both reduce the number of I/O reads and avoid aggregation of data inside a compressed list. The horizontal partitioning index supports pruning partitions of tuples in the top-k query. Thus, we can reduce the computation overhead of irrelevant candidate tuples to the query. Finally, we also propose a hybrid index with both vertical and horizontal partitioning. The extensive experiment results in real data sets demonstrate that our approaches outperform the previous techniques and scale well with the large data size.  相似文献   

12.
It is advantageous to use implicit security for online data storage in a cloud computing environment. We describe the use of a data partitioning scheme for implementing such security involving the roots of a polynomial in finite field. The partitions are stored on randomly chosen servers on the network and they need to be retrieved to recreate the original data. Data reconstruction requires access to each server, login password and the knowledge of the servers on which the partitions are stored. This scheme may also be used for data security in sensor networks and internet voting protocols.  相似文献   

13.
Stream computing applications require minimum latency and high throughput for efficiently processing real-time data. Typically, data-intensive applications where large datasets are required to be moved across execution nodes have low latency requirements. In this paper, a stream-based data processing model is adopted to develop an algorithm for optimal partitioning the input data such that the inter-partition data flow remains minimal. The proposed algorithm improves the execution of the data-intensive workflows in heterogeneous computing environments by partitioning the data-intensive workflow and mapping each partition on the available heterogeneous resources that offer minimum execution time. Minimum data movement between the partitions reduces the latency, which can be further reduced by applying advanced data parallelism techniques. In this paper, we apply data parallelism technique to the bottleneck (most compute-intensive) task in each partition that significantly reduces the latency. We study the effectiveness and the performance of the proposed approach by using synthesized workflows and real-world applications, such as Montage and Cybershake. Our evaluation shows that the proposed algorithm provides schedules with approximately 12% reduced latency and nearly 17% enhanced throughput as compared to the existing state of the art algorithms.  相似文献   

14.
Skyline queries, together with other advanced query operators, are essential in order to help identify sets of interesting data points buried within huge amount of data readily available these days. A skyline query retrieves sets of non-dominated data points in a multi-dimensional dataset. As computing infrastructures become increasingly pervasive, connected by readily available network services, data storage and management have become inevitably more distributed. Under these distributed environments, designing efficient skyline querying with desirable quick response time and progressive returning of answers faces new challenges. To address this, in this paper, we propose a novel skyline query scheme termed MpSky. MpSky is based on a novel space partitioning scheme, employing the dependency relationships among data points on different servers. By grouping points of each server using dependencies, we are able to qualify a skyline point by only comparing it with data on dependent servers, and parallelize the skyline computation among non-dependent partitions that are from different servers or individual servers. By controlling the query propagation among partitions, we are able to generate skyline results progressively and prune partitions and points efficiently. Analytical and extensive simulation results show the effectiveness of the proposed scheme.  相似文献   

15.
谢剑锋  杜建洪 《计算机工程》2004,30(12):147-148,161
数据分割技术将视频流分成不同的部分进行传输。传统的数据分割算法是采用基于DCT系数的频率域分层技术。文章提出的算法对DCT系数进行重量化,量化后的系数作为基本层的数据,而量化后的余值作为增强层的数据。实验证明,与频率域分层技术相比,该算法提供了较好的图像质量。  相似文献   

16.
Replication in distributed computing systems provides improved availability and reliability in the event of site failure and network partitioning. However, if strict mutual consistency is required, transactions can be processed in at most one partition, thereby reducing availability. We present a consistency control algorithm that relaxes strict mutual consistency criteria, and allows concurrent processing in all partitions. Inconsistency of data objects in different partitions is resolved at the time of merging the partitions when recovery occurs. The basis of our algorithm is a new merge mechanism that utilizes available semantic information about the data objects and transaction types. We present a formal proof of correctness of the algorithm. Results from a simulation model show that our algorithm performs better than a previously proposed approach that uses compensating transactions to sacrifice serializability of replicated data.  相似文献   

17.
Many RDF systems support reasoning with Datalog rules via materialisation, where all conclusions of RDF data and the rules are precomputed and explicitly stored in a preprocessing step. As the amount of RDF data used in applications keeps increasing, processing large datasets often requires distributing the data in a cluster of shared-nothing servers. While numerous distributed query answering techniques are known, distributed materialisation is less well understood. In this paper, we present several techniques that facilitate scalable materialisation in distributed RDF systems. First, we present a new distributed materialisation algorithm that aims to minimise communication and synchronisation in the cluster. Second, we present two new algorithms for partitioning RDF data, both of which aim to produce tightly connected partitions, but without loading complete datasets into memory. We evaluate our materialisation algorithm against two state-of-the-art distributed Datalog systems and show that our technique offers competitive performance, particularly when the rules are complex. Moreover, we analyse in depth the effects of data partitioning on reasoning performance and show that our techniques offer performance comparable or superior to the state of the art min-cut partitioning, but computing the partitions requires considerably less time and memory.  相似文献   

18.
When a phenomenon is described by a parametric model and multiple datasets are available, a key problem in statistics is to discover which datasets are characterized by the same parameter values. Equivalently, one is interested in partitioning the family of datasets into blocks collecting data that are described by the same parameters. Because of noise, different partitions can be consistent with the data, in the sense that they are accepted by generalized likelihood ratio tests with a given confidence level. Given the fact that testing all possible partitions is a computationally unaffordable task, we propose an algorithm for finding all acceptable partitions while avoiding testing unnecessary ones. The core of our method is an efficient procedure, based on partial order relations on partitions, for computing all partitions that verify an upper bound on a monotone function. The reduction of the computational burden brought about by the algorithm is analyzed both theoretically and experimentally. Applications to the identification of switched systems are also presented.  相似文献   

19.
An error resilient three-dimensional (3-D) mesh coding system is proposed in this paper. The encoder uses a shape adaptive data partitioning scheme to alleviate the effect of error propagation. An input mesh surface is coarsely divided into smooth and detailed regions, and each region is further divided into partitions of similar sizes. Then, those partitions are progressively compressed and their joint boundaries are compressed using the boundary edge collapse rule. At the decoder, the boundary edge collapse rule facilitates the seamless assembly of the partitions. When no data is available for a partition due to transmission errors, we employ a concealment scheme based on the projection onto convex sets (POCS) theory. Simulation results demonstrate that the proposed algorithm reconstructs 3-D mesh surfaces faithfully even in severe error prone environments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号