首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
在高性能计算程序对海量分布存储数据的操控中,有效的数据管理很重要。该文提出一个新的高性能分布计算的数据管理与优化系统,它包含一个元数据管理系统和存储系统,提供一个容易使用且能自动进行存储访问优化的平台。该平台采用的多存储资源体系结构能够满足性能和存储容量需求,并能自适应地利用当前的I/O优化方法。  相似文献   

2.
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.  相似文献   

3.
Time series data mining becomes an active research area due to the rapid proliferation of temporal-dependent applications. Dimensionality reduction and uncertainty handling play a pivotal role in extracting the time series pattern. Most of the dimensionality reduction schemes are designed based on the assumption that every class of samples follows the Gaussian distribution. Lack of this property in real time data distribution does not allow dimensionality reduction techniques to characterize the different classes well and measure the data uncertainty accurately. In addition to, applying an uncertainty measurement evenly on inconsistent time series data samples may underestimate the source of uncertainty among various sub-samples. This paper presents the Handling UNcertainty and missing value prediction in Time series (HUNT). The proposed approach employs Adaptive Reservoir Filling for sampling the time series and Discrepant Sample dependent Chebyshev inequality for handling the uncertainty. The HUNT implements the adaptive reservoir filling using discrepancy estimation over a statistical population and decides the reservoir size according to the variations in the data stream. The state of the statistical population ensures the uncertainty handling over discrepant samples. The proposed approach precisely replaces the missing values with the support of the Mean-Mode imputation method. To effectively select the key features, it applies both the indirect and direct performance measures on the statistical samples. Finally, the proposed model generates the fine-tuned statistical samples through segmentation to facilitate the time series pattern matching. The experimental results demonstrate that the HUNT approach significantly outperforms the existing time series pattern matching approaches such as KSample approach by 18% higher recall and UG-Miner approach by 20% minimum Mean Absolute Error (MAE) while testing on the Weather forecasting dataset.  相似文献   

4.
在不考虑硬件环境的情况下,XML数据在RDBMS中的存储技术从很大程度上决定了基于关系的XML数据查询效率。目前基于关系的XML存储方式分为两大类:模型映射方法(model-mapping approach)和结构映射方法(structure-mapping approach)。根据XML数据查询处理效率,文章讨论了相关XML存储方法的优点和不足,并归结出XML存储后续研究的两个方向:路径信息的多雏处理和数据修改的有效支持。  相似文献   

5.
Multibiometric systems, which consolidate or fuse multiple sources of biometric information, typically provide better recognition performance than unimodal systems. While fusion can be accomplished at various levels in a multibiometric system, score-level fusion is commonly used as it offers a good trade-off between data availability and ease of fusion. Most score-level fusion rules assume that the scores pertaining to all the matchers are available prior to fusion. Thus, they are not well equipped to deal with the problem of missing match scores. While there are several techniques for handling missing data in general, the imputation scheme, which replaces missing values with predicted values, is preferred since this scheme can be followed by a standard fusion scheme designed for complete data. In this work, the performance of the following imputation methods are compared in the context of multibiometric fusion: K-nearest neighbor (KNN) schemes, likelihood-based schemes, Bayesian-based schemes and multiple imputation (MI) schemes. Experiments on the MSU database assess the robustness of the schemes in handling missing scores at different missing rates. It is observed that the Gaussian mixture model (GMM)-based KNN imputation scheme results in the best recognition accuracy.  相似文献   

6.
Tracing DoS attacks that employ source address spoofing is an important and challenging problem. Traditional traceback schemes provide spoofed packets traceback capability either by augmenting the packets with partial path information (i.e., packet marking) or by storing packet digests or signatures at intermediate routers (i.e., packet logging). Such approaches require either a large number of attack packets to be collected by the victim to infer the paths (packet marking) or a significant amount of resources to be reserved at intermediate routers (packet logging). We adopt a hybrid traceback approach in which packet marking and packet logging are integrated in a novel manner, so as to achieve the best of both worlds, that is, to achieve a small number of attack packets to conduct the traceback process and a small amount of resources to be allocated at intermediate routers for packet logging purposes. Based on this notion, two novel traceback schemes are presented. The first scheme, called distributed link-list traceback (DLLT), is based on the idea of preserving the marking information at intermediate routers in such a way that it can be collected using a link list-based approach. The second scheme, called probabilistic pipelined packet marking (PPPM), employs the concept of a "pipeline" for propagating marking information from one marking router to another so that it eventually reaches the destination. We evaluate the effectiveness of the proposed schemes against various performance metrics through a combination of analytical and simulation studies. Our studies show that the proposed schemes offer a drastic reduction in the number of packets required to conduct the traceback process and a reasonable saving in the storage requirement.  相似文献   

7.
Iterated importance sampling in missing data problems   总被引:2,自引:0,他引:2  
Missing variable models are typical benchmarks for new computational techniques in that the ill-posed nature of missing variable models offer a challenging testing ground for these techniques. This was the case for the EM algorithm and the Gibbs sampler, and this is also true for importance sampling schemes. A population Monte Carlo scheme taking advantage of the latent structure of the problem is proposed. The potential of this approach and its specifics in missing data problems are illustrated in settings of increasing difficulty, in comparison with existing approaches. The improvement brought by a general Rao–Blackwellisation technique is also discussed.  相似文献   

8.
无线传感器网络主要用于从目标对象收集信息,由于其能源极其有限,分布式数据存储和查询得到越来越多人的注意.本文提出了一种基于小波构架的新型分布式存储方式,它使所有信息经小波压缩后平均分布于各个节点之中,构成小波系数空间存储结构树.通过仿真实验,表明这种算法在无线传感器网络的数据管理中获得了良好的效果:(1)通过简化小波变换消除了额外的计算和通信量,大大节省了数据管理所需的能耗;(2)利用传感器节点内和节点间的信息关联,有效提高了存储效率;(3)利用小波多分辨率的编码技术和小波系数空间结构树的自相似性,支持时空两个方向的快速查询.  相似文献   

9.
Sensor networks have been an attractive platform for pervasive computing and communication. However, they are vulnerable to attacks if deployed in hostile environments. The past research of sensor network security has focused on securing information in communication, but how to secure information in storage has been overlooked. Meanwhile, distributed data storage and retrieval have become popular for efficient data management in sensor networks, which renders the absence of schemes for securing stored information to be a more severe problem. Therefore, we propose three evolutionary schemes, namely, the simple hash-based (SHB) scheme, the enhanced hash-based (EHB) scheme, and the adaptive polynomial-based (APB) scheme, to deal with the problem. All the schemes have the properties that only authorized entities can access data stored in the sensor network, and the schemes are resilient to a large number of sensor node compromises. The EHB and the APB schemes do not involve any centralized entity except for a few initialization or renewal operations, and thus support secure, distributed data storage and retrieval. The APB scheme further provides high scalability and flexibility, and hence is the most suitable among the three schemes for real applications. The schemes were evaluated through extensive analysis and TOSSIM-based simulations.  相似文献   

10.
Data imputation is a common practice encountered when dealing with incomplete data. Irrespectively of the existing spectrum of techniques, the results of imputation are commonly numeric meaning that once the data have been imputed they are not distinguishable from the original data being initially available prior to imputation. In this study, the crux of the proposed approach is to develop a way of representing imputed (missing) entries as information granules and in this manner quantify the quality of the imputation process and the quality of the ensuing data. We establish a two-stage imputation mechanism in which we start with any method of numeric imputation and then form a granular representative of missing value. In this sense, the approach could be regarded as an enhancement of the existing imputation techniques.Proceeding with the detailed imputation schemes, we discuss two ways of imputation. In the first one, imputation is realized for individual variables of data sets and afterwards enhanced by the buildup of information granules. In the second approach, we are concerned with the use of fuzzy clustering, Fuzzy C-Means (FCM), which helps establish a structure in the data and then use this information in the imputation process.The design of information granules invokes the fundamentals of Granular Computing, namely a principle of justifiable granularity and an allocation of information granularity. Numeric experiments concerned with a suite of publicly available data sets offer detailed insights into the main facets of the overall design process and deliver a parametric analysis of the methods.  相似文献   

11.
《Parallel Computing》2014,40(10):697-709
In order to run tasks in a parallel and load-balanced fashion, existing scientific parallel applications such as mpiBLAST introduce a data-initializing stage to move database fragments from shared storage to local cluster nodes. Unfortunately, with the exponentially increasing size of sequence databases in today’s big data era, such an approach is inefficient.In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by “read” operation for data analysis. SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two interlocked components: (1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and (2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4–10 and double the overall execution performance as compared with existing schemes.  相似文献   

12.
Querying Compressed Data in Data Warehouses   总被引:1,自引:0,他引:1  
The large size of most data warehouses (typically hundreds of gigabytes to terabytes) results in non-trivial storage costs and makes compression techniques attractive. For the most part, page-level compression (as opposed to attribute or record level schemes) has been shown to achieve the greatest reductions in storage size for databases. A key issue with such schemes is how to quickly access the data to answer queries, since individual tuple boundaries are lost. In this paper we introduce an approach that aims to maintain the benefits of page-level compression (i.e., large reductions in storage size), while at the same time improving query performance through an efficient signature file indexing scheme. The approach uses an attribute-level signature generation method that exploits the value distribution of each attribute in a data warehouse. We provide an extensive theoretical analysis of this approach in which we compare our approach with a recently proposed indexing technique, encoded bitmapped indexing, along a number of important metrics including query processing, insertion, and storage costs. Results show that our approach is preferred in many situations that are likely to occur in practice. We have also implemented a prototype system which validates our analytical findings.  相似文献   

13.
This paper presents three garbage collection schemes for causal message logging with independent checkpointing. The first scheme allows each process to autonomously remove useless log information in its volatile storage by piggybacking only some additional information without requiring any extra message and forced checkpoint. Additionally, it supports faster output commit than traditional schemes. The second scheme enables each process to remove a part of log information in the storage if more empty space is required. It reduces the number of processes participating in the garbage collection by using the size of the log information of each process. The third scheme is a hybrid scheme having the advantages of the two proposed schemes. Simulation results show that the third scheme significantly reduces the garbage collection overhead compared with the traditional schemes regardless of specific communication patterns of distributed applications.  相似文献   

14.
Many researchers approach the problem of programming distributed memory machines by assuming a global shared name space. Thus the user views the distributed memory of the machine as though it were shared. A major issue that arises at this point is how to manage the memory. When a processor accesses data stored on another processor's memory, data must be moved between the two processors. Once these data are retrieved from another processor's memory, several interesting issues are raised. Where should these data be stored locally? What transformations must be performed to the code to guarantee that the nonlocal accesses reference the correct memory location? What optimizations can be performed to reduce the time spent in accessing the nonlocal data? In this paper we examine various data migration mechanisms that allow an explicit and controlled mapping of data to memory. We describe, experimentally evaluate, and model a set of schemes for storing and retrieving off-processor array elements. The schemes are all based on using hash tables for efficient access of nonlocal data. The three different techniques evaluated are the basic hashed cache, partial enumeration, and full enumeration, the details of which are described in the paper. In all three schemes, nonlocal data are stored in hash tables—the difference is in the amount of memory used by the schemes and the retrieval mechanisms for nonlocal data.  相似文献   

15.
Data is often replicated in distributed systems to improve availability and performance. This replication is expensive in terms of disk storage since the existing schemes generally require full files to be stored at each site. In this paper, we present schemes which significantly reduce the storage requirements in replication based systems. These schemes use the coding method suggested by Rabin to store replicated data. The first scheme that we present is a modification of the simple voting algorithm and its quorum requirements. We then show how some of the extensions of the voting algorithm can also be modified to get storage efficient schemes for managing such replication. We evaluate the availability offered by these schemes and show that the storage space required to achieve certain availability are significantly lower than the conventional schemes with full file replication. Since coding is used, these schemes also provide a high degree of data security  相似文献   

16.
This survey concerns the role of data structures for compactly storing and representing various types of information in a localized and distributed fashion. Traditional approaches to data representation are based on global data structures, which require access to the entire structure even if the sought information involves only a small and local set of entities. In contrast, localized data representation schemes are based on breaking the information into small local pieces, or labels , selected in a way that allows one to infer information regarding a small set of entities directly from their labels, without using any additional (global) information. The survey concentrates mainly on combinatorial and algorithmic techniques, such as adjacency and distance labeling schemes and interval schemes for routing, and covers complexity results on various applications, focusing on compact localized schemes for message routing in communication networks.Received: August 2001, Accepted: May 2002, Supported in part by a grant from the Israel Science Foundation.  相似文献   

17.
针对目前在分布异构的大规模软件开发中难以高效地知晓信息和发现知识的问题,将语义网引入软件工程领域,对多源异构数据进行细粒度语义关联,提出本体构建、关联抽取和发现的方法,实现基于本体的软件工程关联数据的自动构建。该方法对软件工程本体进行概念抽取、合并、实例消解和属性消歧,从软件仓库结构化数据集中抽取出完整无冗余的关联数据;并采用同义词、动宾短语和结构关系三个特征利用自然语言处理(NLP)技术和信息检索(IR)技术从软件仓库中发现潜在的关联数据。实验结果表明,所提出的方法能从分布式软件工程数据集中自动构建和融合生成软件工程本体,并有效地发现潜在的关联数据将其扩充到软件工程本体中;与Baseline、Phraing和O-CSTI三种方法相比,关联数据发现的召回率、精准率和F值都有显著提高。  相似文献   

18.
Scientific datasets are often stored on distributed archival storage systems, because geographically distributed sensor devices store the datasets in their local machines and also because the size of scientific datasets demands large amount of disk space. Multidimensional indexing techniques have been shown to greatly improve range query performance into large scientific datasets. In this paper, we discuss several ways of distributing a multidimensional index in order to speed up access to large distributed scientific datasets. This paper compares the designs, challenges, and problems for distributed multidimensional indexing schemes, and provides a comprehensive performance study of distributed indexing to provide guidelines to choose a distributed multidimensional index for a specific data analysis application.  相似文献   

19.
Very fast and accurate 3-D capacitance extraction is essential for interconnect optimization in VLSI ultra-deep sub-micron designs (UDSM). Parallel processing provides an approach to reducing the simulation turn-around time. This paper examines the parallelization of the well-known fast multipole-based 3-D capacitance extraction program FASTCAP, which employs new adaptive and preconditioning techniques. To account for the complicated data dependencies in the unstructured problems, we propose a novel generalized cost function model, which can be used to accurately measure the workload associated with each cube in the hierarchy. We then present two adaptive partitioning schemes, combined with efficient communication mechanisms with bounded buffer size, to reduce the parallel processing overhead. The overall load balance is achieved through balancing the load at each level of the multipole computation. We report detailed performance results on a variety of distributed memory parallel platforms, using standard benchmarks on 3-D capacitance extraction.  相似文献   

20.
徐万山  张建标  袁艺林 《软件学报》2023,34(11):5392-5407
对称可搜索加密(symmetric searchable encryption, SSE)能实现密文数据的检索而不泄露用户隐私, 在云存储领域得到了广泛的研究与应用. 然而, 在SSE方案中, 半诚实或者不诚实的服务器可能篡改文件中的数据, 返回给用户不可信的文件, 因此对这些文件进行验证是十分必要的. 现有的可验证SSE方案大多是用户本地进行验证, 恶意用户可能会伪造验证结果, 无法保证验证的公平性. 基于以上考虑, 提出一种基于区块链的动态可验证对称可搜索加密方案(verifiable dynamic symmetric searchable encryption, VDSSE); VDSSE采用对称加密实现动态更新过程中的前向安全; 在此基础上, 利用区块链实现搜索结果的验证, 验证过程中, 提出一种新的验证标签——Vtag, 利用Vtag的累积性实现验证信息的压缩存储, 降低验证信息在区块链上的存储开销, 并能够有效支持SSE方案的动态验证. 由于区块链具有不可篡改的性质, 验证的公平性得以保证. 最后, 对VDSSE进行实验评估和安全性分析, 验证方案的可行性和安全性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号