期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Cluster based parallel database management system for data intensive computing

Jianzhong Li Wei Zhang 《Frontiers of Computer Science in China》2009,3(3):302-314

This paper describes a computer-cluster based parallel database management system (DBMS), InfiniteDB, developed by the authors. InfiniteDB aims at efficiently support data intensive computing in response to the rapid growing in database size and the need of high performance analyzing of massive databases. It can be efficiently executed in the computing system composed by thousands of computers such as cloud computing system. It supports the parallelisms of intra-query, inter-query, intra-operation, inter-operation and pipelining. It provides effective strategies for managing massive databases including the multiple data declustering methods, the declustering-aware algorithms for relational operations and other database operations, and the adaptive query optimization method. It also provides the functions of parallel data warehousing and data mining, the coordinatorwrapper mechanism to support the integration of heterogeneous information resources on the Internet, and the fault tolerant and resilient infrastructures. It has been used in many applications and has proved quite effective for data intensive computing. 相似文献

2.

ROARS: a robust object archival system for data intensive scientific computing

Hoang Bui Peter Bui Patrick Flynn Douglas Thain 《Distributed and Parallel Databases》2012,30(5-6):325-350

As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide both data archival services and rich metadata, and cleanly integrate with large scale computing resources. ROARS is a hybrid approach to distributed storage that provides both large, robust, scalable storage and efficient rich metadata queries for scientific applications. In this paper, we present the design and implementation of ROARS, focusing primarily on the challenge of maintaining data integrity across long time scales. We evaluate the performance of ROARS on a storage cluster, comparing to the Hadoop distributed file system and a centralized file server. We observe that ROARS has read and write performance that scales with the number of storage nodes, and integrity checking that scales with the size of the largest node. We demonstrate the ability of ROARS to function correctly through multiple system failures and reconfigurations. ROARS has been in production use for over three years as the primary data repository for a biometrics research lab at the University of Notre Dame. 相似文献

3.

Guest editorial: Special issue on data intensive cloud computing

Jinjun Chen Surya Nepal 《Computing》2015,97(4):333-335

相似文献

4.

Dynamic power management in energy-aware computer networks and data intensive computing systems

《Future Generation Computer Systems》2014

Energy awareness is an important aspect of modern network and computing system design and management, especially in the case of internet-scale networks and data intensive large scale distributed computing systems. The main challenge is to design and develop novel technologies, architectures and methods that allow us to reduce energy consumption in such infrastructures, which is also the main reason for reducing the total cost of running a network. Energy-aware network components as well as new control and optimization strategies may save the energy utilized by the whole system through adaptation of network capacity and resources to the actual traffic load and demands, while ensuring end-to-end quality of service. In this paper, we have designed and developed a two-level control framework for reducing power consumption in computer networks. The implementation of this framework provides the local control mechanisms that are implemented at the network device level and network-wide control strategies implemented at the central control level. We also developed network-wide optimization algorithms for calculating the power setting of energy consuming network components and energy-aware routing for the recommended network configuration. The utility and efficiency of our framework have been verified by simulation and by laboratory tests. The test cases were carried out on a number of synthetic as well as on real network topologies, giving encouraging results. Thus, we come up with well justified recommendations for energy-aware computer network design, to conclude the paper. 相似文献

5.

Distributed evolutionary Monte Carlo for Bayesian computing

Bo Hu Kam-Wah Tsui 《Computational statistics & data analysis》2010,54(3):688-610

Sampling from a multimodal and high-dimensional target distribution posits a great challenge in Bayesian analysis. A new Markov chain Monte Carlo algorithm Distributed Evolutionary Monte Carlo (DGMC) is proposed for real-valued problems, which combines the attractive features of the distributed genetic algorithm and the Markov chain Monte Carlo. The DGMC algorithm evolves a population of Markov chains through some genetic operators to simulate the target function. Theoretical justification proves that the DGMC algorithm has the target function as its stationary distribution. The effectiveness of the DGMC algorithm is illustrated by simulating two multimodal distributions and an application to a real data example. 相似文献

6.

A data cloning algorithm for computing maximum likelihood estimates in spatial generalized linear mixed models

Hossein Baghishani Mohsen Mohammadzadeh 《Computational statistics & data analysis》2011,55(4):1748-1759

Non-Gaussian spatial data are common in many sciences such as environmental sciences, biology and epidemiology. Spatial generalized linear mixed models (SGLMMs) are flexible models for modeling these types of data. Maximum likelihood estimation in SGLMMs is usually made cumbersome due to the high-dimensional intractable integrals involved in the likelihood function and therefore the most commonly used approach for estimating SGLMMs is based on the Bayesian approach. This paper proposes a computationally efficient strategy to fit SGLMMs based on the data cloning (DC) method suggested by Lele et al. (2007). This method uses Markov chain Monte Carlo simulations from an artificially constructed distribution to calculate the maximum likelihood estimates and their standard errors. In this paper, the DC method is adapted and generalized to estimate SGLMMs and some of its asymptotic properties are explored. Performance of the method is illustrated by a set of simulated binary and Poisson count data and also data about car accidents in Mashhad, Iran. The focus is inference in SGLMMs for small and medium data sets. 相似文献

7.

Algorithms for computing estimates in document retrieval systems

V. M. Driyanskii L. G. Katerinich 《Cybernetics and Systems Analysis》1985,21(5):708-715

相似文献

8.

Automated context aggregation and file annotation for PAN-based computing 总被引：1，自引：1，他引：0

Alexandros Karypidis Spyros Lalis 《Personal and Ubiquitous Computing》2007,11(1):33-44

This paper presents a method for automatically annotating files created on portable devices with contextual metadata. We achieve this through the combination of two system components. One is a context dissemination mechanism which allows devices in a personal area network (PAN) to maintain a shared aggregate contextual perception. The other is a storage management system that uses such context information to automatically decorate files created on personal devices with annotations. As a result, the user is able to flexibly browse and lookup files that were generated on the move, based on the contextual situation at the time of their creation. What is equally important is that the user is relieved from the cumbersome task of having to manually provide annotations in an explicit fashion. This is especially valuable when generating files on the move, using U/I-restricted portable devices.

Spyros LalisEmail:

相似文献

9.

基于树型层次结构的计算资源共享与聚集

杜经纬张岳《计算机工程与设计》2012,33(4):1342-1346

提出并描述了一个基于树型层次结构的计算资源共享与聚集系统(tree-based layered sharing and aggregation,TLSA).TLSA系统由对等网络环境下的空闲节点组成,形成一个类似B树的层次结构,使在节点加入和退出的时候可以自动的维持平衡.树型结构的网络拓扑通过自组织的可用性协议来维护,保证了系统的比较低的消息通信量和平衡的处理器负载.通过内部的资源发现协议,节点可以寻找到系统中最近最合适的空闲计算资源来完成大量的子任务.通过模拟测试结果表明对于大规模的子任务,TLSA可以在很短的时间内寻找到空闲资源,而且网络消息通信量不超过O(logmN),具有低消息通信量、非集中性、可扩展性、自组织等特性. 相似文献

10.

Algorithm level power efficiency optimization for CPU-GPU processing element in data intensive SIMD/SPMD computing

Da Qi Ren^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(2):245-253

Power efficiency investigation has been required in each level of a High Performance Computing (HPC) system because of the increasing computation demands of scientific and engineering applications. Focusing on handling the critical design constraints in the software level that run beyond a parallel system composed of huge numbers of power-hungry components, we optimize HPC program design in order to achieve the best possible power performance on the target hardware platform. The power performance of a CUDA Processing Element (PE) is determined by both hardware factors including power features of each component including with CPU, GPU, main memory and PCI buses, and their interconnection architecture; and software factors including algorithm design and the character of executable instructions performed on it. In this paper, approaches to model and evaluate the power consumption of large scale SIMD computation by CUDA PEs on multi-core and GPU platforms are introduced. The model allows obtaining design characteristic values at the early programming stage, thus benefitting programmers by providing the necessary environment information for choosing the best power-efficient alternative. Based on the model, CPU Dynamic frequency scaling (DFS) can be applied on CUDA PE architecture that adjusts CPU frequency to enhance power efficiency of the entire PE without compromising its computing performance. The power model and power efficiency improvements of the new designs have been validated by measuring the new programs on the real GPU multiprocessing system. 相似文献

11.

Bayesian posterior mean estimates for Poisson hidden Markov models

Junko Murakami 《Computational statistics & data analysis》2009,53(4):941-955

This paper focuses on the Bayesian posterior mean estimates (or Bayes’ estimate) of the parameter set of Poisson hidden Markov models in which the observation sequence is generated by a Poisson distribution whose parameter depends on the underlining discrete-time time-homogeneous Markov chain. Although the most commonly used procedures for obtaining parameter estimates for hidden Markov models are versions of the expectation maximization and Markov chain Monte Carlo approaches, this paper exhibits an algorithm for calculating the exact posterior mean estimates which, although still cumbersome, has polynomial rather than exponential complexity, and is a feasible alternative for use with small scale models and data sets. This paper also shows simulation results, comparing the posterior mean estimates obtained by this algorithm and the maximum likelihood estimates obtained by expectation maximization approach. 相似文献

12.

Granular computing for relational data classification

Piotr Hońko 《Journal of Intelligent Information Systems》2013,41(2):187-210

We propose a novel framework for generating classification rules from relational data. This is a specialized version of the general framework intended for mining relational data and is defined in granular computing theory. In the framework proposed in this paper we define a method for deriving information granules from relational data. Such granules are the basis for generating relational classification rules. In our approach we follow the granular computing idea of switching between different levels of granularity of the universe. Thanks to this a granule-based relational data representation can easily be replaced by another one and thereby adjusted to a given data mining task, e.g. classification. A generalized relational data representation, as defined in the framework, can be treated as the search space for generating rules. On account of this the size of the search space may significantly be limited. Furthermore, our framework, unlike others, unifies not only the way the data and rules to be derived are expressed and specified, but also partially the process of generating rules from the data. Namely, the rules can be directly obtained from the information granules or constructed based on them. 相似文献

13.

Learning Bayesian networks for discrete data

Faming Liang Jian Zhang 《Computational statistics & data analysis》2009,53(4):865-876

Bayesian networks have received much attention in the recent literature. In this article, we propose an approach to learn Bayesian networks using the stochastic approximation Monte Carlo (SAMC) algorithm. Our approach has two nice features. Firstly, it possesses the self-adjusting mechanism and thus avoids essentially the local-trap problem suffered by conventional MCMC simulation-based approaches in learning Bayesian networks. Secondly, it falls into the class of dynamic importance sampling algorithms; the network features can be inferred by dynamically weighted averaging the samples generated in the learning process, and the resulting estimates can have much lower variation than the single model-based estimates. The numerical results indicate that our approach can mix much faster over the space of Bayesian networks than the conventional MCMC simulation-based approaches. 相似文献

14.

Bayesian versus data driven model selection for microarray data

Raffaele Giancarlo Giosué Lo Bosco Filippo Utro 《Natural computing》2015,14(3):393-402

相似文献

15.

Point estimates for variance-structure parameters in Bayesian analysis of hierarchical models

Yi He 《Computational statistics & data analysis》2008,52(5):2560-2577

Markov chain Monte Carlo (MCMC) made Bayesian analysis feasible for hierarchical models, but the literature about their variance parameters is sparse. This is particularly so for point estimators of variance-structure parameters, which are useful for simplifying tables and sample-size calculations, and as “plug-in” estimators in complex calculations. This paper uses simulation experiments to compare three such point estimators, the posterior mode, median, and mean, for three parameterizations of the variance structure, as precisions, standard deviations, and variances. We first consider simple linear regression, where fairly explicit expressions are possible, and then three more complex models: crossed random effects, smoothed analysis of variance (SANOVA), and the conditional autoregressive (CAR) model with two classes of neighbor relations. We illustrate the latter results using periodontal data. The posterior mean often performs poorly in terms of bias and mean-squared error, and should be avoided. The posterior median never performs worse than the mean and often performs far better. The surprise is that, on the whole, the posterior mode performs best regardless of the variance structure's parameterization, although the potential for multi-modality may make it unattractive for general use. 相似文献

16.

A parallel computing framework for big data

Guoliang Chen Rui Mao Kezhong Lu 《Frontiers of Computer Science》2017,11(4):608-621

Big data has received great attention in research and application. However, most of the current efforts focus on system and application to handle the challenges of “volume” and “velocity”, and not much has been done on the theoretical foundation and to handle the challenge of “variety”. Based on metric-space indexing and computationalcomplexity theory, we propose a parallel computing framework for big data. This framework consists of three components, i.e., universal representation of big data by abstracting various data types into metric space, partitioning of big data based on pair-wise distances in metric space, and parallel computing of big data with the NC-class computing theory. 相似文献

17.

Key Derivation Policy for data security and data integrity in cloud computing

P. Senthil Kumari A. R. Nadira Banu Kamal 《Automatic Control and Computer Sciences》2016,50(3):165-178

Cloud computing is currently emerging as a promising next-generation architecture in the Information Technology (IT) industry and education sector. The encoding process of state information from the data and protection are governed by the organizational access control policies. An encryption technique protects the data confidentiality from the unauthorized access leads to the development of fine-grained access control policies with user attributes. The Attribute-Based Encryption (ABE) verifies the intersection of attributes to the multiple sets. The handling of adding or revoking the users is difficult with respect to changes in policies. The inclusion of multiple encrypted copies for the same key raised the computational cost. This paper proposes an efficient Key Derivation Policy (KDP) for improvement of data security and integrity in the cloud and overcomes the problems in traditional methods. The local key generation process in proposed method includes the data attributes. The secret key is generated from the combination of local keys with the user attribute by a hash function. The original text is recovered from the ciphertext by the decryption process. The key sharing between data owner and user validates the data integrity referred MAC verification process. The proposed efficient KDP with MAC verification analyze the security issues and compared with the Cipher Text–Attribute-Based Encryption (CP-ABE) schemes on the performance parameters of encryption time, computational overhead and the average lifetime of key generation. The major advantage of proposed approach is the updating of public information and easy handling of adding/revoking of users in the cloud. 相似文献

18.

Special issue for data intensive eScience

Judy Qiu Dennis Gannon 《Distributed and Parallel Databases》2012,30(5-6):303-306

相似文献

19.

A FAD for data intensive applications

Danforth S. Valduriez P. 《Knowledge and Data Engineering, IEEE Transactions on》1992,4(1):34-51

相似文献

20.

Improving class probability estimates for imbalanced data

Byron C. Wallace Issa J. Dahabreh 《Knowledge and Information Systems》2014,41(1):33-52

Obtaining good probability estimates is imperative for many applications. The increased uncertainty and typically asymmetric costs surrounding rare events increase this need. Experts (and classification systems) often rely on probabilities to inform decisions. However, we demonstrate that class probability estimates obtained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration. To our knowledge, this problem has not previously been explored. We propose a new metric, the stratified Brier score, to capture class-specific calibration, analogous to the per-class metrics widely used to assess the discriminative performance of classifiers in imbalanced scenarios. We propose a simple, effective method to mitigate the bias of probability estimates for imbalanced data that bags estimators independently calibrated over balanced bootstrap samples. This approach drastically improves performance on the minority instances without greatly affecting overall calibration. We extend our previous work in this direction by providing ample additional empirical evidence for the utility of this strategy, using both support vector machines and boosted decision trees as base learners. Finally, we show that additional uncertainty can be exploited via a Bayesian approach by considering posterior distributions over bagged probability estimates. 相似文献