期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using OVA modeling to improve classification performance for large datasets

Patricia E.N. Lutu Andries P. Engelbrecht 《Expert systems with applications》2012,39(4):4358-4376

One-Versus-All (OVA) classification is a classifier construction method where a k-class prediction task is decomposed into k 2-class sub-problems. One base model is constructed for each sub-problem and the base models are then combined into one model. Aggregate model implementation is the process of constructing several base models which are then combined into a single model for prediction. In essence, OVA classification is a method of aggregate modeling. This paper reports studies that were conducted to establish whether OVA classification can provide predictive performance gains when large volumes of data are available for modeling as is commonly the case in data mining. It is demonstrated in this paper that firstly, OVA modeling can be used to increase the amount of training data while at the same time using base model training sets whose size is much smaller than the total amount of available training data. Secondly, OVA models created from large datasets provide a higher level of predictive performance compared to single k-class models. Thirdly, the use of boosted OVA base models can provide higher predictive performance compared to un-boosted OVA base models. Fourthly, when the combination algorithm for base model predictions is able to resolve tied predictions, the resulting aggregate models provide a higher level of predictive performance. 相似文献

2.

Subset selection from large datasets for Kriging modeling

Gijs Rennen 《Structural and Multidisciplinary Optimization》2009,38(6):545-569

When building a Kriging model, the general intuition is that using more data will always result in a better model. However, we show that when we have a large non-uniform dataset, using a uniform subset can have several advantages. Reducing the time necessary to fit the model, avoiding numerical inaccuracies and improving the robustness with respect to errors in the output data are some aspects which can be improved by using a uniform subset. We furthermore describe several new and current methods for selecting a uniform subset. These methods are tested and compared on several artificial datasets and one real life dataset. The comparison shows how the selected subsets affect different aspects of the resulting Kriging model. As none of the subset selection methods performs best on all criteria, the best method to choose depends on how the different aspects are valued. The comparison made in this paper can be used to facilitate the user in making a good choice. 相似文献

3.

Information fusion-based meta-classification predictive modeling for ETF performance

Asil Oztekin 《Information Systems Frontiers》2018,20(2):223-238

This study is aimed at determining the future share net inflows and outflows of Exchange Traded Funds (ETFs). The relationship between net flows is closely related to investor perception of the future and past performance of mutual funds. The net flows for Exchange Traded Funds are expected to be less related to overall fund performance, but rather based on the characteristics of the fund that make it attractive to an individual investor. In order to explore the relationship between investor’s perception of ETFs and subsequent net flows, this study is designed to shed light on the multifaceted linkages between fund characteristics and net flows. A meta-classification predictive modeling approach is designed for the use of large data sets. Then its implementation and results are discussed. A thorough selection of fifteen attributes from each fund, which are the most likely contributors to fund inflows and outflows, is deployed in the analyses. The large data set calls for the use of a robust systematic approach to identifying the attributes of the funds that best predict future inflows and outflows of the fund. The predictive performance of the proposed decision analytic methodology was assessed via the 10-fold cross validation, which yielded very promising results. 相似文献

4.

Adapting ADtrees for improved performance on large datasets with high-arity features

Robert Van Dam Irene Langkilde-Geary Dan Ventura 《Knowledge and Information Systems》2013,35(3):525-552

The ADtree, a data structure useful for caching sufficient statistics, has been successfully adapted to grow lazily when memory is limited and to update sequentially with an incrementally updated dataset. However, even these modified forms of the ADtree still exhibit inefficiencies in terms of both space usage and query time, particularly on datasets with very high dimensionality and with high-arity features. We propose four modifications to the ADtree, each of which can be used to improve size and query time under specific types of datasets and features. These modifications also provide an increased ability to precisely control how an ADtree is built and to tune its size given external memory or speed requirements. 相似文献

5.

Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

Jinyan Li Simon Fong Sabah Mohammed Jinan Fiaidhi 《The Journal of supercomputing》2016,72(10):3708-3728

相似文献

6.

Kernel matching pursuit for large datasets

Vlad Samy Jean-Philippe 《Pattern recognition》2005,38(12):2385-2390

Kernel matching pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernel-induced dictionary. Here we propose a modification of the kernel matching pursuit algorithm that aims at making the method practical for large datasets. Starting from an approximating algorithm, the weak greedy algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the trade-off between the accuracy and the need for resources. Finally, we present some experiments performed on a large dataset that support our approach and illustrate its applicability. 相似文献

7.

Mining co-distribution patterns for large crime datasets

Peter Phillips Ickjai Lee 《Expert systems with applications》2012,39(14):11556-11563

Crime activities are geospatial phenomena and as such are geospatially, thematically and temporally correlated. We analyze crime datasets in conjunction with socio-economic and socio-demographic factors to discover co-distribution patterns that may contribute to the formulation of crime. We propose a graph based dataset representation that allows us to extract patterns from heterogeneous areal aggregated datasets and visualize the resulting patterns efficiently. We demonstrate our approach with real crime datasets and provide a comparison with other techniques. 相似文献

8.

Evaluating holistic aggregators efficiently for very large datasets

Lixin?Fu Email author Sanguthevar?Rajasekaran 《The VLDB Journal The International Journal on Very Large Data Bases》2004,13(2):148-161

In data warehousing applications, numerous OLAP queries involve the processing of holistic aggregators such as computing the top n, median, quantiles, etc. In this paper, we present a novel approach called dynamic bucketing to efficiently evaluate these aggregators. We partition data into equiwidth buckets and further partition dense buckets into subbuckets as needed by allocating and reclaiming memory space. The bucketing process dynamically adapts to the input order and distribution of input datasets. The histograms of the buckets and subbuckets are stored in our new data structure called structure trees. A recent selection algorithm based on regular sampling is generalized and its analysis extended. We have also compared our new algorithms with this generalized algorithm and several other recent algorithms. Experimental results show that our new algorithms significantly outperform prior ones not only in the runtime but also in accuracy.Received: 20 December 2000, Published online: 4 March 2004Edited by: P. Scheuermann.Sanguthevar Rajasekaran: This authors work is supported by NSF Grant 9912395. 相似文献

9.

Efficient supervised optimum-path forest classification for large datasets

João P. Papa Alexandre X. Falcão Victor Hugo C. de Albuquerque João Manuel R.S. Tavares 《Pattern recognition》2012,45(1):512-520

Today data acquisition technologies come up with large datasets with millions of samples for statistical analysis. This creates a tremendous challenge for pattern recognition techniques, which need to be more efficient without losing their effectiveness. We have tried to circumvent the problem by reducing it into the fast computation of an optimum-path forest (OPF) in a graph derived from the training samples. In this forest, each class may be represented by multiple trees rooted at some representative samples. The forest is a classifier that assigns to a new sample the label of its most strongly connected root. The methodology has been successfully used with different graph topologies and learning techniques. In this work, we have focused on one of the supervised approaches, which has offered considerable advantages over Support Vector Machines and Artificial Neural Networks to handle large datasets. We propose (i) a new algorithm that speeds up classification and (ii) a solution to reduce the training set size with negligible effects on the accuracy of classification, therefore further increasing its efficiency. Experimental results show the improvements with respect to our previous approach and advantages over other existing methods, which make the new method a valuable contribution for large dataset analysis. 相似文献

10.

Fair multi-agent task allocation for large datasets analysis

Quentin Baert Anne-Cécile Caron Maxime Morge Jean-Christophe Routier 《Knowledge and Information Systems》2018,54(3):591-615

MapReduce is a design pattern for processing large datasets distributed on a cluster. Its performances are linked to the data structure and the runtime environment. Indeed, data skew can yield an unfair task allocation, but even when the initial allocation produced by the partition function is well balanced, an unfair allocation can occur during the reduce phase due to the heterogeneous performance of nodes. For these reasons, we propose an adaptive multi-agent system. In our approach, the reducer agents interact during the job and the task reallocation is based on negotiation in order to decrease the workload of the most loaded reducer and so the runtime. In this paper, we propose and evaluate two negotiation strategies. Finally, we experiment our multi-agent system with real-world datasets over heterogeneous runtime environment. 相似文献

11.

Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

Pablo Bermejo Jose A. Gámez Jose M. Puerta 《Expert systems with applications》2011,38(3):2072-2080

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers. 相似文献

12.

An optimization of ReliefF for classification in large datasets

Yue Paul J. Norman D. 《Data & Knowledge Engineering》2009,68(11):1348

ReliefF has proved to be a successful feature selector but when handling a large dataset, it is computationally expensive. We present an optimization using Supervised Model Construction which improves starter selection. Effectiveness has been evaluated using 12 UCI datasets and a clinical diabetes database. Experiments indicate that compared with ReliefF, the proposed method improved computation efficiency whilst maintaining the classification accuracy. In the clinical dataset (20,000 records with 47 features), feature selection via Supervised Model Construction (FSSMC) reduced the processing time by 80%, compared to ReliefF, and maintained accuracy for Naive Bayes, IB1 and C4.5 classifiers. 相似文献

13.

Effective data summarization for hierarchical clustering in large datasets

Bidyut Kr. Patra Sukumar Nandi 《Knowledge and Information Systems》2015,42(1):1-20

相似文献

14.

DDR: an index method for large time-series datasets

《Information Systems》2005,30(5):333-348

The tree index structure is a traditional method for searching similar data in large datasets. It is based on the presupposition that most sub-trees are pruned in the searching process. As a result, the number of page accesses is reduced. However, time-series datasets generally have a very high dimensionality. Because of the so-called dimensionality curse, the pruning effectiveness is reduced in high dimensionality. Consequently, the tree index structure is not a suitable method for time-series datasets. In this paper, we propose a two-phase (filtering and refinement) method for searching time-series datasets. In the filtering step, a quantizing time-series is used to construct a compact file which is scanned for filtering out irrelevant. A small set of candidates is translated to the second step for refinement. In this step, we introduce an effective index compression method named grid-based datawise dimensionality reduction (DRR) which attempts to preserve the characteristics of the time-series. An experimental comparison with existing techniques demonstrates the utility of our approach. 相似文献

15.

Improving content-based image retrieval for heterogeneous datasets using histogram-based descriptors

Carolina Reta Ismael Solis-Moreno Jose A. Cantoral-Ceballos Rogelio Alvarez-Vargas Paul Townend 《Multimedia Tools and Applications》2018,77(7):8163-8193

相似文献

16.

O-Buffer based IFT watershed from markers for large medical datasets

Ernesto Coto Sren Grimm David Williams 《Computers & Graphics》2007,31(6):848-863

The watershed transform from markers is a very popular image segmentation operator. The image foresting transform (IFT) watershed is a common method to compute the watershed transform from markers using a priority queue, but which can consume too much memory when applied to three-dimensional medical datasets. This is a considerable limitation on the applicability of the IFT watershed, as the size of medical datasets keeps increasing at a faster pace than physical memory technologies develop. This paper presents the O-IFT watershed, a new type of IFT watershed based on the O-Buffer framework, and introduces an efficient data representation which considerably reduces the memory consumption of the algorithm. In addition, this paper introduces the O-Buckets, a new implementation of the priority queue which further reduces the memory consumption of the algorithm. The new O-IFT watershed with O-Buckets allows the application of the watershed transform from markers to large medical datasets. 相似文献

17.

Improving command and control speech recognition on mobile devices: using predictive user models for language modeling

Tim Paek David Maxwell Chickering 《User Modeling and User-Adapted Interaction》2007,17(1-2):93-117

Command and control (C&C) speech recognition allows users to interact with a system by speaking commands or asking questions restricted to a fixed grammar containing pre-defined phrases. Whereas C&C interaction has been commonplace in telephony and accessibility systems for many years, only recently have mobile devices had the memory and processing capacity to support client-side speech recognition. Given the personal nature of mobile devices, statistical models that can predict commands based in part on past user behavior hold promise for improving C&C recognition accuracy. For example, if a user calls a spouse at the end of every workday, the language model could be adapted to weight the spouse more than other contacts during that time. In this paper, we describe and assess statistical models learned from a large population of users for predicting the next user command of a commercial C&C application. We explain how these models were used for language modeling, and evaluate their performance in terms of task completion. The best performing model achieved a 26% relative reduction in error rate compared to the base system. Finally, we investigate the effects of personalization on performance at different learning rates via online updating of model parameters based on individual user data. Personalization significantly increased relative reduction in error rate by an additional 5%. 相似文献

18.

Scalable and portable visualization of large atomistic datasets

Ashish Sharma Rajiv K. Kalia Aiichiro Nakano Priya Vashishta 《Computer Physics Communications》2004,163(1):53-64

A scalable and portable code named Atomsviewer has been developed to interactively visualize a large atomistic dataset consisting of up to a billion atoms. The code uses a hierarchical view frustum-culling algorithm based on the octree data structure to efficiently remove atoms outside of the user's field-of-view. Probabilistic and depth-based occlusion-culling algorithms then select atoms, which have a high probability of being visible. Finally a multiresolution algorithm is used to render the selected subset of visible atoms at varying levels of detail. Atomsviewer is written in C++ and OpenGL, and it has been tested on a number of architectures including Windows, Macintosh, and SGI. Atomsviewer has been used to visualize tens of millions of atoms on a standard desktop computer and, in its parallel version, up to a billion atoms.

Program summary

Title of program: AtomsviewerCatalogue identifier: ADUMProgram summary URL:http://cpc.cs.qub.ac.uk/summaries/ADUMProgram obtainable from: CPC Program Library, Queen's University of Belfast, N. IrelandComputer for which the program is designed and others on which it has been tested: 2.4 GHz Pentium 4/Xeon processor, professional graphics card; Apple G4 (867 MHz)/G5, professional graphics cardOperating systems under which the program has been tested: Windows 2000/XP, Mac OS 10.2/10.3, SGI IRIX 6.5Programming languages used: C++, C and OpenGLMemory required to execute with typical data: 1 gigabyte of RAMHigh speed storage required: 60 gigabytesNo. of lines in the distributed program including test data, etc.: 550 241No. of bytes in the distributed program including test data, etc.: 6 258 245Number of bits in a word: ArbitraryNumber of processors used: 1Has the code been vectorized or parallelized: NoDistribution format: tar gzip fileNature of physical problem: Scientific visualization of atomic systemsMethod of solution: Rendering of atoms using computer graphic techniques, culling algorithms for data minimization, and levels-of-detail for minimal renderingRestrictions on the complexity of the problem: NoneTypical running time: The program is interactive in its executionUnusual features of the program: NoneReferences: The conceptual foundation and subsequent implementation of the algorithms are found in [A. Sharma, A. Nakano, R.K. Kalia, P. Vashishta, S. Kodiyalam, P. Miller, W. Zhao, X.L. Liu, T.J. Campbell, A. Haas, Presence—Teleoperators and Virtual Environments 12 (1) (2003)]. 相似文献

19.

Scalable reduction of large datasets to interesting subsets

Gregory Todd Williams Jesse Weaver Medha Atre James A. Hendler 《Journal of Web Semantics》2010,8(4):365-373

With a huge amount of RDF data available on the web, the ability to find and access relevant information is crucial. Traditional approaches to storing, querying, and reasoning fall short when faced with web-scale data. We present a system that combines the computational power of large clusters for enabling large-scale reasoning and data access with an efficient data structure for storing and querying the accessed data on a traditional personal computer or other resource-constrained device. We present results of using this system to load the 2009 Billion Triples Challenge dataset, materialize RDFS inferences, extract an “interesting” subset of the data using a large cluster, and further analyze the extracted data using a personal computer, all in the order of tens of minutes. 相似文献

20.

Exploring uncertainty and model predictive performance concepts via a modular snowmelt-runoff modeling framework

Tyler Jon Smith Lucy Amanda Marshall 《Environmental Modelling & Software》2010,25(6):691-701

Model selection is an extremely important aspect of many hydrologic modeling studies because of the complexity, variability, and uncertainty that surrounds the current understanding of watershed-scale systems. However, development and implementation of a complete precipitation-runoff modeling framework, from model selection to calibration and uncertainty analysis, are rarely confronted. This paper introduces a modular precipitation-runoff modeling framework that has been developed and applied to a research site in Central Montana, USA. The case study focuses on an approach to hydrologic modeling that considers model development, selection, calibration, uncertainty analysis, and overall assessment. The results of this case study suggest that a modular framework is useful in identifying the interactions between and among different process representations and their resultant predictions of stream discharge. Such an approach can strengthen model building and address an oft ignored aspect of predictive uncertainty; namely, model structural uncertainty. 相似文献