共查询到20条相似文献,搜索用时 15 毫秒
1.
One-Versus-All (OVA) classification is a classifier construction method where a k-class prediction task is decomposed into k 2-class sub-problems. One base model is constructed for each sub-problem and the base models are then combined into one model. Aggregate model implementation is the process of constructing several base models which are then combined into a single model for prediction. In essence, OVA classification is a method of aggregate modeling. This paper reports studies that were conducted to establish whether OVA classification can provide predictive performance gains when large volumes of data are available for modeling as is commonly the case in data mining. It is demonstrated in this paper that firstly, OVA modeling can be used to increase the amount of training data while at the same time using base model training sets whose size is much smaller than the total amount of available training data. Secondly, OVA models created from large datasets provide a higher level of predictive performance compared to single k-class models. Thirdly, the use of boosted OVA base models can provide higher predictive performance compared to un-boosted OVA base models. Fourthly, when the combination algorithm for base model predictions is able to resolve tied predictions, the resulting aggregate models provide a higher level of predictive performance. 相似文献
2.
Gijs Rennen 《Structural and Multidisciplinary Optimization》2009,38(6):545-569
When building a Kriging model, the general intuition is that using more data will always result in a better model. However, we show that when we have a large non-uniform dataset, using a uniform subset can have several advantages. Reducing the time necessary to fit the model, avoiding numerical inaccuracies and improving the robustness with respect to errors in the output data are some aspects which can be improved by using a uniform subset. We furthermore describe several new and current methods for selecting a uniform subset. These methods are tested and compared on several artificial datasets and one real life dataset. The comparison shows how the selected subsets affect different aspects of the resulting Kriging model. As none of the subset selection methods performs best on all criteria, the best method to choose depends on how the different aspects are valued. The comparison made in this paper can be used to facilitate the user in making a good choice. 相似文献
3.
Asil Oztekin 《Information Systems Frontiers》2018,20(2):223-238
This study is aimed at determining the future share net inflows and outflows of Exchange Traded Funds (ETFs). The relationship between net flows is closely related to investor perception of the future and past performance of mutual funds. The net flows for Exchange Traded Funds are expected to be less related to overall fund performance, but rather based on the characteristics of the fund that make it attractive to an individual investor. In order to explore the relationship between investor’s perception of ETFs and subsequent net flows, this study is designed to shed light on the multifaceted linkages between fund characteristics and net flows. A meta-classification predictive modeling approach is designed for the use of large data sets. Then its implementation and results are discussed. A thorough selection of fifteen attributes from each fund, which are the most likely contributors to fund inflows and outflows, is deployed in the analyses. The large data set calls for the use of a robust systematic approach to identifying the attributes of the funds that best predict future inflows and outflows of the fund. The predictive performance of the proposed decision analytic methodology was assessed via the 10-fold cross validation, which yielded very promising results. 相似文献
4.
Robert Van Dam Irene Langkilde-Geary Dan Ventura 《Knowledge and Information Systems》2013,35(3):525-552
The ADtree, a data structure useful for caching sufficient statistics, has been successfully adapted to grow lazily when memory is limited and to update sequentially with an incrementally updated dataset. However, even these modified forms of the ADtree still exhibit inefficiencies in terms of both space usage and query time, particularly on datasets with very high dimensionality and with high-arity features. We propose four modifications to the ADtree, each of which can be used to improve size and query time under specific types of datasets and features. These modifications also provide an increased ability to precisely control how an ADtree is built and to tune its size given external memory or speed requirements. 相似文献
5.
6.
Kernel matching pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernel-induced dictionary. Here we propose a modification of the kernel matching pursuit algorithm that aims at making the method practical for large datasets. Starting from an approximating algorithm, the weak greedy algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the trade-off between the accuracy and the need for resources. Finally, we present some experiments performed on a large dataset that support our approach and illustrate its applicability. 相似文献
7.
高斯过程分类是近年机器学习领域引起广泛关注的一类有监督的学习算法。该算法在高斯过程的先验假设下,以后验概率最大化的为目标,获得对新样本的预测值及属于该值的概率。针对图像数据的特性,提出一种将高斯过程应用于图像分类的方法,同时在此基础上给出对图片进行排序的一种方案。在公开的图像数据集上进行了实验,并与支持向量机分类器进行对比,证实了其有效性,为改进图像分类技术提供一条可供参考的途径。 相似文献
8.
Crime activities are geospatial phenomena and as such are geospatially, thematically and temporally correlated. We analyze crime datasets in conjunction with socio-economic and socio-demographic factors to discover co-distribution patterns that may contribute to the formulation of crime. We propose a graph based dataset representation that allows us to extract patterns from heterogeneous areal aggregated datasets and visualize the resulting patterns efficiently. We demonstrate our approach with real crime datasets and provide a comparison with other techniques. 相似文献
9.
In data warehousing applications, numerous OLAP queries involve the processing of holistic aggregators such as computing the top n, median, quantiles, etc. In this paper, we present a novel approach called dynamic bucketing to efficiently evaluate these aggregators. We partition data into equiwidth buckets and further partition dense buckets into subbuckets as needed by allocating and reclaiming memory space. The bucketing process dynamically adapts to the input order and distribution of input datasets. The histograms of the buckets and subbuckets are stored in our new data structure called structure trees. A recent selection algorithm based on regular sampling is generalized and its analysis extended. We have also compared our new algorithms with this generalized algorithm and several other recent algorithms. Experimental results show that our new algorithms significantly outperform prior ones not only in the runtime but also in accuracy.Received: 20 December 2000, Published online: 4 March 2004Edited by: P. Scheuermann.Sanguthevar Rajasekaran: This authors work is supported by NSF Grant 9912395. 相似文献
10.
João P. Papa Alexandre X. Falcão Victor Hugo C. de Albuquerque João Manuel R.S. Tavares 《Pattern recognition》2012,45(1):512-520
Today data acquisition technologies come up with large datasets with millions of samples for statistical analysis. This creates a tremendous challenge for pattern recognition techniques, which need to be more efficient without losing their effectiveness. We have tried to circumvent the problem by reducing it into the fast computation of an optimum-path forest (OPF) in a graph derived from the training samples. In this forest, each class may be represented by multiple trees rooted at some representative samples. The forest is a classifier that assigns to a new sample the label of its most strongly connected root. The methodology has been successfully used with different graph topologies and learning techniques. In this work, we have focused on one of the supervised approaches, which has offered considerable advantages over Support Vector Machines and Artificial Neural Networks to handle large datasets. We propose (i) a new algorithm that speeds up classification and (ii) a solution to reduce the training set size with negligible effects on the accuracy of classification, therefore further increasing its efficiency. Experimental results show the improvements with respect to our previous approach and advantages over other existing methods, which make the new method a valuable contribution for large dataset analysis. 相似文献
11.
Quentin Baert Anne-Cécile Caron Maxime Morge Jean-Christophe Routier 《Knowledge and Information Systems》2018,54(3):591-615
MapReduce is a design pattern for processing large datasets distributed on a cluster. Its performances are linked to the data structure and the runtime environment. Indeed, data skew can yield an unfair task allocation, but even when the initial allocation produced by the partition function is well balanced, an unfair allocation can occur during the reduce phase due to the heterogeneous performance of nodes. For these reasons, we propose an adaptive multi-agent system. In our approach, the reducer agents interact during the job and the task reallocation is based on negotiation in order to decrease the workload of the most loaded reducer and so the runtime. In this paper, we propose and evaluate two negotiation strategies. Finally, we experiment our multi-agent system with real-world datasets over heterogeneous runtime environment. 相似文献
12.
E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers. 相似文献
13.
ReliefF has proved to be a successful feature selector but when handling a large dataset, it is computationally expensive. We present an optimization using Supervised Model Construction which improves starter selection. Effectiveness has been evaluated using 12 UCI datasets and a clinical diabetes database. Experiments indicate that compared with ReliefF, the proposed method improved computation efficiency whilst maintaining the classification accuracy. In the clinical dataset (20,000 records with 47 features), feature selection via Supervised Model Construction (FSSMC) reduced the processing time by 80%, compared to ReliefF, and maintained accuracy for Naive Bayes, IB1 and C4.5 classifiers. 相似文献
14.
15.
《Information Systems》2005,30(5):333-348
The tree index structure is a traditional method for searching similar data in large datasets. It is based on the presupposition that most sub-trees are pruned in the searching process. As a result, the number of page accesses is reduced. However, time-series datasets generally have a very high dimensionality. Because of the so-called dimensionality curse, the pruning effectiveness is reduced in high dimensionality. Consequently, the tree index structure is not a suitable method for time-series datasets. In this paper, we propose a two-phase (filtering and refinement) method for searching time-series datasets. In the filtering step, a quantizing time-series is used to construct a compact file which is scanned for filtering out irrelevant. A small set of candidates is translated to the second step for refinement. In this step, we introduce an effective index compression method named grid-based datawise dimensionality reduction (DRR) which attempts to preserve the characteristics of the time-series. An experimental comparison with existing techniques demonstrates the utility of our approach. 相似文献
16.
17.
The watershed transform from markers is a very popular image segmentation operator. The image foresting transform (IFT) watershed is a common method to compute the watershed transform from markers using a priority queue, but which can consume too much memory when applied to three-dimensional medical datasets. This is a considerable limitation on the applicability of the IFT watershed, as the size of medical datasets keeps increasing at a faster pace than physical memory technologies develop. This paper presents the O-IFT watershed, a new type of IFT watershed based on the O-Buffer framework, and introduces an efficient data representation which considerably reduces the memory consumption of the algorithm. In addition, this paper introduces the O-Buckets, a new implementation of the priority queue which further reduces the memory consumption of the algorithm. The new O-IFT watershed with O-Buckets allows the application of the watershed transform from markers to large medical datasets. 相似文献
18.
Command and control (C&C) speech recognition allows users to interact with a system by speaking commands or asking questions
restricted to a fixed grammar containing pre-defined phrases. Whereas C&C interaction has been commonplace in telephony and
accessibility systems for many years, only recently have mobile devices had the memory and processing capacity to support
client-side speech recognition. Given the personal nature of mobile devices, statistical models that can predict commands
based in part on past user behavior hold promise for improving C&C recognition accuracy. For example, if a user calls a spouse
at the end of every workday, the language model could be adapted to weight the spouse more than other contacts during that
time. In this paper, we describe and assess statistical models learned from a large population of users for predicting the
next user command of a commercial C&C application. We explain how these models were used for language modeling, and evaluate
their performance in terms of task completion. The best performing model achieved a 26% relative reduction in error rate compared
to the base system. Finally, we investigate the effects of personalization on performance at different learning rates via
online updating of model parameters based on individual user data. Personalization significantly increased relative reduction
in error rate by an additional 5%. 相似文献
19.
Ashish Sharma Rajiv K. Kalia Aiichiro Nakano Priya Vashishta 《Computer Physics Communications》2004,163(1):53-64
A scalable and portable code named Atomsviewer has been developed to interactively visualize a large atomistic dataset consisting of up to a billion atoms. The code uses a hierarchical view frustum-culling algorithm based on the octree data structure to efficiently remove atoms outside of the user's field-of-view. Probabilistic and depth-based occlusion-culling algorithms then select atoms, which have a high probability of being visible. Finally a multiresolution algorithm is used to render the selected subset of visible atoms at varying levels of detail. Atomsviewer is written in C++ and OpenGL, and it has been tested on a number of architectures including Windows, Macintosh, and SGI. Atomsviewer has been used to visualize tens of millions of atoms on a standard desktop computer and, in its parallel version, up to a billion atoms.
Program summary
Title of program: AtomsviewerCatalogue identifier: ADUMProgram summary URL:http://cpc.cs.qub.ac.uk/summaries/ADUMProgram obtainable from: CPC Program Library, Queen's University of Belfast, N. IrelandComputer for which the program is designed and others on which it has been tested: 2.4 GHz Pentium 4/Xeon processor, professional graphics card; Apple G4 (867 MHz)/G5, professional graphics cardOperating systems under which the program has been tested: Windows 2000/XP, Mac OS 10.2/10.3, SGI IRIX 6.5Programming languages used: C++, C and OpenGLMemory required to execute with typical data: 1 gigabyte of RAMHigh speed storage required: 60 gigabytesNo. of lines in the distributed program including test data, etc.: 550 241No. of bytes in the distributed program including test data, etc.: 6 258 245Number of bits in a word: ArbitraryNumber of processors used: 1Has the code been vectorized or parallelized: NoDistribution format: tar gzip fileNature of physical problem: Scientific visualization of atomic systemsMethod of solution: Rendering of atoms using computer graphic techniques, culling algorithms for data minimization, and levels-of-detail for minimal renderingRestrictions on the complexity of the problem: NoneTypical running time: The program is interactive in its executionUnusual features of the program: NoneReferences: The conceptual foundation and subsequent implementation of the algorithms are found in [A. Sharma, A. Nakano, R.K. Kalia, P. Vashishta, S. Kodiyalam, P. Miller, W. Zhao, X.L. Liu, T.J. Campbell, A. Haas, Presence—Teleoperators and Virtual Environments 12 (1) (2003)]. 相似文献20.
Gregory Todd Williams Jesse Weaver Medha Atre James A. Hendler 《Journal of Web Semantics》2010,8(4):365-373
With a huge amount of RDF data available on the web, the ability to find and access relevant information is crucial. Traditional approaches to storing, querying, and reasoning fall short when faced with web-scale data. We present a system that combines the computational power of large clusters for enabling large-scale reasoning and data access with an efficient data structure for storing and querying the accessed data on a traditional personal computer or other resource-constrained device. We present results of using this system to load the 2009 Billion Triples Challenge dataset, materialize RDFS inferences, extract an “interesting” subset of the data using a large cluster, and further analyze the extracted data using a personal computer, all in the order of tens of minutes. 相似文献