共查询到20条相似文献,搜索用时 15 毫秒
1.
One-Versus-All (OVA) classification is a classifier construction method where a k-class prediction task is decomposed into k 2-class sub-problems. One base model is constructed for each sub-problem and the base models are then combined into one model. Aggregate model implementation is the process of constructing several base models which are then combined into a single model for prediction. In essence, OVA classification is a method of aggregate modeling. This paper reports studies that were conducted to establish whether OVA classification can provide predictive performance gains when large volumes of data are available for modeling as is commonly the case in data mining. It is demonstrated in this paper that firstly, OVA modeling can be used to increase the amount of training data while at the same time using base model training sets whose size is much smaller than the total amount of available training data. Secondly, OVA models created from large datasets provide a higher level of predictive performance compared to single k-class models. Thirdly, the use of boosted OVA base models can provide higher predictive performance compared to un-boosted OVA base models. Fourthly, when the combination algorithm for base model predictions is able to resolve tied predictions, the resulting aggregate models provide a higher level of predictive performance. 相似文献
2.
Gijs Rennen 《Structural and Multidisciplinary Optimization》2009,38(6):545-569
When building a Kriging model, the general intuition is that using more data will always result in a better model. However,
we show that when we have a large non-uniform dataset, using a uniform subset can have several advantages. Reducing the time
necessary to fit the model, avoiding numerical inaccuracies and improving the robustness with respect to errors in the output
data are some aspects which can be improved by using a uniform subset. We furthermore describe several new and current methods
for selecting a uniform subset. These methods are tested and compared on several artificial datasets and one real life dataset.
The comparison shows how the selected subsets affect different aspects of the resulting Kriging model. As none of the subset
selection methods performs best on all criteria, the best method to choose depends on how the different aspects are valued.
The comparison made in this paper can be used to facilitate the user in making a good choice. 相似文献
3.
Asil Oztekin 《Information Systems Frontiers》2018,20(2):223-238
This study is aimed at determining the future share net inflows and outflows of Exchange Traded Funds (ETFs). The relationship between net flows is closely related to investor perception of the future and past performance of mutual funds. The net flows for Exchange Traded Funds are expected to be less related to overall fund performance, but rather based on the characteristics of the fund that make it attractive to an individual investor. In order to explore the relationship between investor’s perception of ETFs and subsequent net flows, this study is designed to shed light on the multifaceted linkages between fund characteristics and net flows. A meta-classification predictive modeling approach is designed for the use of large data sets. Then its implementation and results are discussed. A thorough selection of fifteen attributes from each fund, which are the most likely contributors to fund inflows and outflows, is deployed in the analyses. The large data set calls for the use of a robust systematic approach to identifying the attributes of the funds that best predict future inflows and outflows of the fund. The predictive performance of the proposed decision analytic methodology was assessed via the 10-fold cross validation, which yielded very promising results. 相似文献
4.
Robert Van Dam Irene Langkilde-Geary Dan Ventura 《Knowledge and Information Systems》2013,35(3):525-552
The ADtree, a data structure useful for caching sufficient statistics, has been successfully adapted to grow lazily when memory is limited and to update sequentially with an incrementally updated dataset. However, even these modified forms of the ADtree still exhibit inefficiencies in terms of both space usage and query time, particularly on datasets with very high dimensionality and with high-arity features. We propose four modifications to the ADtree, each of which can be used to improve size and query time under specific types of datasets and features. These modifications also provide an increased ability to precisely control how an ADtree is built and to tune its size given external memory or speed requirements. 相似文献
5.
6.
Kernel matching pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernel-induced dictionary. Here we propose a modification of the kernel matching pursuit algorithm that aims at making the method practical for large datasets. Starting from an approximating algorithm, the weak greedy algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the trade-off between the accuracy and the need for resources. Finally, we present some experiments performed on a large dataset that support our approach and illustrate its applicability. 相似文献
7.
高斯过程分类是近年机器学习领域引起广泛关注的一类有监督的学习算法。该算法在高斯过程的先验假设下,以后验概率最大化的为目标,获得对新样本的预测值及属于该值的概率。针对图像数据的特性,提出一种将高斯过程应用于图像分类的方法,同时在此基础上给出对图片进行排序的一种方案。在公开的图像数据集上进行了实验,并与支持向量机分类器进行对比,证实了其有效性,为改进图像分类技术提供一条可供参考的途径。 相似文献
8.
Crime activities are geospatial phenomena and as such are geospatially, thematically and temporally correlated. We analyze crime datasets in conjunction with socio-economic and socio-demographic factors to discover co-distribution patterns that may contribute to the formulation of crime. We propose a graph based dataset representation that allows us to extract patterns from heterogeneous areal aggregated datasets and visualize the resulting patterns efficiently. We demonstrate our approach with real crime datasets and provide a comparison with other techniques. 相似文献
9.
In data warehousing applications, numerous OLAP queries involve the processing of holistic aggregators such as computing the top n, median, quantiles, etc. In this paper, we present a novel approach called dynamic bucketing to efficiently evaluate these aggregators. We partition data into equiwidth buckets and further partition dense buckets into subbuckets as needed by allocating and reclaiming memory space. The bucketing process dynamically adapts to the input order and distribution of input datasets. The histograms of the buckets and subbuckets are stored in our new data structure called structure trees. A recent selection algorithm based on regular sampling is generalized and its analysis extended. We have also compared our new algorithms with this generalized algorithm and several other recent algorithms. Experimental results show that our new algorithms significantly outperform prior ones not only in the runtime but also in accuracy.Received: 20 December 2000, Published online: 4 March 2004Edited by: P. Scheuermann.Sanguthevar Rajasekaran: This authors work is supported by NSF Grant 9912395. 相似文献
10.
João P. Papa Alexandre X. Falcão Victor Hugo C. de Albuquerque João Manuel R.S. Tavares 《Pattern recognition》2012,45(1):512-520
Today data acquisition technologies come up with large datasets with millions of samples for statistical analysis. This creates a tremendous challenge for pattern recognition techniques, which need to be more efficient without losing their effectiveness. We have tried to circumvent the problem by reducing it into the fast computation of an optimum-path forest (OPF) in a graph derived from the training samples. In this forest, each class may be represented by multiple trees rooted at some representative samples. The forest is a classifier that assigns to a new sample the label of its most strongly connected root. The methodology has been successfully used with different graph topologies and learning techniques. In this work, we have focused on one of the supervised approaches, which has offered considerable advantages over Support Vector Machines and Artificial Neural Networks to handle large datasets. We propose (i) a new algorithm that speeds up classification and (ii) a solution to reduce the training set size with negligible effects on the accuracy of classification, therefore further increasing its efficiency. Experimental results show the improvements with respect to our previous approach and advantages over other existing methods, which make the new method a valuable contribution for large dataset analysis. 相似文献
11.
Quentin Baert Anne-Cécile Caron Maxime Morge Jean-Christophe Routier 《Knowledge and Information Systems》2018,54(3):591-615
MapReduce is a design pattern for processing large datasets distributed on a cluster. Its performances are linked to the data structure and the runtime environment. Indeed, data skew can yield an unfair task allocation, but even when the initial allocation produced by the partition function is well balanced, an unfair allocation can occur during the reduce phase due to the heterogeneous performance of nodes. For these reasons, we propose an adaptive multi-agent system. In our approach, the reducer agents interact during the job and the task reallocation is based on negotiation in order to decrease the workload of the most loaded reducer and so the runtime. In this paper, we propose and evaluate two negotiation strategies. Finally, we experiment our multi-agent system with real-world datasets over heterogeneous runtime environment. 相似文献
12.
E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers. 相似文献
13.
ReliefF has proved to be a successful feature selector but when handling a large dataset, it is computationally expensive. We present an optimization using Supervised Model Construction which improves starter selection. Effectiveness has been evaluated using 12 UCI datasets and a clinical diabetes database. Experiments indicate that compared with ReliefF, the proposed method improved computation efficiency whilst maintaining the classification accuracy. In the clinical dataset (20,000 records with 47 features), feature selection via Supervised Model Construction (FSSMC) reduced the processing time by 80%, compared to ReliefF, and maintained accuracy for Naive Bayes, IB1 and C4.5 classifiers. 相似文献
14.
15.
《Information Systems》2005,30(5):333-348
The tree index structure is a traditional method for searching similar data in large datasets. It is based on the presupposition that most sub-trees are pruned in the searching process. As a result, the number of page accesses is reduced. However, time-series datasets generally have a very high dimensionality. Because of the so-called dimensionality curse, the pruning effectiveness is reduced in high dimensionality. Consequently, the tree index structure is not a suitable method for time-series datasets. In this paper, we propose a two-phase (filtering and refinement) method for searching time-series datasets. In the filtering step, a quantizing time-series is used to construct a compact file which is scanned for filtering out irrelevant. A small set of candidates is translated to the second step for refinement. In this step, we introduce an effective index compression method named grid-based datawise dimensionality reduction (DRR) which attempts to preserve the characteristics of the time-series. An experimental comparison with existing techniques demonstrates the utility of our approach. 相似文献
16.
17.
《Advances in Engineering Software》1999,30(6):389-400
This article describes algorithms and data-structures for the fast construction of three-dimensional triangulations from large sets of scattered data-points. The triangulations have a guaranteed error bound, i.e. all the data-points lie within a pre-specified distance from the triangulation. Three different methods for choosing triangulation vertices are presented, based on interpolation, and L2 and L∞-optimization of the error over subsets of the data-points. The main focus of this article will be on devising a simple and fast algorithm for constructing an approximating triangulation of a very large set of points. We propose the use of adapted dynamic data structures and excessive caching of information to speed up the computation and show how the method can be extended to approximate multiple dependent datasets in higher-dimensional approximation problems. 相似文献
18.
Task Assignment in distributed server systems focuses on the policy that assigns the tasks reached these systems in order to improve the response time. These tasks, generally, have the property that there is a tiny fraction (about 3%) of the large tasks that makes half (50%) of the total load. However, this property creates additional problems: the large tasks make the load difficult to balance among the servers, and the small tasks will be delayed by the large ones when they are in the same queue. In this paper, we propose a new policy for the Web clusters that we call Partitioning Large Tasks (PLT) and which deals with these problems mostly under a high traffic demand and a high variability of task sizes. PLT partitions each large task into fragments and assigns them to be processed in a parallel way and completing at the same time to improve the mean response time, and separates the small tasks from the large tasks to avoid being delayed. Performance tests show a significantly improvement in performance of PLT over the existing task assignment policies. 相似文献
19.
20.
Clustering is a widely used technique in data mining applications for discovering patterns in underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either numeric or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining applications. In this article, we present two algorithms that extend the Squeezer algorithm to domains with mixed numeric and categorical attributes. The performance of the two algorithms has been studied on real and artificially generated datasets. Comparisons with other clustering algorithms illustrate the superiority of our approaches. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 1077–1089, 2005. 相似文献