首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A\(n'\)DE, where \(n' = n-1\), while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.  相似文献   

2.
Some supervised tasks are presented with a numerical output but decisions have to be made in a discrete, binarised, way, according to a particular cutoff. This binarised regression task is a very common situation that requires its own analysis, different from regression and classification—and ordinal regression. We first investigate the application cases in terms of the information about the distribution and range of the cutoffs and distinguish six possible scenarios, some of which are more common than others. Next, we study two basic approaches: the retraining approach, which discretises the training set whenever the cutoff is available and learns a new classifier from it, and the reframing approach, which learns a regression model and sets the cutoff when this is available during deployment. In order to assess the binarised regression task, we introduce context plots featuring error against cutoff. Two special cases are of interest, the \( UCE \) and \( OCE \) curves, showing that the area under the former is the mean absolute error and the latter is a new metric that is in between a ranking measure and a residual-based measure. A comprehensive evaluation of the retraining and reframing approaches is performed using a repository of binarised regression problems created on purpose, concluding that no method is clearly better than the other, except when the size of the training data is small.  相似文献   

3.
A 3D binary image I can be naturally represented by a combinatorial-algebraic structure called cubical complex and denoted by Q(I), whose basic building blocks are vertices, edges, square faces and cubes. In Gonzalez-Diaz et al. (Discret Appl Math 183:59–77, 2015), we presented a method to “locally repair” Q(I) to obtain a polyhedral complex P(I) (whose basic building blocks are vertices, edges, specific polygons and polyhedra), homotopy equivalent to Q(I), satisfying that its boundary surface is a 2D manifold. P(I) is called a well-composed polyhedral complex over the picture I. Besides, we developed a new codification system for P(I), encoding geometric information of the cells of P(I) under the form of a 3D grayscale image, and the boundary face relations of the cells of P(I) under the form of a set of structuring elements. In this paper, we build upon (Gonzalez-Diaz et al. 2015) and prove that, to retrieve topological and geometric information of P(I), it is enough to store just one 3D point per polyhedron and hence neither grayscale image nor set of structuring elements are needed. From this “minimal” codification of P(I), we finally present a method to compute the 2-cells in the boundary surface of P(I).  相似文献   

4.
The Drebin dataset (in: NDSS, 2014) is the most supplied academic dataset of Android malware. Therefore it is the most used dataset in research papers on Android malware detection. The research community is using it for evaluation and comparison of their algorithms. We discovered that 49.35% of samples in this dataset has at least one other sample that is a repackaged version containing exactly the same sequence of opcode. The only differences between the original malware and the duplicated ones, in all cases, are the resources embedded and some strings in the code. For assessing the performance of malware detectors or classifiers, a part of the dataset is used for this purpose. So a major part of the testing set end up beeing the same samples that have been used in the training set. This situation can lead us, the research community, to overrate the performance of algorithms we are designing. In the worst case, it leads us to wrong conclusions and wrong directions for future research. Then we conduct an experiment where we test several classification algorithms on the Drebin dataset with and without the duplicates. Our results show that depending on the classifier the full dataset can lead from moderately (124%) to strongly (172%) underrated inaccuracy, and the order of performance of the algorithms is modified. Finally we provide the list of unique malware samples from the Drebin dataset, available on Github.  相似文献   

5.
The problem of determining the maximum mutual information I(X; Y) and minimum entropy H(X, Y) of a pair of discrete random variables X and Y is considered under the condition that the probability distribution of X is fixed and the error probability Pr{Y ≠ X} takes a given value ε, 0 ≤ ε ≤ 1. Precise values for these quantities are found, which in several cases allows us to obtain explicit formulas for both the maximum information and minimum entropy in terms of the probability distribution of X and the parameter ε.  相似文献   

6.
Recent years have witnessed the rapid growth of text data, and thus the increasing importance of in-depth analysis of text data for various applications. Text data are often organized in a database with documents labeled by attributes like time and location. Different documents manifest different topics. The topics of the documents may change along the attributes of the documents, and such changes have been the subject of research in the past. However, previous analyses techniques, such as topic detection and tracking, topic lifetime, and burstiness, all focus on the topic behavior of the documents in a given attribute range without contrasting to the documents in the overall range. This paper introduces the concept of u n i q u e t o p i c s, referring to those topics that only appear frequently within a small range of documents but not in the whole range. These unique topics may reflect some unique characteristics of documents in this small range not found outside of the range. The paper aims at an efficient pruning-based algorithm that, for a user-given set of keywords and a user-given attribute, finds the maximal ranges along the given attribute and their unique topics that are highly related to the given keyword set. Thorough experiments show that the algorithm is effective in various scenarios.  相似文献   

7.
We obtain the conditions for the emergence of the swarm intelligence effect in an interactive game of restless multi-armed bandit (rMAB). A player competes with multiple agents. Each bandit has a payoff that changes with a probability p c per round. The agents and player choose one of three options: (1) Exploit (a good bandit), (2) Innovate (asocial learning for a good bandit among n I randomly chosen bandits), and (3) Observe (social learning for a good bandit). Each agent has two parameters (c, p obs ) to specify the decision: (i) c, the threshold value for Exploit, and (ii) p obs , the probability for Observe in learning. The parameters (c, p obs ) are uniformly distributed. We determine the optimal strategies for the player using complete knowledge about the rMAB. We show whether or not social or asocial learning is more optimal in the (p c , n I ) space and define the swarm intelligence effect. We conduct a laboratory experiment (67 subjects) and observe the swarm intelligence effect only if (p c , n I ) are chosen so that social learning is far more optimal than asocial learning.  相似文献   

8.
The authors solve problems of finding the greatest lower bounds for the probability F (\( \upsilon \)) - F (u),0< u < \( \upsilon \) < ∞, in the set of distribution functions F (x) of nonnegative random variables with unimodal density with mode m, u < m < \( \upsilon \), and fixed two first moments.  相似文献   

9.
Abstract—In the projective plane PG(2, q), a subset S of a conic C is said to be almost complete if it can be extended to a larger arc in PG(2, q) only by the points of C \ S and by the nucleus of C when q is even. We obtain new upper bounds on the smallest size t(q) of an almost complete subset of a conic, in particular,
$$t(q) < \sqrt {q(3lnq + lnlnq + ln3)} + \sqrt {\frac{q}{{3\ln q}}} + 4 \sim \sqrt {3q\ln q} ,t(q) < 1.835\sqrt {q\ln q.} $$
The new bounds are used to extend the set of pairs (N, q) for which it is proved that every normal rational curve in the projective space PG(N, q) is a complete (q+1)-arc, or equivalently, that no [q+1,N+1, q?N+1]q generalized doubly-extended Reed–Solomon code can be extended to a [q + 2,N + 1, q ? N + 2]q maximum distance separable code.
  相似文献   

10.
An outer-connected dominating set in a graph G = (V, E) is a set of vertices D ? V satisfying the condition that, for each vertex v ? D, vertex v is adjacent to some vertex in D and the subgraph induced by V?D is connected. The outer-connected dominating set problem is to find an outer-connected dominating set with the minimum number of vertices which is denoted by \(\tilde {\gamma }_{c}(G)\). In this paper, we determine \(\tilde {\gamma }_{c}(S(n,k))\), \(\tilde {\gamma }_{c}(S^{+}(n,k))\), \(\tilde {\gamma }_{c}(S^{++}(n,k))\), and \(\tilde {\gamma }_{c}(S_{n})\), where S(n, k), S +(n, k), S ++(n, k), and S n are Sierpi\(\acute {\mathrm {n}}\)ski-like graphs.  相似文献   

11.
After combining the ν-Twin Support Vector Regression (ν-TWSVR) with the rough set theory, we propose an efficient Rough ν-Twin Support Vector Regression, called Rough ν-TWSVR for short. We construct a pair of optimization problems which are motivated by and mathematically derived from a related ν-TWSVR Rastogi et al. (Appl Intell 46(3):670–683 2017) and Rough ν-SVR Zhao et al. (Expert Syst Appl 36(6):9793–9798 2009). Rough ν-TWSVR not only utilizes more data information rather than the extreme data points in the ν-TWSVR, but also makes different points having different effects on the regressor depending on their positions. This method can implement the structural risk minimization and automatically control accuracies according to the structure of the data sets. In addition, the double ε s are utilized to construct the rough tube for upper(lower)-bound Rough ν-TWSVR instead of a single ε in the upper(lower)-bound ν-TWSVR. Moreover, This rough tube consisting of positive region, boundary region, and negative region yields the feasible set of the Rough ν-TWSVR larger than that of the ν-TWSVR, which makes the objective function of the Rough ν-TWSVR no more than that of ν-TWSVR. The Rough ν-TWSVR improves the generalization performance of the ν-TWSVR, especially for the data sets with outliers. Experimental results on toy examples and benchmark data sets confirm the validation and applicability of our proposed Rough ν-TWSVR.  相似文献   

12.
This paper introduces α-systems of differential inclusions on a bounded time interval [t0, ?] and defines α-weakly invariant sets in [t0, ?] × ?n, where ?n is a phase space of the differential inclusions. We study the problems connected with bringing the motions (trajectories) of the differential inclusions from an α-system to a given compact set M ? ?n at the moment ? (the approach problems). The issues of extracting the solvability set W ? [t0, ?] × ?n in the problem of bringing the motions of an α-system to M and the issues of calculating the maximal α-weakly invariant set Wc ? [t0, ?] × ?n are also discussed. The notion of the quasi-Hamiltonian of an α-system (α-Hamiltonian) is proposed, which seems important for the problems of bringing the motions of the α-system to M.  相似文献   

13.
We present methods to construct transitive partitions of the set E n of all binary vectors of length n into codes. In particular, we show that for all n = 2 k ? 1, k ≥ 3, there exist transitive partitions of E n into perfect transitive codes of length n.  相似文献   

14.
Continuous visible nearest neighbor query processing in spatial databases   总被引:1,自引:0,他引:1  
In this paper, we identify and solve a new type of spatial queries, called continuous visible nearest neighbor (CVNN) search. Given a data set P, an obstacle set O, and a query line segment q in a two-dimensional space, a CVNN query returns a set of \({\langle p, R\rangle}\) tuples such that \({p \in P}\) is the nearest neighbor to every point r along the interval \({R \subseteq q}\) as well as p is visible to r. Note that p may be NULL, meaning that all points in P are invisible to all points in R due to the obstruction of some obstacles in O. In contrast to existing continuous nearest neighbor query, CVNN retrieval considers the impact of obstacles on visibility between objects, which is ignored by most of spatial queries. We formulate the problem, analyze its unique characteristics, and develop efficient algorithms for exact CVNN query processing. Our methods (1) utilize conventional data-partitioning indices (e.g., R-trees) on both P and O, (2) tackle the CVNN search by performing a single query for the entire query line segment, and (3) only access the data points and obstacles relevant to the final query result by employing a suite of effective pruning heuristics. In addition, several interesting variations of CVNN queries have been introduced, and they can be supported by our techniques, which further demonstrates the flexibility of the proposed algorithms. A comprehensive experimental evaluation using both real and synthetic data sets has been conducted to verify the effectiveness of our proposed pruning heuristics and the performance of our proposed algorithms.  相似文献   

15.
Given a road network G = (V,E), where V (E) denotes the set of vertices(edges) in G, a set of points of interest P and a query point q residing in G, the reverse furthest neighbors (Rfn R ) query in road networks fetches a set of points pP that take q as their furthest neighbor compared with all points in P ∪ {q}. This is the monochromatic Rfn R (Mrfn R ) query. Another interesting version of Rfn R query is the bichromatic reverse furthest neighbor (Brfn R ) query. Given two sets of points P and Q, and a query point qQ, a Brfn R query fetches a set of points pP that take q as their furthest neighbor compared with all points in Q. This paper presents efficient algorithms for both Mrfn R and Brfn R queries, which utilize landmarks and partitioning-based techniques. Experiments on real datasets confirm the efficiency and scalability of proposed algorithms.  相似文献   

16.
An addition sequence problem is given a set of numbers X = {n 1, n 2, . . . , n m }, what is the minimal number of additions needed to compute all m numbers starting from 1? This problem is NP-complete. In this paper, we present a branch and bound algorithm to generate an addition sequence with a minimal number of elements for a set X by using a new strategy. Then we improve the generation by generalizing some results on addition chains (m = 1) to addition sequences and finding what we will call a presumed upper bound for each n j , 1 ≤ j ≤ m, in the search tree.  相似文献   

17.
The performance of a linear error-detecting code in a symmetric memoryless channel is characterized by its probability of undetected error, which is a function of the channel symbol error probability, involving basic parameters of a code and its weight distribution. However, the code weight distribution is known for relatively few codes since its computation is an NP-hard problem. It should therefore be useful to have criteria for properness and goodness in error detection that do not involve the code weight distribution. In this work we give two such criteria. We show that a binary linear code C of length n and its dual code C of minimum code distance d are proper for error detection whenever d ≥ ?n/2? + 1, and that C is proper in the interval [(n + 1 ? 2d)/(n ? d); 1/2] whenever ?n/3? + 1 ≤ d ≤ ?n/2?. We also provide examples, mostly of Griesmer codes and their duals, that satisfy the above conditions.  相似文献   

18.
Self-Organizing Networks (SON) add automation to the Operation and Maintenance of mobile networks. Self-healing is the SON function that performs automated troubleshooting. Among other functions, self-healing performs automatic diagnosis (or root cause analysis), that is the task of identifying the most probable fault causes in problematic cells. For training the automatic diagnosis functionality based on support-decision systems, supervised learning algorithms usually extract the knowledge from a training set made up from solved troubleshooting cases. However, the lack of these sets of real solved cases is the bottleneck in the design of realistic diagnosis systems. In this paper, the properties of such troubleshooting cases and training sets are studied. Subsequently, a method based on model fitting is proposed to extract a statistical model that can be used to generate vectors that emulate the network behavior in the presence of faults. These emulated vectors can then be used to evaluate novel diagnosis systems. In order to evaluate the feasibility of the proposed approach, an LTE fault dataset has been modeled, based on both the analysis of real cases collected over two months and a network simulator. In addition, the obtained baseline model can be very useful for the research community in the area of automatic diagnosis.  相似文献   

19.
Multi Secret Sharing (MSS) scheme is an efficient method of transmitting more than one secret securely. In (n, n)-MSS scheme n secrets are used to create n shares and for reconstruction, all n shares are required. In state of the art schemes n secrets are used to construct n or n + 1 shares, but one can recover partial secret information from less than n shares. There is a need to develop an efficient and secure (n, n)-MSS scheme so that the threshold property can be satisfied. In this paper, we propose three different (n, n)-MSS schemes. In the first and second schemes, Boolean XOR is used and in the third scheme, we used Modular Arithmetic. For quantitative analysis, Similarity metrics, Structural, and Differential measures are considered. A proposed scheme using Modular Arithmetic performs better compared to Boolean XOR. The proposed (n, n)-MSS schemes outperform the existing techniques in terms of security, time complexity, and randomness of shares.  相似文献   

20.
We consider the problem of determining the maximum and minimum of the Rényi divergence Dλ(P||Q) and Dλ(Q||P) for two probability distribution P and Q of discrete random variables X and Y provided that the probability distribution P and the parameter α of α-coupling between X and Y are fixed, i.e., provided that Pr{X = Y } = α.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号