首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models or asserting that certain problematic attributes (such as race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the data set, without knowing how the models work. Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all. Our approach does not require the black-box model to be retrained. This is important if, for example, the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence such as feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available data sets and models. We also validate our procedure using techniques from interpretable learning and feature selection, as well as against other black-box auditing procedures. To further demonstrate the effectiveness of this technique, we use it to audit a black-box recidivism prediction algorithm.  相似文献   

2.
Distance automata are automata weighted over the semiring \((\mathbb {N}\cup \{\infty \},\min , +)\) (the tropical semiring). Such automata compute functions from words to \(\mathbb {N}\cup \{\infty \}\). It is known from Krob that the problems of deciding ‘ fg’ or ‘ f=g’ for f and g computed by distance automata is an undecidable problem. The main contribution of this paper is to show that an approximation of this problem is decidable. We present an algorithm which, given ε>0 and two functions f,g computed by distance automata, answers “yes” if f≤(1?ε)g, “no” if f≦?g, and may answer “yes” or “no” in all other cases. The core argument behind this quasi-decision procedure is an algorithm which is able to provide an approximated finite presentation of the closure under products of sets of matrices over the tropical semiring. Lastly, our theorem of affine domination gives better bounds on the precision of known decision procedures for cost automata, when restricted to distance automata.  相似文献   

3.
This paper presents a framework for compositional nonblocking verification of discrete event systems modelled as extended finite-state machines (EFSM). Previous results are improved to consider general conflict-equivalence based abstractions of EFSMs communicating both via shared variables and events. Performance issues resulting from the conversion of EFSM systems to finite-state machine systems are avoided by operating directly on EFSMs, deferring the unfolding of variables into state machines as long as possible. Several additional methods to abstract EFSMs and remove events are also presented. The proposed algorithm has been implemented in the discrete event systems tool Supremica, and the paper presents experimental results for several large EFSM models that can be verified faster than by previously used methods.  相似文献   

4.
The top-k query on uncertain data set has been a very hot topic these years, and there have been many studies on uncertain top-k queries. Unfortunately, most of the existing algorithms only consider centralized processing environments, and they are not suitable for the large-scale data. In this paper, it is the first attempt to process probabilistic threshold top-k queries (an important uncertain top-k query, PT-k for short) in a distributed environment. We propose 3 efficient algorithms. The serial distributed approach adopts a new method, which only requires a few amount of calculations, to serially process PT-k queries in distributed environments. The global sorting first algorithm for PT-k query processing (GSP) is designed for improving the computation speed. In GSP, a distributed sorting operation is performed, and then we compute the candidates for PT-k queries in parallel. The query results can be computed by using a novel incremental method which can reduce the number of calculations. The local filtering first algorithm for PT-k query processing is designed for reducing the network overhead. Specifically, several filtering strategies are proposed to filter out redundant data locally, and then the incremental method in GSP is used to process the PT-k queries. Finally, the effectiveness of our proposed algorithms is verified through a series of experiments.  相似文献   

5.
Why-not and why questions can be posed by database users to seek clarifications on unexpected query results. Specifically, why-not questions aim to explain why certain expected tuples are absent from the query results, while why questions try to clarify why certain unexpected tuples are present in the query results. This paper systematically explores the why-not and why questions on reverse top-k queries, owing to its importance in multi-criteria decision making. We first formalize why-not questions on reverse top-k queries, which try to include the missing objects in the reverse top-k query results, and then, we propose a unified framework called WQRTQ to answer why-not questions on reverse top-k queries. Our framework offers three solutions to cater for different application scenarios. Furthermore, we study why questions on reverse top-k queries, which aim to exclude the undesirable objects from the reverse top-k query results, and extend the framework WQRTQ to efficiently answer why questions on reverse top-k queries, which demonstrates the flexibility of our proposed algorithms. Extensive experimental evaluation with both real and synthetic data sets verifies the effectiveness and efficiency of the presented algorithms under various experimental settings.  相似文献   

6.
The problem of kNN (k Nearest Neighbor) queries has received considerable attention in the database and information retrieval communities. Given a dataset D and a kNN query q, the k nearest neighbor algorithm finds the closest k data points to q. The applications of kNN queries are board, not only in spatio-temporal databases but also in many areas. For example, they can be used in multimedia databases, data mining, scientific databases and video retrieval. The past studies of kNN query processing did not consider the case that the server may receive multiple kNN queries at one time. Their algorithms process queries independently. Thus, the server will be busy with continuously reaccessing the database to obtain the data that have already been acquired. This results in wasting I/O costs and degrading the performance of the whole system. In this paper, we focus on this problem and propose an algorithm named COrrelated kNN query Evaluation (COKE). The main idea of COKE is an “information sharing” strategy whereby the server reuses the query results of previously executed queries for efficiently processing subsequent queries. We conduct a comprehensive set of experiments to analyze the performance of COKE and compare it with the Best-First Search (BFS) algorithm. Empirical studies indicate that COKE outperforms BFS, and achieves lower I/O costs and less running time.  相似文献   

7.
Providing top-k typical relevant keyword queries would benefit the users who cannot formulate appropriate queries to express their imprecise query intentions. By extracting the semantic relationships both between keywords and keyword queries, this paper proposes a new keyword query suggestion approach which can provide typical and semantically related queries to the given query. Firstly, a keyword coupling relationship measure, which considers both intra- and inter-couplings between each pair of keywords, is proposed. Then, the semantic similarity of different keyword queries can be measured by using a semantic matrix, in which the coupling relationships between keywords in queries are reserved. Based on the query semantic similarities, we next propose an approximation algorithm to find the most typical queries from query history by using the probability density estimation method. Lastly, a threshold-based top-k query selection method is proposed to expeditiously evaluate the top-k typical relevant queries. We demonstrate that our keyword coupling relationship and query semantic similarity measures can capture the coupling relationships between keywords and semantic similarities between keyword queries accurately. The efficiency of query typicality analysis and top-k query selection algorithm is also demonstrated.  相似文献   

8.
We study two natural extensions of Constraint Satisfaction Problems (CSPs). Balance-Max-CSP requires that in any feasible assignment each element in the domain is used an equal number of times. An instance of Hard-Max-CSP consists of soft constraints and hard constraints, and the goal is to maximize the weight of satisfied soft constraints while satisfying all the hard constraints. These two extensions contain many fundamental problems not captured by CSPs, and challenge traditional theories about CSPs in a more general framework. Max-2-SAT and Max-Horn-SAT are the only two nontrivial classes of Boolean CSPs that admit a robust satisfibiality algorithm, i.e., an algorithm that finds an assignment satisfying at least (1 ? g(ε)) fraction of constraints given a (1 ? ε)-satisfiable instance, where g(ε) → 0 as ε → 0, and g(0) = 0. We prove the inapproximability of these problems with balance or hard constraints, showing that each variant changes the nature of the problems significantly (in different ways). For instance, deciding whether an instance of 2-SAT admits a balanced assignment is NP-hard, and for Max-2-SAT with hard constraints, it is hard to find a constant-factor approximation even on (1 ? ε)-satisfiable instances (in particular, the version with hard constraints does not admit a robust satisfiability algorithm). We also study hardness results for a certain CSP over a larger domain capturing ordering constraints: we show that hard constraints rule out constant-factor approximation algorithms. All our hardness results are almost optimal — they completely rule out algorithms with certain properties, or can be matched by simple extensions to existing algorithms.  相似文献   

9.
Continuous visible nearest neighbor query processing in spatial databases   总被引:1,自引:0,他引:1  
In this paper, we identify and solve a new type of spatial queries, called continuous visible nearest neighbor (CVNN) search. Given a data set P, an obstacle set O, and a query line segment q in a two-dimensional space, a CVNN query returns a set of \({\langle p, R\rangle}\) tuples such that \({p \in P}\) is the nearest neighbor to every point r along the interval \({R \subseteq q}\) as well as p is visible to r. Note that p may be NULL, meaning that all points in P are invisible to all points in R due to the obstruction of some obstacles in O. In contrast to existing continuous nearest neighbor query, CVNN retrieval considers the impact of obstacles on visibility between objects, which is ignored by most of spatial queries. We formulate the problem, analyze its unique characteristics, and develop efficient algorithms for exact CVNN query processing. Our methods (1) utilize conventional data-partitioning indices (e.g., R-trees) on both P and O, (2) tackle the CVNN search by performing a single query for the entire query line segment, and (3) only access the data points and obstacles relevant to the final query result by employing a suite of effective pruning heuristics. In addition, several interesting variations of CVNN queries have been introduced, and they can be supported by our techniques, which further demonstrates the flexibility of the proposed algorithms. A comprehensive experimental evaluation using both real and synthetic data sets has been conducted to verify the effectiveness of our proposed pruning heuristics and the performance of our proposed algorithms.  相似文献   

10.
Given a graph and degree upper bounds on vertices, the BDMST problem requires us to find a minimum cost spanning tree respecting the given degree bounds. This problem generalizes the Travelling Salesman Path Problem (TSPP), even in unweighted graphs, and so we expect that it is necessary to relax the degree constraints to get efficient algorithms. Könemann and Ravi (Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing, pp. 537–546, 2000; Proceedings of the Thirty-Fifth ACM Symposium on Theory of Computing, pp. 389–395, 2003) give bicriteria approximation algorithms for the problem using local search techniques of Fischer (Technical Report 14853, Cornell University, 1993). Their algorithms find solutions which make a tradeoff of the approximation factor for the cost of the resulting tree against the factor by which degree constraints are violated. In particular, they give an algorithm which, for a graph with a spanning tree of cost C and degree B, and for parameters b,w>1, produces a tree whose cost is at most wC and whose degree is at most \(\frac{w}{w-1}bB+\log_{b}n.\) A primary contribution of Könemann and Ravi is to use a Lagrangean relaxation to formally relate the BDMST problem to what we call the MDMST problem, which is the problem of finding an MST of minimum degree in a graph. In their solution to the MDMST problem, they make central use of a local-search approximation algorithm of Fischer.In this paper, we give the first approximation algorithms for the BDMST problem—both our algorithms find trees of optimal cost. We achieve this improvement using a primal-dual cost bounding methodology from Edmonds’ weighted matching algorithms which was not previously used in this context. In order to follow Edmonds’ approach, we develop algorithms for a variant of the MDMST problem in which there are degree lower bound requirements. This variant may be of independent interest; in particular, our results extend to a generalized version of the BDMST problem in which both upper and lower degree bounds are given.First we give a polynomial-time algorithm that finds a tree of optimal cost and with maximum degree at most \(\frac{b}{2-b}B+O(\log_{b}n)\) for any b∈(1,2). We also give a quasi-polynomial-time approximation algorithm which produces a tree of optimal cost C and maximum degree at most B+O(log?n/log?log?n). That is, the error is additive as well as restricted just to the degree. This further improvement in degree is obtained by using augmenting-path techniques that search over a larger solution space than Fischer’s local-search algorithm.  相似文献   

11.
In this paper, we propose an efficient solution for processing continuous range spatial keyword queries over moving spatio-textual objects (namely, CRSK-mo queries). Major challenges in efficient processing of CRSK-mo queries are as follows: (i) the query range is determined based on both spatial proximity and textual similarity; thus a straightforward spatial proximity based pruning of the search space is not applicable as any object far from a query location with a high textual similarity score can still be the answer (and vice versa), (ii) frequent location updates may invalidate a query result, and thus require frequent re-computing of the result set for any object updates. To address these challenges, the key idea of our approach is to exploit the spatial and textual upper bounds between queries and objects to form safe zones (at the client-side) and buffer regions (at the server-side), and then use these bounds to quickly prune objects and queries through smart in-memory data structures. We conduct extensive experiments with a synthetic dataset that verify the effectiveness and efficiency of our proposed algorithm.  相似文献   

12.
k-nearest neighbor (k-NN) queries are well-known and widely used in a plethora of applications. However, in the original definition of k-NN queries there is no concern regarding diversity of the answer set with respect to the user’s interests. For instance, travelers may be looking for touristic sites that are close to where they are, but that would also lead them to see different parts of the city. Likewise, if one is looking for restaurants close by, it may be more interesting to learn about restaurants of different categories or ethnicities which are nonetheless relatively close. The interesting novel aspect of this type of query is that there are two competing criteria to be optimized: closeness and diversity. We propose two approaches that leverage the notion of linear skyline queries in order to find the k diverse nearest neighbors within a radius r from a given query point, or (k, r)-DNNs for short. Our proposed approaches return a relatively small set containing all optimal solutions for any linear combination of the weights a user could give to the two competing criteria, and we consider three different notions of diversity: spatial, categorical and angular. Our experiments, varying a number of parameters and exploring synthetic and real datasets, in both Euclidean space and road networks, respectively, show that our approaches are several orders of magnitude faster than a straightforward approach.  相似文献   

13.
In this article, we present the construction of an interactive multilevel focus+context visualization framework for the navigation and exploration of large-scale 2D and 3D images. The presented framework utilizes a balanced multiresolution technique supported by a balanced wavelet transform (BWT). It extends the mode of focus+context visualization, where spatially separate magnification of regions of interest (ROIs) is performed, as opposed to in-place magnification. Each resulting visualization scenario resembles a tree structure, where the root constitutes the main context, each non-root internal node plays the dual roles of both focus and context, and each leaf solely represents a focus. Our developed prototype supports interactive manipulation of the visualization hierarchy, such as addition and deletion of ROIs and desired changes in their resolutions at any level of the hierarchy on the fly. We describe the underlying data structure efficiently support such operations. Changes in the spatial locations of query windows defining the ROIs trigger on-demand reconstruction queries. We explain in detail how to efficiently process such reconstruction queries within the hierarchy of details (wavelet coefficients) contained in the BWT in order to ensure real-time feedback. As the BWT need only be constructed once in a preprocessing phase on the server-side and robust on-demand reconstruction queries require minimal data communication overhead, our presented framework is a suitable candidate for efficient web-based visualization of complex large-scale imagery. We also discuss the performance characteristics of our proposed framework from various aspects, such as time and space complexities and achieved frame rates.  相似文献   

14.
15.
Efficiently answering reachability queries, which checks whether one vertex can reach another in a directed graph, has been studied extensively during recent years. However, the size of the graph that people are facing and generating nowadays is growing so rapidly that simple algorithms, such as BFS and DFS, are no longer feasible. Although Refined Online Search algorithms can scale to large graphs, they all suffer from the false positive problem. In this paper, we analyze the cause of false positive and propose an efficient High Dimensional coordinate generating method based on Graph Dominance Drawing (HD-GDD) to answer reachability queries in linear or even constant time. We conduct experiments on different graph structures and different graph sizes to fully evaluate the performance and behavior of our proposal. Empirical results demonstrate that our method outperforms state-of-the-art algorithms and can handle extensive graphs.  相似文献   

16.
In this paper we propose a machine learning technique for real-time robot path planning for an autonomous robot in a planar environment with obstacles where the robot possess no a priori map of its environment. Our main insight in this paper is that a robot’s path planning times can be significantly reduced if it can refer to previous maneuvers it used to avoid obstacles during earlier missions, and adapt that information to avoid obstacles during its current navigation. We propose an online path planning algorithm called LearnerRRT that utilizes a pattern matching technique called Sample Consensus Initial Alignment (SAC-IA) in combination with an experience-based learning technique to adapt obstacle boundary patterns encountered in previous environments to the current scenario followed by corresponding adaptations in the obstacle-avoidance paths. Our proposed algorithm LearnerRRT works as a learning-based reactive path planning technique which enables robots to improve their overall path planning performance by locally improving maneuvers around commonly encountered obstacle patterns by accessing previously accumulated environmental information. We have conducted several experiments in simulations and hardware to verify the performance of the LearnerRRT algorithm and compared it with a state-of-the-art sampling-based planner. LearnerRRT on average takes approximately 10% of the planning time and 14% of the total time taken by the sampling-based planner to solve the same navigation task based on simulation results and takes only 33% of the planning time, 46% of total time and 95% of total distance compared to the sampling-based planner based on our hardware results.  相似文献   

17.
Dimensionality reduction has been attracted extensive attention in machine learning. It usually includes two types: feature selection and subspace learning. Previously, many researchers have demonstrated that the dimensionality reduction is meaningful for real applications. Unfortunately, a large mass of these works utilize the feature selection and subspace learning independently. This paper explores a novel supervised feature selection algorithm by considering the subspace learning. Specifically, this paper employs an ? 2,1?norm and an ? 2,p ?norm regularizers, respectively, to conduct sample denoising and feature selection via exploring the correlation structure of data. Then this paper uses two constraints (i.e. hypergraph and low-rank) to consider the local structure and the global structure among the data, respectively. Finally, this paper uses the optimizing framework to iteratively optimize each parameter while fixing the other parameter until the algorithm converges. A lot of experiments show that our new supervised feature selection method can get great results on the eighteen public data sets.  相似文献   

18.
Data streams management has attracted the attention of many researchers during the recent years. The reason is that numerous devices generate huge amounts of data demanding an efficient processing scheme for delivering high quality applications. Data are reported through streams and stored into a number of partitions. Separation techniques facilitate the parallel management of data while intelligent methods are necessary to manage these multiple instances of data. Progressive analytics over huge amounts of data could be adopted to deliver partial responses and, possibly, to save time in the execution of applications. An interesting research domain is the efficient management of queries over multiple partitions. Usually, such queries demand responses in the form of ordered sets of objects (e.g., top-k queries). These ordered sets include objects in a ranked order and require novel mechanisms for deriving responses based on partial results. In this paper, we study a setting of multiple data partitions and propose an intelligent, uncertainty driven decision making mechanism that aims to respond to streams of queries. Our mechanism delivers an ordered set of objects over a number of partial ordered subsets retrieved by each partition of data. We envision that a number of query processors are placed in front of each partition and report progressive analytics to a Query Controller (QC). The QC receives queries, assigns the task to the underlying processors and decides the right time to deliver the final ordered set to the application. We propose an aggregation model for deriving the final ordered set of objects and a Fuzzy Logic (FL) inference process. We present a Type-2 FL system that decides when the QC should stop aggregating partial subsets and return the final response to the application. We report on the performance of the proposed mechanism through the execution of a large set of experiments. Our results deal with the throughput of the QC, the quality of the final ordered set of objects and the time required for delivering the final response.  相似文献   

19.
In this paper, we show that the problem of deterministically factoring multivariate polynomials reduces to the problem of deterministic polynomial identity testing. Specifically, we show that given an arithmetic circuit (either explicitly or via black-box access) that computes a multivariate polynomial f, the task of computing arithmetic circuits for the factors of f can be solved deterministically, given a deterministic algorithm for the polynomial identity testing problem (we require either a white-box or a black-box algorithm, depending on the representation of f).Together with the easy observation that deterministic factoring implies a deterministic algorithm for polynomial identity testing, this establishes an equivalence between these two central derandomization problems of arithmetic complexity.Previously, such an equivalence was known only for multilinear circuits (Shpilka & Volkovich, 2010).  相似文献   

20.
We study the k-level uncapacitated facility location problem (k-level UFL) in which clients need to be connected with paths crossing open facilities of k types (levels). In this paper we first propose an approximation algorithm that for any constant k, in polynomial time, delivers solutions of cost at most α k times OPT, where α k is an increasing function of k, with \(\lim _{k\to \infty } \alpha _{k} = 3\). Our algorithm rounds a fractional solution to an extended LP formulation of the problem. The rounding builds upon the technique of iteratively rounding fractional solutions on trees (Garg, Konjevod, and Ravi SODA’98) originally used for the group Steiner tree problem. We improve the approximation ratio for k-level UFL for all k ≥ 3, in particular we obtain the ratio equal 2.02, 2.14, and 2.24 for k = 3,4, and 5.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号