首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
Exploring statistical correlations for image retrieval   总被引:1,自引:0,他引:1  
Bridging the cognitive gap in image retrieval has been an active research direction in recent years, of which a key challenge is to get enough training data to learn the mapping functions from low-level feature spaces to high-level semantics. In this paper, image regions are classified into two types: key regions representing the main semantic contents and environmental regions representing the contexts. We attempt to leverage the correlations between types of regions to improve the performance of image retrieval. A Context Expansion approach is explored to take advantages of such correlations by expanding the key regions of the queries using highly correlated environmental regions according to an image thesaurus. The thesaurus serves as both a mapping function between image low-level features and concepts and a store of the statistical correlations between different concepts. It is constructed through a data-driven approach which uses Web data (images, their surrounding textual annotations) as training data source to learn the region concepts and to explore the statistical correlations. Experimental results on a database of 10,000 general-purpose images show the effectiveness of our proposed approach in both improving search precision (i.e. filter irrelevant images) and recall (i.e. retrieval relevant images whose context may be varied). Several major factors which have impact on the performance of our approach are also studied.  相似文献   

3.
We investigate the expressive power of the typedλ-calculus when expressing computations over finite structures, i.e., databases. We show that the simply typedλ-calculus can express various database query languages such as the relational algebra, fixpoint logic, and the complex object algebra. In our embeddings, inputs and outputs areλ-terms encoding databases, and a program expressing a query is aλ-term which types when applied to an input and reduces to an output. Our embeddings have the additional property that PTIME computable queries are expressible by programs that, when applied to an input, reduce to an output in a PTIME sequence of reduction steps. Under our database input-output conventions, all elementary queries are expressible in the typedλ-calculus and the PTIME queries are expressible in the order-5 (order-4) fragment of the typedλ-calculus (with equality).  相似文献   

4.
5.
We study the computational complexity of auditing finite attributes in databases allowing statistical queries. Given a database that supports statistical queries, the auditing problem is to check whether an attribute can be completely determined or not from a given set of statistical information. Some restricted cases of this problem have been investigated earlier, e.g. the complexity of statistical sum queries is known by the work of Kleinberg et al. (J. Comput. System Sci. 66 (2003) 244–253). We characterize all classes of statistical queries such that the auditing problem is polynomial-time solvable. We also prove that the problem is coNP-complete in all other cases under a plausible conjecture on the complexity of constraint satisfaction problems (CSP). The characterization is based on the complexity of certain CSP problems; the exact complexity for such problems is known in many cases. This result is obtained by exploiting connections between auditing and constraint satisfaction, and using certain algebraic techniques. We also study a generalization of the auditing problem where one asks if a set of statistical information imply that an attribute is restricted to K or less different values. We characterize all classes of polynomial-time solvable problems in this case, too.  相似文献   

6.
A common constraint in distributed data is that the database cannot be moved to other network sites due to computational costs, data size, or privacy considerations. All of the existing distributed algorithms for computing k-nearest neighbours (k-NNs) are designed for horizontally partitioned or special case of vertically partitioned data where different sites contain different attributes for a common set of entities. In this article, we present a framework including a general model and a decomposable algorithm for computing k-NN in d-dimensional space across horizontally and vertically partitioned data in the most general situation in which existing distributed databases want to cooperate for k-NN. The key is to obtain valid results, with a minimum information disclosure. The proposed algorithm preserves the privacy of the data at individual sites by requiring transmission of only minimal information to other sites. The computation is performed by exchanging minimum number of higher level summaries so that even if they are captured by an intruder to actual data tuples can ever be revealed.  相似文献   

7.
An approach for multivariate statistical process control based on multiway locality preserving projections (LPP) is presented. The recently developed LPP is a linear dimensionality reduction technique for preserving the neighborhood structure of the data set. It is characterized by capturing the intrinsic structure of the observed data and finding more meaningful low-dimensional information hidden in the high-dimensional observations compared with PCA. In this study, LPP is used to extract the intrinsic geometrical structure of the process data. Hotelling’s T2 (D) and the squared prediction error (SPE or Q) statistic charts for on-line monitoring are then presented, and the contribution plots of these two statistical indices are used for fault diagnosis. Moreover, a moving window technique is used for the implementation of on-line monitoring. Case study was carried out with the data of industrial penicillin fed-batch cultivations. As a comparison, the results obtained with the MPCA are also presented. It is concluded that the Multiway LPP (MLPP) outperforms the conventional MPCA. Finally, the robustness of the MLPP monitoring is analyzed by adding noises to the original data.  相似文献   

8.
Conventional relational database management systems fail to address three features of statistical data management in a biomedical/clinical database, namely, that (1) statistical and medical data (SMD) require a great deal of space and need to be stored in a reduced form with minimal duplication; indeed, SMD have many derived/calculated and summary statistics that make the number of attributes in a relation (i.e., a set of records) grow rapidly and dynamically; (2) most SMD have hierarchical structures that are difficult to manage using the relational data model since SMD are stored in separate relations for duplication and space considerations; and (3) the management of SMD is made easier if it is possible to reorganize relations or group data, a capability lacking in conventional relational database management systems. In this paper, we (1) introduce five extended relational operators, (lattice) NEST, (lattice) UNNEST, MERGE, SPREAD, and GEN, to reorganize relations; (2) integrate the extended operators with conventional relational algebra and introduce the concept of the lattice relational model; and (3) give applications of the extended relational operators and the lattice relational model in solving the problems of statistical data manipulation in medical databases.  相似文献   

9.
Let us consider the following situation: \(t\) entities (e.g., hospitals) hold different databases containing different records for the same type of confidential (e.g., medical) data. They want to deliver a protected version of this data to third parties (e.g., pharmaceutical researchers), preserving in some way both the utility and the privacy of the original data. This can be done by applying a statistical disclosure control (SDC) method. One possibility is that each entity protects its own database individually, but this strategy provides less utility and privacy than a collective strategy where the entities cooperate, by means of a distributed protocol, to produce a global protected dataset. In this paper, we investigate the problem of distributed protocols for SDC protection methods. We propose a simple, efficient and secure distributed protocol for the specific SDC method of rank shuffling. We run some experiments to evaluate the quality of this protocol and to compare the individual and collective strategies for solving the problem of protecting a distributed database. With respect to other distributed versions of SDC methods, the new protocol provides either more security or more efficiency, as we discuss through the paper.  相似文献   

10.
Today's database systems must deal with uncertainty in the data they store. Consequently, there is a strong need for mining probabilistic databases. Because probabilistic data in first normal form relations is redundant, existing mining techniques are inadequate for discovering probabilistic databases. This paper designs a new strategy for identifying potentially useful patterns in probabilistic databases. A dependent rule is thus identified in a probabilistic database, represented in the form X → Y with conditional probability matrix MY|X . This method uses an instance selection to increase efficiency, enabling us to reduce the search space. We evaluated the proposed technique, and our experimental results demonstrate that the approach is effective and efficient.  相似文献   

11.
《国际计算机数学杂志》2012,89(12):1265-1271

A statistical database is said to be secure if no sensitive statistics are disclosed. An important problem in security of statistical databases is to maximize the number of disclosed statistics, while ensuring that the database is secure. In this paper we present a new result for the case of range queries in 1-dimensional statistical databases and k -compromise, where k is odd.  相似文献   

12.
We propose the f-divergences of two probability distributions as the measures of the organization of a probabilistic system with respect to its probabilistic uncertainty. A probabilistic system consist of stochastical objects on which random variables are defined which are statistically dependent. Using Shannon's f-divergence for the organization of a probabilistic system we express it in terms of the probability distributions of the element random variables and their statistical linkages. Then we find the maximum entropy of a probabilistic system if the statistical linkages between its elements are given as input data. We show that an important class of physical statistical systems can be described by the probabilistic systems of Gibbsian type.  相似文献   

13.
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns.In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.  相似文献   

14.
Sven  Sebastian  Thu   《Data & Knowledge Engineering》2009,68(10):1128-1155
We establish search algorithms from the area of propositional logic as invaluable tools for the semantic knowledge acquisition in the conceptual database design phase. The acquisition of such domain knowledge is crucial for the quality of the target database.Integrity constraints are conditions that capture the semantics of the application domain under consideration. They restrict the databases to those that are considered meaningful to the application at hand. In practice, the decision of specifying a constraint is very important and extremely challenging.We show how techniques from propositional logic can be utilised to offer decision support for specifying Boolean and multivalued dependencies between properties of entities and relationships in conceptual databases. In particular, we use a search version of SAT-solvers to semi-automatically generate sample databases for this class of dependencies in Entity-Relationship models. The sample databases enable design participants to judge, justify, convey and test their understanding of the semantics of the future database. Indeed, the decision by the participants to specify a dependency explicitly is reduced to their decision whether there is some sample database that they can accept as a future database instance.  相似文献   

15.
We consider random graphs, and their extensions to random structures, with edge probabilities of the form β n α , where n is the number of vertices, α,β are fixed and α>1 (α>arity−1 for structures of higher arity). We consider conjunctive properties over these random graphs, and investigate the problem of computing their asymptotic conditional probabilities. This provides us a novel approach to dealing with uncertainty in databases, with applications to data privacy and other database problems. This work was partially supported by the grants NSF SEIII 0513877, NSF 61-2252, and NSF IIS-0428168.  相似文献   

16.
17.
The evaluability of queries on a statistical database containing joinable tables connected by an intersection hypergraph is considered. A characterization of evaluable queries is given, which allows one to define polynomial-time procedures both for testing evaluability and for evaluating queries. These results are useful in designing an `informed query system' for statistical databases which promotes an integrated use of stored information. Such a query system allows the user to formulate a query involving attributes from several joinable tables as if they were all contained in a single universal table  相似文献   

18.
The problems of efficient data storage and data retrieval are important issues in the design of image database systems. A data structure called a 2-D string, which represents symbolic pictures preserving spatial knowledge, was proposed by Chang et al. It allows a natural way to construct iconic indexes for pictures. We proposed a data structure 2-D B-string to characterize the spatial knowledge embedded in images. It is powerful enough to describe images with partly overlapping or completely overlapping objects without the need of partitioning objects. When there exist a large volume of complex images in the image database, the processing time for image retrieval is tremendous. It is essential to develop efficient access methods for retrieval. In this paper, access methods, to different extents of precision, for retrieval of desired images encoded in 2-D B-strings are proposed. The signature file acting as a spatial filter of image database is based on disjoint coding and superimposed coding techniques. It provides an efficient way to retrieve images in image databases.  相似文献   

19.
Sequential Pattern Mining in Multi-Databases via Multiple Alignment   总被引:2,自引:0,他引:2  
To efficiently find global patterns from a multi-database, information in each local database must first be mined and summarized at the local level. Then only the summarized information is forwarded to the global mining process. However, conventional sequential pattern mining methods based on support cannot summarize the local information and is ineffective for global pattern mining from multiple data sources. In this paper, we present an alternative local mining approach for finding sequential patterns in the local databases of a multi-database. We propose the theme of approximate sequential pattern mining roughly defined as identifying patterns approximately shared by many sequences. Approximate sequential patterns can effectively summerize and represent the local databases by identifying the underlying trends in the data. We present a novel algorithm, ApproxMAP, to mine approximate sequential patterns, called consensus patterns, from large sequence databases in two steps. First, sequences are clustered by similarity. Then, consensus patterns are mined directly from each cluster through multiple alignment. We conduct an extensive and systematic performance study over synthetic and real data. The results demonstrate that ApproxMAP is effective and scalable in mining large sequences databases with long patterns. Hence, ApproxMAP can efficiently summarize a local database and reduce the cost for global mining. Furthremore, we present an elegant and uniform model to identify both high vote sequential patterns and exceptional sequential patterns from the collection of these consensus patterns from each local databases.  相似文献   

20.
The inaccessible set consists of those elements in a statistical database which cannot be compromised if queries of a certain fixed type are used. We define the size of the inaccessible set as a measure of the security of statistical databases; this measure has the property that it reflects changes in the type of the queries in a very direct way. We then derive several of these measures for various types of queries.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号