期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Indexing the solution space: a new technique for nearest neighborsearch in high-dimensional space

Berchtold S. Keim D.A. Kriegel H.-P. Seidl T. 《Knowledge and Data Engineering, IEEE Transactions on》2000,12(1):45-57

Similarity search in multimedia databases requires an efficient support of nearest-neighbor search on a large set of high-dimensional points as a basic operation for query processing. As recent theoretical results show, state of the art approaches to nearest-neighbor search are not efficient in higher dimensions. In our new approach, we therefore precompute the result of any nearest-neighbor search which corresponds to a computation of the Voronoi cell of each data point. In a second step, we store conservative approximations of the Voronoi cells in an index structure efficient for high-dimensional data spaces. As a result, nearest neighbor search corresponds to a simple point query on the index structure. Although our technique is based on a precomputation of the solution space, it is dynamic, i.e., it supports insertions of new data points. An extensive experimental evaluation of our technique demonstrates the high efficiency for uniformly distributed as well as real data. We obtained a significant reduction of the search time compared to nearest neighbor search in other index structures such as the X-tree 相似文献

2.

Evaluation of thermodynamic properties from alloy phase diagram with miscibility gap using non-random two-liquid equation

《Calphad》2002,26(3):327-340

The non-random two-liquid equation has been applied to evaluate the thermodynamic properties of the liquid solution at elevated temperatures in a binary alloy system with a liquid phase miscibility gap. Only upon making use of the phase equilibrium data at the critical and monotectic points of the miscibility gap from a T-X phase diagram and thermochemical data, the parameters needed for the evaluation, i. e., (g₁₂ - g₂₂), g₂₁ - (g₁₁) and a of the non-random two-liquid solution approach, can be determined. The evaluation of thermodynamic properties was carried out numerically for three binary alloy systems, i. e., Al-Pb, Zn-Pb and Ga-Hg systems. The application of the non-random two-liquid equation to these three binary alloy systems shows that the evaluated results are close to the available experimental measurements. 相似文献

3.

An anchor-based spectral clustering method

Qin Zhang Guo-qiang Zhong Jun-yu Dong 《浙江大学学报:C卷英文版》2018,19(11):1385-1396

Spectral clustering is one of the most popular and important clustering methods in pattern recognition, machine learning, and data mining. However, its high computational complexity limits it in applications involving truly large-scale datasets. For a clustering problem with n samples, it needs to compute the eigenvectors of the graph Laplacian with O(n³) time complexity. To address this problem, we propose a novel method called anchor-based spectral clustering (ASC) by employing anchor points of data. Specifically, m (m ? n) anchor points are selected from the dataset, which can basically maintain the intrinsic (manifold) structure of the original data. Then a mapping matrix between the original data and the anchors is constructed. More importantly, it is proved that this data-anchor mapping matrix essentially preserves the clustering structure of the data. Based on this mapping matrix, it is easy to approximate the spectral embedding of the original data. The proposed method scales linearly relative to the size of the data but with low degradation of the clustering performance. The proposed method, ASC, is compared to the classical spectral clustering and two state-of-the-art accelerating methods, i.e., power iteration clustering and landmark-based spectral clustering, on 10 real-world applications under three evaluation metrics. Experimental results show that ASC is consistently faster than the classical spectral clustering with comparable clustering performance, and at least comparable with or better than the state-of-the-art methods on both effectiveness and efficiency. 相似文献

4.

Overall and pairwise segregation tests based on nearest neighbor contingency tables 总被引：1，自引：0，他引：1

Elvan Ceyhan 《Computational statistics & data analysis》2009,53(8):2786-2808

Multivariate interaction between two or more classes (or species) has important consequences in many fields and may cause multivariate clustering patterns such as spatial segregation or association. The spatial segregation occurs when members of a class tend to be found near members of the same class (i.e., near conspecifics) while spatial association occurs when members of a class tend to be found near members of the other class or classes. These patterns can be studied using a nearest neighbor contingency table (NNCT). The null hypothesis is randomness in the nearest neighbor (NN) structure, which may result from-among other patterns-random labeling (RL) or complete spatial randomness (CSR) of points from two or more classes (which is called the CSR independence, henceforth). New versions of overall and cell-specific tests based on NNCTs (i.e., NNCT-tests) are introduced and compared with Dixon’s overall and cell-specific tests and various other spatial clustering methods. Overall segregation tests are used to detect any deviation from the null case, while the cell-specific tests are post hoc pairwise spatial interaction tests that are applied when the overall test yields a significant result. The distributional properties of these tests are analyzed and finite sample performance of the tests are assessed by an extensive Monte Carlo simulation study. Furthermore, it is shown that the new NNCT-tests have better performance in terms of Type I error and power estimates. The methods are also applied on two real life data sets for illustrative purposes. 相似文献

5.

Subsampling techniques and the Jackknife methodology in the estimation of the extremal index

M. Ivette Gomes Andreia Hall 《Computational statistics & data analysis》2008,52(4):2022-2041

For a sequence of independent, identically distributed random variables any limiting point process for the time normalized exceedances of high levels is a Poisson process. However, for stationary dependent sequences, under general local and asymptotic dependence restrictions, any limiting point process for the time normalized exceedances of high levels is a compound Poisson process, i.e., there is a clustering of high exceedances, where the underlying Poisson points represent cluster positions, and the multiplicities correspond to the cluster sizes. For such classes of stationary sequences there exists the extremal indexθ, 0?θ?1, directly related to the clustering of exceedances of high values. The extremal index θ is equal to one for independent, identically distributed sequences, i.e., high exceedances appear individually, and θ>0 for “almost all” cases of interest. The estimation of the extremal index through the use of the Generalized Jackknife methodology, possibly together with the use of subsampling techniques, is performed. Case studies in the fields of environment and finance will illustrate the performance of the new extremal index estimator comparatively to the classical one. 相似文献

6.

Fuzzy logic approaches to structure preserving dimensionalityreduction

Pal N.R. Eluri V.K. Mandal G.K. 《Fuzzy Systems, IEEE Transactions on》2002,10(3):277-286

Sammon's (1969) nonlinear projection method is computationally prohibitive for large data sets, and it cannot project new data points. We propose a low-cost fuzzy rule-based implementation of Sammon's method for structure preserving dimensionality reduction. This method uses a sample and applies Sammon's method to project it. The input data points are then augmented by the corresponding projected (output) data points. The augmented data set thus obtained is clustered with the fuzzy c-means (FCM) clustering algorithm. Each cluster is then translated into a fuzzy rule to approximate the Sammon's nonlinear projection scheme. We consider both Mamdani-Assilian and Takagi-Sugeno models for this. Different schemes of parameter estimation are considered. The proposed schemes are applied on several data sets and are found to be quite effective to project new points, i.e., such systems have good predictability 相似文献

7.

Determinants of pull-based development in the context of continuous integration

Yue Yu Gang Yin Tao Wang Cheng Yang Huaimin Wang 《中国科学:信息科学(英文版)》2016,59(8):080104

The pull-based development model, widely used in distributed software teams on open source communities, can efficiently gather the wisdom from crowds. Instead of sharing access to a central repository, contributors create a fork, update it locally, and request to have their changes merged back, i.e., submit a pull-request. On the one hand, this model lowers the barrier to entry for potential contributors since anyone can submit pull-requests to any repository, but on the other hand it also increases the burden on integrators, who are responsible for assessing the proposed patches and integrating the suitable changes into the central repository. The role of integrators in pull-based development is crucial. They must not only ensure that pull-requests should meet the project’s quality standards before being accepted, but also finish the evaluations in a timely manner. To keep up with the volume of incoming pull-requests, continuous integration (CI) is widely adopted to automatically build and test every pull-request at the time of submission. CI provides extra evidences relating to the quality of pull-requests, which would help integrators to make final decision (i.e., accept or reject). In this paper, we present a quantitative study that tries to discover which factors affect the process of pull-based development model, including acceptance and latency in the context of CI. Using regression modeling on data extracted from a sample of GitHub projects deploying the Travis-CI service, we find that the evaluation process is a complex issue, requiring many independent variables to explain adequately. In particular, CI is a dominant factor for the process, which not only has a great influence on the evaluation process per se, but also changes the effects of some traditional predictors. 相似文献

8.

Projective reconstruction of ellipses from multiple images

F. Mai^{Author Vitae} Y.S. Hung Author Vitae Author Vitae 《Pattern recognition》2010,43(3):545-556

This paper presents a new approach for reconstructing 3D ellipses (including circles) from a sequence of 2D images taken by uncalibrated cameras. Our strategy is to estimate an ellipse in 3D space by reconstructing N(≥5) 3D points (called representative points) on it, where the representative points are reconstructed by minimizing the distances from their projections to the measured 2D ellipses on different images (i.e., 2D reprojection error). This minimization problem is transformed into a sequence of minimization sub-problems that can be readily solved by an algorithm which is guaranteed to converge to a (local) minimum of the 2D reprojection error. Our method can reconstruct multiple 3D ellipses simultaneously from multiple images and it readily handles images with missing and/or partially occluded ellipses. The proposed method is evaluated using both synthetic and real data. 相似文献

9.

Distributed data clustering in sensor networks

Ittay Eyal Idit Keidar Raphael Rom 《Distributed Computing》2011,24(5):207-222

Low overhead analysis of large distributed data sets is necessary for current data centers and for future sensor networks. In such systems, each node holds some data value, e.g., a local sensor read, and a concise picture of the global system state needs to be obtained. In resource-constrained environments like sensor networks, this needs to be done without collecting all the data at any location, i.e., in a distributed manner. To this end, we address the distributed clustering problem, in which numerous interconnected nodes compute a clustering of their data, i.e., partition these values into multiple clusters, and describe each cluster concisely. We present a generic algorithm that solves the distributed clustering problem and may be implemented in various topologies, using different clustering types. For example, the generic algorithm can be instantiated to cluster values according to distance, targeting the same problem as the famous k-means clustering algorithm. However, the distance criterion is often not sufficient to provide good clustering results. We present an instantiation of the generic algorithm that describes the values as a Gaussian Mixture (a set of weighted normal distributions), and uses machine learning tools for clustering decisions. Simulations show the robustness, speed and scalability of this algorithm. We prove that any implementation of the generic algorithm converges over any connected topology, clustering criterion and cluster representation, in fully asynchronous settings. 相似文献

10.

共用数据导向的分布式系统失效恢复缺陷检测

下载免费PDF全文

高钰王栋戴千旺窦文生魏峻《软件学报》2023,34(12):5578-5596

分布式系统的可靠性和可用性至关重要.然而,不正确的失效恢复机制及其实现会引发失效恢复缺陷,威胁分布式系统的可靠性和可用性.只有发生在特定时机的节点失效才会触发失效恢复缺陷,因此,检测分布式系统中的失效恢复缺陷具有挑战性.提出了一种新方法 Deminer来自动检测分布式系统中的失效恢复缺陷.在大规模分布式系统中观察到,同一份数据(即共用数据)可能被一组I/O写操作存储到不同位置(如不同的存储路径或节点).而打断这样一组共用数据写操作执行的节点失效更容易触发失效恢复缺陷.因此, Deminer以共用数据的使用为指导,通过自动识别和注入这类容易引发故障的节点失效来检测失效恢复缺陷.首先, Deminer追踪目标系统的一次正确执行中关键数据的使用.然后, Deminer基于执行轨迹识别使用共用数据的I/O写操作对,并预测容易引发错误的节点失效注入点.最后, Deminer通过测试预测的节点失效注入点以及检查故障征兆来暴露和确认失效恢复缺陷.实现了Deminer原型工具,并在4个流行的开源分布式系统ZooKeeper、HBase、YARN和HDFS的最新版本上进行了验证.实验结果表明Demine... 相似文献

11.

Estimation of 3D structure and motion from image corners

Yuncai LiuAuthor Vitae Xiaoyun Zhang^{Author Vitae} 《Pattern recognition》2003,36(6):1269-1277

This paper deals with motion estimation from image corner correspondences in two cases: the orthogonal corner and the general corner with known space angles. The contribution of the paper is in three folds: First, the three-dimensional structure of a corner is recovered easily from its image by introducing a new coordinate system; second, it is shown that the one corner and two points correspondences over two views are sufficient to uniquely determine the motion, i.e., the rotation and translation; third, experiments using both simulated data and real images are conducted, which present good results. 相似文献

12.

A density invariant approach to clustering

Kashyap Manish Bhattacharya Mahua 《Neural computing & applications》2017,28(7):1695-1713

Organizing data into sensible groups is called as ‘data clustering.’ It is an open research problem in various scientific fields. Neither a universal solution nor an absolute strategy for its evaluation exists in the literature. In this context, through this paper, we make following three contributions: (1) A new method for finding ‘natural groupings’ or clusters in the data set is presented. For this, a new term ‘vicinity’ is coined. Vicinity captures the idea of density together with spatial distribution of data points in feature space. This new notion has a potential to separate various type of clusters. In summary, the approach presented here is non-convex admissive (i.e., convex hulls of the clusters found can intersect which is desirable for non-convex clusters), cluster proportion and omission admissive (i.e., duplicating a cluster arbitrary number of times or deleting a cluster does not alter other cluster’s boundaries), scale covariant, consistent (shrinking within cluster distances and enlarging inter-cluster distances does not affect the clustering results) but not rich (does not generates exhaustive partitions of the data) and density invariant. (2) Strategy for automatic detection of various tunable parameters in the proposed ‘Vicinity Based Cluster Detection’ (VBCD) algorithm is presented. (3) New internal evaluation index called ‘Space-Density Index’ (SDI) for the clustered results (by any method) is also presented. Experimental results reveal that VBCD captures the idea of ‘natural groupings’ better than the existing approaches. Also, SDI evaluation scheme provides a better judgment as compared to earlier internal cluster validity indices.

相似文献

13.

In situ evaluation of recommender systems: Framework and instrumentation

M. Funk A. Rozinat E. Karapanos A.K. Alves de Medeiros A. Koca 《International journal of human-computer studies》2010,68(8):525-547

This paper deals with the evaluation of the recommendation functionality inside a connected consumer electronics product in prototype stage. This evaluation is supported by a framework to access and analyze data about product usage and user experience. The strengths of this framework lie in the collection of both objective data (i.e., “What is the user doing with the product?”) and subjective data (i.e., “How is the user experiencing the product?”), which are linked together and analyzed in a combined way. The analysis of objective data provides insights into how the system is actually used in the field. Combined with the subjective data, personal opinions and evaluative judgments on the product quality can be then related to actual user behavior. In order to collect these data in a most natural context, remote data collection allows for extensive user testing within habitual environments. We have applied our framework to the case of an interactive TV recommender system application to illustrate that the user experience of recommender systems can be evaluated in real-life usage scenarios. 相似文献

14.

A New Representation and Algorithm for Constructing Convex Hulls in Higher Dim ensional Spaces

下载免费PDF全文

L Wei Liang Youdong 《计算机科学技术学报》1992,7(1):1-5

This paper presents a new and simple scheme to describe the convex hull in R^d,which only uses three kinds of the faces of the convex hull.i.e.,the d-1-faces,d-2-faces and 0-faces.Thus,we develop and efficient new algorithm for constructing the convex hull of a finite set of points incrementally.This algorithm employs much less storage and time than that of the previously-existing approaches.The analysis of the runniing time as well as the storage for the new algorithm is also theoretically made.The algorithm is optimal in the worst case for even d. 相似文献

15.

Building semantic trees from XML documents

《Journal of Web Semantics》2016

The distributed nature of the Web, as a decentralized system exchanging information between heterogeneous sources, has underlined the need to manage interoperability, i.e., the ability to automatically interpret information in Web documents exchanged between different sources, necessary for efficient information management and search applications. In this context, XML was introduced as a data representation standard that simplifies the tasks of interoperation and integration among heterogeneous data sources, allowing to represent data in (semi-) structured documents consisting of hierarchically nested elements and atomic attributes. However, while XML was shown most effective in exchanging data, i.e., in syntactic interoperability, it has been proven limited when it comes to handling semantics, i.e., semantic interoperability, since it only specifies the syntactic and structural properties of the data without any further semantic meaning. As a result, XML semantic-aware processing has become a motivating challenge in Web data management, requiring dedicated semantic analysis and disambiguation methods to assign well-defined meaning to XML elements and attributes. In this context, most existing approaches: (i) ignore the problem of identifying ambiguous XML elements/nodes, (ii) only partially consider their structural relationships/context, (iii) use syntactic information in processing XML data regardless of the semantics involved, and (iv) are static in adopting fixed disambiguation constraints thus limiting user involvement. In this paper, we provide a new XML Semantic Disambiguation Framework titled XSDFdesigned to address each of the above limitations, taking as input: an XML document, and then producing as output a semantically augmented XML tree made of unambiguous semantic concepts extracted from a reference machine-readable semantic network. XSDF consists of four main modules for: (i) linguistic pre-processing of simple/compound XML node labels and values, (ii) selecting ambiguous XML nodes as targets for disambiguation, (iii) representing target nodes as special sphere neighborhood vectors including all XML structural relationships within a (user-chosen) range, and (iv) running context vectors through a hybrid disambiguation process, combining two approaches: concept-basedand context-based disambiguation, allowing the user to tune disambiguation parameters following her needs. Conducted experiments demonstrate the effectiveness and efficiency of our approach in comparison with alternative methods. We also discuss some practical applications of our method, ranging over semantic-aware query rewriting, semantic document clustering and classification, Mobile and Web services search and discovery, as well as blog analysis and event detection in social networks and tweets. 相似文献

16.

Diverse and proportional size-<Emphasis Type="Italic">l</Emphasis> object summaries using pairwise relevance

Georgios J. Fakas Zhi Cai Nikos Mamoulis 《The VLDB Journal The International Journal on Very Large Data Bases》2016,25(6):791-816

The abundance and ubiquity of graphs (e.g., online social networks such as Google\(+\) and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a data subject (DS), a recently proposed keyword search paradigm produces a set of object summaries (OSs) as results. An OS is a tree structure rooted at the DS node (i.e., a node containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. OS snippets, denoted as size-l OSs, have also been investigated. A size-l OS is a partial OS containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, in this paper, we investigate the effective and efficient generation of two novel types of OS snippets, i.e., diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, besides the importance of each node, we also consider its pairwise relevance (similarity) to the other nodes in the OS and the snippet. We conduct an extensive evaluation on two real graphs (DBLP and Google\(+\)). We verify effectiveness by collecting user feedback, e.g., by asking DBLP authors (i.e., the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate the quality of the snippets that they produce. 相似文献

17.

Correcting evaluation bias of relational classifiers with network cross validation

Jennifer Neville Brian Gallagher Tina Eliassi-Rad Tao Wang 《Knowledge and Information Systems》2012,30(1):31-55

Recently, a number of modeling techniques have been developed for data mining and machine learning in relational and network domains where the instances are not independent and identically distributed (i.i.d.). These methods specifically exploit the statistical dependencies among instances in order to improve classification accuracy. However, there has been little focus on how these same dependencies affect our ability to draw accurate conclusions about the performance of the models. More specifically, the complex link structure and attribute dependencies in relational data violate the assumptions of many conventional statistical tests and make it difficult to use these tests to assess the models in an unbiased manner. In this work, we examine the task of within-network classification and the question of whether two algorithms will learn models that will result in significantly different levels of performance. We show that the commonly used form of evaluation (paired t-test on overlapping network samples) can result in an unacceptable level of Type I error. Furthermore, we show that Type I error increases as (1) the correlation among instances increases and (2) the size of the evaluation set increases (i.e., the proportion of labeled nodes in the network decreases). We propose a method for network cross-validation that combined with paired t-tests produces more acceptable levels of Type I error while still providing reasonable levels of statistical power (i.e., 1−Type II error). 相似文献

18.

“VANILLA” malware: vanishing antiviruses by interleaving layers and layers of attacks

Botacin Marcus de Geus Paulo Lício Grégio André 《Journal in Computer Virology》2019,15(4):233-247

Malware are persistent threats to any networked systems. Recent years increase in multi-core, distributed systems created new opportunities for malware authors to exploit such capabilities. In particular, the distributed execution of a malware in multiple cores may be used to evade currently widespread single-core-based detectors (e.g., antiviruses, or AVs) and malware analysis solutions that are unable to correlate data from multiple sources. In this paper, we propose a technique for distributing the malware functions in several distinct “vanilla” processes to show that AVs can be easily evaded. Therefore, our technique allows malware to interleave of layers of attacks to remain undetected by current AVs. Our goal is to expose a real menace and to discuss it so as to provide insights for the development of better AVs. We discuss the role of distributed and multicore-based malware in current and future threat scenarios with practical examples that we specially crafted for testing (e.g., a distributed sample synchronized via cache side channels). We (i) review multi-threaded/processed implementation issues (from kernel and userland) and present a multi-core-based monitoring solution; (ii) present strategies for code distribution, exemplified via DLL injectors, and discuss their weak and strong points; and (iii) evaluate how real security solutions perform when exposed to distributed malware. We converted real, serial malware to parallel code and showed that current AVs are not fully able to detect multi-core malware.

相似文献

19.

Verifying a quantitative relaxation of linearizability via refinement

Kiran Adhikari James Street Chao Wang Yang Liu Shaojie Zhang 《International Journal on Software Tools for Technology Transfer (STTT)》2016,18(4):393-407

The recent years have seen increasingly widespread use of highly concurrent data structures in both multi-core and distributed computing environments, thereby escalating the priority for verifying their correctness. Quasi linearizability is a quantitative variation of the standard linearizability correctness condition to allow more implementation freedom for performance optimization. However, ensuring that the implementation satisfies the quantitative aspect of this new correctness condition is often an arduous task. In this paper, we propose the first automated method for formally verifying quasi linearizability of the implementation model of a concurrent data structure with respect to its sequential specification. The method is based on checking a relaxed version of the refinement relation between the implementation model and the specification model through explicit state model checking. Our method can directly handle concurrent systems where each thread or process makes infinitely many method calls. Furthermore, unlike many existing verification methods, it does not require the user to supply annotations of the linearization points. We have implemented the new method in the PAT verification framework. Our experimental evaluation shows that the method is effective in verifying the new quasi linearizability requirement and detecting violations. 相似文献

20.

MIND: An approach to optimize communication time via middleware tuning

《Information Systems》2019

Minimizing the communication time due to the transfer over a network of the intermediary results produced during the execution of a distributed query is a fundamental problem in distributed database management systems. We take a new look at this problem by investigating the relationship between the communication time and a remote data access middleware. We focus on two middleware parameters that are manually tuned by database administrators or programmers: the fetch size (i.e., the number of tuples that are communicated at once) and the message size (i.e., the size of the buffer at the middleware level). We present an experimental study which shows that these parameters have a crucial impact on the communication time. Then, we propose the MIND framework, which tunes the aforementioned middleware parameters, while adapting to different queries (that may vary in terms of selectivity) and networks (that may vary in terms of bandwidth). The main technical contributions of MIND are (i) a communication time estimation function that takes into account the middleware parameters, the size of the query result and the network environment, and (ii) an iterative optimization algorithm to find the fetch size and the message size that allow a good trade-off between low resource consumption and low communication time. We conclude with an experimental study that emphasizes the effectiveness of the MIND framework. 相似文献