首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 21 毫秒
王艳  侯哲  黄滟鸿  史建琦  张格林 《软件学报》2022,33(7):2482-2498
如今,越来越多的社会决策借助机器学习模型给出,包括法律决策、财政决策等等.对于这些决策,算法的公平性是极为重要的.事实上,在这些环境中引入机器学习的目的之一,就是为了规避或减少人类在决策过程中存在的偏见.然而,数据集常常包含敏感特征,或可能存在历史性偏差,会使得机器学习算法产生带有偏见的模型.由于特征选择对基于树的模型具有重要性,它们容易受到敏感属性的影响.提出一种基于概率模型检查的方法,以形式化验证决策树和树集成模型的公平性.将公平性问题转换为概率验证问题,为算法模型构建PCSP#模型,并使用PAT模型检查工具求解,以不同定义的公平性度量衡量模型公平性.基于该方法开发了FairVerify工具,并在多个基于不同数据集和复合敏感属性的分类器上验证了不同的公平性度量,展现了较好的性能.与现有的基于分布的验证器相比,该方法具有更高的可扩展性和鲁棒性.  相似文献   

无线自组网常常采用分簇网络结构来改善网络性能,并且大都是非交叠分簇结构.首先,比较了交叠分簇结构和非交叠分簇结构的优缺点.然后,重点对交叠分簇网络结构的特性进行了研究,特别考虑了相邻簇之间的高效通信问题.最后,通过模拟实验分析了交叠分簇策略下几种典型分簇算法的性能,验证了算法的有效性.  相似文献   

Knowledge discovery through directed probabilistic topic models: a survey   总被引:1,自引:0,他引:1  
Graphical models have become the basic framework for topic based probabilistic modeling. Especially models with latent variables have proved to be effective in capturing hidden structures in the data. In this paper, we survey an important subclass Directed Probabilistic Topic Models (DPTMs) with soft clustering abilities and their applications for knowledge discovery in text corpora. From an unsupervised learning perspective, “topics are semantically related probabilistic clusters of words in text corpora; and the process for finding these topics is called topic modeling”. In topic modeling, a document consists of different hidden topics and the topic probabilities provide an explicit representation of a document to smooth data from the semantic level. It has been an active area of research during the last decade. Many models have been proposed for handling the problems of modeling text corpora with different characteristics, for applications such as document classification, hidden association finding, expert finding, community discovery and temporal trend analysis. We give basic concepts, advantages and disadvantages in a chronological order, existing models classification into different categories, their parameter estimation and inference making algorithms with models performance evaluation measures. We also discuss their applications, open challenges and future directions in this dynamic area of research.  相似文献   

Small Footprint LiDAR (Light Detection And Ranging) has been proposed as an effective tool for measuring detailed biophysical characteristics of forests over broad spatial scales. However, by itself LiDAR yields only a sample of the true 3D structure of a forest. In order to extract useful forestry relevant information, this data must be interpreted using mathematical models and computer algorithms that infer or estimate specific forest metrics. For these outputs to be useful, algorithms must be validated and/or calibrated using a sub-sample of ‘known’ metrics measured using more detailed, reliable methods such as field sampling. In this paper we describe a novel method for delineating and deriving metrics of individual trees from LiDAR data based on watershed segmentation. Because of the costs involved with collecting both LiDAR data and field samples for validation, we use synthetic LiDAR data to validate and assess the accuracy of our algorithm. This synthetic LiDAR data is generated using a simple geometric model of Loblolly pine (Pinus taeda) trees and a simulation of LiDAR sampling. Our results suggest that point densities greater than 2 and preferably greater than 4 points per m2 are necessary to obtain accurate forest inventory data from Loblolly pine stands. However the results also demonstrate that the detection errors (i.e. the accuracy and biases of the algorithm) are intrinsically related to the structural characteristics of the forest being measured. We argue that experiments with synthetic data are directly useful to forest managers to guide the design of operational forest inventory studies. In addition, we argue that the development of LiDAR simulation models and experiments with the data they generate represents a fundamental and useful approach to designing, improving and exploring the accuracy and efficiency of LiDAR algorithms.  相似文献   

Topic models are generative probabilistic models which have been applied to information retrieval to automatically organize and provide structure to a text corpus. Topic models discover topics in the corpus, which represent real world concepts by frequently co-occurring words. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task.In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time. We find that the large majority (87%–89%) of topic evolutions correspond well with actual code change activities by developers. We are thus encouraged to use topic models as tools for studying the evolution of a software system.  相似文献   

Clustering is an important research topic that has practical applications in many fields. It has been demonstrated that fuzzy clustering, using algorithms such as the fuzzy C-means (FCM), has clear advantages over crisp and probabilistic clustering methods. Like most clustering algorithms, however, FCM and its derivatives need the number of clusters in the given data set as one of their initializing parameters. The main goal of this paper is to develop an effective fuzzy algorithm for automatically determining the number of clusters. After a brief review of the relevant literature, we present a new algorithm for determining the number of clusters in a given data set and a new validity index for measuring the “goodness” of clustering. Experimental results and comparisons are given to illustrate the performance of the new algorithm.  相似文献   

A clustering ensemble combines in a consensus function the partitions generated by a set of independent base clusterers. In this study both the employment of particle swarm clustering (PSC) and ensemble pruning (i.e., selective reduction of base partitions) using evolutionary techniques in the design of the consensus function is investigated. In the proposed ensemble, PSC plays two roles. First, it is used as a base clusterer. Second, it is employed in the consensus function; arguably the most challenging element of the ensemble. The proposed consensus function exploits a representation for the base partitions that makes cluster alignment unnecessary, allows for the combination of partitions with different number of clusters, and supports both disjoint and overlapping (fuzzy, probabilistic, and possibilistic) partitions. Results on both synthetic and real-world data sets show that the proposed ensemble can produce statistically significant better partitions, in terms of the validity indices used, than the best base partition available in the ensemble. In general, a small number of selected base partitions (below 20% of the total) yields the best results. Moreover, results produced by the proposed ensemble compare favorably to those of state-of-the-art clustering algorithms, and specially to swarm based clustering ensemble algorithms.  相似文献   

《Computer Communications》2007,30(14-15):2826-2841
The past few years have witnessed increased interest in the potential use of wireless sensor networks (WSNs) in applications such as disaster management, combat field reconnaissance, border protection and security surveillance. Sensors in these applications are expected to be remotely deployed in large numbers and to operate autonomously in unattended environments. To support scalability, nodes are often grouped into disjoint and mostly non-overlapping clusters. In this paper, we present a taxonomy and general classification of published clustering schemes. We survey different clustering algorithms for WSNs; highlighting their objectives, features, complexity, etc. We also compare of these clustering algorithms based on metrics such as convergence rate, cluster stability, cluster overlapping, location-awareness and support for node mobility.  相似文献   

杜慧  陈云芳  张伟 《计算机科学》2017,44(Z6):29-32, 47
主题模型利用快速的机器学习算法从高维稀疏的单词数据中提取出低维的主题表示,实现了对文档单词的聚类。对主题模型中的参数进行估计是该领域的一项重要研究工作。详细描述了概率潜在语义分析模型和潜在狄利克雷模型以及主题模型中基本的参数估计方法,并对模型的困惑度进行了实验比较。  相似文献   

单菁  申德荣  寇月  聂铁铮  于戈 《软件学报》2017,28(2):326-340
随着社交网络的蓬勃发展,信息传播问题由于具有广泛的应用前景而受到广泛关注,影响力最大化问题是信息传播中的一个研究热点.它致力于在信息传播过程开始之前选取能够使预期影响力达到最大的节点作为信息传播的初始节点,并且多采用基于概率的模型,如独立级联模型等.然而,现有的影响力最大化解决方案大多认为信息传播过程是自动的,忽略了社交网站平台在信息传播过程中可以起到的作用.此外,基于概率的模型存在一些问题,如无法保障信息的有效传播、无法适应动态变化的网络结构等等.因此,本文提出了一种基于重叠社区搜索的传播热点选择方法,该方法通过迭代式推广模型根据用户行为反馈逐步选择影响力最大化节点,使社交网站平台在信息传播过程中充分发挥控制作用,并提出一种新型的基于重叠社区结构的方法来衡量节点影响力,根据这种衡量方式来选择传播热点.本文提出了解决该问题的两种精确算法,包括一种基本方法和一种优化方法,以及该问题的近似算法.并通过大量实验验证了精确及近似算法的效率和近似算法的准确率以及迭代式传播热点选择方法的有效性.  相似文献   

The Topic Detection task is focused on discovering the main topics addressed by a series of documents (e.g., news reports, e-mails, tweets). Topics, defined in this way, are expected to be thematically similar, cohesive and self-contained. This task has been broadly studied from the point of view of clustering and probabilistic techniques. In this work, we propose for this task the application of Formal Concept Analysis (FCA), an exploratory technique for data analysis and organization. In particular, we propose an extension of FCA-based methods for topic detection applied in the literature by applying the stability concept for the topic selection. The hypothesis is that FCA will enable the better organization of the data and stability the better selection of topics based on this data organization, thus better fulfilling the task requirements by improving the quality and accuracy of the topic detection process. In addition, the proposed FCA-based methodology is able to cope with some well-known drawbacks that clustering and probabilistic methodologies present, such as: the need to set a predefined number of clusters or the difficulty in dealing with topics with complex generalization-specialization relationships. In order to prove this hypothesis, the FCA operation is compared to other established techniques — Hierarchical Agglomerative Clustering (HAC) and Latent Dirichlet Allocation (LDA). To allow this comparison, these approaches have been implemented by the authors in a novel experimental framework. The quality of the topics detected by the different approaches in terms of their suitability for the topic detection task is evaluated by means of internal clustering validity metrics. This evaluation demonstrates that FCA generates cohesive clusters, which are less subject to changes in cluster granularity. Driven by the quality of the detected topics, FCA achieves the best general outcome, improving the experimental results for Topic Detection Task at the 2013 Replab Campaign.  相似文献   

This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self-configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner.  相似文献   

Given the pervasive nature of malicious mobile code (viruses, worms, etc.), developing statistical/structural models of code execution is of considerable importance. We investigate using probabilistic suffix trees (PSTs) and associated suffix automata (PSAs) to build models of benign application behavior with the goal of subsequently being able to detect malicious applications as anything that deviates therefrom. We describe these probabilistic suffix models and present new generic analysis and manipulation algorithms. The models and the algorithms are then used in the context of API (i.e., system call) sequences realized by Windows XP applications. The analysis algorithms, when applied to traces (i.e., sequences of API calls) of benign and malicious applications, aid in choosing an appropriate modeling strategy in terms of distance metrics and consequently provide classification measures in terms of sequence-to-model matching. We give experimental results based on classification of unobserved traces of benign and malicious applications against a suffix model trained solely from traces generated by a small set of benign applications.  相似文献   

Empirical validation of software metrics suites to predict fault proneness in object-oriented (OO) components is essential to ensure their practical use in industrial settings. In this paper, we empirically validate three OO metrics suites for their ability to predict software quality in terms of fault-proneness: the Chidamber and Kemerer (CK) metrics, Abreu's Metrics for Object-Oriented Design (MOOD), and Bansiya and Davis' Quality Metrics for Object-Oriented Design (QMOOD). Some CK class metrics have previously been shown to be good predictors of initial OO software quality. However, the other two suites have not been heavily validated except by their original proposers. Here, we explore the ability of these three metrics suites to predict fault-prone classes using defect data for six versions of Rhino, an open-source implementation of JavaScript written in Java. We conclude that the CK and QMOOD suites contain similar components and produce statistical models that are effective in detecting error-prone classes. We also conclude that the class components in the MOOD metrics suite are not good class fault-proneness predictors. Analyzing multivariate binary logistic regression models across six Rhino versions indicates these models may be useful in assessing quality in OO classes produced using modern highly iterative or agile software development processes.  相似文献   

Data clustering has been proven to be an effective method for discovering structure in medical datasets. The majority of clustering algorithms produce exclusive clusters meaning that each sample can belong to one cluster only. However, most real-world medical datasets have inherently overlapping information, which could be best explained by overlapping clustering methods that allow one sample belong to more than one cluster. One of the simplest and most efficient overlapping clustering methods is known as overlapping k-means (OKM), which is an extension of the traditional k-means algorithm. Being an extension of the k-means algorithm, the OKM method also suffers from sensitivity to the initial cluster centroids. In this paper, we propose a hybrid method that combines k-harmonic means and overlapping k-means algorithms (KHM-OKM) to overcome this limitation. The main idea behind KHM-OKM method is to use the output of KHM method to initialize the cluster centers of OKM method. We have tested the proposed method using FBCubed metric, which has been shown to be the most effective measure to evaluate overlapping clustering algorithms regarding homogeneity, completeness, rag bag, and cluster size-quantity tradeoff. According to results from ten publicly available medical datasets, the KHM-OKM algorithm outperforms the original OKM algorithm and can be used as an efficient method for clustering medical datasets.  相似文献   

主题模型LDA的多文档自动文摘   总被引:3,自引:0,他引:3  
近年来使用概率主题模型表示多文档文摘问题受到研究者的关注.LDA (latent dirichlet allocation)是主题模型中具有代表性的概率生成性模型之一.提出了一种基于LDA的文摘方法,该方法以混乱度确定LDA模型的主题数目,以Gibbs抽样获得模型中句子的主题概率分布和主题的词汇概率分布,以句子中主题权重的加和确定各个主题的重要程度,并根据LDA模型中主题的概率分布和句子的概率分布提出了2种不同的句子权重计算模型.实验中使用ROUGE评测标准,与代表最新水平的SumBasic方法和其他2种基于LDA的多文档自动文摘方法在通用型多文档摘要测试集DUC2002上的评测数据进行比较,结果表明提出的基于LDA的多文档自动文摘方法在ROUGE的各个评测标准上均优于SumBasic方法,与其他基于LDA模型的文摘相比也具有优势.  相似文献   

Scheduling stochastic workloads is a difficult task. In order to design efficient scheduling algorithms for such workloads, it is required to have a good in-depth knowledge of basic random scheduling strategies. This paper analyzes the distribution of sequential jobs and the system behavior in heterogeneous computational grid environments where the brokering is done in such a way that each computing element has a probability to be chosen proportional to its number of CPUs and (new from the previous paper) its relative speed. We provide the asymptotic behavior for several metrics (queue-sizes, slowdowns, etc.) or, in some cases, an approximation of this behavior. We study these metrics for a variety of workload configurations (load, distribution, etc.). We compare our probabilistic analysis to simulations in order to validate our results. These results provide a good understanding of the system behavior for each metric proposed. This enables us to design advanced and efficient algorithms for more complex cases.  相似文献   

Detecting topics from Twitter streams has become an important task as it is used in various fields including natural disaster warning, users opinion assessment, and traffic prediction. In this article, we outline different types of topic detection techniques and evaluate their performance. We categorize the topic detection techniques into five categories which are clustering, frequent pattern mining, Exemplar-based, matrix factorization, and probabilistic models. For clustering techniques, we discuss and evaluate nine different techniques which are sequential k-means, spherical k-means, Kernel k-means, scalable Kernel k-means, incremental batch k-means, DBSCAN, spectral clustering, document pivot clustering, and Bngram. Moreover, for matrix factorization techniques, we analyze five different techniques which are sequential Latent Semantic Indexing (LSI), stochastic LSI, Alternating Least Squares (ALS), Rank-one Downdate (R1D), and Column Subset Selection (CSS). Additionally, we evaluate several other techniques in the frequent pattern mining, Exemplar-based, and probabilistic model categories. Results on three Twitter datasets show that Soft Frequent Pattern Mining (SFM) and Bngram achieve the best term precision, while CSS achieves the best term recall and topic recall in most of the cases. Moreover, Exemplar-based topic detection obtains a good balance between the term recall and term precision, while achieving a good topic recall and running time.  相似文献   

The multimodel approach was recently developed to deal with the issues of complex systems modeling and control. Despite its success in different fields, it is still faced with several design problems, in particular the determination of the number and parameters of the different models representative of the system as well as the choice of the adequate method of validities computation used for multimodel output deduction.In this paper, a new approach for complex systems modeling based on both neural and fuzzy clustering algorithms is proposed, which aims to derive different models describing the system in the whole operating domain. The implementation of this approach requires two main steps. The first step consists in determining the structure of the model-base. For this, the number of models must be firstly worked out by using a neural network and a Rival Penalized Competitive Learning (RPCL). The different operating clusters are then selected referring to two different clustering algorithms (K-means and fuzzy K-means). The second step is a parametric identification of the different models in the base by using the clustering results for model orders and parameters estimation. This step is ended in a validation procedure which aims to confirm the efficiency of the proposed modeling by using the adequate method of validity computation. The proposed approach is implemented and tested with two nonlinear systems. The obtained results turn out to be satisfactory and show a good precision, which is strongly related to the dispersion of the data and the related clustering method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号