首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Given the pervasive nature of malicious mobile code (viruses, worms, etc.), developing statistical/structural models of code execution is of considerable importance. We investigate using probabilistic suffix trees (PSTs) and associated suffix automata (PSAs) to build models of benign application behavior with the goal of subsequently being able to detect malicious applications as anything that deviates therefrom. We describe these probabilistic suffix models and present new generic analysis and manipulation algorithms. The models and the algorithms are then used in the context of API (i.e., system call) sequences realized by Windows XP applications. The analysis algorithms, when applied to traces (i.e., sequences of API calls) of benign and malicious applications, aid in choosing an appropriate modeling strategy in terms of distance metrics and consequently provide classification measures in terms of sequence-to-model matching. We give experimental results based on classification of unobserved traces of benign and malicious applications against a suffix model trained solely from traces generated by a small set of benign applications.  相似文献   

2.
3.
Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.  相似文献   

4.
Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments.  相似文献   

5.
以说话人识别中的背景模型为基础,根据模型中的各个高斯分量,构造出说话人特征空间,将长度不一样的语句映射成为空间中大小相同的向量,且经过相关矩阵进行规整后,采用线性支持向量机进行说话人识别。借鉴几种常见的特征规整方式,结合语句映射后的向量,提出四种不同的规整方法:均值/方差规整、权重规整、WLOG规整和球形规整,并与概率序列核进行比较研究。根据语音特征向量序列中相邻的特征向量的前后转移关系,结合提出的概率序列核,构造出转移概率序列核。实验在NIST2001库上进行,结果表明概率序列核模型识别性能接近经典的UBM-MAP模型,将这两类模型得分进行融合,可非常明显地提高识别性能,进一步融合转移概率序列核后,性能还可提高19.1%。  相似文献   

6.
Latent Dirichlet allocation (LDA) is one of the major models used for topic modelling. A number of models have been proposed extending the basic LDA model. There has also been interesting research to replace the Dirichlet prior of LDA with other pliable distributions like generalized Dirichlet, Beta-Liouville and so forth. Owing to the proven efficiency of using generalized Dirichlet (GD) and Beta-Liouville (BL) priors in topic models, we use these versions of topic models in our paper. Furthermore, to enhance the support of respective topics, we integrate mixture components which gives rise to generalized Dirichlet mixture allocation and Beta-Liouville mixture allocation models respectively. In order to improve the modelling capabilities, we use variational inference method for estimating the parameters. Additionally, we also introduce an online variational approach to cater to specific applications involving streaming data. We evaluate our models based on its performance on applications related to text classification, image categorization and genome sequence classification using a supervised approach where the labels are used as an observed variable within the model.  相似文献   

7.
Statistical topic models for multi-label document classification   总被引:2,自引:0,他引:2  
Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A?drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies.  相似文献   

8.
The Journal of Supercomputing - Twitter is a popular social network for people to share views or opinions on various topics. Many people search for health topics through Twitter; thus, obtaining a...  相似文献   

9.
Given a parametric Markov model, we consider the problem of computing the rational function expressing the probability of reaching a given set of states. To attack this principal problem, Daws has suggested to first convert the Markov chain into a finite automaton, from which a regular expression is computed. Afterwards, this expression is evaluated to a closed form function representing the reachability probability. This paper investigates how this idea can be turned into an effective procedure. It turns out that the bottleneck lies in the growth of the regular expression relative to the number of states (n Θ(log n)). We therefore proceed differently, by tightly intertwining the regular expression computation with its evaluation. This allows us to arrive at an effective method that avoids this blow up in most practical cases. We give a detailed account of the approach, also extending to parametric models with rewards and with non-determinism. Experimental evidence is provided, illustrating that our implementation provides meaningful insights on non-trivial models.  相似文献   

10.
This paper introduces a novel probabilistic method for robot based object segmentation. The method integrates knowledge of the robot’s motion to determine the shape and location of objects. This allows a robot with no prior knowledge of its workspace to isolate objects against their surroundings by moving them and observing their visual feedback. The main contribution of the paper is to improve upon current methods by allowing object segmentation in changing environments and moving backgrounds. The approach allows optimal values for the algorithm parameters to be estimated. Empirical studies against alternatives demonstrate clear improvements in both planar and three dimensional motion.  相似文献   

11.
Multi-label text classification is an increasingly important field as large amounts of text data are available and extracting relevant information is important in many application contexts. Probabilistic generative models are the basis of a number of popular text mining methods such as Naive Bayes or Latent Dirichlet Allocation. However, Bayesian models for multi-label text classification often are overly complicated to account for label dependencies and skewed label frequencies while at the same time preventing overfitting. To solve this problem we employ the same technique that contributed to the success of deep learning in recent years: greedy layer-wise training. Applying this technique in the supervised setting prevents overfitting and leads to better classification accuracy. The intuition behind this approach is to learn the labels first and subsequently add a more abstract layer to represent dependencies among the labels. This allows using a relatively simple hierarchical topic model which can easily be adapted to the online setting. We show that our method successfully models dependencies online for large-scale multi-label datasets with many labels and improves over the baseline method not modeling dependencies. The same strategy, layer-wise greedy training, also makes the batch variant competitive with existing more complex multi-label topic models.  相似文献   

12.
Human behavior recognition is one important task of image processing and surveillance system. One main challenge of human behavior recognition is how to effectively model behaviors on condition of unconstrained videos due to tremendous variations from camera motion,background clutter,object appearance and so on. In this paper,we propose two novel Multi-Feature Hierarchical Latent Dirichlet Allocation models for human behavior recognition by extending the bag-of-word topic models such as the Latent Dirichlet Allocation model and the Multi-Modal Latent Dirichlet Allocation model. The two proposed models with three hierarchies including low-level visual features,feature topics,and behavior topics can effectively fuse two different types of features including motion and static visual features,avoid detecting or tracking the motion objects,and improve the recognition performance even if the features are extracted with a great amount of noise. Finally,we adopt the variational EM algorithm to learn the parameters of these models. Experiments on the YouTube dataset demonstrate the effectiveness of our proposed models.  相似文献   

13.
文本主题引发的情感反馈与用户特征之间具有一定的关联。为了充分挖掘用户特征的价值以提高情感预测的准确度,在双层主题模型MSTM和SLTM的基础上,增加了对用户特征信息的采样层,进而提出了基于用户特征的“用户-主题-情感”三层主题模型UMSTM和USLTM。通过三层模型与基础模型在最高情感命中率以及情感概率预测相关系数的对比实验,来检验用户特征对情感预测产生的效果与影响。实验验证了UMSTM和USLTM在以上两种指标中,相对于MSTM和SLTM均有提高。  相似文献   

14.
15.
Jiang  Haixin  Zhou  Rui  Zhang  Limeng  Wang  Hua  Zhang  Yanchun 《World Wide Web》2019,22(6):2545-2560

In LDA model, independence assumptions in the Dirichlet distribution of the topic proportions lead to the inability to model the connections between topics. Some researchers have attempted to break them and thus obtained more powerful topic models. Following this strategy, by using an association matrix to measure the association between latent topics, we develop an associated topic model (ATM), in which consecutive sentences are considered important and the topic assignments for words are jointly determined by the association matrix and the sentence level topic distributions, instead of the document-specific topic distributions only. This approach gives a more realistic modeling of latent topic connections where the presence of a topic may be connected with the presence of another. We derive a collapsed Gibbs sampling algorithm for inference and parameter estimation for the ATM. The experimental results demonstrate that the ATM gives a more practical interpretation and is capable of learning more associated topics.

  相似文献   

16.
Topic models are generative probabilistic models which have been applied to information retrieval to automatically organize and provide structure to a text corpus. Topic models discover topics in the corpus, which represent real world concepts by frequently co-occurring words. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task.In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time. We find that the large majority (87%–89%) of topic evolutions correspond well with actual code change activities by developers. We are thus encouraged to use topic models as tools for studying the evolution of a software system.  相似文献   

17.
18.
19.
Software developers insert logging statements in their source code to record important runtime information; such logged information is valuable for understanding system usage in production and debugging system failures. However, providing proper logging statements remains a manual and challenging task. Missing an important logging statement may increase the difficulty of debugging a system failure, while too much logging can increase system overhead and mask the truly important information. Intuitively, the actual functionality of a software component is one of the major drivers behind logging decisions. For instance, a method maintaining network communications is more likely to be logged than getters and setters. In this paper, we used automatically-computed topics of a code snippet to approximate the functionality of a code snippet. We studied the relationship between the topics of a code snippet and the likelihood of a code snippet being logged (i.e., to contain a logging statement). Our driving intuition is that certain topics in the source code are more likely to be logged than others. To validate our intuition, we conducted a case study on six open source systems, and we found that i) there exists a small number of “log-intensive” topics that are more likely to be logged than other topics; ii) each pair of the studied systems share 12% to 62% common topics, and the likelihood of logging such common topics has a statistically significant correlation of 0.35 to 0.62 among all the studied systems; and iii) our topic-based metrics help explain the likelihood of a code snippet being logged, providing an improvement of 3% to 13% on AUC and 6% to 16% on balanced accuracy over a set of baseline metrics that capture the structural information of a code snippet. Our findings highlight that topics contain valuable information that can help guide and drive developers’ logging decisions.  相似文献   

20.
Identifying influential speakers in multi-party conversations has been the focus of research in communication, sociology, and psychology for decades. It has been long acknowledged qualitatively that controlling the topic of a conversation is a sign of influence. To capture who introduces new topics in conversations, we introduce SITS—Speaker Identity for Topic Segmentation—a nonparametric hierarchical Bayesian model that is capable of discovering (1) the topics used in a set of conversations, (2) how these topics are shared across conversations, (3) when these topics change during conversations, and (4) a speaker-specific measure of “topic control”. We validate the model via evaluations using multiple datasets, including work meetings, online discussions, and political debates. Experimental results confirm the effectiveness of SITS in both intrinsic and extrinsic evaluations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号