首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 188 毫秒
1.
异常检测是机器学习与数据挖掘的热点研究领域之一, 主要应用于故障诊断、入侵检测、欺诈检测等领域. 当前已有很多有效的相关研究工作, 特别是基于隔离森林的异常检测方法, 但在处理高维数据时仍然存在许多困难. 提出了一种新的k近邻隔离森林的异常检算法: k-nearest neighbor based isolation forest (KNIF). 该方法采用超球体作为隔离工具, 利用第k近邻的方法来构建隔离森林, 并构建基于距离的异常值计算方法. 通过充分实验表明KNIF方法能有效地进行复杂分布环境下的异常检测, 并能适应不同分布形式的应用场景.  相似文献   

2.
尚敬文  王朝坤  辛欣  应翔 《软件学报》2017,28(3):648-662
社区结构是复杂网络的一个重要特征,社区发现对研究网络结构有重要的应用价值.k-均值等经典聚类算法是解决社区发现问题的一类基本方法.然而,在处理网络的高维矩阵时,使用这些经典聚类方法得到的社区往往不够准确.提出一种基于深度稀疏自动编码器的社区发现算法CoDDA,尝试提高使用这些经典方法处理高维邻接矩阵进行社区发现的准确性.首先,提出基于跳数的处理方法,对稀疏的邻接矩阵进行优化处理.得到的相似度矩阵不仅能反映网络拓扑结构中相连节点间的相似关系,同时能反映不相连节点间的相似关系.接着,基于无监督深度学习方法,构建深度稀疏自动编码器,对相似度矩阵进行特征提取,得到低维的特征矩阵.与邻接矩阵相比,特征矩阵对网络拓扑结构有更强的特征表达能力.最后,使用k-均值算法对低维特征矩阵聚类得到社区结构.实验结果显示,与6种典型的社区发现算法相比,CoDDA算法能够发现更准确的社区结构.同时,参数实验结果显示,CoDDA算法发现的社区结构比直接使用高维邻接矩阵的基本k-均值算法发现的社区结构更为准确.  相似文献   

3.
An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new classifier called the kNN model-based classifier (kNN Model) is proposed. It combines the strength of both kNN and Rocchio. A text categorization prototype, which implements kNN Model along with kNN and Rocchio, is described. An experimental evaluation of different methods is carried out on two common document corpora: the 20-newsgroup collection and the ModApte version of the Reuters-21578 collection of news stories. The experimental results show that the proposed kNN model-based method outperforms the kNN and Rocchio classifiers, and is therefore a good alternative for kNN and Rocchio in some application areas. This work was partly supported by the European Commission project ICONS, project no. IST-2001-32429.  相似文献   

4.
Modelling and solving the intrusion detection problem in computer networks   总被引:1,自引:0,他引:1  
Rachid   《Computers & Security》2004,23(8):687-696
We introduce a novel anomaly intrusion detection method based on a Within-Class Dissimilarity, WCD. This approach functions by using an appropriate metric WCD to measure the distance between an unknown user and a known user defined respectively by their profile vectors. First of all, each user performs a set of commands (events) on a given system (Unix for example). The events vector of a given user profile is a binary vector, such that an element of this vector is equal to “1” if an event happens, and to “0” otherwise. In addition to this, each user's class k has a typical profile defined by the vector Pk, in order to test if a new user i defined by its profile vector Pi belongs to the same class k or not. The Pk vector is a weighted events vector Ek, such that each weight represents the number of occurrences of an event ek. If the “distance” dki (measured by a dissimilarity parameter) between an unknown profile Pi and a known profile Pk is reasonable according to a given threshold and to some constraints, then there is no intrusion. Else, the user i is suspicious. A simple example illustrates the WCD procedure. A survey of intrusion detection methods is presented.Our proposed method based on clustering users and using simple statistical formulas is very easy for implementation.  相似文献   

5.
We present in this paper an approach that is composed of two techniques that respectively tackle the issues of detecting corporate cyber scanning and clustering distributed reconnaissance activity. The first employed technique is based on a non-attribution anomaly detection approach that focuses on what is being scanned rather than who is performing the scanning. The second technique adopts a statistical time series approach that is rendered by observing the correlation status of a traffic signal to perform the identification and clustering. To empirically validate both techniques, we utilize and examine two real network traffic datasets and implement two experimental environments. The first dataset comprises of unsolicited one-way telescope/darknet traffic while the second dataset has been captured in our lab through a customized setup. The results show, on one hand, that for a class C network with 250 active hosts and 5 monitored servers, the training period of the proposed detection technique required a stabilization time of less than 1 s and a state memory of 80 bytes. Moreover, in comparison with Snort’s sfPortscan technique, it was able to detect 4215 unique scans and yielded zero false negative. On the other hand, the proposed clustering technique is able to correctly identify and cluster the scanning machines with high accuracy even in the presence of legitimate traffic. We further validate this clustering technique by formulating the presented scenario as a machine learning problem. Specifically, we compare our proposed technique with an unsupervised data clustering technique that adopts the k-means and the expectation maximization approach. The results authenticate our clustering technique rendering it feasible for adoption.  相似文献   

6.
An online incremental method of vision-only loop-closure detection for long-term robot navigation is proposed. The method is based on the scheme of direct feature matching which has recently become more efficient than the Bag-of-Words scheme in many challenging environments. The contributions of the paper are the application of hierarchical k-means to speed-up feature matching time and the improvement of the score calculation technique used to determine the loop-closing location. As a result, the presented method runs quickly in long term while retaining high accuracy. The searching cost has been markedly reduced. The proposed method requires no any off-line dictionary generation processes. It can start anywhere at any times. The evaluation has been done on standard well-known datasets being challenging in perceptual aliasing and dynamic changes. The results show that the proposed method offers high precision-recall in large-scale different environments with real-time computation comparing to other vision-only loop-closure detection methods.  相似文献   

7.
ABSTRACT

Sniffers are program that are used to capture network packets illegally. It is a malicious activity performed by network users, and because of this network security is at risk. Detection of sniffers is an essential task to maintaining network security. Man in the middle (MiM) intrusion detection, switched network sniffer detection based on IP packet routing, and ARP watch detection techniques are used to detect a sniffer in switch Local Area Network (LAN) environments.

This paper highlights the shortcomings of previous techniques that detect sniffers in switch LAN environments. We propose a new sniffer detection technique based on Internet Protocol (IP) packet routing that removes the shortcomings of previous techniques. Experimental results demonstrate that the proposed detection technique shows a significance improvement in network performance.  相似文献   

8.
Clustering is a popular data analysis and data mining technique. A popular technique for clustering is based on k-means such that the data is partitioned into K clusters. However, the k-means algorithm highly depends on the initial state and converges to local optimum solution. This paper presents a new hybrid evolutionary algorithm to solve nonlinear partitional clustering problem. The proposed hybrid evolutionary algorithm is the combination of FAPSO (fuzzy adaptive particle swarm optimization), ACO (ant colony optimization) and k-means algorithms, called FAPSO-ACO–K, which can find better cluster partition. The performance of the proposed algorithm is evaluated through several benchmark data sets. The simulation results show that the performance of the proposed algorithm is better than other algorithms such as PSO, ACO, simulated annealing (SA), combination of PSO and SA (PSO–SA), combination of ACO and SA (ACO–SA), combination of PSO and ACO (PSO–ACO), genetic algorithm (GA), Tabu search (TS), honey bee mating optimization (HBMO) and k-means for partitional clustering problem.  相似文献   

9.
Intrusion detection is a necessary step to identify unusual access or attacks to secure internal networks. In general, intrusion detection can be approached by machine learning techniques. In literature, advanced techniques by hybrid learning or ensemble methods have been considered, and related work has shown that they are superior to the models using single machine learning techniques. This paper proposes a hybrid learning model based on the triangle area based nearest neighbors (TANN) in order to detect attacks more effectively. In TANN, the k-means clustering is firstly used to obtain cluster centers corresponding to the attack classes, respectively. Then, the triangle area by two cluster centers with one data from the given dataset is calculated and formed a new feature signature of the data. Finally, the k-NN classifier is used to classify similar attacks based on the new feature represented by triangle areas. By using KDD-Cup ’99 as the simulation dataset, the experimental results show that TANN can effectively detect intrusion attacks and provide higher accuracy and detection rates, and the lower false alarm rate than three baseline models based on support vector machines, k-NN, and the hybrid centroid-based classification model by combining k-means and k-NN.  相似文献   

10.
This note presents a simplification and generalization of an algorithm for searchingk-dimensional trees for nearest neighbors reported by Friedmanet al [3]. If the distance between records is measured usingL 2 , the Euclidean norm, the data structure used by the algorithm to determine the bounds of the search space can be simplified to a single number. Moreover, because distance measurements inL 2 are rotationally invariant, the algorithm can be generalized to allow a partition plane to have an arbitrary orientation, rather than insisting that it be perpendicular to a coordinate axis, as in the original algorithm. When ak-dimensional tree is built, this plane can be found from the principal eigenvector of the covariance matrix of the records to be partitioned. These techniques and others yield variants ofk-dimensional trees customized for specific applications.It is wrong to assume thatk-dimensional trees guarantee that a nearest-neighbor query completes in logarithmic expected time. For smallk, logarithmic behavior is observed on all but tiny trees. However, for largerk, logarithmic behavior is achievable only with extremely large numbers of records. Fork = 16, a search of ak-dimensional tree of 76,000 records examines almost every record.  相似文献   

11.
王洪亚  杨利宏  刘晓强 《软件学报》2016,27(12):3051-3066
相似连接算法在数据清理、数据集成和重复网页检测等领域有着广泛的应用.现有相似连接算法有两种类型:基于相似度阈值的相似连接和Top-k相似连接.Top-k连接算法非常适合于相似度阈值未知的应用场景,目前最为有效的Top-k相似连接算法是Xiao等人提出的Topk-join.为了解决Topk-join中存在的性能问题,提出了一种Top-k相似连接算法Opt-join,该算法将Token批处理技术集成在现有的事件驱动框架中,以降低前缀事件的处理代价;通过置换哈希查找与过滤操作的执行位置来降低哈希查找代价,并理论证明了该置换的正确性.实验结果表明:与Topk-join算法相比,Opt-join取得了1.28倍~3.09倍的性能提升.实验数据还显示:随着数据长度的增加或k值的增长,Opt-join的性能优势有不断增加的趋势.  相似文献   

12.
Text categorization refers to the task of assigning the pre-defined classes to text documents based on their content. k-NN algorithm is one of top performing classifiers on text data. However, there is little research work on the use of different voting methods over text data. Also, when a huge number of training data is available online, the response speed slows down, since a test document has to obtain the distance with each training data. On the other hand, min–max-modular k-NN (M3-k-NN) has been applied to large-scale text categorization. M3-k-NN achieves a good performance and has faster response speed in a parallel computing environment. In this paper, we investigate five different voting methods for k-NN and M3-k-NN. The experimental results and analysis show that the Gaussian voting method can achieve the best performance among all voting methods for both k-NN and M3-k-NN. In addition, M3-k-NN uses less k-value to achieve the better performance than k-NN, and thus is faster than k-NN in a parallel computing environment. The work of K. Wu and B. L. Lu was supported in part by the National Natural Science Foundation of China under the grants NSFC 60375022 and NSFC 60473040, and the Microsoft Laboratory for Intelligent Computing and Intelligent Systems of Shanghai Jiao Tong University.  相似文献   

13.
一种基于彩色编码技术的基序发现算法   总被引:2,自引:0,他引:2  
王建新  黄元南  陈建二 《软件学报》2007,18(6):1298-1307
从DNA序列中发现基序是生物计算中的一个重要问题,序列条数K=20包含基序用例的序列条数k=16的(l,d)-(K-k)问题(记作(l,d)-(20-16)问题)是目前生物学家十分关注的基序发现问题.针对该问题提出了一种基于彩色编码技术的SDA(sample-driven algorithm)搜索算法--彩色编码基序搜索算法(color coding motif finding algorithm,简称CCMF算法).它利用彩色编码技术将该问题转化为(l,d)-(16-16)问题,再采用分治算法和分支定界法来求解.在解决将(l,d)-(20-16)问题转化为(l,d)-(16-16)问题时,CCMF算法利用彩色编码技术将4 845个组合降低到403个着色,这将极大地提高算法的整体运行效率.使用模拟数据和生物数据进行测试的结果表明,CCMF算法能够快速发现所有(l,d)-(20-16)问题的基序模型和基序用例,具有优于其他算法的综合性能评价,能够用于真实的基序发现问题.同时,通过修改着色方案,CCMF算法可以用于求解一般的(l,d)-(K-k)问题,其中,kK.  相似文献   

14.
李鸣鹏  高宏  邹兆年 《软件学报》2014,25(4):797-812
研究了基于图压缩的k可达查询处理,提出了一种支持k可达查询的图压缩算法k-RPC及无需解压缩的查询处理算法,k-RPC算法在所有基于等价类的支持k-reach查询的图压缩算法中是最优的.由于k-RPC算法是基于严格的等价关系,因此进一步又提出了线性时间的近似图压缩算法k-GRPC.k-GRPC算法允许从原始图中删除部分边,然后使用k-RPC获得更好的压缩比.提出了线性时间的无需解压缩的查询处理算法.真实数据上的实验结果表明,对于稀疏的原始图,两种压缩算法的压缩比分别可以达到45%,对于稠密的原始图,两种压缩算法的压缩比分别可以达到75%和67%;与在原始图上直接进行查询处理相比,两种基于压缩图的查询处理算法效率更好,在稀疏图上的查询效率可以提高2.5倍.  相似文献   

15.
The K nearest neighbors approach is a viable technique in time series analysis when dealing with ill-conditioned and possibly chaotic processes. Such problems are frequently encountered in, e.g., finance and production economics. More often than not, the observed processes are distorted by nonnormal disturbances, incomplete measurements, etc. that undermine the identification, estimation and performance of multivariate techniques. If outliers can be duly recognized, many crisp statistical techniques may perform adequately as such. Geno-mathematical programming provides a connection between statistical time series theory and fuzzy regression models that may be utilized e.g., in the detection of outliers. In this paper we propose a fuzzy distance measure for detecting outliers via geno-mathematical parametrization. Fuzzy KNN is connected as a linkable library to the genetic hybrid algorithm (GHA) of the author, in order to facilitate the determination of the LR-type fuzzy number for automatic outlier detection in time series data. We demonstrate that GHA[Fuzzy KNN] provides a platform for automatically detecting outliers in both simulated and real world data.  相似文献   

16.
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345–366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.  相似文献   

17.
利用传统的k匿名技术在社会网络中进行隐私保护时会存在聚类准则单一、图中数据信息利用不足等问题. 针对该问题, 提出了一种利用Kullback-Leibler (KL)散度衡量节点1-邻居图相似性的匿名技术(anonymization techniques for measuring the similarity of node 1-neighbor graph based on Kullback-Leibler divergence, SNKL). 根据节点1-邻居图分布的相似性对原始图节点集进行划分, 按照划分好的类进行图修改, 使修改后的图满足k匿名, 完成图的匿名发布. 实验结果表明, SNKL方法与HIGA方法相比在聚类系数上的改变量平均降低了17.3%, 同时生成的匿名图与原始图重要性节点重合度保持在95%以上. 所提方法在有效保证隐私的基础上, 可以显著的降低对原始图结构信息的改变.  相似文献   

18.
Given a graph with a source and a sink node, the NP-hard maximum k-splittable s,t-flow (M k SF) problem is to find a flow of maximum value from s to t with a flow decomposition using at most k paths. The multicommodity variant of this problem is a natural generalization of disjoint paths and unsplittable flow problems. Constructing a k-splittable flow requires two interdepending decisions. One has to decide on k paths (routing) and on the flow values for the paths (packing). We give efficient algorithms for computing exact and approximate solutions by decoupling the two decisions into a first packing step and a second routing step. Usually the routing is considered before the packing. Our main contributions are as follows: (i) We show that for constant k a polynomial number of packing alternatives containing at least one packing used by an optimal M k SF solution can be constructed in polynomial time. If k is part of the input, we obtain a slightly weaker result. In this case we can guarantee that, for any fixed ε>0, the computed set of alternatives contains a packing used by a (1−ε)-approximate solution. The latter result is based on the observation that (1−ε)-approximate flows only require constantly many different flow values. We believe that this observation is of interest in its own right. (ii) Based on (i), we prove that, for constant k, the M k SF problem can be solved in polynomial time on graphs of bounded treewidth. If k is part of the input, this problem is still NP-hard and we present a polynomial time approximation scheme for it.  相似文献   

19.
On-line hierarchical clustering   总被引:1,自引:0,他引:1  
Most of the techniques used in the literature for hierarchical clustering are based on off-line operation. The main contribution of this paper is to propose a new algorithm for on-line hierarchical clustering by finding the nearest k objects to each introduced object so far and these nearest k objects are continuously updated by the arrival of a new object. By final object, we have the objects and their nearest k objects which are sorted to produce the hierarchical dendogram. The results of the application of the new algorithm on real and synthetic data and also using simulation experiments, show that the new technique is quite efficient and, in many respects, superior to traditional off-line hierarchical methods.  相似文献   

20.
This paper presents a new class of (malicious) codes denoted k-ary codes. Instead of containing the whole instructions composing the program’s action, this type of codes is composed of k distinct parts which constitute a partition of the entire code. Each of these parts contains only a subset of the instructions. When considered alone (e.g. by an antivirus) every part cannot be distinguished from a normal uninfected program while their respective action combined according to different possible modes results in the offensive behaviour. In this paper, we presents a formalisation of this type of codes by means of Boolean functions and give their detailed taxonomy. We first show that classical malware are just a particular instance of this general model then we specifically address the case of k-ary codes. We give some complexity results about their detection based on the interaction between the different parts. As a general result, the detection is proved to be NP-complete.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号