首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Ordination is a powerful method for analysing complex data sets but has been largely ignored in sequence analysis. This paper shows how to use principal coordinates analysis to find low-dimensional representations of distance matrices derived from aligned sets of sequences. The method takes a matrix of Euclidean distances between all pairs of sequence and finds a coordinate space where the distances are exactly preserved. The main problem is to find a measure of distance between aligned sequences that is Euclidean. The simplest distance function is the square root of the percentage difference (as measured by identities) between two sequences, where one ignores any positions in the alignment where there is a gap in any sequence. If one does not ignore positions with a gap, the distances cannot be guaranteed to be Euclidean but the deleterious effects are trivial. Two examples of using the method are shown. A set of 226 aligned globins were analysed and the resulting ordination very successfully represents the known patterns of relationship between the sequences. In the other example, a set of 610 aligned 5S rRNA sequences were analysed. Sequence ordinations complement phylogenetic analyses. They should not be viewed as a complete alternative.  相似文献   

2.
Model management systems are computerised systems that facilitate the management of large numbers of decision models used in organizations. Model selection and sequencing in a model management system is the problem of processing a given model base in order to arrive at a sequence of models that can be executed to produce a set of required outputs (goal). Prior solution approaches do not attempt to solve this problem such that the goal is achieved while best meeting the objectives of the user. Instead, research to date has typically provided the first sequence of models which satisfy the goal, without attempting to optimise the objectives of the user. This restricts the applicability of many existing approaches to problems with unique solutions or to situations where users exhibit no preference among the candidate model sequences (i.e. solutions). In many real-world problems, however, multiple solutions may exist and users may prefer a certain solution over the others, based on a variety of criteria such as solution cost, accuracy and so on. In this paper, we present an architecture based on the concept of blackboard control that solves the model selection and sequencing problem while attempting to optimise the objectives of the user. We also discuss the applicability of the proposed approach for solving other problems encountered in the area of model management.  相似文献   

3.
We consider a feature selection problem where the decision-making objective is to minimize overall misclassification cost by selecting relevant features from a training dataset. We propose a two-stage solution approach for solving misclassification cost minimizing feature selection (MCMFS) problem. Additionally, we propose a maximum-margin genetic algorithm (MMGA) that maximizes margin of separation between classes by taking into account all examples as opposed to maximizing margin of separation using a few support vectors. Feature selection is carried out by either an exhaustive or a heuristic simulated annealing approach in the first stage and a cost sensitive classification using either MMGA or cost sensitive support vector machines (SVM) in the second stage. Using simulated and real-world data sets and different misclassification cost matrices, we test our two-stage approach for solving the MCMFS problem. Our results indicate that feature selection plays an important role when misclassification cost asymmetries increase and the MMGA shows equal or better performance than the SVM.  相似文献   

4.
Software Tools for DNA Sequence Design   总被引:3,自引:0,他引:3  
The design of DNA sequences is a key problem for implementing molecular self-assembly with nucleic acid molecules. These molecules must meet several physical, chemical and logical requirements, mainly to avoid mishybridization. Since manual selection of proper sequences is too time-consuming for more than a handful of molecules, the aid of computer programs is advisable. In this paper two software tools for designing DNA sequences are presented, the DNASequenceGenerator and the DNASequenceCompiler. Both employ an approach of sequence dissimilarity based on the uniqueness of overlapping subsequences and a graph based algorithm for sequence generation. Other sequence properties like melting temperature or forbidden subsequences are also regarded, but not secondary structure errors or equilibrium chemistry. Fields of application are DNA computing and DNA-based nanotechnology. In the second part of this paper, sequences generated with the DNASequenceGenerator are compared to those from several publications of other groups, an example application for the DNASequenceCompiler is presented, and the advantages and disadvantages of the presented approach are discussed.  相似文献   

5.
The reconstruction of founder genetic sequences of a population is a relevant issue in evolutionary biology research. The problem consists in finding a biologically plausible set of genetic sequences (founders), which can be recombined to obtain the genetic sequences of the individuals of a given population. The reconstruction of these sequences can be modelled as a combinatorial optimisation problem in which one has to find a set of genetic sequences such that the individuals of the population under study can be obtained by recombining founder sequences minimising the number of recombinations. This problem is called the founder sequence reconstruction problem. Solving this problem can contribute to research in understanding the origins of specific genotypic traits. In this paper, we present large neighbourhood search algorithms to tackle this problem. The proposed algorithms combine a stochastic local search with a branch-and-bound algorithm devoted to neighbourhood exploration. The developed algorithms are thoroughly evaluated on three different benchmark sets and they establish the new state of the art for realistic problem instances.  相似文献   

6.
DNA computation exploits the computational power inherent in molecules for information processing. However, in order to perform the computation correctly, a set of good DNA sequences is crucial. A lot of work has been carried out on designing good DNA sequences to archive a reliable molecular computation. In this article, the ant colony system (ACS) is introduced as a new tool for DNA sequence design. In this approach, the DNA sequence design is modeled as a path-finding problem, which consists of four nodes, to enable the implementation of the ACS. The results of the proposed approach are compared with other methods such as the genetic algorithm.  相似文献   

7.
汪美玲  周翔  陶秋铭  赵琛 《软件学报》2015,26(9):2326-2338
标签云是社交网站提供在线资源说明与导航功能的一种流行机制.标签选择即从大量标签中选出有代表性的有限标签,是创建标签云的核心任务.标签选择结果的多样性,是影响用户满意度的一个重要因素.信息覆盖度与标签非相似性是在标签选择中引入多样性的两个主要角度.为了进一步提高标签选择结果的信息覆盖度与标签非相似性,提出了3种标签选择方法.在每种方法中,定义了目标函数以同时量化标签集合的信息覆盖度与标签非相似性,并设计了近似算法以求解相应的最大化问题;同时,还分析了近似算法的近似比.利用CiteULike网站与Last.fm网站的标注数据集,将所提出的方法与已有方法进行了比较.实验结果表明,所提出的方法在信息覆盖度与标签非相似性方面都具有较好的效果.  相似文献   

8.
This paper describes a generic algorithm for finding restriction sites within DNA sequences. The 'genericity' of the algorithm is made possible through the use of set theory. Basic elements of DNA sequences, i.e. nucleotides (bases), are represented in sets, and DNA sequences, whether specific, ambiguous or even protein-coding, are represented as sequences of those sets. The set intersection operation demonstrates its ability to perform pattern-matching correctly on various DNA sequences. The performance analysis showed that the degree of complexity of the pattern matching is reduced from exponential to linear. An example is given to show the actual and potential restriction sites, derived by the generic algorithm, in the DNA sequence template coding for a synthetic calmodulin.  相似文献   

9.
The rapid increase of available DNA, protein, and other biological sequences has made the problem of discovering meaningful patterns from sequences an important task for Bioinformatics research. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with gap constraints. In this article, we identify a new research problem for mining approximate repeating patterns (ARPs) with gap constraints, where the appearance of a pattern is subject to an approximate match, which is very common in biological sequences. To solve the problem, we propose an ArpGap (ARP mining with Gap constraints) algorithm with three major components for ARP mining: (1) a data‐driven pattern generation approach to avoid generating unnecessary candidates for validation; (2) a back‐tracking pattern search process to discover approximate occurrences of a pattern under user specified gap constraints; and (3) an Apriori‐like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real‐world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost. The results further suggest that the proposed method is practical for discovering approximate patterns for protein sequences where the sequence length is usually several hundreds to one thousand and the pattern length is relatively short.  相似文献   

10.
《Computers & Geosciences》2003,29(6):741-751
Some applications use data formats (e.g. STL file format), where a set of triangles is used to represent the surface of a 3D object and it is necessary to reconstruct the triangular mesh with adjacency information. It is a lengthy process for large data sets as the time complexity of this process is O(N log N), where N is number of triangles. Triangular mesh reconstruction is a general problem and relevant algorithms can be used in GIS and DTM systems as well as in CAD/CAM systems. Many algorithms rely on space subdivision techniques while hash functions offer a more effective solution to the reconstruction problem. Hash data structures are widely used throughout the field of computer science. The hash table can be used to speed up the process of triangular mesh reconstruction but the speed strongly depends on hash function properties. Nevertheless the design or selection of the hash function for data sets with unknown properties is a serious problem. This paper describes a new hash function, presents the properties obtained for large data sets, and discusses validity of the reconstructed surface. Experimental results proved theoretical considerations and advantages of hash function use for mesh reconstruction.  相似文献   

11.
在很多信息处理任务中,人们容易获得大量的无标签样本,但对样本进行标注是非常费时和费力的。作为机器学习领域中一种重要的学习方法,主动学习通过选择最有信息量的样本进行标注,减少了人工标注的代价。然而,现有的大多数主动学习算法都是基于分类器的监督学习方法,这类算法并不适用于无任何标签信息的样本选择。针对这个问题,借鉴最优实验设计的算法思想,结合自适应稀疏邻域重构理论,提出基于自适应稀疏邻域重构的主动学习算法。该算法可以根据数据集各区域的不同分布自适应地选择邻域规模,同步完成邻域点的搜寻和重构系数的计算,能在无任何标签信息的情况下较好地选择最能代表样本集分布结构的样本。基于人工合成数据集和真实数据集的实验表明,在同等标注代价下,基于自适应稀疏邻域重构的主动学习算法在分类精度和鲁棒性上具有较高的性能。  相似文献   

12.
Single nucleotide polymorphism (SNP) in human genomes is considered to be highly associated with complex genetic diseases. As a consequence, obtaining all SNPs from human populations is one of the primary goals of recent studies on human genomics. The two sequences of SNPs in diploid human organisms are called haplotypes. In this paper, the problem of haplotype reconstruction from SNP fragments with and without genotype information is studied. Minimum error correction (MEC) is an important model for this problem but only effective when the error rate of the fragments is low. MEC/GI, as an extension to MEC model, employs the related genotype information besides the SNP fragments and, therefore, results in a more accurate inference. We introduce algorithmic neural network-based approaches and experimentally prove that our methods are fast and accurate. Particularly, our approach is faster, more accurate, and also compatible for solving MEC model, in comparison with a feed-forward (and back propagation like) neural network.  相似文献   

13.
Sequence segmentation has gained popularity in bioinformatics and particularly in studying DNA sequences. Information theoretic models have been used in providing accurate solutions in the segmentation of DNA sequences. Existing dynamic programming approaches provide optimal solution to the segmentation problem. However, their quadratic time complexity prohibits their applicability to long sequences. In this paper, we propose a parallel approach to improve the performance of a quasilinear sequence segmentation algorithm. The target segmentation technique is a divide-and-conquer recursive algorithm that is based on information theory principles and models. We present three parallel implementations that aim at reducing the segmentation time. The first implementation uses the multithreading capabilities of CPUs. The second one is a hybrid implementation that utilizes the synergy between the CPU and the multithreading power of GPUs. The third implementation is a variation of the hybrid approach where it utilizes the concept of unified memory between the CPU and the GPU instead of the standard memory copy approach. We demonstrate the applicability of the parallel implementations by testing them on real DNA sequences and randomly generated sequences with different lengths and different number of unique elements. The results show that the hybrid CPU-GPU approach outperforms the sequential implementation with a speedup of up to 5.9X while the CPU parallel implementation provides a poor speedup of only 1.7X.  相似文献   

14.
Most algorithms for surface reconstruction from sample points rely on computationally demanding operations to derive the reconstruction. In this paper we introduce an innovative approach for generating 3D piecewise linear approximations from sample points that relies strongly on topological information, thus reducing the computational cost and numerical instabilities typically associated with geometric computations. Discrete Morse theory provides the basis for a topological framework that supports a robust reconstruction algorithm capable of handling multiple components and has low computational cost. We describe the proposed approach and introduce the reconstruction algorithm, called TSR – topological surface reconstructor. Some reconstruction results are presented and the performance of TSR is compared with that of other reconstruction approaches for some standard point sets.  相似文献   

15.
The design of reliable DNA sequences is crucial in many engineering applications which depend on DNA-based technologies, such as nanotechnology or DNA computing. In these cases, two of the most important properties that must be controlled to obtain reliable sequences are self-assembly and self-complementary hybridization. These processes have to be restricted to avoid undesirable reactions, because in the specific case of DNA computing, undesirable reactions usually lead to incorrect computations. Therefore, it is important to design robust sets of sequences which provide efficient and reliable computations. The design of reliable DNA sequences involves heterogeneous and conflicting design criteria that do not fit traditional optimization methods. In this paper, DNA sequence design has been formulated as a multiobjective optimization problem and a novel multiobjective approach based on swarm intelligence has been proposed to solve it. Specifically, a multiobjective version of the Artificial Bee Colony metaheuristics (MO-ABC) is developed to tackle the problem. MO-ABC takes in consideration six different conflicting design criteria to generate reliable DNA sequences that can be used for bio-molecular computing. Moreover, in order to verify the effectiveness of the novel multiobjective proposal, formal comparisons with the well-known multiobjective standard NSGA-II (fast non-dominated sorting genetic algorithm) were performed. After a detailed study, results indicate that our artificial swarm intelligence approach obtains satisfactory reliable DNA sequences. Two multiobjective indicators were used in order to compare the developed algorithms: hypervolume and set coverage. Finally, other relevant works published in the literature were also studied to validate our results. To this respect the conclusion that can be drawn is that the novel approach proposed in this paper obtains very promising DNA sequences that significantly surpass other results previously published.  相似文献   

16.
There is currently an abundance of vision algorithms which, provided with a sequence of images that have been acquired from sufficiently close successive 3D locations, are capable of determining the relative positions of the viewpoints from which the images have been captured. However, very few of these algorithms can cope with unordered image sets. This paper presents an efficient method for recovering the position and orientation parameters corresponding to the viewpoints of a set of panoramic images for which no a priori order information is available, along with certain structure information regarding the imaged environment. The proposed approach assumes that all images have been acquired from a constant height above a planar ground and operates sequentially, employing the Levenshtein distance to deduce the spatial proximity of image viewpoints and thus determine the order in which images should be processed. The Levenshtein distance also provides matches between imaged points, from which their corresponding environment points can be reconstructed. Image matching with the aid of the Levenshtein distance forms the crux of an iterative process that alternates between image localization from multiple reconstructed points and point reconstruction from multiple image projections, until all views have been localized. Periodic refinement of the reconstruction with the aid of bundle adjustment, distributes the reconstruction error among images. The approach is demonstrated on several unordered sets of panoramic images obtained in indoor environments.  相似文献   

17.
串联重复序列是基因组构建的困难片段,由于其重复单元之间的相似性与其拷贝数的不确定性,在序列比对时容易定位到多个候选位置,如何快速而准确地筛选出正确的比对位置是一项挑战。现有方法使用种子(从测序片段中选取的短序列)来定位并扩展候选比对位置,但挑选种子时未考虑串联重复序列特性。因此,提出了一种串联重复序列比对的位置筛选方法,其通过计算稀有kmer(长度为k的子序列)序列的相似性来筛选比对结果。此外,采用合并稀有kmer的策略加速计算,并利用基于编辑距离的模糊查找以提高过滤信息密度。实验结果表明,在模拟数据集上提高比对结果的召回率与准确率的同时,该方法比现有方法快约2倍,且具有良好的并行加速性能。  相似文献   

18.
基于灰色遗传算法的快速图像匹配方法研究   总被引:2,自引:0,他引:2  
针对图像匹配速度慢,抗干扰能力差的问题,将灰色关联理论与遗传算法相结合,提出了一种鲁棒性强的快速图像匹配方法——GGA(Grey Genetic Algorithm)法。该方法首先确定问题的参数空间,通过对参数空间编码和种群初始化得到待匹配的多个初始位置,然后利用模板图和当前搜索子图的直方图信息,分别构建参考序列和比较序列,并以两序列间的灰色关联度为适应度函数。在此基础上,初始群体经过选择、交叉和变异等操作逐代进化到搜索空间的优化区域,并逼近最佳匹配位置。实验结果显示,GGA法充分利用了灰色关联理论的小样本特性和遗传算法的计算并行性,在保证一定匹配精度的情况下,实时性和鲁棒性明显提高。  相似文献   

19.
Inferring consensus structure from nucleic acid sequences   总被引:12,自引:0,他引:12  
This paper presents an unsupervised inference method for determining the higher-order structure from sequence data. The method is general, but in this paper it is applied to nucleic acid sequences in determining the secondary (2-D) and tertiary (3-D) structure of the macromolecule. The method evaluates position - position interdependence of the sequence using an information measure known as expected mutual information. The expected mutual information is calculated for each pair of positions and the chi-square test is used to screen statistically significant position pairs. In the calculation of expected mutual information, an unbiased probability estimator is used to overcome the problem associated with zero observation in conserved sites. A selection criterion based on known structural constraints of the strongest interdependent position pairs is applied yielding position pairs most indicative of secondary and tertiary interactions. The method has been tested using tRNA and 5S rRNA sequences with very good results.  相似文献   

20.
We present an approach to adaptively select time steps from time‐dependent volume data sets for an integrated and comprehensive visualization. This reduced set of time steps not only saves cost, but also allows to show both the spatial structure and temporal development in one combined rendering. Our selection optimizes the coverage of the complete data on the basis of a minimum‐cost flow‐based technique to determine meaningful distances between time steps. As both optimal solutions of the involved transport and selection problem are prohibitively expensive, we present new approaches that are significantly faster with only minor deviations. We further propose an adaptive scheme for the progressive incorporation of new time steps. An interactive volume raycaster produces an integrated rendering of the selected time steps, and their computed differences are visualized in a dedicated chart to provide additional temporal similarity information. We illustrate and discuss the utility of our approach by means of different data sets from measurements and simulation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号