首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
The monitoring of the expression profiles of thousands of genes have proved to be particularly promising for biological classification. DNA microarray data have been recently used for the development of classification rules, particularly for cancer diagnosis. However, microarray data present major challenges due to the complex, multiclass nature and the overwhelming number of variables characterizing gene expression profiles. A regularized form of sliced inverse regression (REGSIR) approach is proposed. It allows the simultaneous development of classification rules and the selection of those genes that are most important in terms of classification accuracy. The method is illustrated on some publicly available microarray data sets. Furthermore, an extensive comparison with other classification methods is reported. The REGSIR performance is comparable with the best classification methods available, and when appropriate feature selection is made the performance can be considerably improved.  相似文献   

3.
The volume of publicly available data in biomedicine is constantly increasing. However, these data are stored in different formats and on different platforms. Integrating these data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology (NCBO), we have developed the Resource Index – a growing, large-scale ontology-based index of more than twenty heterogeneous biomedical resources. The resources come from a variety of repositories maintained by organizations from around the world. We use a set of over 200 publicly available ontologies contributed by researchers in various domains to annotate the elements in these resources. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies, in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics “under the hood.”  相似文献   

4.
Recently, a huge amount of social networks have been made publicly available. In parallel, several definitions and methods have been proposed to protect users’ privacy when publicly releasing these data. Some of them were picked out from relational dataset anonymization techniques, which are riper than network anonymization techniques. In this paper we summarize privacy-preserving techniques, focusing on graph-modification methods which alter graph’s structure and release the entire anonymous network. These methods allow researchers and third-parties to apply all graph-mining processes on anonymous data, from local to global knowledge extraction.  相似文献   

5.
A comparison of decision tree ensemble creation techniques   总被引:3,自引:0,他引:3  
We experimentally evaluate bagging and seven other randomization-based approaches to creating an ensemble of decision tree classifiers. Statistical tests were performed on experimental results from 57 publicly available data sets. When cross-validation comparisons were tested for statistical significance, the best method was statistically more accurate than bagging on only eight of the 57 data sets. Alternatively, examining the average ranks of the algorithms across the group of data sets, we find that boosting, random forests, and randomized trees are statistically significantly better than bagging. Because our results suggest that using an appropriate ensemble size is important, we introduce an algorithm that decides when a sufficient number of classifiers has been created for an ensemble. Our algorithm uses the out-of-bag error estimate, and is shown to result in an accurate ensemble for those methods that incorporate bagging into the construction of the ensemble  相似文献   

6.
Microarray technology has made it possible to monitor the expression levels of many genes simultaneously across a number of experimental conditions. Fuzzy clustering is an important tool for analyzing microarray gene expression data. In this article, a real-coded Simulated Annealing (VSA) based fuzzy clustering method with variable length configuration is developed and combined with popular Artificial Neural Network (ANN) based classifier. The idea is to refine the clustering produced by VSA using ANN classifier to obtain improved clustering performance. The proposed technique is used to cluster three publicly available real life microarray data sets. The superior performance of the proposed technique has been demonstrated by comparing with some widely used existing clustering algorithms. Also statistical significance test has been conducted to establish the statistical significance of the superior performance of the proposed clustering algorithm. Finally biological relevance of the clustering solutions are established.  相似文献   

7.
Conventional approaches for modeling human mobility pattern often focus on human activity and movement dynamics in their regular daily lives and cannot capture changes in human movement dynamics in response to large-scale events. With the rapid advancement of information and communication technologies, many researchers have adopted alternative data sources (e.g., cell phone records, GPS trajectory data) from private data vendors to study human movement dynamics in response to large-scale natural or societal events. Big geosocial data such as georeferenced tweets are publicly available and dynamically evolving as real-world events are happening, making it more likely to capture the real-time sentiments and responses of populations. However, precisely-geolocated geosocial data is scarce and biased toward urban population centers. In this research, we developed a big geosocial data analytical framework for extracting human movement dynamics in response to large-scale events from publicly available georeferenced tweets. The framework includes a two-stage data collection module that collects data in a more targeted fashion in order to mitigate the data scarcity issue of georeferenced tweets; in addition, a variable bandwidth kernel density estimation(VB-KDE) approach was adopted to fuse georeference information at different spatial scales, further augmenting the signals of human movement dynamics contained in georeferenced tweets. To correct for the sampling bias of georeferenced tweets, we adjusted the number of tweets for different spatial units (e.g., county, state) by population. To demonstrate the performance of the proposed analytic framework, we chose an astronomical event that occurred nationwide across the United States, i.e., the 2017 Great American Eclipse, as an example event and studied the human movement dynamics in response to this event. However, this analytic framework can easily be applied to other types of large-scale events such as hurricanes or earthquakes.  相似文献   

8.
ABSTRACT

Knowledge sharing can be hindered by barriers that prevent the free flow of information, especially across organizational and other boundaries. Therefore information produced at one location might not be available to entities elsewhere even if there are benefits to sharing this information. This can often lead to 'reinventing the wheel' and wasted investments in duplicating resources and ultimately will lead to the development of knowledge silos. Information technologies can be used to address this problem as they provide opportunities to lower the barriers to knowledge sharing and increase collaboration. This need for knowledge sharing and collaborative technologies can be important for Small Island Developing States (SIDS) within particular regions that are exposed to similar environmental and economic issues that can hinder their development. Although each SIDS may have Knowledge Resources that it uses to address its own issues, there would be benefits to collaborating and sharing these resources to collectively tackle these regional issues. Even when there is a willingness to share and collaborate and entities have been established to foster this collaboration, there is a void in the availability of tools and technologies needed to support collaboration and sharing of resources. This paper describes the research that has been done to help fill this void by designing and developing a technological solution, a Knowledge Broker, for the identification and sharing of Knowledge Resources that may be spread across various locations (e.g. countries). The Design Science Research methodology was used to develop the Knowledge Broker architecture, which provides a single point of access to the knowledge resources within a particular domain. A critical component of this Knowledge Broker is a common, online interactive vocabulary of the domain of interest which provides the terms which are used to describe and search for the knowledge resources available. The Knowledge Broker was evaluated using informed arguments and an illustrative scenario in the Comprehensive Disaster Management domain in the Caribbean region. The initial evaluations that have been reported in this paper indicates that the Knowledge Broker has the potential to increase the efficiency of solving regional issues through the sharing of knowledge resources.  相似文献   

9.
The problem of optimally placing data on disks (ODP) to maximize disk-access performance has long been recognized as important. Solutions to this problem have been reported for some widely available disk technologies, such as magnetic CAV and optical CLV disks. However, important new technologies such as multizoned magnetic disks, have been recently introduced. For such technologies no formal solution to the ODP problem has been reported. In this paper, we first identify the fundamental characteristics of disk-device technologies which influence the solution to the ODP problem. We develop a comprehensive solution to the problem that covers all currently available disk technologies. We show how our comprehensive solution can be reduced to the solutions for existing disk technologies, contributing thus a solution to the ODP problem for multizoned disks. Our analytical solution has been validated through simulations and through its reduction to the known solutions for particular disks. Finally, we study how the solution for multizoned disks is affected by the disk and data characteristics  相似文献   

10.
In-Memory Databases (IMDBs), such as SAP HANA, enable new levels of database performance by removing the disk bottleneck and by compressing data in memory. The consequence of this improved performance means that reports and analytic queries can now be processed on demand. Therefore, the goal is now to provide near real-time responses to compute and data intensive analytic queries. To facilitate this, much work has investigated the use of acceleration technologies within the database context. While current research into the application of these technologies has yielded positive results, they have tended to focus on single database tasks or on isolated single user requests. This paper uses SHEPARD, a framework for managing accelerated tasks across shared heterogeneous resources, to introduce acceleration into an IMDB. Results show how, using SHEPARD, multiple simultaneous user queries all receive speed-up by using a shared pool of accelerators. Results also show that offloading analytic tasks onto accelerators can have indirect benefits for other database workloads by reducing contention for CPU resources.  相似文献   

11.
Dynamic storage allocation is an important part of a large class of computer programs written in C and C + +. High-performance algorithms for dynamic storage allocation have been, and will continue to be, of considerable interest. This paper presents detailed measurements of the cost of dynamic storage allocation in 11 diverse C and C + + programs using five very different dynamic storage allocation implementations, including a conservative garbage collection algorithm. Four of the allocator implementations measured are publicly available on the Internet. A number of the programs used in these measurements are also available on the Internet to facilitate further research in dynamic storage allocation. Finally, the data presented in this paper is an abbreviated version of more extensive statistics that are also publicly available on the Internet.  相似文献   

12.
13.
Many tasks related to sentiment analysis rely on sentiment lexicons, lexical resources containing information about the emotional implications of words (e.g., sentiment orientation of words, positive or negative). In this work, we present an automatic method for building lemma-level sentiment lexicons, which has been applied to obtain lexicons for English, Spanish and other three official languages in Spain. Our lexicons are multi-layered, allowing applications to trade off between the amount of available words and the accuracy of the estimations. Our evaluations show high accuracy values in all cases. As a previous step to the lemma-level lexicons, we have built a synset-level lexicon for English similar to SentiWordNet 3.0, one of the most used sentiment lexicons nowadays. We have made several improvements in the original SentiWordNet 3.0 building method, reflecting significantly better estimations of positivity and negativity, according to our evaluations. The resource containing all the lexicons, ML-SentiCon, is publicly available.  相似文献   

14.
3D texture classification under varying viewpoint and illumination has been a vivid research topic, and many methods have been developed. It is crucial that these methods be compared using an unbiased evaluation methodology. The most frequently employed methodologies use images from the Columbia–Utrecht Reflectance and Texture Database. These methodologies construct the training and test sets to be disjoint in the imaging parameters, but do not separate them spatially because they use images of the same surface patch for both. We perform a series of experiments which show that such practice leads to overestimation of classifier performance and distorts experimental findings. To correct that, we accurately register the images across all imaging conditions and split the surface patches to parts. The training and testing is then done on spatially disjoint parts. We show that such methodology gives a more realistic assessment of classifier performance. The sample annotations for all images are publicly available.  相似文献   

15.
王立杰  李萌  蔡斯博  李戈  谢冰  杨芙清 《软件学报》2012,23(6):1335-1349
随着Web服务技术的不断成熟和发展,互联网上出现了大量的公共Web服务.在使用Web服务开发软件系统的过程中,其文本描述信息(例如简介和使用说明等)可以帮助服务消费者直观有效地识别和理解Web服务并加以利用.已有的研究工作大多关注于从Web服务的WSDL文件中获取此类信息进行Web服务的发现或检索,调研发现,互联网上大部分Web服务的WSDL文件中普遍缺少甚至没有此类信息.为此,提出一种基于网络信息搜索的从WSDL文件之外的信息源为Web服务扩充文本描述信息的方法.从互联网上收集包含目标Web服务特征标识的相关网页,基于从网页中抽取出的信息片段,利用信息检索技术计算信息片段与目标Web服务的相关度,并选取相关度较高的文本片段为Web服务扩充文本描述信息.基于互联网上的真实数据进行的实验,其结果表明,可为约51%的互联网上的Web服务获取到相关网页,并为这些Web服务中约88%扩充文本描述信息.收集到的Web服务及其文本描述信息数据均已公开发布.  相似文献   

16.
Sentiment lexicons and word embeddings constitute well-established sources of information for sentiment analysis in online social media. Although their effectiveness has been demonstrated in state-of-the-art sentiment analysis and related tasks in the English language, such publicly available resources are much less developed and evaluated for the Greek language. In this paper, we tackle the problems arising when analyzing text in such an under-resourced language. We present and make publicly available a rich set of such resources, ranging from a manually annotated lexicon, to semi-supervised word embedding vectors and annotated datasets for different tasks. Our experiments using different algorithms and parameters on our resources show promising results over standard baselines; on average, we achieve a 24.9% relative improvement in F-score on the cross-domain sentiment analysis task when training the same algorithms with our resources, compared to training them on more traditional feature sources, such as n-grams. Importantly, while our resources were built with the primary focus on the cross-domain sentiment analysis task, they also show promising results in related tasks, such as emotion analysis and sarcasm detection.  相似文献   

17.
Android apps share resources, such as sensors, cameras, and Global Positioning System, that are subject to specific usage policies whose correct implementation is left to programmers. Failing to satisfy these policies may cause resource leaks, that is, apps may acquire but never release resources. This might have different kinds of consequences, such as apps that are unable to use resources or resources that are unnecessarily active wasting battery. Researchers have proposed several techniques to detect and fix resource leaks. However, the unavailability of public benchmarks of faulty apps makes comparison between techniques difficult, if not impossible, and forces researchers to build their own data set to verify the effectiveness of their techniques (thus, making their work burdensome). The aim of our work is to define a public benchmark of Android apps affected by resource leaks. The resulting benchmark, called AppLeak, is publicly available on GitLab and includes faulty apps, versions with bug fixes (when available), test cases to automatically reproduce the leaks, and additional information that may help researchers in their tasks. Overall, the benchmark includes a body of 40 faults that can be exploited to evaluate and compare both static and dynamic analysis techniques for resource leak detection.  相似文献   

18.
This paper presents a data envelopment analysis (DEA)‐based decision‐support methodology that has been implemented and is being used by a not‐for‐profit organization, Fe y Alegría, which runs 439 Bolivian schools reaching over 160,000 disadvantaged students in that poverty‐stricken Latin American nation. Bolivia is a poor country with the highest percentage of indigenous population and the lowest per capita income in South America and as such its inhabitants are in dire need of effective educational resources to help them out of poverty. The DEA‐based methodology described in this paper has offered an objective way to compare network schools among themselves and with out‐of‐network schools, providing a deeper understanding of school efficiency levels in the face of scarce resources, and allowing for sharing of best practices across the network. The paper introduces the educational environment in Bolivia, presents the DEA model, describes the decision support methodology, and provides two examples of its use. The first example compares Fe y Alegría secondary schools with out‐of‐network secondary schools using publicly available data, and the second compares Fe y Alegría secondary schools among themselves using a proprietary database. The paper also comments on lessons learned and the need for broad consensus‐building and organization‐wide buy‐in for successful adoption and maximum impact.  相似文献   

19.
The most complete proteome of human lenses has been compiled using 2-D LC-MS/MS analysis of foetal, aged normal and advanced nuclear cataract lenses. A total of 231 proteins were identified across all lens groups, including 112 proteins that have not been reported previously. Proteins were grouped according to their PANTHER molecular function classification in order to facilitate comparisons. Previously unreported N-terminal acetylation was detected in a number of proteins, with the majority being associated with the prior removal of a methionine residue. This pattern of proteolysis may indicate that methionine aminopeptidase activity is present in human lenses. Acetylation is likely to aid in the stability of proteins that are present in the lens for many decades. Protein sequences were also used to interrogate the three human lens cDNA libraries publicly available. Surprisingly, 84 proteins we identified were not present in the cDNA libraries.  相似文献   

20.
The environmental impact of aviation is enormous given the fact that in the US alone there are nearly 6 million flights per year of commercial aircraft. This situation has driven numerous policy and procedural measures to help develop environmentally friendly technologies which are safe and affordable and reduce the environmental impact of aviation. However, many of these technologies require significant initial investment in newer aircraft fleets and modifications to existing regulations which are both long and costly enterprises. We propose to use an anomaly detection method based on Virtual Sensors to help detect overconsumption of fuel in aircraft which relies only on the data recorded during flight of most existing commercial aircraft, thus significantly reducing the cost and complexity of implementing this method. The Virtual Sensors developed here are ensemble-learning regression models for detecting the overconsumption of fuel based on instantaneous measurements of the aircraft state. This approach requires no additional information about standard operating procedures or other encoded domain knowledge. We present experimental results on three data sets and compare five different Virtual Sensors algorithms. The first two data sets are publicly available and consist of a simulated data set from a flight simulator and a real-world turbine disk. We show the ability to detect anomalies with high accuracy on these data sets. These sets contain seeded faults, meaning that they have been deliberately injected into the system. The second data set is from real-world fleet of 84 jet aircraft where we show the ability to detect fuel overconsumption which can have a significant environmental and economic impact. To the best of our knowledge, this is the first study of its kind in the aviation domain.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号