首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
In our work, we review and empirically evaluate five different raw methods of text representation that allow automatic processing of Wikipedia articles. The main contribution of the article—evaluation of approaches to text representation for machine learning tasks—indicates that the text representation is fundamental for achieving good categorization results. The analysis of the representation methods creates a baseline that cannot be compensated for even by sophisticated machine learning algorithms. It confirms the thesis that proper data representation is a prerequisite for achieving high-quality results of data analysis. Evaluation of the text representations was performed within the Wikipedia repository by examination of classification parameters observed during automatic reconstruction of human-made categories. For that purpose, we use a classifier based on a support vector machines method, extended with multilabel and multiclass functionalities. During classifier construction we observed parameters such as learning time, representation size, and classification quality that allow us to draw conclusions about text representations. For the experiments presented in the article, we use data sets created from Wikipedia dumps. We describe our software, called Matrix’u, which allows a user to build computational representations of Wikipedia articles. The software is the second contribution of our research, because it is a universal tool for converting Wikipedia from a human-readable form to a form that can be processed by a machine. Results generated using Matrix’u can be used in a wide range of applications that involve usage of Wikipedia data.  相似文献   

2.
用Naive Bayes方法协调分类Web网页   总被引:41,自引:0,他引:41  
范焱  郑诚  王清毅  蔡庆生  刘洁 《软件学报》2001,12(9):1386-1392
WWW上的信息极大丰富,如何从巨量的信息中有效地发现有用的信息,是亟待解决的问题,而Web网页的正确分类正是其中的核心问题.针对超文本结构中的结构特征,提出了用NaiveBayes方法协调分别利用超文本页面中的文本信息和结构信息进行分类的方法.经实验验证,与只用单种方法对超文本进行分类的方法相比,综合分类法有效地提高了分类的正确率.  相似文献   

3.
WEBSOM is a recently developed neural method for exploring full-text document collections, for information retrieval, and for information filtering. In WEBSOM the full-text documents are encoded as vectors in a document space somewhat like in earlier information retrieval methods, but in WEBSOM the document space is formed in an unsupervised manner using the Self-Organizing Map algorithm. In this article the document representations the WEBSOM creates are shown to be computationally efficient approximations of the results of a certain probabilistic model. The probabilistic model incorporates information about the similarity of use of different words to take into account their semantic relations.  相似文献   

4.
基于内容的网页特征提取   总被引:5,自引:1,他引:5  
文章主要研究基于内容的中文网页的特征提取技术,具体介绍了分词词典的建造方法,网页正文、标记信息和超链信息的特征提取。通过对旅游类网页的实验结果显示,文中的方法和应用效果良好。  相似文献   

5.
The stability of linear systems defined by ordinarydifferential equations with constant or periodic coefficients can beassessed from the spectral radius of their transition matrix. Inclassical applications of this theory, the transition matrix isexplicitly computed first, then its eigenvalues are evaluated; if thelargest eigenvalue is larger than unity, the system is unstable. Theproposed implicit transition matrix approach extracts the dominanteigenvalues of the transition matrix using the Arnoldi algorithm,without the explicit computation of this matrix. As a result, theproposed implicit method yields stability information at a far lowercomputational cost than that of the classical approach, and is ideallysuited for stability computations of systems involving a large number ofdegrees of freedom. Examples of application of the proposed methodologyto flexible multi-body systems are presented that demonstrate itsaccuracy and computational efficiency.  相似文献   

6.
大规模并行计算机电源系统稳定性分析   总被引:1,自引:0,他引:1       下载免费PDF全文
大规模并行计算机电源系统通常都采用分布式供电架构,系统稳定性设计是其中的难点和关键点。本文提出了一种实用的输入输出阻抗匹配方法,详细分析了变换器的输入和输出阻抗,绘制了阻抗特性波特图,根据仿真结果确定了总线上的输出电容。电源系统测试和运行结果表明,在总线上放置合适的电容可以解决系统稳定性问题。  相似文献   

7.
聚焦爬虫技术研究综述   总被引:50,自引:1,他引:50  
周立柱  林玲 《计算机应用》2005,25(9):1965-1969
因特网的迅速发展对万维网信息的查找与发现提出了巨大的挑战。对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页。为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究。至今,聚焦爬虫已成为有关万维网的研究热点之一。文中对这一热点研究进行综述,给出聚焦爬虫(Focused Crawler)的基本概念,概述其工作原理;并根据研究的发展现状,对聚焦爬虫的关键技术(抓取目标描述,网页分析算法和网页搜索策略等)作系统介绍和深入分析。在此基础上,提出聚焦爬虫今后的一些研究方向,包括面向数据分析和挖掘的爬虫技术研究,主题的描述与定义,相关资源的发现,W eb数据清洗,以及搜索空间的扩展等。  相似文献   

8.
开源情报是反恐研究的一种新数据源,内容十分丰富且获取与分析技术日益成熟.目前,基于开源情报的反恐方面的研究成果已彰显出巨大应用前景.本文以“东突”分裂活动为研究对象,利用网络爬虫从万维网中获取相关文本数据,采用文本分析方法从这些数据中抽取“东突”分裂活动中涉及的人员、组织、时间和地点四要素,依据概念之间的关联关系构建多模元网络.首先 采用元网络分解法将多模元网络分解成单顶点子网络和二分子网络,通过对各个子网络进行中心性分析判别各类节点的重要性; 然后综合各个子网络的中心性指标形成人员、组织、时间和地点四类节点的重要性综合指数(Importance composite index,ICI).随后,进一步采用k-壳分解法直接对多模元网络进行分解,判别出元网络中的核心节点.经对比分析,发现本文的研究结果与实际结果吻合较好.  相似文献   

9.
第三届中文倾向性分析评测(COAE2011)语料的构建与分析   总被引:1,自引:0,他引:1  
文本倾向性分析已成为自然语言处理领域研究的热点问题之一。为进一步推动中文倾向性分析的研究,中国中文信息学会信息检索专业委员会举办了第三届中文倾向性分析评测(COAE2011)。该次评测主要关注领域和上下文语境(Context)对中文倾向性分析的影响。该文主要介绍COAE2011评测语料的构建及其对评测的支撑 首先介绍了COAE2011语料的领域选取、媒介分布等获取过程,然后详细阐述语料的标注原则与方法,最后依据评测结果分析领域和上下文语境因素对倾向性的影响。COAE2011语料的建立将为中文倾向性分析提供强大的资源支持。  相似文献   

10.
The Knowledge Society is increasing the demand for tools to manage the didactic knowledge stored in Learning Objects Repositories, and needed by teachers to generate courseware. In this respect, still there is a lack of automated tools for the analysis and retrieval of learning resources from such repositories. Here we propose the use of the OLAP technique to help teachers to specify a didactic ontology by which performing quantitative and qualitative analysis of Internet-based Learning Objects Repositories. The related system is presented, together with a case study based on real repositories.  相似文献   

11.
《元朝秘史》电子文本检索系统的研制   总被引:2,自引:0,他引:2  
本文概要地介绍13世纪《元朝秘史》的文献背景及原文所独有的复杂文本形式,通过对文本的内涵分析和版面分析,设计了关于《元朝秘史》电子检索系统的研制方案。其中主要解决了原文三行一体显示格式的还原问题,而且系统可以分别对原文汉字音写、汉语译文、汉字旁译、语音语法标注等不同部分进行检索和统计。检索输出结果包括研究者最重视的传统学术章节号、卷页码、在电子文本出现的具体位置。另外,系统对检索词采用了上下文检索技术,输出文本包括检索词的部分上下文内容。本系统基本满足历史、文学和语言研究的应用需求。  相似文献   

12.
由于大量同义词和关联词的存在,使得在文本挖掘过程中文本特征空间无法准确表达文本语义以及计算高维复杂性。本文利用潜在语义分析和关联规则挖掘构造同义和关联词集,用于减少文本特征空间中的同义词和关联词,降低信息冗余,改进挖掘效率。文中对相应的算法进行了描述,实验结果令人满意。  相似文献   

13.
基于多模式分析自动解析新闻视频   总被引:1,自引:0,他引:1  
王伟强  高文 《软件学报》2001,12(9):1271-1278
提出一种结合视觉、声音、文字等多种模式信息自动解析新闻视频的方法,并对音频特征的提取以及综合多种模式信息解析新闻视频的算法进行了详细的探讨.多种模式信息的使用有效地弥补了仅基于图像分析技术分割新闻条目的不足,从而使该方法对不同方式存在的新闻条目在分割时具有更广泛的适应性.在包含184100帧的测试数据集上,对于新闻条目边界点的检测,系统获得了95.1%查全率,93.3%的正确率.实验结果证明了该方法的有效性、强壮性.  相似文献   

14.
高精密转台常用于高精度测量仪器的性能测试,或作为激光扫描仪的稳定平台使用,而转台的稳定性直接影响仪器的测量精度.提出一种利用数码相机对转台稳定性进行标定的方法,在转台运动时,安装在转台上的相机连续获取相片,基于摄影测量原理,计算相机的空间位置和姿态,从而确定转台的运动轨迹.实验结果表明,该方法可以有效地检测转台运动的稳定性.  相似文献   

15.
为了正确地设置控制器参数, 针对主动队列管理(AQM)系统提出了一种图形化的稳定性分析方法. 将TCP/AQM系统的模型转化为带有时滞的二阶系统形式, 从而用特征伪多项式来刻画其闭环系统的稳定性. 在复平面上, 借助被控对象的逆奈奎斯特曲线和控制器的负频率特征直线, 给出了判定闭环系统稳定性的充要准则. 研究了使得AQM系统稳定的PID控制器的比例增益边界与网络参数之间的关系.通过Matlab和Network Simulator分别进行了仿真, 实验结果验证了该方法的有效性. 不同的PID控制器稳定区域对比, 进一步表明该方法的保守性较小. 该方法的优点在于计算复杂度较低, 而且在复平面上显示直观.  相似文献   

16.
17.
翟聪  巫威眺 《自动化学报》2020,46(8):1738-1747
道路环境及密集交通流随机波动是交通扰动的诱因, 文中考虑道路环境中的汽车鸣笛效应和驾驶员异质性的影响, 提出鸣笛发生临界密度的概念, 建立了更符合实际的格子流体动力学模型, 并揭示非饱和交通状态下诱发交通流失稳的机理.在线性稳定性分析中利用扰动法得到了该模型的稳定性条件, 并基于还原微扰法对该模型的非线性稳定性问题进行研究, 通过求解mKDV方程获取的扭结-反扭结孤立波描述了在临界点附近密度波的传输规则.仿真结果表明, 考虑有鸣笛效应的新格子模型相比于Nagatani模型的稳定性更强, 而较大的临界密度对交通流稳定性存在消极影响; 与以往微观模型相比, 本文模型能解释鸣笛现象发生的自然条件, 即密度高且流量低的地方, 同时驾驶员特性也对交通流的稳定性存在着显著影响.  相似文献   

18.
The ICL Distributed Array Processor (DAP) is an SIMD array processor containing a large, 2-D array of bit serial processing elements. The architecture of the DAP makes it well suited to data processing applications where searching operations must be carried out on large numbers of data records. This paper discusses the use of the DAP for two such applications, these being the scanning of serial text files and the clustering of a range of types of database. The processing efficiency of the DAP, when compared with a serial processor, is greatest when fixed length records are processed.  相似文献   

19.
Recently, biology has been confronted with large multidimensional gene expression data sets where the expression of thousands of genes is measured over dozens of conditions. The patterns in gene expression are frequently explained retrospectively by underlying biological principles. Here we present a method that uses text analysis to help find meaningful gene expression patterns that correlate with the underlying biology described in scientific literature. The main challenge is that the literature about an individual gene is not homogenous and may addresses many unrelated aspects of the gene. In the first part of the paper we present and evaluate the neighbor divergence per gene (NDPG) method that assigns a score to a given subgroup of genes indicating the likelihood that the genes share a biological property or function. To do this, it uses only a reference index that connects genes to documents, and a corpus including those documents. In the second part of the paper we present an approach, optimizing separating projections (OSP), to search for linear projections in gene expression data that separate functionally related groups of genes from the rest of the genes; the objective function in our search is the NDPG score of the positively projected genes. A successful search, therefore, should identify patterns in gene expression data that correlate with meaningful biology. We apply OSP to a published gene expression data set; it discovers many biologically relevant projections. Since the method requires only numerical measurements (in this case expression) about entities (genes) with textual documentation (literature), we conjecture that this method could be transferred easily to other domains. The method should be able to identify relevant patterns even if the documentation for each entity pertains to many disparate subjects that are unrelated to each other.  相似文献   

20.
Behaviour analysis should form an integral part of the software development process. This is particularly important in the design of concurrent and distributed systems, where complex interactions can cause unexpected and undesired system behaviour. We advocate the use of a compositional approach to analysis. The software architecture of a distributed program is represented by a hierarchical composition of subsystems, with interacting processes at the leaves of the hierarchy. Compositional reachability analysis (CRA) exploits the compositional hierarchy for incrementally constructing the overall behaviour of the system from that of its subsystems. In the Tracta CRA approach, both processes and properties reflecting system specifications are modelled as state machines. Property state machines are composed into the system and violations are detected on the global reachability graph obtained. The property checking mechanism has been specifically designed to deal with compositional techniques. Tracta is supported by an automated tool compatible with our environment for the development of distributed applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号