首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
由于文档图像的布局复杂、目标对象尺寸分布不均匀,现有的检测算法很少考虑多模态信息和全局依赖关系,提出了基于视觉和文本的多模态文档图像目标检测方法。首先探索多模态特征的融合策略,为利用文本特征,将图像中文本序列信息转换为二维表征,在文本特征和视觉特征初次融合之后,将其输入到骨干网络提取多尺度特征,并在提取过程中多次融入文本特征实现多模态特征的深度融合;为保证小物体和大物体的检测精度,设计了一个金字塔网络,该网络的横向连接将上采样的特征图与自下而上生成的特征图在通道上连接,实现高层语义信息和低层特征信息的传播。在大型公开数据集PubLayNet上的实验结果表明,该方法的检测精度为95.86%,与其他检测方法相比有更高的准确率。该方法不仅实现了多模态特征的深度融合,还丰富了融合的多模态特征信息,具有良好的检测性能。  相似文献   

2.
由于条件属性在各样本的分布特性和所反映的主观特性的不同,每一个样本对应于真实情况的局部映射。建立了粗糙集理论中样本知识与信息之间的对应表示关系,给出了由属性约简求约简决策表的方法。基于后离散化策略处理连续属性,实现离散效率和信息损失之间的动态折衷。提出相对值条件互信息的概念衡量单一样本中各条件属性的相关性,可以充分利用现有数据处理不完备信息系统。即使在先验知识不足的情况下,也能通过主动学习构造新的规则补充进知识库中。拓广了粗糙集理论的应用范围,在UCI机器学习数据集上的实验结果和样例分析证明了该算法的合理性和有效性。  相似文献   

3.
Since a decade, text categorization has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has mainly focused on “flat texts” whereas many documents are now semi-structured and especially under the XML format. In this paper, we propose a semantic kernel for semi-structured biomedical documents. The semantic meanings of words are extracted using the unified medical language system (UMLS) framework. The kernel, with a SVM classifier, has been applied to a text categorization task on a medical corpus of free text documents. The results have shown that the semantic kernel outperforms the linear kernel and the naive Bayes classifier. Moreover, this kernel was ranked in the top 10 of the best algorithms among 44 classification methods at the 2007 Computational Medicine Center (CMC) Medical NLP International Challenge.  相似文献   

4.
在高性能计算作业调度系统中,许多调度算法依赖于对作业运行时间的准确估计,尤其是以EASY为代表的回填算法,而使用用户提供的作业运行时间往往会降低调度性能。提出了一种基于分类和实例学习相结合的作业运行时间预测算法--GA-Sim,该算法在考虑预测准确性的同时考虑了低估问题。在两个实际调度日志上的数值实验结果表明,相较于IRPA和TRIP算法,GA-Sim在取得更高预测精度的同时降低了低估率。 对数值实验结果进行了深入分析,并给出了不同情形下选择恰当预测算法的建议。  相似文献   

5.
6.
This paper presents a method for computing a thesaurus from a text corpus, and combined with a revised back-propagation neural network (BPNN) learning algorithm for document categorization. Automatically constructed thesaurus is a data structure that accomplished by extracting the relatedness between words. Neural network is one of the efficient approaches for document categorization. However the conventional BPNN has the problems of slow learning and easy to involve into the local minimum. We use a revised algorithm to improve the conventional BPNN that can overcome these problems. A well constructed thesaurus has been recognized as valuable tool in the effective operation of document categorization, it overcome some problem for the document categorization based on bag of words which ignored the relationship between words. To investigate the effectiveness of our method, we conducted the experiments on the standard Reuter-21578. The experimental results show that the proposed model was able to achieve higher categorization effectiveness as measured by the precision, recall and F-measure.  相似文献   

7.
In this paper, the problem of automatic document classification by a set of given topics is considered. The method proposed is based on the use of the latent semantic analysis to retrieve semantic dependencies between words. The classification of document is based on these dependencies. The results of experiments performed on the basis of the standard test data set TREC (Text REtrieval Conference) confirm the attractiveness of this approach. The relatively low computational complexity of this method at the classification stage makes it possible to be applied to the classification of document streams.  相似文献   

8.
In this paper, a survey of works on word sense disambiguation is presented, and the method used in the Texterra system [1] is described. The method is based on calculation of semantic relatedness of Wikipedia concepts. Comparison of the proposed method and the existing word sense disambiguation methods on various document collections is given.  相似文献   

9.
基于主题划分的网页自动摘要   总被引:4,自引:0,他引:4  
陈志敏  沈洁  林颖  周峰 《计算机应用》2006,26(3):641-0644
提出了一种以网页结构为指导的自动摘要方法。对页面源文件进行解析时,利用文档的结构信息生成DOM树,并在此基础上划分文档主题。同时充分挖掘网页标记对主题词提取和句子重要性计算的价值。最后以主题块为单位,根据句子间的相似度调整句子权重,动态生成摘要。实验结果表明该方法能有效解决文档摘要分布不平衡问题,减少了文摘内容的冗余。  相似文献   

10.
This paper presents a discussion on rough set theory from the textural point of view. A texturing is a family of subsets of a given universal set U satisfying certain conditions which are generally basic properties of the power set. The suitable morphisms between texture spaces are given by direlations defined as pairs (r,R) where r is a relation and R is a corelation. It is observed that the presections are natural generalizations for rough sets; more precisely, if (r,R) is a complemented direlation, then the inverse of the relation r (the corelation R) is actually a lower approximation operator (an upper approximation operator).  相似文献   

11.
Information systems are widely used in all business areas. These systems typically integrate a set of functionalities that implement business rules and maintain databases. Users interact with these systems and use these features through user interfaces (UI). Each UI is usually composed of menus where the user can select the desired functionality, thus accessing a new UI that corresponds to the desired feature. Hence, a system normally contains multiple UIs. However, keeping consistency between these UIs of a system from a visual (organisation, component style, etc.) and behavioral perspective is usually difficult. This problem also appears in software production lines, where it would be desirable to have patterns to guide the construction and maintenance of UIs. One possible way of defining such patterns is to use model-driven engineering (MDE). In MDE, models are defined at different levels, where the bottom level is called a metamodel. The metamodel determines the main characteristics of the models of the upper levels, serving as a guideline. Each new level must adhere to the rules defined by the lower levels. This way, if anything changes in a lower level, these changes are propagated to the levels above it. The goal of this work is to define and validate a metamodel that allows the modeling of UIs of software systems, thus allowing the definition of patterns of interface and supporting system evolution. To build this metamodel, we use a graph structure. This choice is due to the fact that a UI can be easily represented as a graph, where each UI component is a vertex and edges represent dependencies between these components. Moreover, graph theory provides support for a great number of operations and transformations that can be useful for UIs. The metamodel was defined based on the investigation of patterns that occur in UIs. We used a sample of information systems containing different types of UIs to obtain such patterns. To validate the metamodel, we built the complete UI models of one new system and of four existing real systems. This shows not only the expressive power of the metamodel, but also its versatility, since our validation was conducted using different types of systems (a desktop system, a web system, mobile system, and a multiplatform system). Moreover, it also demonstrated that the proposed approach can be used not only to build new models, but also to describe existing ones (by reverse engineering).  相似文献   

12.
Text categorization is an important research area of text mining. The original purpose of text categorization is to recognize, understand and organize different types of texts or documents. The general categorization approaches are treated as supervised learning, which infers similarity among a collection of categorized texts for training purposes. The existing categorization approaches are obviously not content-oriented and constrained at single word level.This paper introduces an innovative content-oriented text categorization approach named as CogCate. Inspired by cognitive situation models, CogCate exploits a human cognitive procedure in categorizing texts. In addition to traditional statistical analysis at word level, CogCate also applies lexical/semantical analysis, which ensures the accuracy of categorization. The evaluation experiments have testified the performance of CogCate. Meanwhile, CogCate remarkably reduces the time and effort spent on software training and maintenance of text collections. Our research work attests that interdisciplinary research efforts benefit text categorization.  相似文献   

13.
Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets.  相似文献   

14.
15.
Document image processing is a crucial process in office automation and begins at the ‘OCR’ phase with difficulties in document ‘analysis’ and ‘understanding’. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base. Rules can be formulated based on features which might be observed within one specific layout object. However, rules can also express dependencies between different layout objects. In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (e.g., lists). Received June 19, 2000 / Revised November 8, 2000  相似文献   

16.
Entailment for measure-based belief structures can extend the possible probability value range of variables on a space and obtain more information from variables. However, if the variable space comes from intuitionistic fuzzy sets, the classical entailment for measure-based belief structures will not work in this issue. To deal with this situation, we propose the entailment for intuitionistic fuzzy sets based on generalized belief structures in this paper to apply the entailment for measure based belief structures on space, which is made up of non-membership degree, membership degree and hesitancy degree of a given intuitionistic fuzzy sets. Numerical examples are mentioned to prove the effectively and flexibility of this proposed entailment model. The experimental results indicate that the proposed algorithm can extend the possible probability value range of variables of space efficiently and obtain more information from intuitionistic fuzzy sets.  相似文献   

17.
为解决目前Simulink模型规则检查工具对国内如飞行控制等一些特定领域的标准规范覆盖不完善的问题,设计了41条建模准则,并基于元模型理论,提出了一种共性的解析和检查框架,在此基础上实现了Simulink模型规则检查工具SimREG。该方法通过一种无编译的方式来实现对Simulink模型的静态规则检查,在Simulink模型到元模型的映射过程中提取针对每条准则执行检查时需要的模型信息,并将模型重构为有向图的形式,然后在遍历过程中,对图中每个节点进行选定准则的分析处理,完成检查过程。SimREG完成了全部41条建模准则的检查过程,在与三个有代表性的规则检查工具的对比实验中取得了更好的检查结果。SimREG工具将元模型理论应用于Simulink模型的规则检查过程中,在检查速度更快的同时获得了更低的漏报率和误报率。  相似文献   

18.
19.
As one of the most fundamental yet important methods of data clustering, center-based partitioning approach clusters the dataset into k subsets, each of which is represented by a centroid or medoid. In this paper, we propose a new medoid-based k-partitions approach called Clustering Around Weighted Prototypes (CAWP), which works with a similarity matrix. In CAWP, each cluster is characterized by multiple objects with different representative weights. With this new cluster representation scheme, CAWP aims to simultaneously produce clusters of improved quality and a set of ranked representative objects for each cluster. An efficient algorithm is derived to alternatingly update the clusters and the representative weights of objects with respect to each cluster. An annealing-like optimization procedure is incorporated to alleviate the local optimum problem for better clustering results and at the same time to make the algorithm less sensitive to parameter setting. Experimental results on benchmark document datasets show that, CAWP achieves favorable effectiveness and efficiency in clustering, and also provides useful information for cluster-specified analysis.  相似文献   

20.
This study develops an ontology building process for extracting conceptual tags and hierarchies in textual corpus. Though humans have been creating ontologies for many years, efficient ontology building processes in textual corpus are extremely ad hoc. Several issues have identified including how to recognize terminology in textual document, name concept tags in terminologies, and derive conceptual hierarchies among concepts. The proposed approach is extraction technique combinations to produce ontology prototype for editors. The empirical feedback indicates that elicitation synergy is productive during the early stages of building. Additionally, this elicitation synergy is especially useful for ontology editors who lack reference models of a working domain and who encounter textual corpus as major knowledge sources.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号