首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
现有的从PDF文档抽取文本内容的方法(如PDFBox类库采用的方法)处理速度较低,无法满足高速网络中内容分析的需求,也不能对网络中部分到达的PDF数据包进行流式的处理。为此,提出了基于自动机理论的PDF文本内容抽取方法。该方法通过建立具有层次的关键字自动机,可以快速地抽取完整PDF文档和不完整PDF文档中的文本内容。在中文和英文PDF文档数据集下的实验结果表明,基于自动机理论的PDF文本内容抽取方法耗时仅为PDFBox方法的17%~37%。  相似文献   

4.
EDCMS: A Content Management System for Engineering Documents   总被引:1,自引:0,他引:1  
Engineers often need to look for the right pieces of information by sifting through long engineering documents.It is a very tiring and time-consuming job.To address this issue,researchers are increasingly devoting their attention to new ways to help information users,including engineers,to access and retrieve document content.The research reported in this paper explores how to use the key technologies of document decomposition (study of document structure),document mark-up (with EXtensible Mark- up Language (XML),HyperText Mark-up Language (HTML),and Scalable Vector Graphics (SVG)),and a facetted classification mechanism.Document content extraction is implemented via computer programming (with Java).An Engineering Document Content Management System (EDCMS) developed in this research demonstrates that as information providers we can make document content in a more accessible manner for information users including engineers. The main features of the EDCMS system are: 1) EDCMS is a system that enables users,especially engineers,to access and retrieve information at content rather than document level.In other words,it provides the right pieces of information that answer specific questions so that engineers don't need to waste time sifting through the whole document to obtain the required piece of information. 2) Users can use the EDCMS via both the data and metadata of a document to access engineering document content. 3) Users can use the EDCMS to access and retrieve content objects,i.e.text,images and graphics (including engineering drawings) via multiple views and at different granularities based on decomposition schemes. Experiments with the EDCMS have been conducted on semi-structured documents,a textbook of CADCAM,and a set of project posters in the Engineering Design domain.Experimental results show that the system provides information users with a powerful solution to access document content.  相似文献   

5.
摘 要:二维条码在直接零件标识技术中应用越来越广泛,但存在标刻在柱形零件上的二维条码由于扭曲形变而无法识读的问题。为了校正这种畸变,论文分析了柱面二维条码的畸变形成原理,然后基于标定的机器视觉装置与透视投影原理建立柱面上二维条码畸变校正模型,最后通过图像信息估计出柱形零件半径参数信息,并代入畸变校正模型得到最终的校正结果。实验结果表明,该方法能够较好的校正自由半径柱面上二维条码的扭曲畸变。  相似文献   

6.
7.
Restoring warped document images through 3D shape modeling   总被引:2,自引:0,他引:2  
Scanning a document page from a thick bound volume often results in two kinds of distortions in the scanned image, i.e., shade along the "spine" of the book and warping in the shade area. In this paper, we propose an efficient restoration method based on the discovery of the 3D shape of a book surface from the shading information in a scanned document image. From a technical point of view, this shape from shading (SFS) problem in real-world environments is characterized by 1) a proximal and moving light source, 2) Lambertian reflection, 3) nonuniform albedo distribution, and 4) document skew. Taking all these factors into account, we first build practical models (consisting of a 3D geometric model and a 3D optical model) for the practical scanning conditions to reconstruct the 3D shape of the book surface. We next restore the scanned document image using this shape based on deshading and dewarping models. Finally, we evaluate the restoration results by comparing our estimated surface shape with the real shape as well as the OCR performance on original and restored document images. The results show that the geometric and photometric distortions are mostly removed and the OCR results are improved markedly.  相似文献   

8.
针对一维帧内串匹配(ISC)算法虽然较好地提高了屏幕图像编码效果,但是其将二维图像逐个编码单元(CU)一维化,使图像中相邻区域被分割,空间相关性得不到利用的问题,提出二维帧内串匹配(2D ISC)算法。该算法在编码器和解码器几乎不增加额外内存消耗的情况下,在高效视频编码(HEVC)的重建缓存内,利用字典编码工具,对当前CU内的像素,实现不受CU边界限制的任意二维形状的搜索和匹配;同时引入色彩量化预处理技术和水平、垂直搜索顺序自适应技术,进一步提升编码效果。通用测试条件的实验结果显示,对于典型的屏幕图像,在全帧内(AI)、随机接入(RA)和低延迟(LB)三种配置下,与HEVC相比,无损编码模式分别最多节省码率46.5%、34.8%、25.4%,有损编码模式分别最多节省码率34.0%、37.2%、23.9%;与一维帧内串匹配算法相比,无损编码模式分别最多节省码率18.3%、13.9%、11.0%,有损编码模式分别最多节省码率19.8%、20.5%、10.4%。实验结果表明了该算法的可行性和有效性。  相似文献   

9.
As XML documents contain both content and structure information, taking advantage of the document structure in the retrieval process can lead to better identify relevant information units. In this paper, we describe an information retrieval (IR) approach dealing with queries composed of content and structure conditions. The XFIRM model we propose is designed to be as flexible as possible to process such queries. It is based on a complete query language, derived from XPath and on a relevance values propagation method. This paper aims at evaluating functions used in the propagation process, and particularly the use of distance between nodes as a parameter. The proposed method is evaluated, thanks to the INEX evaluation initiative. Results show a relative high precision of our proposal.  相似文献   

10.
Multimedia Tools and Applications - In certain platforms, such as Google Docs, documents are adapted for specific mobile device types, and the installation of their applications is required....  相似文献   

11.
针对脱机手写维吾尔文本行图像中单词切分问题,提出了FCM融合K-means的聚类算法。通过该算法得到单词内距离和单词间距离两种分类。以聚类结果为依据,对文字区域进行合并,得到切分点,再对切分点内的文字进行连通域标注,进行着色处理。以50幅不同的人书写的维吾尔脱机手写文本图像为实验对象,共有536行和4?002个单词,正确切分率达到80.68%。实验结果表明,该方法解决了手写维吾尔文在切分过程中,单词间距离不规律带来的切分困难的问题和一些单词间重叠的问题。同时实现了大篇幅手写文本图像的整体处理。  相似文献   

12.
13.
The distributed nature of the Web, as a decentralized system exchanging information between heterogeneous sources, has underlined the need to manage interoperability, i.e., the ability to automatically interpret information in Web documents exchanged between different sources, necessary for efficient information management and search applications. In this context, XML was introduced as a data representation standard that simplifies the tasks of interoperation and integration among heterogeneous data sources, allowing to represent data in (semi-) structured documents consisting of hierarchically nested elements and atomic attributes. However, while XML was shown most effective in exchanging data, i.e., in syntactic interoperability, it has been proven limited when it comes to handling semantics, i.e.,  semantic interoperability, since it only specifies the syntactic and structural properties of the data without any further semantic meaning. As a result, XML semantic-aware processing has become a motivating challenge in Web data management, requiring dedicated semantic analysis and disambiguation methods to assign well-defined meaning to XML elements and attributes. In this context, most existing approaches: (i) ignore the problem of identifying ambiguous XML elements/nodes, (ii) only partially consider their structural relationships/context, (iii) use syntactic information in processing XML data regardless of the semantics involved, and (iv) are static in adopting fixed disambiguation constraints thus limiting user involvement. In this paper, we provide a new XML Semantic Disambiguation Framework titled XSDFdesigned to address each of the above limitations, taking as input: an XML document, and then producing as output a semantically augmented XML tree made of unambiguous semantic concepts extracted from a reference machine-readable semantic network. XSDF consists of four main modules for: (i) linguistic pre-processing of simple/compound XML node labels and values, (ii) selecting ambiguous XML nodes as targets for disambiguation, (iii) representing target nodes as special sphere neighborhood vectors including all XML structural relationships within a (user-chosen) range, and (iv) running context vectors through a hybrid disambiguation process, combining two approaches: concept-basedand context-based disambiguation, allowing the user to tune disambiguation parameters following her needs. Conducted experiments demonstrate the effectiveness and efficiency of our approach in comparison with alternative methods. We also discuss some practical applications of our method, ranging over semantic-aware query rewriting, semantic document clustering and classification, Mobile and Web services search and discovery, as well as blog analysis and event detection in social networks and tweets.  相似文献   

14.
Relevance feedback (RF) is a technique that allows to enrich an initial query according to the user feedback. The goal is to express more precisely the user’s needs. Some open issues arise when considering semi-structured documents like XML documents. They are mainly related to the form of XML documents which mix content and structure information and to the new granularity of information. Indeed, the main objective of XML retrieval is to select relevant elements in XML documents instead of whole documents. Most of the RF approaches proposed in XML retrieval are simple adaptation of traditional RF to the new granularity of information. They usually enrich queries by adding terms extracted from relevant elements instead of terms extracted from whole documents. In this article, we describe a new approach of RF that takes advantage of two sources of evidence: the content and the structure. We propose to use the query term proximity to select terms to be added to the initial query and to use generic structures to express structural constraints. Both sources of evidence are used in different combined forms. Experiments were carried out within the INEX evaluation campaign and results show the effectiveness of our approaches.  相似文献   

15.
郑珂  马骏  陈明 《微机发展》2008,18(5):25-27
介绍了二维概念格图形向三维空间转化和延伸的必要性和现状。通过对传统概念格图形分层定位布局方法的研究与分析,提出并实现了一种新的以具有大量的平行四边形和有向线段为基本特征的概念格在三维空间的自动布局算法,描述了一种基于该算法的二维概念格图形的三维重构机制,有效地解决了节点横向过度扩张的问题并减少了线段交叉,较好地实现了复杂概念格图形的三维可视化,为知识发现和知识处理提供了良好的基础。  相似文献   

16.
介绍了二维概念格图形向三维空问转化和延伸的必要性和现状.通过对传统概念格图形分层定位布局方法的研究与分析,提出并实现了一种新的以具有大量的平行四边形和有向线段为基本特征的概念格在三维空间的自动布局算法,描述了一种基于该算法的二维概念格图形的三维重构机制,有效地解决了节点横向过度扩张的问题并减少了线段交叉,较好地实现了复杂概念格图形的三维可视化,为知识发现和知识处理提供了良好的基础.  相似文献   

17.
A novel text line extraction technique is presented for multi-skewed document images of handwritten English or Bengali text. It assumes that hypothetical water flows, from both left and right sides of the image frame, face obstruction from characters of text lines. The stripes of areas left unwetted on the image frame are finally labelled for extraction of text lines. The success rate of the technique, as observed experimentally, are 90.34% and 91.44% for handwritten Bengali and English document images, respectively. The work may contribute significantly for the development of applications related to optical character recognition of Bengali/English text.  相似文献   

18.
Automatic ontology-based knowledge extraction from Web documents   总被引:4,自引:0,他引:4  
To bring the Semantic Web to life and provide advanced knowledge services, we need efficient ways to access and extract knowledge from Web documents. Although Web page annotations could facilitate such knowledge gathering, annotations are rare and will probably never be rich or detailed enough to cover all the knowledge these documents contain. Manual annotation is impractical and unscalable, and automatic annotation tools remain largely undeveloped. Specialized knowledge services therefore require tools that can search and extract specific knowledge directly from unstructured text on the Web, guided by an ontology that details what type of knowledge to harvest. An ontology uses concepts and relations to classify domain knowledge. Other researchers have used ontologies to support knowledge extraction, but few have explored their full potential in this domain. The paper considers the Artequakt project which links a knowledge extraction tool with an ontology to achieve continuous knowledge support and guide information extraction. The extraction tool searches online documents and extracts knowledge that matches the given classification structure. It provides this knowledge in a machine-readable format that will be automatically maintained in a knowledge base (KB). Knowledge extraction is further enhanced using a lexicon-based term expansion mechanism that provides extended ontology terminology.  相似文献   

19.
International Journal on Document Analysis and Recognition (IJDAR) - The automation of document processing has recently gained attention owing to its great potential to reduce manual work. Any...  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号