首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
PDF文件信息的抽取与分析   总被引:5,自引:0,他引:5  
李珍  田学东 《计算机应用》2003,23(12):145-147
PDF文件网络信息抽取的重要资源。通过对PDF文件结构的分析,针对最流行的线性PDF文件,在论述如何从源代码中取出正文内容字符串流并进行解码的基础上,对从解码后的字符串流中提取出文本及其相关的字体、字号和换行等文本信息进行了详细的讨论。这将有助于根据需要进一步抽取PDF文件信息。  相似文献   

2.
As collections of archived digital documents continue to grow the maintenance of an archive, and the quality of reproduction from the archived format, become important long‐term considerations. In particular, Adobe's portable document format (PDF) is now an important ‘final form’ standard for archiving and distributing electronic versions of technical documents. It is important that all embedded images in the PDF, and any fonts used for text rendering, should at the very minimum be easily readable on screen. Unfortunately, because PDF is based on PostScript technology, it allows the embedding of bitmap fonts in Adobe Type 3 format as well as higher‐quality outline fonts in TrueType or Adobe Type 1 formats. Bitmap fonts do not generally perform well when they are scaled and rendered on low‐resolution devices such as workstation screens. The work described here investigates how a plug‐in to Adobe Acrobat enables bitmap fonts to be substituted by corresponding outline fonts using a checksum matching technique against a canonical set of bitmap fonts, as originally distributed. The target documents for our initial investigations are those PDF files produced by LATEX systems when set up in a default (bitmap font) configuration. For all bitmap fonts where recognition exceeds a certain confidence threshold replacement fonts in Adobe Type 1 (outline) format can be substituted with consequent improvements in file size, screen display quality and rendering speed. The accuracy of font recognition is discussed together with the prospects of extending these methods to bitmap‐font PDF files from sources other than LATEX. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

3.
An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.  相似文献   

4.
在.NET平台上应用COM进行文档网上发布   总被引:1,自引:0,他引:1  
人们经常利用办公软件制作各种文档,某些情况下需要将制作的办公文档在网上发布.为了能够批量地将办公文档发布到网页上,提出一种应用COM组件接口将文档转化为PDF格式文件,然后再将PDF文件转化为图片,把文档发布到网上的方法,使得信息的利用率大大提高,展示更容易.方法的实用性和扩展性强,具有批量处理文档的能力,能极大地提高工作效率,具有较好的应用价值.  相似文献   

5.
Flash Paper软件能将当前主流的文档转换为SWF或PDF文档,便于在网络上浏览和打印。本文介绍了Flash Paper这个软件的特点、使用方法,并详细说明了如何在教学中运用这个软件辅助制作电子教案、电子课件、网页、电子书籍。  相似文献   

6.
刘友继  孙星明  罗纲 《计算机工程》2006,32(17):230-232
通过分析格式化文件PDF(Portable Document Format)文档的数据结构,提出了一种新的基于PDF文档结构的大容量信息隐藏算法。将秘密信息预处理后伪装成合法PDF对象的形式,以文件流的操作方式嵌入到载体文件中,并满足嵌入的信息不影响文件在阅读器、编辑器与打印机中的输出。实验实现了线性化PDF文档的信息隐藏与检测。理论分析与实验结果均表明,该算法具有较大的信息隐藏容量、很快的隐藏与检测速度及依赖于加密算法和密钥的安全性。  相似文献   

7.
苟孟洛 《计算机安全》2014,(5):12-13,18
随着互联网的高速发展和办公自动化的日益普及,PDF(portable document format)文件已经成为全球电子文档分发的开放式标准,由于PDF文档的高实用性和普遍适应性,使其成为有针对性钓鱼攻击的有效载体。恶意代码对计算机的严重破坏性,检测和防止含有恶意代码的PDF文档已日益成为计算机安全领域的重要目标。通过从文档中提取特征数据,提出了一个基于机器学习算法的恶意PDF检测框架,最后并通过实验验证了其检测模型的有效性。  相似文献   

8.
模糊测试是当前检测程序错误的最主流、最有效的手段之一.模糊测试工具首先对种子文件进行变异,生成大量新输入文件,然后挑选新输入来执行目标程序,以触发程序中潜在的漏洞.当前对模糊测试的研究多着眼于改进变异算法,提高生成的新文件对目标程序代码的覆盖,忽略了备用种子文件的筛选策略对提高模糊测试覆盖率与测试效率的的贡献.针对该问题,我们提出了基于覆盖频率的种子文件筛选策略,在每次执行目标程序时,我们记录程序执行中覆盖过的路径边;根据边被执行次数的多少,我们将这些边分为低频边和高频边;对于包含了更多低频边且执行效率高的种子文件,我们给予更高的优先级.我们在模糊测试工具American Fuzzy Lop (AFL)实现了对应的算法,实验表明我们的算法有效提高了模糊测试的效率和代码覆盖率.  相似文献   

9.
Efforts to improve application reliability can be irrelevant if the reliability of the underlying operating system on which the application resides is not seriously considered. An important first step in improving the reliability of an operating system is to gain insights into why and how the bugs originate, contributions of the different modules to the bugs, their distribution across severities, the different ways in which the bugs may be resolved and the impact of bug severities on their resolution times. To acquire this insight, we conducted an extensive analysis of the publicly available bug data on the Linux kernel over a period of seven years. We also justify and explain the statistical bug occurrence trends observed from the data, using the architecture of the Linux kernel as an anchor. The statistical analysis of the Linux bug data suggests that the Linux kernel may draw significant benefits from the continual reliability improvement efforts of its developers. These efforts, however, are disproportionately targeted towards popular configurations and hardware platforms, due to which the reliability of these configurations may be better than those that are not commonly used. Thus, a key finding of our study is that it may be prudent to restrict to using common configurations and platforms when using open source systems such as Linux in applications with stringent reliability expectations. Finally, our study of the architectural properties of the bugs suggests that the dependence among the modules rather than the unreliabilities of the individual modules is the primary cause of the bugs and their impact on system reliability.  相似文献   

10.
Inconsistent identifiers make it difficult for developers to understand source code. In particular, large software systems written by several developers can be vulnerable to identifier inconsistency. Unfortunately, it is not easy to detect inconsistent identifiers that are already used in source code. Although several techniques have been proposed to address this issue, many of these techniques can result in false alarms since such techniques do not accept domain words and idiom identifiers that are widely used in programming practice. This paper proposes an approach to detecting inconsistent identifiers based on a custom code dictionary. It first automatically builds a Code Dictionary from the existing API documents of popular Java projects by using an Natural Language Processing (NLP) parser. This dictionary records domain words with dominant part-of-speech (POS) and idiom identifiers. This set of domain words and idioms can improve the accuracy when detecting inconsistencies by reducing false alarms. The approach then takes a target program and detects inconsistent identifiers of the program by leveraging the Code Dictionary. We provide CodeAmigo, a GUI-based tool support for our approach. We evaluated our approach on seven Java based open-/proprietary- source projects. The results of the evaluations show that the approach can detect inconsistent identifiers with 85.4 % precision and 83.59 % recall values. In addition, we conducted an interview with developers who used our approach, and the interview confirmed that inconsistent identifiers frequently and inevitably occur in most software projects. The interviewees then stated that our approach can help to better detect inconsistent identifiers that would have been missed through manual detection.  相似文献   

11.
PDF文件中可识别图像的提取   总被引:3,自引:3,他引:0  
PDF(portable documentformat)文件是用于电子文档分发的理想格式,是全球电子文档分发的开放式标准.从PDF文件中提取可供识别的图像,有利于图像识别和信息处理.详细介绍了一种从PDF文件中提取可识别图像的方案.  相似文献   

12.
Memory‐related program failures in multithreaded programs can be caused by a variety of bugs. Concurrency bugs can occur due to unexpected or incorrect thread interleavings during execution. Other kinds of memory bugs, such as buffer overflows and uninitialized reads, may also occur in multithreaded as well as single‐threaded programs. Most prior techniques for isolating these bugs are specialized, addressing only one type of concurrency bug or certain types of other memory bugs. The memory corruption caused by these bugs can also undergo significant propagation during program execution. When a program failure finally occurs due to memory corruption, the true root cause of the failure may be effectively concealed as significant portions of memory may have become corrupted. We propose a general framework that can isolate the root cause of any failure in a multithreaded program that involves memory corruption and reveals at least a subset of this memory corruption. This includes three important types of concurrency bugs—data races, atomicity violations, and order violations—as well as other kinds of memory bugs. To account for propagation of memory corruption, our approach uses a dynamic technique called ‘execution suppression’ that iteratively reveals memory corruption in a failing execution to isolate the true root cause of the failure. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

13.
在当今的计算机世界里,Microsoft Word的Doc格式和Adobe Acrobat的PDF格式是使用率最高的两种文档格式,其中PDF是Internet上进行电子文档发行和数字化传播的理想格式。文章介绍了一些日常使用PDF文档的体会,希望有助于读者更好地应用PDF文档。  相似文献   

14.
聚类技术能将大规模数据按照数据的相似性划分成用户可迅速理解的簇.从而使用户更快地了解大量文档中所包含的内容。因此.聚类技术成为搜索引擎中不可或缺的部分和研究热点。Web上的AJAX应用和PowerPoint文件等弱链接文档由于缺乏足够的超链接信息,导致搜索该类文档时.排序结果不佳。针对该问题.给出一个弱链接文档的搜索引擎框架,并重点描述一个基于网页搜索结果的弱链接文档排序算法.基于聚类的弱链接文档排序算法利用聚类算法从高质量的网页搜索结果中提取与查询相关的主题.并根据主题的相关网页的排名确定该主题的重要性.根据识别的带权重的主题计算弱链接文档的排序值。实验结果表明该算法能够为弱链接文档产生较好的排序结果.  相似文献   

15.
基于PDF文档作为掩体的信息隐写方法   总被引:1,自引:0,他引:1  
日前应用极为广泛的PDF文档,发现了其中存在可以用作信息隐写的隐密信道。通过采用以一定的冗余换取安全性的策略,并使用基于混沌模型的随机选择隐写单元的方法,使隐写系统满足Kerckhoffs原理。分析和实验结果表明,该文隐写方法可嵌入任意大小的信息,并保持在PDF阅读器中显示的透明性,具有较好的简单实用性。  相似文献   

16.
现有的从PDF文档抽取文本内容的方法(如PDFBox类库采用的方法)处理速度较低,无法满足高速网络中内容分析的需求,也不能对网络中部分到达的PDF数据包进行流式的处理。为此,提出了基于自动机理论的PDF文本内容抽取方法。该方法通过建立具有层次的关键字自动机,可以快速地抽取完整PDF文档和不完整PDF文档中的文本内容。在中文和英文PDF文档数据集下的实验结果表明,基于自动机理论的PDF文本内容抽取方法耗时仅为PDFBox方法的17%~37%。  相似文献   

17.
As developers face an ever-increasing pressure to engineer secure software, researchers are building an understanding of security-sensitive bugs (i.e. vulnerabilities). Research into mining software repositories has greatly increased our understanding of software quality via empirical study of bugs. Conceptually, however, vulnerabilities differ from bugs: they represent an abuse of functionality as opposed to insufficient functionality commonly associated with traditional, non-security bugs. We performed an in-depth analysis of the Chromium project to empirically examine the relationship between bugs and vulnerabilities. We mined 374,686 bugs and 703 post-release vulnerabilities over five Chromium releases that span six years of development. We used logistic regression analysis, ranking analysis, bug type classifications, developer experience, and vulnerability severity metrics to examine the overarching question: are bugs and vulnerabilities in the same files? While we found statistically significant correlations between pre-release bugs and post-release vulnerabilities, we found the association to be weak. Number of features, source lines of code, and pre-release security bugs are, in general, more closely associated with post-release vulnerabilities than any of our non-security bug categories. In further analysis, we examined sub-types of bugs, such as stability-related bugs, and the associations did not improve. Even the files with the most severe vulnerabilities (by measure of CVSS or bounty payouts) did not show strong correlations with number of bugs. These results indicate that bugs and vulnerabilities are empirically dissimilar groups, motivating the need for security engineering research to target vulnerabilities specifically.  相似文献   

18.
动态测试用例生成技术是一类新兴的软件测试技术。由于使用该类技术无需任何人工干预,也无需验证人员具备任何专业知识,同时该类技术能够无误地发现程序错误,越来越多的研究者采用该技术查找预发布的二进制级软件错误。然而,已有的该类技术及其实现系统不具有可重定向性,只能处理面向某种特定指令集体系结构(ISA)的二进制代码,进行测试用例的生成与查错。本文提出了一种全新的指令集体系结构无关的二进制级动态测试用例生成技术,以及实现该技术的系统Hunter。与已有的动态测试用例生成技术不同,Hunter具有极强的可重定向性,可对任何指令集体系结构的二进制代码进行查错,定向地为其生成指向不同执行路径的测试用例。Hunter定义了一套元指令集体系结构(MetaISA),将在二进制代码执行过程中收集到的所有执行信息映射为MetaISA,并对生成的MetaISA序列进行符号化执行、约束收集、约束求解以及测试用例生成,从而使整个过程与ISA无关。我们实现了Hunter,将其重定向至32位x86、PowerPC和Sparc ISA,并使用该系统为6个含有已知错误的测试程序查错。实验结果表明,由于MetaISA的引入,只需很小的开销,Hunter系统即可容易且有效地重定向至不同的ISA,并且Hunter能够有效地发现面向32位x86、PowerPC和Sparc ISA编写的二进制应用中隐藏极深的错误。  相似文献   

19.
20.
Maps are one of the most valuable documents for gathering geospatial information about a region. Yet, finding a collection of diverse, high-quality maps is a significant challenge because there is a dearth of content-specific metadata available to identify them from among other images on the Web. For this reason, it is desirous to analyze the content of each image. The problem is further complicated by the variations between different types of maps, such as street maps and contour maps, and also by the fact that many high-quality maps are embedded within other documents such as PDF reports. In this paper, we present an automatic method to find high-quality maps for a given geographic region. Not only does our method find documents that are maps, but also those that are embedded within other documents. We have developed a Content-Based Image Retrieval (CBIR) approach that uses a new set of features for classification in order to capture the defining characteristics of a map. This approach is able to identify all types of maps irrespective of their subject, scale, and color in a highly scalable and accurate way. Our classifier achieves an F1-measure of 74%, which is an 18% improvement over the previous work in the area.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号