首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant messenger and short message service data and they adversely affect off-the-shelf text mining methods. Most techniques address this problem by supervised methods by making use of hand labeled corrections. But they require human generated labels and corrections that are very expensive and time consuming to obtain because of multilinguality and complexity of the corruptions. While we do not champion unsupervised methods over supervised when quality of results is the singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention that is necessary to generate a parallel labeled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A hidden Markov model (HMM) over a “subsequencized” representation of words is used, where a word is represented as a bag of weighted subsequences. The approximate maximum likelihood inference algorithm used is such that the training phase involves clustering over vectors and not the customary and expensive dynamic programming (Baum–Welch algorithm) over sequences that is necessary for HMMs. A principled transformation of maximum likelihood based “central clustering” cost function of Baum–Welch into a “pairwise similarity” based clustering is proposed. This transformation makes it possible to apply “subsequence kernel” based methods that model delete and insert corruptions well. The novelty of this approach lies in that the expensive (Baum–Welch) iterations required for HMM, can be avoided through an approximation of the loglikelihood function and by establishing a connection between the loglikelihood and a pairwise distance. Anecdotal evidence of efficacy is provided on public and proprietary data.  相似文献   

2.
Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed “unmasking”, can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.  相似文献   

3.
A method to obtain a code representation of handwritten signatures is described and an algorithm for signature verification based on such representations is proposed. Results of tests to determine efficient methods of image compression for the purpose of signature verification are presented. Konstantin Alekseev. Born 1979. Received Master’s degree in engineering and technology (Radioengineering) in 2002. Currently post-graduate student at St. Petersburg State Electrotechnical University “LETI”, chair of television and video. Scientific interests: digital image processing and pattern recognition. Author of three papers. Svetlana Egorova. Born 1931. Graduated from St. Petersburg State Electrotechnical University “LETI” in 1955, received Candidates degree (Eng.) in 1965; since 1968 a senior lecturer at the chair of television and video, St. Petersburg State Electrotechnical University “LETI”. Scientific interests: optical and digital image processing and compression methods in signal processing. Author of 141 papers.  相似文献   

4.
Methods of synchronizing interaction of the digital devices of distributed systems with the use of a common center relaying the signals from the devices were proposed. They are mostly intended to perform operations like “all-to-all,” “all-to-one,” and “one-to-all.” The center substantially accelerates synchronization and improves efficiency of the communication facilities interconnecting the devices.  相似文献   

5.
We introduce a new abstract model of database query processing, finite cursor machines, that incorporates certain data streaming aspects. The model describes quite faithfully what happens in so-called “one-pass” and “two-pass query processing”. Technically, the model is described in the framework of abstract state machines. Our main results are upper and lower bounds for processing relational algebra queries in this model, specifically, queries of the semijoin fragment of the relational algebra.  相似文献   

6.
7.
The detection and correction of false friends—also called real-word errors—is a notoriously difficult problem. On realistic data, the break-even point for automatic correction so far could not be reached: the number of additional infelicitous corrections outnumbered the useful corrections. We present a new approach where we first compute a profile of the error channel for the given text. During the correction process, the profile (1) helps to restrict attention to a small set of “suspicious” lexical tokens of the input text where it is “plausible” to assume that the token represents a false friend. In this way, recognition of false friends is improved. Furthermore, the profile (2) helps to isolate the “most promising” correction suggestion for “suspicious” tokens. Using a conventional word trigram statistics for disambiguation we obtain a correction method that can be successfully applied to unrestricted text. In experiments for OCR documents, we show significant accuracy gains by fully automatic correction of false friends.  相似文献   

8.
Making use of the Kerr theorem for shear-free null congruences and of Newman’s representation for a virtual charge “moving” in complex space-time, we obtain an axisymmetric time-dependent generalization of the Kerr congruence, with a singular ring uniformly contracting to a point and expanding then to infinity. Electromagnetic and complex eikonal field distributions are naturally associated with the obtained congruence, with electric charge being necessarily unit (“elementary”).  相似文献   

9.
In order to be able to draw inferences about real world phenomena from a representation expressed in a digital computer, it is essential that the representation should have a rigorously correct algebraic structure. It is also desirable that the underlying algebra be familiar, and provide a close modelling of those phenomena. The fundamental problem addressed in this paper is that, since computers do not support real-number arithmetic, the algebraic behaviour of the representation may not be correct, and cannot directly model a mathematical abstraction of space based on real numbers. This paper describes a basis for the robust geometrical construction of spatial objects in computer applications using a complex called the “Regular Polytope”. In contrast to most other spatial data types, this definition supports a rigorous logic within a finite digital arithmetic. The definition of connectivity proves to be non-trivial, and alternatives are investigated. It is shown that these alternatives satisfy the relations of a region connection calculus (RCC) as used for qualitative spatial reasoning, and thus introduce the rigor of that reasoning to geographical information systems. They also form what can reasonably be termed a “Finite Boolean Connection Algebra”. The rigorous and closed nature of the algebra ensures that these primitive functions and predicates can be combined to any desired level of complexity, and thus provide a useful toolkit for data retrieval and analysis. The paper argues for a model with two and three-dimensional objects that have been coded in Java and which implement a full set of topological and connectivity functions which is shown to be complete and rigorous.  相似文献   

10.
Conclusion The program is adequate testimony that the I.M.L.—M.I.R. system can handle complicated musical procedures, and that furthermore, the present computer staff-format can be easily modified to print “normal” music symbols once music type-bars can be added to the printer.  相似文献   

11.
“There will always (I hope) be print books, but just as the advent of photography changed the role of painting or film changed the role of theater in our culture, electronic publishing is changing the world of print media. To look for a one-to-one transposition to the new medium is to miss the future until it has passed you by.”—Tim O’Reilly (2002). It is not hard to envisage that publishers will leverage subscribers’ information, interest groups’ shared knowledge and others sources to enhance their publications. While this enhances the value of the publication through more accurate and personalized content, it also brings a new set of challenges to the publisher. Content is now driven by web and in a truly automated system, that is, no designer “re-touch” intervention is envisaged. This paper introduces an exploratory mapping strategy to allocate web driven content in a highly graphical publication like a traditional magazine. Two major aspects of the mapping are covered, those enable different level of flexibility and address different content flowing strategies. The last contribution is an evaluation of existing standards, which potentially can leverage this work to incorporate flexible mapping, and subsequently, composition capabilities. The work published here is an extended version of the article presented at the Eight ACM Symposium on Document Engineering in fall 2008 (Giannetti 2008).  相似文献   

12.
This paper proposes a “reading” of the church of San Lorenzo in Turin, designed by Guarino Guarini, through the philosophical notion of “fold” introduced by Gilles Deleuze. The paper consists of two parts. The first part contains an exploration of the notion of “fold” in architecture and in philosophy and examines the use of the fold in the theory of Baroque architecture as well as the range of this new tool in architectural practise in contemporary architecture and in philosophy and examines the use of the fold as fundamental condition for understanding Baroque era. The second part contains the application of the notion of fold as a philosophical and conceptual framework for the “reading” of the chapel.  相似文献   

13.
The current work is focused on the implementation of a robust multimedia application for watermarking digital images, which is based on an innovative spread spectrum analysis algorithm for watermark embedding and on a content-based image retrieval technique for watermark detection. The existing highly robust watermark algorithms are applying “detectable watermarks” for which a detection mechanism checks if the watermark exists or not (a Boolean decision) based on a watermarking key. The problem is that the detection of a watermark in a digital image library containing thousands of images means that the watermark detection algorithm is necessary to apply all the keys to the digital images. This application is non-efficient for very large image databases. On the other hand “readable” watermarks may prove weaker but easier to detect as only the detection mechanism is required. The proposed watermarking algorithm combine’s the advantages of both “detectable” and “readable” watermarks. The result is a fast and robust multimedia application which has the ability to cast readable multibit watermarks into digital images. The watermarking application is capable of hiding 214 different keys into digital images and casting multiple zero-bit watermarks onto the same coefficient area while maintaining a sufficient level of robustness.  相似文献   

14.
This paper investigates the prospects of Rodney Brooks’ proposal for AI without representation. It turns out that the supposedly characteristic features of “new AI” (embodiment, situatedness, absence of reasoning, and absence of representation) are all present in conventional systems: “New AI” is just like old AI. Brooks proposal boils down to the architectural rejection of central control in intelligent agents—Which, however, turns out to be crucial. Some of more recent cognitive science suggests that we might do well to dispose of the image of intelligent agents as central representation processors. If this paradigm shift is achieved, Brooks’ proposal for cognition without representation appears promising for full-blown intelligent agents—Though not for conscious agents.  相似文献   

15.
Trust structures     
A general formal model for trust in dynamic networks is presented. The model is based on the trust structures of Carbone, Nielsen and Sassone: a domain theoretic generalisation of Weeks’ framework for credential based trust management systems, e.g., KeyNote and SPKI. Collections of mutually referring trust policies (so-called “webs” of trust) are given a precise meaning in terms of an abstract domain-theoretic semantics. A complementary concrete operational semantics is provided using the well-known I/O-automaton model. The operational semantics is proved to adhere to the abstract semantics, effectively providing a distributed algorithm allowing principals to compute the meaning of a “web” of trust policies. Several techniques allowing sound and efficient distributed approximation of the abstract semantics are presented and proved correct. BRICS: Basic Research in Computer Science (www.brics.dk) funded by the Panish National Research Foundation.  相似文献   

16.
A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: “title” and “authors”. Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction. Received April 5, 2000 / Revised February 19, 2001  相似文献   

17.
To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).  相似文献   

18.
Summary Equivalence is a fundamental notion for the semantic analysis of algebraic specifications. In this paper the notion of “crypt-equivalence” is introduced and studied w.r.t. two “loose” approaches to the semantics of an algebraic specificationT: the class of all first-order models ofT and the class of all term-generated models ofT. Two specifications are called crypt-equivalent if for one specification there exists a predicate logic formula which implicitly defines an expansion (by new functions) of every model of that specification in such a way that the expansion (after forgetting unnecessary functions) is homologous to a model of the other specification, and if vice versa there exists another predicate logic formula with the same properties for the other specification. We speak of “first-order crypt-equivalence” if this holds for all first-order models, and of “inductive crypt-equivalence” if this holds for all term-generated models. Characterizations and structural properties of these notions are studied. In particular, it is shown that firstorder crypt-equivalence is equivalent to the existence of explicit definitions and that in case of “positive definability” two first-order crypt-equivalent specifications admit the same categories of models and homomorphisms. Similarly, two specifications which are inductively crypt-equivalent via sufficiently complete implicit definitions determine the same associated categories. Moreover, crypt-equivalence is compared with other notions of equivalence for algebraic specifications: in particular, it is shown that first-order cryptequivalence is strictly coarser than “abstract semantic equivalence” and that inductive crypt-equivalence is strictly finer than “inductive simulation equivalence” and “implementation equivalence”.  相似文献   

19.
According to John Haugeland, the capacity for “authentic intentionality” depends on a commitment to constitutive standards of objectivity. One of the consequences of Haugeland’s view is that a neurocomputational explanation cannot be adequate to understand “authentic intentionality”. This paper gives grounds to resist such a consequence. It provides the beginning of an account of authentic intentionality in terms of neurocomputational enabling conditions. It argues that the standards, which constitute the domain of objects that can be represented, reflect the statistical structure of the environments where brain sensory systems evolved and develop. The objection that I equivocate on what Haugeland means by “commitment to standards” is rebutted by introducing the notion of “florid, self-conscious representing”. Were the hypothesis presented plausible, computational neuroscience would offer a promising framework for a better understanding of the conditions for meaningful representation.  相似文献   

20.
Comprehension is the goal of reading. However, students often encounter reading difficulties due to the lack of background knowledge and proper reading strategy. Unfortunately, print text provides very limited assistance to one’s reading comprehension through its static knowledge representations such as symbols, charts, and graphs. Integrating digital materials and reading strategy into paper-based reading activities may bring opportunities for learners to make meaning of the print material. In this study, QR codes were adopted in association with mobile technology to deliver supplementary materials and questions to support students’ reading. QR codes were printed on paper prints to provide direct access to digital materials and scaffolded questions. Smartphones were used to scan the printed QR codes to fetch pre-designed digital resources and scaffolded questions over the Internet. A quasi-experiment was conducted to evaluate the effectiveness of direct access to the digital materials prepared by the instructor using QR codes and that of scaffolded questioning in improving students’ reading comprehension. The results suggested that direct access to digital resources using QR codes does not significantly influence students’ reading comprehension; however, the reading strategy of scaffolded questioning significantly improves students’ understanding about the text. The survey showed that most students agreed that the integrated print-and-digital-material- based learning system benefits English reading comprehension but may not be as efficient as expected. The implications of the findings shed light on future improvement of the system.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号