首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The search for a reliable expression to measure an author'slexical richness has constituted many statisticians' holy grailover the last decades in their attempt to solve some controversialauthorship attributions. The greatest effort has been devotedto find a formula grounded on the computation of tokens, word-types,most-frequent-word(s), hapax legomena, hapax dislegomena, etc.,such that it would characterize a text successfully, independentof its length. In this line, Yule's K and Zipf 's Z seem tobe generally accepted by scholars as reliable measures of lexicalrepetition and lexical richness by computing content and functionwords altogether.1 Given the latter's higher frequency, theyprove to be more reliable identifiers when isolatedly computedin p.c.a. and Delta-based attribution studies, and their rateto the former also measures the functional density of a text.In this paper, we aim to show that each constant serves to measurea specific feature and, as such, they are thought to complementone another since a supposedly rich text (in terms of its lemmas)does necessarily have to characterize by its low functionaldensity, and vice versa. For this purpose, an annotated corpusof the West Saxon Gospels (WSG) and Apollonius of Tyre (AoT)has been used along with a huge raw corpus.  相似文献   

2.
This paper considers the question of authorship attribution techniques whenfaced with a pastiche. We ask whether the techniques can distinguish the real thing from the fake, or can the author fool the computer? If the latter, is this because the pastiche is good, or because the technique is faulty? Using a number of mainly vocabulary-based techniques, Gilbert Adair's pastiche of Lewis Carroll, Alice Through the Needle's Eye, is compared with the original `Alice' books. Standard measures of lexical richness, Yule's K andOrlov's Z both distinguish Adair from Carroll, though Z also distinguishesthe two originals. A principal component analysis based on word frequenciesfinds that the main differences are not due to authorship. A discriminantanalysis based on word usage and lexical richness successfully distinguishes thepastiche from the originals. Weighted cusum tests were also unable to distinguish the two authors in a majority of cases. As a cross-validation, wemade similar comparisons with control texts: another children's story from thesame era, and other work by Carroll and Adair. The implications of thesefindings are discussed.  相似文献   

3.
4.
The statement, ’’Results of most non-traditional authorship attribution studies are not universally accepted as definitive,' is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of competent research; flawed statistical techniques; corrupted primary data; lack of expertise in allied fields; a dilettantish approach; inadequate treatment of errors. Various solutions are suggested: construct a correct and complete experimental design; educate the practitioners; study style in its totality; identify and educate the gatekeepers; develop a complete theoretical framework; form an association of practitioners. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

5.
Markov chains are used as a formal mathematical model for sequences of elements of a text. This model is applied for authorship attribution of texts. As elements of a text, we consider sequences of letters or sequences of grammatical classes of words. It turns out that the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text are rather stable characteristics of an author and, apparently, they could be used in disputed authorship attribution. A comparison of results for various modifications of the method using both letters and grammatical classes is given. Experimental research involves 385 texts of 82 writers. In the Appendix, the research of D.V. Khmelev is described, where data compression algorithms are applied to authorship attribution.  相似文献   

6.
The most important approaches to computer-assistedauthorship attribution are exclusively based onlexical measures that either represent the vocabularyrichness of the author or simply comprise frequenciesof occurrence of common words. In this paper wepresent a fully-automated approach to theidentification of the authorship of unrestricted textthat excludes any lexical measure. Instead we adapt aset of style markers to the analysis of the textperformed by an already existing natural languageprocessing tool using three stylometric levels, i.e.,token-level, phrase-level, and analysis-levelmeasures. The latter represent the way in which thetext has been analyzed. The presented experiments ona Modern Greek newspaper corpus show that the proposedset of style markers is able to distinguish reliablythe authors of a randomly-chosen group and performsbetter than a lexically-based approach. However, thecombination of these two approaches provides the mostaccurate solution (i.e., 87% accuracy). Moreover, wedescribe experiments on various sizes of the trainingdata as well as tests dealing with the significance ofthe proposed set of style markers.  相似文献   

7.
Authorship Attribution with Support Vector Machines   总被引:1,自引:0,他引:1  
In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.  相似文献   

8.
9.
10.
Authorship attribution, also known as authorship classification, is the problem of identifying the authors (reviewers) of a set of documents (reviews). The common approach is to build a classifier usin...  相似文献   

11.
作者身份识别是对作者个人写作风格的分析。虽然这一任务在多种语言中都得到了广泛的研究,但对中文而言,研究还没有涉及古典诗歌领域。唐诗同时具有跳跃性和整体性,为了兼顾这两种特点,该文提出了一种双通道的Cap-Transformer集成模型。上通道Capsule模型可以在提取特征的同时降低信息损失,能够更好地捕获唐诗各个意象的语义特征;下通道Transformer模型通过多头自注意力机制充分学习唐诗所有意象共同反映的深层语义信息。实验表明,该文提出的模型适用于唐诗作者身份识别任务,并通过错误分析,针对唐诗文本的特殊性,讨论了唐诗作者身份识别任务目前存在的问题及未来的研究方向和面临的挑战。  相似文献   

12.
The basic assumption of quantitative authorship attributionis that the author of a text can be selected from a set of possibleauthors by comparing the values of textual measurements in thattext to their corresponding values in each possible author'swriting sample. Over the past three centuries, many types oftextual measurements have been proposed, but never before havethe majority of these measurements been tested on the same dataset.A large-scale comparison of textual measurements is crucialif current techniques are to be used effectively and if newand more powerful techniques are to be developed. This articlepresents the results of a comparison of thirty-nine differenttypes of textual measurements commonly used in attribution studies,in order to determine which are the best indicators of authorship.Based on the results of these tests, a more accurate approachto quantitative authorship attribution is proposed, which involvesthe analysis of many different textual measurements.  相似文献   

13.
14.
15.
16.
17.
This paper describes how traditional andnon-traditional methods were used to identifyseventeen previously unknown articles that webelieve to be by Stephen Crane, published inthe New-York Tribune between 1889 and1892. The articles, printed without byline inwhat was at the time New York City's mostprestigious newspaper, report on activities ina string of summer resort towns on New Jersey'snorthern shore. Scholars had previouslyidentified fourteen shore reports as Crane's;these possible attributions more than doublethat corpus. The seventeen articles confirmhow remarkably early Stephen Crane set hisdistinctive writing style and artistic agenda. In addition, the sheer quantity of the articlesfrom the summer of 1892 reveals how vigorouslythe twenty-year-old Crane sought to establishhimself in the role of professional writer. Finally, our discovery of an article about theNew Jersey National Guard's summer encampmentreveals another way in which Crane immersedhimself in nineteenth-century military cultureand help to explain how a young man who hadnever seen a battle could write so convincinglyof war in his soon-to-come masterpiece,The Red Badge of Courage. We argue that thejoint interdisciplinary approach employed inthis paper should be the way in whichattributional research is conducted.  相似文献   

18.
19.
并发程序切片是并发程序分析的一种重要手段。针对多线程共享变量通信机制,在通过程序分析工具CodeSurfer获取程序基本信息的基础上构造程序可达图,生成以程序状态和语句二元组为节点的并发程序依赖图,实现了基于程序可达图的并发程序切片原型系统。初步实验结果表明,与传统的切片方法相比,采用基于程序可达图的并发程序切片方法,可有效地解决依赖关系不可传递问题,获得高精度的并发程序切片。  相似文献   

20.
张洋  江铭虎 《自动化学报》2021,47(11):2501-2520
作者识别是根据已知文本推断未知文本作者的交叉学科. 其传统研究通常基于文学或语言学的经验知识, 而现代研究则主要依靠数学方法量化作者的写作风格. 近些年, 随着认知科学、系统科学和信息技术的发展, 作者识别受到越来越多研究者的关注. 本文主要站在计算语言学的角度综述作者识别领域现代研究中的方法和思路. 首先, 简要介绍了作者识别的发展历程. 然后, 详述了文体风格特征、作者识别方法以及该领域中多层面的研究. 接着介绍了与作者识别相关的一些评测、数据集及评价指标. 最后, 指出该领域存在的一些问题, 结合这些问题分析并展望了作者识别的发展趋势.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号