首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
This paper describes how traditional andnon-traditional methods were used to identifyseventeen previously unknown articles that webelieve to be by Stephen Crane, published inthe New-York Tribune between 1889 and1892. The articles, printed without byline inwhat was at the time New York City's mostprestigious newspaper, report on activities ina string of summer resort towns on New Jersey'snorthern shore. Scholars had previouslyidentified fourteen shore reports as Crane's;these possible attributions more than doublethat corpus. The seventeen articles confirmhow remarkably early Stephen Crane set hisdistinctive writing style and artistic agenda. In addition, the sheer quantity of the articlesfrom the summer of 1892 reveals how vigorouslythe twenty-year-old Crane sought to establishhimself in the role of professional writer. Finally, our discovery of an article about theNew Jersey National Guard's summer encampmentreveals another way in which Crane immersedhimself in nineteenth-century military cultureand help to explain how a young man who hadnever seen a battle could write so convincinglyof war in his soon-to-come masterpiece,The Red Badge of Courage. We argue that thejoint interdisciplinary approach employed inthis paper should be the way in whichattributional research is conducted.  相似文献   

This paper considers the problem of quantifying literary style and looks at several variables which may be used as stylistic “fingerprints” of a writer. A review of work done on the statistical analysis of “change over time” in literary style is then presented, followed by a look at a specific application area, the authorship of Biblical texts.  相似文献   

张洋  江铭虎 《自动化学报》2021,47(11):2501-2520
作者识别是根据已知文本推断未知文本作者的交叉学科. 其传统研究通常基于文学或语言学的经验知识, 而现代研究则主要依靠数学方法量化作者的写作风格. 近些年, 随着认知科学、系统科学和信息技术的发展, 作者识别受到越来越多研究者的关注. 本文主要站在计算语言学的角度综述作者识别领域现代研究中的方法和思路. 首先, 简要介绍了作者识别的发展历程. 然后, 详述了文体风格特征、作者识别方法以及该领域中多层面的研究. 接着介绍了与作者识别相关的一些评测、数据集及评价指标. 最后, 指出该领域存在的一些问题, 结合这些问题分析并展望了作者识别的发展趋势.  相似文献   

In And Then There Were None, Ward Elliot and Robert Valenza report on the work of the Shakespeare Clinic (Claremont McKenna Colleges, 1987–1995). Working from popular theories that William Shakespeare is not the true author of the plays and poems ascribed to him, Elliot and Valenza cast a broad net to find another writer whose distinctive linguistic features match those of the Shakespeare canon. A regime of 51 tests was designed whereby to compare Shakespeare's drama with 79 non-Shakespearean (or at least noncanonical) plays. Success rates at or near 100% are reported for the Elliot-Valenza tests in distinguishing Shakespeare from non-Shakespeare. A smaller battery of tests was designed for distinguishing Shakespeare poems from nondramatic texts by other poets, with similar success rates being reported. But many of the Elliot-Valenza tests are deeply flawed, both in their design and execution.Donald Foster is the Jean Webster Professor of Dramatic Literature in the Dept. of English at Vassar College.  相似文献   

A key word with regard to a sub-corpus is a word of which the frequency in that sub-corpus is significantly higher than expected under the hypothesis that its use and the variable part of the corpus are mutually independent. A study in literary statistics almost invariably includes a chapter devoted to key words. However, a strong attack has been recently launched upon the way stylometry has been modelling texts since the classical works of Herdan, Guiraud or Muller. In fact statistical modelling seems as valid in stylistics as in any other field of the humanities and social sciences. What is questionable is the fact that many studies in literary statistics are more satisfied with the easy identification of monsters, i.e. literary phenomena unexplained by wrong models, than with the laborious research of models fitting the textual data well. A short examination of the mentioned controversy and the quantitative analysis of an example provided by Laclos' novelLes Liaisons dangereuses endeavour to support this argument.Christian Delcourt is a senior lecturer in the Department of Romance Philology at the University of Liége.  相似文献   

Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.  相似文献   

Fosterss critique of our work is overdrawn, has left our findings 99.9% intact.  相似文献   

This paper considers the question of authorship attribution techniques whenfaced with a pastiche. We ask whether the techniques can distinguish the real thing from the fake, or can the author fool the computer? If the latter, is this because the pastiche is good, or because the technique is faulty? Using a number of mainly vocabulary-based techniques, Gilbert Adair's pastiche of Lewis Carroll, Alice Through the Needle's Eye, is compared with the original `Alice' books. Standard measures of lexical richness, Yule's K andOrlov's Z both distinguish Adair from Carroll, though Z also distinguishesthe two originals. A principal component analysis based on word frequenciesfinds that the main differences are not due to authorship. A discriminantanalysis based on word usage and lexical richness successfully distinguishes thepastiche from the originals. Weighted cusum tests were also unable to distinguish the two authors in a majority of cases. As a cross-validation, wemade similar comparisons with control texts: another children's story from thesame era, and other work by Carroll and Adair. The implications of thesefindings are discussed.  相似文献   

In Response to Elliott and Valenza, 'And Then There Were None', (1996) Donald Foster has taken strenuous issue with our Shakespeare Clinic's final report, which concluded that none of the testable Shakespeare claimants, and none of the Shakespeare Apocrypha poems and plays – including Funeral Elegy by W.S. – match Shakespeare. Though he seems to accept most of our exclusions – notably excepting those of the Elegy and A Lover's Complaint – he believes that our methodology is nonetheless fatally flawed by worthless figures ... wrong more often than right, rigorous cherry–picking, playing with a stacked deck, and conveniently exil[ing] ... inconvenient data. He describes our tests as foul vapor and methodological madness.We believe that this criticism is seriously overdrawn, and that our tests and conclusions have emerged essentially intact. By our count, he claims to have found 21 errors of consequence in our report. Only five of these claims, all trivial, have any validity at all. If fully proved, they might call for some cautions and slight refinements for five of our 54 tests, but in no case would they come close to invalidating the questioned test. The remaining 49 tests are wholly intact. Total erosion of our findings from the Foster critique could amount, at most, to half of one percent. None of his accusations of cherry–picking, deck–stacking, and evidence–ignoring are substantiated.  相似文献   

Authorship Attribution with Support Vector Machines   总被引:1,自引:0,他引:1  
In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.  相似文献   

李扬  张伟  彭晨 《计算机应用》2020,40(2):473-478
作者身份识别任务旨在判断一篇文档的作者,但目前已有的作者身份识别方法都是目标独立的,意味着这些方法在预测作者身份时假设没有任何限定条件,这与实际情况不相符合。为了解决限定条件下的作者身份识别问题,提出了一种目标依赖的作者身份识别方法TDAA。首先,使用用户评论对应的商品ID作为限定信息;其次,为了使文本建模过程更加具有普适性,使用BERT提取预训练的评论文本特征;然后,使用卷积神经网络(CNN)进行深层次的文本特征提取;最后,为了将两种不同的信息融合起来,讨论了两种不同的融合方式。在亚马逊电影评论(Amazon Movie_and_TV)和CD评论(CDs_and_Vinyl_5)两个数据集上的实验结果表明,所提出的方法在精确率评价指标上较对比方法提高了4%~5%。  相似文献   

Authorship attribution, also known as authorship classification, is the problem of identifying the authors (reviewers) of a set of documents (reviews). The common approach is to build a classifier usin...  相似文献   

The statement, ’’Results of most non-traditional authorship attribution studies are not universally accepted as definitive,' is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of competent research; flawed statistical techniques; corrupted primary data; lack of expertise in allied fields; a dilettantish approach; inadequate treatment of errors. Various solutions are suggested: construct a correct and complete experimental design; educate the practitioners; study style in its totality; identify and educate the gatekeepers; develop a complete theoretical framework; form an association of practitioners. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

This study investigates the writing stylechange of two Turkish authors, Çetin Altanand Yaar Kemal, in their old and newworks using respectively their newspapercolumns and novels. The style markers are thefrequencies of word lengths in both text andvocabulary, and the rate of usage of mostfrequent words. For both authors, t-tests andlogistic regressions show that the length ofthe words in new works is significantly longerthan that of the old. The principal componentanalyses graphically illustrate the separationbetween old and new texts. The works arecorrectly categorized as old or new with 75 to100% accuracy and 92% average accuracy usingdiscriminant analysis-based cross validation. The results imply higher time gap may havepositive impact in separation andcategorization. For Altan a regressionanalysis demonstrates a decrease in averageword length as the age of his column increases. One interesting observation is that for oneword each author has similar preference changesover time.  相似文献   

A method is analysed and developed in which specific consecutive pairs of words (i.e. collocations) are deduced in order to distinguish between the works of different authors. The approach is designed for use in conjunction with two earlier proposals which are based, respectively, on the first word in every speech and on all the remaining words spoken on stage. Treating two plays of known authorship as anonymous, the new approach to collocations is found to assign each correctly from a group of five dramatists. For further verification, the technique is applied to Acts III, IV and V of Pericles. In accordance with literary scholarship, Shakespeare is selected as the author from a group of seven contemporaneous playwrights. Tests based on collocations show that Wilkins is more likely than either Chapman or the mature Shakespeare to have been the (main) writer of Acts I and II of Pericles, thus confirming the result obtained from both previous studies. Wilfrid Smith has a Ph.D. in Control Theory. He is a Fellow of the British Institute of Mathematics and its Applications and is at present a reader in the University of Ulster. His work is in the field of authorship of early English plays and has been recognized by an entry in the ninth edition of who's Who in the World.  相似文献   

现当代文学作品的作者身份识别研究   总被引:3,自引:0,他引:3       下载免费PDF全文
主要利用了SVM统计机器学习模型对中国现当代文学八位代表人物的作品进行了作者身份识别研究,在识别过程中选取了以词汇为基础的多种统计量作为识别特征,并且采取了基于低密度多特征的训练方法,在跨文体的作品的作者身份识别中取得了非常优异的识别性能。  相似文献   

基于语义分析的作者身份识别方法研究   总被引:5,自引:0,他引:5  
作者身份识别是一项应用广泛的研究,身份识别的关键问题是从作品中提取出代表语体风格的识别特征,并根据这些风格特征,评估作品与作品之间的风格相似度。传统的身份识别方法,主要考察作者遣词造句、段落组织等各种代表文体风格的特征,其中基于标点符号和最常见功能词频数的分析方法受到较为普遍的认同。本文依据文体学理论,利用HowNet知识库,提出一种新的基于词汇语义分析的相似度评估方法,有效利用了功能词以外的其他词汇,达到了较好的身份识别性能。  相似文献   

作者身份识别是对作者个人写作风格的分析。虽然这一任务在多种语言中都得到了广泛的研究,但对中文而言,研究还没有涉及古典诗歌领域。唐诗同时具有跳跃性和整体性,为了兼顾这两种特点,该文提出了一种双通道的Cap-Transformer集成模型。上通道Capsule模型可以在提取特征的同时降低信息损失,能够更好地捕获唐诗各个意象的语义特征;下通道Transformer模型通过多头自注意力机制充分学习唐诗所有意象共同反映的深层语义信息。实验表明,该文提出的模型适用于唐诗作者身份识别任务,并通过错误分析,针对唐诗文本的特殊性,讨论了唐诗作者身份识别任务目前存在的问题及未来的研究方向和面临的挑战。  相似文献   

Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non‐natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n‐grams) or software metrics; and the classification technique that exploits those features, either information retrieval ranking or machine learning. The results of existing studies, however, are not directly comparable as all use different test beds and evaluation methodologies, making it difficult to assess which approach is superior. This paper summarises all previous techniques to source code authorship attribution, implements feature sets that are motivated by the literature, and applies information retrieval ranking methods or machine classifiers for each approach. Importantly, all approaches are tested on identical collections from varying programming languages and author types. Our conclusions are as follows: (i) ranking and machine classifier approaches are around 90% and 85% accurate, respectively, for a one‐in‐10 classification problem; (ii) the byte‐level n‐gram approach is best used with different parameters to those previously published; (iii) neural networks and support vector machines were found to be the most accurate machine classifiers of the eight evaluated; (iv) use of n‐gram features in combination with machine classifiers shows promise, but there are scalability problems that still must be overcome; and (v) approaches based on information retrieval techniques are currently more accurate than approaches based on machine learning. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

提出一种新的写作风格相似度评估方法,利用不同作者写作时在文章语句节奏控制方面的特点,鉴别作者的写作风格,从而达到作者身份识别的目的。该方法构建节奏特征矩阵模型来描述文本的语句节奏,利用点积相似度算法以及改进的KL距离算法来度量节奏特征矩阵之间的差异。实验表明,该方法在文学作品的作者识别方面具有较高的准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号