共查询到20条相似文献,搜索用时 15 毫秒
1.
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship. 相似文献
2.
Joseph Rudman 《Language Resources and Evaluation》1997,31(4):351-365
The statement, ’’Results of most non-traditional authorship attribution studies are not universally accepted as definitive,'
is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of
competent research; flawed statistical techniques; corrupted primary data; lack of expertise in allied fields; a dilettantish
approach; inadequate treatment of errors. Various solutions are suggested: construct a correct and complete experimental design;
educate the practitioners; study style in its totality; identify and educate the gatekeepers; develop a complete theoretical
framework; form an association of practitioners.
This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献
3.
Kjell Härnqvist Ulf Christianson Daniel Ridings Jan-Gunnar Tingsell 《Computers and the Humanities》2003,37(2):179-204
Responses in personalinterviews about education and career with 415Swedish men and women (age 34) forms the basisof a speech corpus with 1.8 million words. Thevocabulary is described by means of two sets ofvariables. One is based on the number of tokensand types, word length and sectioning of therunning text. The other set divides the corpusinto grammatical categories. Both sets ofvariables are related to a number of backgroundvariables such as gender, socioeconomicbackground, education, and indicators of verbalproficiency at age 13 and 32. This possibilityto study the relationship between vocabularyand a broad set of respondent characteristicsis a unique feature of this corpus. 相似文献
4.
主要利用了SVM统计机器学习模型对中国现当代文学八位代表人物的作品进行了作者身份识别研究,在识别过程中选取了以词汇为基础的多种统计量作为识别特征,并且采取了基于低密度多特征的训练方法,在跨文体的作品的作者身份识别中取得了非常优异的识别性能。 相似文献
5.
This paper considers the question of authorship attribution techniques whenfaced with a pastiche. We ask whether the techniques can distinguish the real thing from the fake, or can the author fool the computer? If the latter, is this because the pastiche is good, or because the technique is faulty? Using a number of mainly vocabulary-based techniques, Gilbert Adair's pastiche of Lewis Carroll, Alice Through the Needle's Eye, is compared with the original `Alice' books. Standard measures of lexical richness, Yule's K andOrlov's Z both distinguish Adair from Carroll, though Z also distinguishesthe two originals. A principal component analysis based on word frequenciesfinds that the main differences are not due to authorship. A discriminantanalysis based on word usage and lexical richness successfully distinguishes thepastiche from the originals. Weighted cusum tests were also unable to distinguish the two authors in a majority of cases. As a cross-validation, wemade similar comparisons with control texts: another children's story from thesame era, and other work by Carroll and Adair. The implications of thesefindings are discussed. 相似文献
6.
The measure of lexical repetition constitutes one of the variables used to determine the lexical richness of literary texts, a value further employed in authorship attribution studies. Although most of the constants for lexical richness actually depend on text length, Yule’s characteristic is considered to be highly reliable for being text length independent. It is not the aim of this paper questioning the validity of K to measure the lexical repeat-rate, nor to evaluate its usefulness in authorship studies, but to review the most accurate procedure to calculate its value in the light of the lack of standardization found in the specific literature. At the same time, the peculiar calculation of Yule’s K by TACT is explained. Our study suggests that standardization will certainly help improve the studies where K is employed. 相似文献
7.
8.
Jeff Collins David Kaufer Pantelis Vlachos Brian Butler Suguru Ishizaki 《Computers and the Humanities》2004,38(1):15-36
In author attribution studies function words or lexical measures areoften used to differentiate the authors' textual fingerprints. Thesestudies can be thought of as quantifying the texts, representing thetext with measured variables that stand for specific textual features.The resulting quantifications, while proven useful for statisticallydifferentiating among the texts, bear no resemblance to the understanding a human reader – even an astute one – would develop whilereading the texts. In this paper we present an attribution study that,instead, characterizes the texts according to the representationallanguage choices of the authors, similar to a way we believe close humanreaders come to know a text and distinguish its rhetorical purpose. Fromour automated quantification of The Federalist papers, it isclear why human readers find it impossible to distinguish the authorshipof the disputed papers. Our findings suggest that changes occur in theprocesses of rhetorical invention when undertaken in collaborativesituations. This points to a need to re-evaluate the premise ofautonomous authorship that has informed attribution studies of The Federalist case. 相似文献
9.
10.
Donald W. Foster 《Computers and the Humanities》1996,30(3):247-255
In And Then There Were None, Ward Elliot and Robert Valenza report on the work of the Shakespeare Clinic (Claremont McKenna Colleges, 1987–1995). Working from popular theories that William Shakespeare is not the true author of the plays and poems ascribed to him, Elliot and Valenza cast a broad net to find another writer whose distinctive linguistic features match those of the Shakespeare canon. A regime of 51 tests was designed whereby to compare Shakespeare's drama with 79 non-Shakespearean (or at least noncanonical) plays. Success rates at or near 100% are reported for the Elliot-Valenza tests in distinguishing Shakespeare from non-Shakespeare. A smaller battery of tests was designed for distinguishing Shakespeare poems from nondramatic texts by other poets, with similar success rates being reported. But many of the Elliot-Valenza tests are deeply flawed, both in their design and execution.Donald Foster is the Jean Webster Professor of Dramatic Literature in the Dept. of English at Vassar College. 相似文献
11.
Authorship attribution, also known as authorship classification, is the problem of identifying the authors (reviewers) of a set of documents (reviews). The common approach is to build a classifier usin... 相似文献
12.
为了在语料库中找出源代码的真实作者,提出了一种代码耦合度与程序依赖图特征结合的神经网络模型CPNN来识别源代码作者.首先,使用从源代码中提取的参数、扇入和扇出等特征计算代码的耦合度.其次,从转换的程序依赖图中提取控制和数据依赖项,应用预处理技术将PDG特征转换为具有频率细节的小实例,并且利用逆文档频率技术放大源代码中每... 相似文献
13.
14.
HSK是一项国际汉语能力标准化考试。新HSK大纲中附表所列650个“默认词”多依据专家知识人工列举式的扩充。该文在《现代汉语词典》《现代汉语语法信息词典》等资源的基础上,利用知识工程的方法,迭代使用减字默认、组合默认等词汇等级类推规则,力争实现类推过程中隐性知识的显性化、分散知识的系统化,使得词汇等级类推的每一个环节都有章可循、有据可依,完成了基于新HSK大纲词汇等级的系统类推工作。接着,结合所构建的汉语词法知识库对类推结果进行了筛选,最终得到了23762个词语的类推等级。最后,通过对类推结果的统计分析,表明该文的研究工作可以更好地发挥新HSK词汇大纲在汉语词汇定级、文本难度分级中的指导作用,也可为其他领域教学词汇大纲的制定提供一定的借鉴。 相似文献
15.
16.
Louis Milic 《Computers and the Humanities》1991,25(6):393-400
This paper attempts to assess the progress made in computational stylistics dyring the course of the past twenty-five years. First, we discuss some theoretical notions of style, and then we sketch certain trends that emerge from relevant articles appearing in a variety of publications including conference proceedings and academic journals (other than CHum). The conclusion is that progress has been mixed.Louis T. Milic is professor emeritus of English at Cleveland State University and secretary-treasurer of the Dictionary Society of North America. He has been active in quantitative stylistics since the 1960s and has recently completed work on the second of two period corpora, the Century of Prose Corpus. 相似文献
17.
目前的在线潜在狄利克雷分布模型(LDA)算法大多是基于固定的词汇表,在实际应用中经常会出现词汇表和处理的语料不匹配的情况,影响了模型的实用性。针对这个现象,在置信传播算法(BP)的框架下,使主题单词分布服从狄利克雷过程,重新推导公式,使得词汇表在模型运行之前为空,并且在处理时不断向词汇表中增加发现的新词。实验证明,这种新的基于动态词汇表的算法不仅使得词汇表与语料的贴合度更高,而且使其在混淆度以及互信息指数这两个指标上能够比基于固定词汇表的LDA模型表现得更加优越。 相似文献
18.
David I. Holmes 《Computers and the Humanities》1994,28(2):87-106
This paper considers the problem of quantifying literary style and looks at several variables which may be used as stylistic fingerprints of a writer. A review of work done on the statistical analysis of change over time in literary style is then presented, followed by a look at a specific application area, the authorship of Biblical texts.David Holmes is a Principal Lecturer in Statistics at the University of the West of England, Bristol with specific responsibility for co-ordinating the research programmes in the Department of Mathematical Sciences. He has taught literary style analysis to humanities students since 1983 and has published articles on the statistical analysis of literary style in theJournal of the Royal Statistical Society, History and Computing, andLiterary and Linguistic Computing. He presented papers at the ACH/ALLC conferences in 1991 and 1993. 相似文献
19.
词汇曾一度被认为是外语教学中的"灰姑娘"。近些年来,人们才逐渐认识到其在语言发展中的重要性,并就词汇习得问题作了很多研究。词汇习得又分为刻意学习和附带习得。词汇附带习得被认为是掌握大量单词的重要途径,极受重视。而阅读则是公认的词汇附带习得的主要途径之一。总体来看,这方面的研究在国内还不是很多;国外的研究数量多,且多属实证研究,有很多可借鉴之处;而且,国内大多数相关研究是对国外实证研究的简单重复,或是对国外相关理论的探讨或思辨,故本文主要关注于国外阅读中词汇附带习得的研究,以期读者可以借以了解国外该研究领域的现状。 相似文献
20.
M. W. A. Smith 《Computers and the Humanities》1989,23(2):113-129
A method is analysed and developed in which specific consecutive pairs of words (i.e. collocations) are deduced in order to distinguish between the works of different authors. The approach is designed for use in conjunction with two earlier proposals which are based, respectively, on the first word in every speech and on all the remaining words spoken on stage. Treating two plays of known authorship as anonymous, the new approach to collocations is found to assign each correctly from a group of five dramatists. For further verification, the technique is applied to Acts III, IV and V of Pericles. In accordance with literary scholarship, Shakespeare is selected as the author from a group of seven contemporaneous playwrights. Tests based on collocations show that Wilkins is more likely than either Chapman or the mature Shakespeare to have been the (main) writer of Acts I and II of Pericles, thus confirming the result obtained from both previous studies.
Wilfrid Smith has a Ph.D. in Control Theory. He is a Fellow of the British Institute of Mathematics and its Applications and is at present a reader in the University of Ulster. His work is in the field of authorship of early English plays and has been recognized by an entry in the ninth edition of who's Who in the World. 相似文献