首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
How Variable May a Constant be? Measures of Lexical Richness in Perspective   总被引:1,自引:0,他引:1  
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.  相似文献   

2.
The statement, ’’Results of most non-traditional authorship attribution studies are not universally accepted as definitive,' is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of competent research; flawed statistical techniques; corrupted primary data; lack of expertise in allied fields; a dilettantish approach; inadequate treatment of errors. Various solutions are suggested: construct a correct and complete experimental design; educate the practitioners; study style in its totality; identify and educate the gatekeepers; develop a complete theoretical framework; form an association of practitioners. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

3.
Responses in personalinterviews about education and career with 415Swedish men and women (age 34) forms the basisof a speech corpus with 1.8 million words. Thevocabulary is described by means of two sets ofvariables. One is based on the number of tokensand types, word length and sectioning of therunning text. The other set divides the corpusinto grammatical categories. Both sets ofvariables are related to a number of backgroundvariables such as gender, socioeconomicbackground, education, and indicators of verbalproficiency at age 13 and 32. This possibilityto study the relationship between vocabularyand a broad set of respondent characteristicsis a unique feature of this corpus.  相似文献   

4.
现当代文学作品的作者身份识别研究   总被引:3,自引:0,他引:3       下载免费PDF全文
主要利用了SVM统计机器学习模型对中国现当代文学八位代表人物的作品进行了作者身份识别研究,在识别过程中选取了以词汇为基础的多种统计量作为识别特征,并且采取了基于低密度多特征的训练方法,在跨文体的作品的作者身份识别中取得了非常优异的识别性能。  相似文献   

5.
This paper considers the question of authorship attribution techniques whenfaced with a pastiche. We ask whether the techniques can distinguish the real thing from the fake, or can the author fool the computer? If the latter, is this because the pastiche is good, or because the technique is faulty? Using a number of mainly vocabulary-based techniques, Gilbert Adair's pastiche of Lewis Carroll, Alice Through the Needle's Eye, is compared with the original `Alice' books. Standard measures of lexical richness, Yule's K andOrlov's Z both distinguish Adair from Carroll, though Z also distinguishesthe two originals. A principal component analysis based on word frequenciesfinds that the main differences are not due to authorship. A discriminantanalysis based on word usage and lexical richness successfully distinguishes thepastiche from the originals. Weighted cusum tests were also unable to distinguish the two authors in a majority of cases. As a cross-validation, wemade similar comparisons with control texts: another children's story from thesame era, and other work by Carroll and Adair. The implications of thesefindings are discussed.  相似文献   

6.
The measure of lexical repetition constitutes one of the variables used to determine the lexical richness of literary texts, a value further employed in authorship attribution studies. Although most of the constants for lexical richness actually depend on text length, Yule’s characteristic is considered to be highly reliable for being text length independent. It is not the aim of this paper questioning the validity of K to measure the lexical repeat-rate, nor to evaluate its usefulness in authorship studies, but to review the most accurate procedure to calculate its value in the light of the lack of standardization found in the specific literature. At the same time, the peculiar calculation of Yule’s K by TACT is explained. Our study suggests that standardization will certainly help improve the studies where K is employed.  相似文献   

7.
词汇增长研究能够分析文本的TTR在不同时期的变化,该文选取1954—2018年的中国政府工作报告为语料,分析文本中词例与词种的曲线变化,挖掘政府工作报告中的词汇丰富度与政策的相互关系.该文首先对语料进行了分词,然后根据曲线拟合效果选择拟合更好的Heaps模型进行预测.以中国的"五年计划"作为基础时间周期,对各周期模型预...  相似文献   

8.
In author attribution studies function words or lexical measures areoften used to differentiate the authors' textual fingerprints. Thesestudies can be thought of as quantifying the texts, representing thetext with measured variables that stand for specific textual features.The resulting quantifications, while proven useful for statisticallydifferentiating among the texts, bear no resemblance to the understanding a human reader – even an astute one – would develop whilereading the texts. In this paper we present an attribution study that,instead, characterizes the texts according to the representationallanguage choices of the authors, similar to a way we believe close humanreaders come to know a text and distinguish its rhetorical purpose. Fromour automated quantification of The Federalist papers, it isclear why human readers find it impossible to distinguish the authorshipof the disputed papers. Our findings suggest that changes occur in theprocesses of rhetorical invention when undertaken in collaborativesituations. This points to a need to re-evaluate the premise ofautonomous authorship that has informed attribution studies of The Federalist case.  相似文献   

9.
《水浒传》是独著还是合著、施耐庵和罗贯中是何关系一直存在争议。该文将其作者争议粗略归纳为施耐庵作、罗贯中作、施作罗续、罗作他续、施作罗改五种情况,以罗贯中的《平妖传》为参照,用假设检验、文本聚类、文本分类、波动风格计量等方法,结合对文本内容的分析,考察《水浒传》的写作风格,试图为其作者身份认定提供参考。结果显示,只有罗作他续的可能性大,即前70回为罗贯中所作,后由他人续写,其他四种情况可能性都较小。  相似文献   

10.
    
In And Then There Were None, Ward Elliot and Robert Valenza report on the work of the Shakespeare Clinic (Claremont McKenna Colleges, 1987–1995). Working from popular theories that William Shakespeare is not the true author of the plays and poems ascribed to him, Elliot and Valenza cast a broad net to find another writer whose distinctive linguistic features match those of the Shakespeare canon. A regime of 51 tests was designed whereby to compare Shakespeare's drama with 79 non-Shakespearean (or at least noncanonical) plays. Success rates at or near 100% are reported for the Elliot-Valenza tests in distinguishing Shakespeare from non-Shakespeare. A smaller battery of tests was designed for distinguishing Shakespeare poems from nondramatic texts by other poets, with similar success rates being reported. But many of the Elliot-Valenza tests are deeply flawed, both in their design and execution.Donald Foster is the Jean Webster Professor of Dramatic Literature in the Dept. of English at Vassar College.  相似文献   

11.
Authorship attribution, also known as authorship classification, is the problem of identifying the authors (reviewers) of a set of documents (reviews). The common approach is to build a classifier usin...  相似文献   

12.
为了在语料库中找出源代码的真实作者,提出了一种代码耦合度与程序依赖图特征结合的神经网络模型CPNN来识别源代码作者.首先,使用从源代码中提取的参数、扇入和扇出等特征计算代码的耦合度.其次,从转换的程序依赖图中提取控制和数据依赖项,应用预处理技术将PDG特征转换为具有频率细节的小实例,并且利用逆文档频率技术放大源代码中每...  相似文献   

13.
基于词汇树的图片搜索   总被引:2,自引:0,他引:2       下载免费PDF全文
陈赟  沈一帆 《计算机工程》2010,36(6):189-191
针对基于内容的图片搜索存在召回率低及匹配速度较慢的问题,在词汇树的基础上,利用模糊量化加以解决。把从图像中抽取到的SIFT特征利用词汇树模糊量化到单词中,从而将图片转为用向量表示,同时用向量间的比较测量图片相似度。实验结果表明,该方法可以有效缩短响应时间,提高搜索结果的召回率。  相似文献   

14.
HSK是一项国际汉语能力标准化考试。新HSK大纲中附表所列650个“默认词”多依据专家知识人工列举式的扩充。该文在《现代汉语词典》《现代汉语语法信息词典》等资源的基础上,利用知识工程的方法,迭代使用减字默认、组合默认等词汇等级类推规则,力争实现类推过程中隐性知识的显性化、分散知识的系统化,使得词汇等级类推的每一个环节都有章可循、有据可依,完成了基于新HSK大纲词汇等级的系统类推工作。接着,结合所构建的汉语词法知识库对类推结果进行了筛选,最终得到了23762个词语的类推等级。最后,通过对类推结果的统计分析,表明该文的研究工作可以更好地发挥新HSK词汇大纲在汉语词汇定级、文本难度分级中的指导作用,也可为其他领域教学词汇大纲的制定提供一定的借鉴。  相似文献   

15.
词汇学习是学习英语的基础,传统记忆模型采用机械的记忆方法,使用户在固定的时间周期内记忆词汇,这些静态的记忆模型计划制定复杂,不利于用户有效记忆词汇。针对上述问题,提出一种智能词汇记忆模型。从生物的记忆过程出发,采用幂函数量化艾宾浩斯生物记忆曲线,利用生物记忆曲线追踪每个单词的学习情况,在单词临近遗忘的边缘提醒用户及时复习,动态调整生物记忆曲线。实验结果表明,与传统记忆模型相比,该模型能为用户制定精确的复习计划,可减少用户37.04%的时间用来掌握词汇,具有更高的记忆效率。  相似文献   

16.
This paper attempts to assess the progress made in computational stylistics dyring the course of the past twenty-five years. First, we discuss some theoretical notions of style, and then we sketch certain trends that emerge from relevant articles appearing in a variety of publications including conference proceedings and academic journals (other than CHum). The conclusion is that progress has been mixed.Louis T. Milic is professor emeritus of English at Cleveland State University and secretary-treasurer of the Dictionary Society of North America. He has been active in quantitative stylistics since the 1960s and has recently completed work on the second of two period corpora, the Century of Prose Corpus.  相似文献   

17.
张健伟  严建峰  刘晓升  杨璐 《计算机科学》2016,43(12):120-124, 134
目前的在线潜在狄利克雷分布模型(LDA)算法大多是基于固定的词汇表,在实际应用中经常会出现词汇表和处理的语料不匹配的情况,影响了模型的实用性。针对这个现象,在置信传播算法(BP)的框架下,使主题单词分布服从狄利克雷过程,重新推导公式,使得词汇表在模型运行之前为空,并且在处理时不断向词汇表中增加发现的新词。实验证明,这种新的基于动态词汇表的算法不仅使得词汇表与语料的贴合度更高,而且使其在混淆度以及互信息指数这两个指标上能够比基于固定词汇表的LDA模型表现得更加优越。  相似文献   

18.
This paper considers the problem of quantifying literary style and looks at several variables which may be used as stylistic fingerprints of a writer. A review of work done on the statistical analysis of change over time in literary style is then presented, followed by a look at a specific application area, the authorship of Biblical texts.David Holmes is a Principal Lecturer in Statistics at the University of the West of England, Bristol with specific responsibility for co-ordinating the research programmes in the Department of Mathematical Sciences. He has taught literary style analysis to humanities students since 1983 and has published articles on the statistical analysis of literary style in theJournal of the Royal Statistical Society, History and Computing, andLiterary and Linguistic Computing. He presented papers at the ACH/ALLC conferences in 1991 and 1993.  相似文献   

19.
刘珊 《办公自动化》2010,(24):32-36
词汇曾一度被认为是外语教学中的"灰姑娘"。近些年来,人们才逐渐认识到其在语言发展中的重要性,并就词汇习得问题作了很多研究。词汇习得又分为刻意学习和附带习得。词汇附带习得被认为是掌握大量单词的重要途径,极受重视。而阅读则是公认的词汇附带习得的主要途径之一。总体来看,这方面的研究在国内还不是很多;国外的研究数量多,且多属实证研究,有很多可借鉴之处;而且,国内大多数相关研究是对国外实证研究的简单重复,或是对国外相关理论的探讨或思辨,故本文主要关注于国外阅读中词汇附带习得的研究,以期读者可以借以了解国外该研究领域的现状。  相似文献   

20.
A method is analysed and developed in which specific consecutive pairs of words (i.e. collocations) are deduced in order to distinguish between the works of different authors. The approach is designed for use in conjunction with two earlier proposals which are based, respectively, on the first word in every speech and on all the remaining words spoken on stage. Treating two plays of known authorship as anonymous, the new approach to collocations is found to assign each correctly from a group of five dramatists. For further verification, the technique is applied to Acts III, IV and V of Pericles. In accordance with literary scholarship, Shakespeare is selected as the author from a group of seven contemporaneous playwrights. Tests based on collocations show that Wilkins is more likely than either Chapman or the mature Shakespeare to have been the (main) writer of Acts I and II of Pericles, thus confirming the result obtained from both previous studies. Wilfrid Smith has a Ph.D. in Control Theory. He is a Fellow of the British Institute of Mathematics and its Applications and is at present a reader in the University of Ulster. His work is in the field of authorship of early English plays and has been recognized by an entry in the ninth edition of who's Who in the World.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号