首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
How Variable May a Constant be? Measures of Lexical Richness in Perspective   总被引:1,自引:0,他引:1  
A well-known problem in the domain of quantitative linguistics and stylistics concerns the evaluation of the lexical richness of texts. Since the most obvious measure of lexical richness, the vocabulary size (the number of different word types), depends heavily on the text length (measured in word tokens), a variety of alternative measures has been proposed which are claimed to be independent of the text length. This paper has a threefold aim. Firstly, we have investigated to what extent these alternative measures are truly textual constants. We have observed that in practice all measures vary substantially and systematically with the text length. We also show that in theory, only three of these measures are truly constant or nearly constant. Secondly, we have studied the extent to which these measures tap into different aspects of lexical structure. We have found that there are two main families of constants, one measuring lexical richness and one measuring lexical repetition. Thirdly, we have considered to what extent these measures can be used to investigate questions of textual similarity between and within authors. We propose to carry out such comparisons by means of the empirical trajectories of texts in the plane spanned by the dimensions of lexical richness and lexical repetition, and we provide a statistical technique for constructing confidence intervals around the empirical trajectories of texts. Our results suggest that the trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship.  相似文献   

2.
This paper considers the question of authorship attribution techniques whenfaced with a pastiche. We ask whether the techniques can distinguish the real thing from the fake, or can the author fool the computer? If the latter, is this because the pastiche is good, or because the technique is faulty? Using a number of mainly vocabulary-based techniques, Gilbert Adair's pastiche of Lewis Carroll, Alice Through the Needle's Eye, is compared with the original `Alice' books. Standard measures of lexical richness, Yule's K andOrlov's Z both distinguish Adair from Carroll, though Z also distinguishesthe two originals. A principal component analysis based on word frequenciesfinds that the main differences are not due to authorship. A discriminantanalysis based on word usage and lexical richness successfully distinguishes thepastiche from the originals. Weighted cusum tests were also unable to distinguish the two authors in a majority of cases. As a cross-validation, wemade similar comparisons with control texts: another children's story from thesame era, and other work by Carroll and Adair. The implications of thesefindings are discussed.  相似文献   

3.
4.
Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.  相似文献   

5.
张洋  江铭虎 《自动化学报》2021,47(11):2501-2520
作者识别是根据已知文本推断未知文本作者的交叉学科. 其传统研究通常基于文学或语言学的经验知识, 而现代研究则主要依靠数学方法量化作者的写作风格. 近些年, 随着认知科学、系统科学和信息技术的发展, 作者识别受到越来越多研究者的关注. 本文主要站在计算语言学的角度综述作者识别领域现代研究中的方法和思路. 首先, 简要介绍了作者识别的发展历程. 然后, 详述了文体风格特征、作者识别方法以及该领域中多层面的研究. 接着介绍了与作者识别相关的一些评测、数据集及评价指标. 最后, 指出该领域存在的一些问题, 结合这些问题分析并展望了作者识别的发展趋势.  相似文献   

6.
基于前后文词形特征的生物医学文献句子边界识别   总被引:1,自引:0,他引:1  
针对生物医学文献的特点及信息抽取的特殊要求,提出了基于前后文词形特征和有教师学习的句子边界识别算法.与针对一般英语书面语设计的句子边界识别算法不同,本文提出的算法不使用特殊的辅助词表和语法层面的特征信息,只使用前后文单词的词形信息作为句子边界识别和消歧的依据.利用这些特征设计了最大信息熵识别器和支持向量机识别器,并在Medline摘要上进行了实验,达到了超过99%的正确率.实验结果表明,最大信息熵法和支持向量机法在句子边界消歧问题上具有相近的性能,同时还表明,对生物医学文献句子边界识别,只使用词法层面的特征,不使用辅助词表和词性等语法层面的信息,仍可达到其它算法在一般英语书面语上利用辅助词表和词性信息所达到的性能.  相似文献   

7.
词汇增长研究能够分析文本的TTR在不同时期的变化,该文选取1954—2018年的中国政府工作报告为语料,分析文本中词例与词种的曲线变化,挖掘政府工作报告中的词汇丰富度与政策的相互关系.该文首先对语料进行了分词,然后根据曲线拟合效果选择拟合更好的Heaps模型进行预测.以中国的"五年计划"作为基础时间周期,对各周期模型预...  相似文献   

8.
The measure of lexical repetition constitutes one of the variables used to determine the lexical richness of literary texts, a value further employed in authorship attribution studies. Although most of the constants for lexical richness actually depend on text length, Yule’s characteristic is considered to be highly reliable for being text length independent. It is not the aim of this paper questioning the validity of K to measure the lexical repeat-rate, nor to evaluate its usefulness in authorship studies, but to review the most accurate procedure to calculate its value in the light of the lack of standardization found in the specific literature. At the same time, the peculiar calculation of Yule’s K by TACT is explained. Our study suggests that standardization will certainly help improve the studies where K is employed.  相似文献   

9.
The Middle Dutch Arthurian romance Roman van Walewein (‘Romanceof Gawain’) is attributed in the text itself to two authors,Penninc and Vostaert. Very little quantitative research intothis dual authorship has been done. This article describes ourprogress in applying different non-traditional authorship attributionmethods to the text of Walewein. After providing an introductionto the romance and an overview of earlier research, we evaluateprevious statements on authorship and stylistics by applyingboth Yule's measure of lexical richness and Burrows's Delta.To find out whether these new methods would confirm or evenenhance our present knowledge about the differences betweenthe two authors, we applied an adapted version of John Burrows'sDelta procedure. The adapted version seems to be able to distinguishthe double authorship of the romance. It also helps us to confirmsome and to reject other earlier statements about the positionin the text where the second author started his work.  相似文献   

10.
Statistical information on a substantial corpus of representative Spanish texts is needed in order to determine the significance of data about individual authors or texts by means of comparison. This study describes the organization and analysis of a 150,000-word corpus of 30 well-known twentieth-century Spanish authors. Tables show the computational results of analyses involving sentences, segments, quotations, and word length.The article explains the considerations that guided content, selection, and sample size, and describes special editing needed for the input of Spanish text. Separate sections highlight and comment upon some of the findings.The corpus and the tables provide objective data for studies of homogeneity and heterogeneity. The format of the tables permits others to add to the original 30 authors, organize the results by categories, or use the cumulative results for normative comparisons.Estelle Irizarry is Professor of Spanish at Georgetown University and author of 20 books and annotated editions dealing with Hispanic literature, art, and hoaxes. Her latest book, an edition of Infortunios de Alonso Ramirez, treats the disputed authorship of Spanish America's first novel. She is Courseware Editor of CHum.  相似文献   

11.
Imitative texts of high quality are of some importance to studentsof attribution, especially those who use computational methods.The authorship of such texts is always likely to be difficultto demonstrate. In some cases, the identity of the author isa question of interest to literary scholars. Even when thatis not so, students of attribution face a challenge. If we cannotdistinguish between original and imitation in such cases, wemust always concede that an imitator may have been at work.Shamela (1741) has always been regarded as a brilliant parody.When it is subjected to our standard common-words tests of authorship,it yields mixed results. A new procedure, in which special word-listsare established according to a predetermined set of rules, provesmore effective. It needs, however, to be tried in other cases.  相似文献   

12.
In author attribution studies function words or lexical measures areoften used to differentiate the authors' textual fingerprints. Thesestudies can be thought of as quantifying the texts, representing thetext with measured variables that stand for specific textual features.The resulting quantifications, while proven useful for statisticallydifferentiating among the texts, bear no resemblance to the understanding a human reader – even an astute one – would develop whilereading the texts. In this paper we present an attribution study that,instead, characterizes the texts according to the representationallanguage choices of the authors, similar to a way we believe close humanreaders come to know a text and distinguish its rhetorical purpose. Fromour automated quantification of The Federalist papers, it isclear why human readers find it impossible to distinguish the authorshipof the disputed papers. Our findings suggest that changes occur in theprocesses of rhetorical invention when undertaken in collaborativesituations. This points to a need to re-evaluate the premise ofautonomous authorship that has informed attribution studies of The Federalist case.  相似文献   

13.
In author attribution studies function words or lexical measures areoften used to differentiate the authors' textual fingerprints. Thesestudies can be thought of as quantifying the texts, representing thetext with measured variables that stand for specific textual features.The resulting quantifications, while proven useful for statisticallydifferentiating among the texts, bear no resemblance to the understanding a human reader – even an astute one – would develop whilereading the texts. In this paper we present an attribution study that,instead, characterizes the texts according to the representationallanguage choices of the authors, similar to a way we believe close humanreaders come to know a text and distinguish its rhetorical purpose. Fromour automated quantification of The Federalist papers, it isclear why human readers find it impossible to distinguish the authorshipof the disputed papers. Our findings suggest that changes occur in theprocesses of rhetorical invention when undertaken in collaborativesituations. This points to a need to re-evaluate the premise ofautonomous authorship that has informed attribution studies of The Federalist case.  相似文献   

14.
Markov chains are used as a formal mathematical model for sequences of elements of a text. This model is applied for authorship attribution of texts. As elements of a text, we consider sequences of letters or sequences of grammatical classes of words. It turns out that the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text are rather stable characteristics of an author and, apparently, they could be used in disputed authorship attribution. A comparison of results for various modifications of the method using both letters and grammatical classes is given. Experimental research involves 385 texts of 82 writers. In the Appendix, the research of D.V. Khmelev is described, where data compression algorithms are applied to authorship attribution.  相似文献   

15.
Kevin Burns 《Information Sciences》2006,176(11):1570-1589
Bayesian inference provides a formal framework for assessing the odds of hypotheses in light of evidence. This makes Bayesian inference applicable to a wide range of diagnostic challenges in the field of chance discovery, including the problem of disputed authorship that arises in electronic commerce, counter-terrorism and other forensic applications. For example, when two documents are so similar that one is likely to be a hoax written from the other, the question is: Which document is most likely the source and which document is most likely the hoax? Here I review a Bayesian study of disputed authorship performed by a biblical scholar, and I show that the scholar makes critical errors with respect to several issues, namely: Causal Basis, Likelihood Judgment and Conditional Dependency. The scholar’s errors are important because they have a large effect on his conclusions and because similar errors often occur when people, both experts and novices, are faced with the challenges of Bayesian inference. As a practical solution, I introduce a graphical system designed to help prevent the observed errors. I discuss how this decision support system applies more generally to any problem of Bayesian inference, and how it differs from the graphical models of Bayesian Networks.  相似文献   

16.
This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, the error rate is reduced by approximately 50 percent for single writer data and by approximately 25 percent for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.  相似文献   

17.
社交媒体具有文本不规范的特点,现有自然语言处理工具直接应用于社交媒体文本时效果不甚理想,并且基于  相似文献   

18.
Modelling word or species frequency count data through zero truncated Poisson mixture models allows one to interpret the model mixing distribution as the distribution of the word or species frequencies of the vocabulary or population. As a consequence, estimates of their mixing density can be used as a fingerprint of the style of the author in his texts or of the ecosystem in its samples. Definitions of measure of the evenness and of measure of the diversity within a vocabulary or population are given, and the novelty of these definitions is explained. It is then proposed that the measures of the evenness and of the diversity of a vocabulary or population be approximated through the expectation of these measures under the word or species frequency distribution. That leads to the assessment of the lack of diversity through measures of the variability of the mixing frequency distribution estimates described above.  相似文献   

19.
On using partial supervision for text categorization   总被引:1,自引:0,他引:1  
We discuss the merits of building text categorization systems by using supervised clustering techniques. Traditional approaches for document classification on a predefined set of classes are often unable to provide sufficient accuracy because of the difficulty of fitting a manually categorized collection of documents in a given classification model. This is especially the case for heterogeneous collections of Web documents which have varying styles, vocabulary, and authorship. Hence, we investigate the use of clustering in order to create the set of categories and its use for classification of documents. Completely unsupervised clustering has the disadvantage that it has difficulty in isolating sufficiently fine-grained classes of documents relating to a coherent subject matter. We use the information from a preexisting taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using partially supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of how each category is defined. An extremely effective way then to categorize documents is to use this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters.  相似文献   

20.
Authorship attribution of text documents is a “hot” domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the “majority” (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号