首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Authorship attribution of text documents is a “hot” domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the “majority” (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.  相似文献   

2.
From the ordinal sum theorem for t-subnorms, Jenei introduced a new left-continuous t-norm (called RDP t-norm) by revising the drastic product t-norm. In this paper, propositional and predicate calculi generated by RDP t-norm and its residuum are introduced and the corresponding formal systems RDP and RDP∀, which are schematic extensions of Esteva and Godo’s MTL and MTL∀, respectively, are presented and the standard completeness for RDP and RDP∀ are proved. In addition, a new formula defining the standard disjunctive in RDP was given In the original version of this article, part of it seemed to be text from a previously published article by other authors. This was not so and the author would like to make clear that he was quoting from earlier work of his own in comparison to the work of the other authors. To facilitate this, it was decided that this addendum should be made available.  相似文献   

3.
We present a quantitative analysis of 442 pieces of fictionpublished between 5 October 1992 and 17 September 2001 in theNew Yorker magazine. We address two independent questions usingthe same data set. First, we examine whether changes in theExecutive Editor or Fiction Editor are associated with significantchanges in the type of fiction published at the New Yorker.Second, we examine whether New Yorker authors write fictionmore often than not about characters with whom they share demographictraits. We find that changes in Fiction Editor at the New Yorkerare associated with numerous significant, quantifiable changesin the magazine's fiction and that these effects are greaterthan those associated with a change in the New Yorker's ExecutiveEditor. We also find that authors of New Yorker fiction writesignificantly more often than not about protagonists who sharetheir race, gender, and country of origin and who are withinor below their age range. The same is true of secondary charactersexcept in the case of gender.  相似文献   

4.
The measure of lexical repetition constitutes one of the variables used to determine the lexical richness of literary texts, a value further employed in authorship attribution studies. Although most of the constants for lexical richness actually depend on text length, Yule’s characteristic is considered to be highly reliable for being text length independent. It is not the aim of this paper questioning the validity of K to measure the lexical repeat-rate, nor to evaluate its usefulness in authorship studies, but to review the most accurate procedure to calculate its value in the light of the lack of standardization found in the specific literature. At the same time, the peculiar calculation of Yule’s K by TACT is explained. Our study suggests that standardization will certainly help improve the studies where K is employed.  相似文献   

5.
Recently prominent readers in Shakespeare have embraced WarrenB. Austin's 1969 computer-based study which concludes that HenryChettle wrote Greene's Groatsworth of Wit. However, Austin'sstudy is flawed primarily because Austin excludes hosts of datarelated to his conclusion, while also misreading the data thatcontinuously point to Robert Greene as Groatsworth's author.Austin studies just five of Greene's thirty-two known proseworks and rules out studying many of Greene's words on the solebasis of their subject matter, words like repent that connectGreene to the writing of Groatsworth. Austin is also silentabout Chettle's stated role as copyist and overseer in preparingand printing Groatsworth. Prominent in this discussion are thesix ‘Greene plus’ words Austin identifies, but doesnot analyse, that appear often in Groatsworth and Greene's otherprose writings, but never in Chettle. Especially important arethe forty-one rare and unique words presented here that Austinexcludes from his study and which constitute direct evidenceof Greene's hand in writing the complete text of Groatsworth.Nor does Austin study the orthography of Groatsworth, whichdiffers significantly from Chettle's Kind Harts Dreame and suggestsdifferent authors for each work. Austin's findings should, therefore,be set aside, while renewed consideration is given to the lexicaland orthographical evidence presented in this article that continuesto identify Greene as Groatsworth's author, that is, as someonefamiliar enough with Shakespeare's early theatre practices tocriticize them.  相似文献   

6.
This paper is a case study for examining how a small-corpus-basedapproach can contribute to research in stylistics. Specifically,we have built small corpora of the two Alice books and retrieved,using WordSmith Tools suite, first, verbs of saying and theiradverbials to elucidate how Alice speaks to others in the stories,and secondly, modifiers of ‘Alice’ to get the imagesof the main character. An analysis of these data reveals thatAlice's role in each book is quite distinct: an unexpected visitorthrown into the passive state in Wonderland and an active explorerin Looking-Glass. These findings objectively serve to reinforceour argument over what Alice is called through the perusal ofthe texts. Alice's roles in the two books are thus interactivelysupported by the small-corpus-based approach and the non-corpus-basedapproach, which may explore the validity of the interfaced approach,the collaborative work of quantitative processing and qualitativespeculation.  相似文献   

7.
Markov chains are used as a formal mathematical model for sequences of elements of a text. This model is applied for authorship attribution of texts. As elements of a text, we consider sequences of letters or sequences of grammatical classes of words. It turns out that the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text are rather stable characteristics of an author and, apparently, they could be used in disputed authorship attribution. A comparison of results for various modifications of the method using both letters and grammatical classes is given. Experimental research involves 385 texts of 82 writers. In the Appendix, the research of D.V. Khmelev is described, where data compression algorithms are applied to authorship attribution.  相似文献   

8.
Designing Web-applications is considerably different for mobile computers (handhelds, Personal Digital Assistants) than for desktop computers. The screen size and system resources are more limited and end-users interact differently. Consequently, detecting handheld-browsers on the server side and delivering pages optimized for a small client form factor is inevitable. The authors discuss their experiences during the design and development of an application for medical research, which was designed for both mobile and personal desktop computers. The investigations presented in this paper highlight some ways in which Web content can be adapted to make it more accessible to mobile computing users. As a result, the authors summarize their experiences in design guidelines and provide an overview of those factors which have to be taken into consideration when designing software for mobile computers. “The old computing is about what computers can do, the new computing is about what people can do” (Leonardo’s laptop: human needs and the new computing technologies, MIT Press, 2002).  相似文献   

9.
分布式存储的并行串匹配算法的设计与分析   总被引:7,自引:0,他引:7  
陈国良  林洁  顾乃杰 《软件学报》2000,11(6):771-778
并行串匹配算法的研究大都集中在PRAM(parallel random access machine)模型上,其他更为实际的模型上的并行串匹配算法的研究相对要薄弱得多.该文采用将最优串行算法并行化的技术,利用模式串的周期性质,巧妙地将改进的KMP(Knuth-Morris-Pratt)算法并行化,提出了一个简便、高效且具有良好可扩放性的分布式串匹配算法,其计算复杂度为O(n/p+m),通信复杂度为O(ulogp相似文献   

10.
Imitative texts of high quality are of some importance to studentsof attribution, especially those who use computational methods.The authorship of such texts is always likely to be difficultto demonstrate. In some cases, the identity of the author isa question of interest to literary scholars. Even when thatis not so, students of attribution face a challenge. If we cannotdistinguish between original and imitation in such cases, wemust always concede that an imitator may have been at work.Shamela (1741) has always been regarded as a brilliant parody.When it is subjected to our standard common-words tests of authorship,it yields mixed results. A new procedure, in which special word-listsare established according to a predetermined set of rules, provesmore effective. It needs, however, to be tried in other cases.  相似文献   

11.
Ever since its initial publication four hundred years ago, thousandsof editions, most often illustrated, have been published ofCervantes' masterpiece, Don Quixote. Imagery has become an integralpart of the reception and interpretation of the text. To date,a comprehensive collection of these images, the textual iconographyof the Quixote, has not been published. We report in this paperon overcoming two key obstacles: limitations on the availabilityof materials and limitations due to the technical and financialcharacteristics of print-based dissemination. Our digital iconographymakes a rich artistic tradition accessible to readers for thefirst time, and reveals a wealth of information about the historical,cultural, and literary contexts into which the Quixote has beenplaced.  相似文献   

12.
The search for a reliable expression to measure an author'slexical richness has constituted many statisticians' holy grailover the last decades in their attempt to solve some controversialauthorship attributions. The greatest effort has been devotedto find a formula grounded on the computation of tokens, word-types,most-frequent-word(s), hapax legomena, hapax dislegomena, etc.,such that it would characterize a text successfully, independentof its length. In this line, Yule's K and Zipf 's Z seem tobe generally accepted by scholars as reliable measures of lexicalrepetition and lexical richness by computing content and functionwords altogether.1 Given the latter's higher frequency, theyprove to be more reliable identifiers when isolatedly computedin p.c.a. and Delta-based attribution studies, and their rateto the former also measures the functional density of a text.In this paper, we aim to show that each constant serves to measurea specific feature and, as such, they are thought to complementone another since a supposedly rich text (in terms of its lemmas)does necessarily have to characterize by its low functionaldensity, and vice versa. For this purpose, an annotated corpusof the West Saxon Gospels (WSG) and Apollonius of Tyre (AoT)has been used along with a huge raw corpus.  相似文献   

13.
In reference (Foundation of specification. Journal of Logicand Computation, 15, 951–974, 2005), the author introducesa core specification theory (CST) in order to provide a logicalframework for the design and exploration of specification languages.In this article, we formulate two highly expressive extensionsof CST. The first (CSTU) is CST + a universe of types and thesecond (CSTUS) permits specifications themselves to be dataitems. Finally, we shall explore their metamathematical propertiesand, in particular, provide an interpretation into first-orderarithmetic.  相似文献   

14.
The Corpus of Electronic Texts (CELT) project at UniversityCollege Cork is an on-line corpus of multilingual texts thatare encoded in TEI conformant SGML/XML. As of September 2006,the corpus has 9.3 million words online. Over the last fiveyears, doctoral work carried out at the project has focusedon the development of lexicographical resources spanning theyears c. AD 700–1700, and on the development of toolsto integrate the corpus with these resources. This researchhas been further complimented by the Linking Dictionaries andText project, a North–South Ireland collaboration betweenthe University of Ulster, Coleraine, and University CollegeCork. The Linking Dictionaries and Text project will reach completionin October 2006. This article focuses on CELT's latest researchproject, the Digital Dinneen project, that aims to create anintegrated edition of Patrick S. Dinneen's Foclóir Gaedhilgeagus Béarla (Irish-English Dictionary). In this article,the newly developed research infrastructure—that is theculmination of the doctoral research carried out at CELT andthe Linking Dictionaries and Text collaboration—will bedescribed, and ways that the Digital Dinneen will be integratedinto this infrastructure established. Finally, avenues of futureresearch will be pointed to.  相似文献   

15.
Statistical information on a substantial corpus of representative Spanish texts is needed in order to determine the significance of data about individual authors or texts by means of comparison. This study describes the organization and analysis of a 150,000-word corpus of 30 well-known twentieth-century Spanish authors. Tables show the computational results of analyses involving sentences, segments, quotations, and word length.The article explains the considerations that guided content, selection, and sample size, and describes special editing needed for the input of Spanish text. Separate sections highlight and comment upon some of the findings.The corpus and the tables provide objective data for studies of homogeneity and heterogeneity. The format of the tables permits others to add to the original 30 authors, organize the results by categories, or use the cumulative results for normative comparisons.Estelle Irizarry is Professor of Spanish at Georgetown University and author of 20 books and annotated editions dealing with Hispanic literature, art, and hoaxes. Her latest book, an edition of Infortunios de Alonso Ramirez, treats the disputed authorship of Spanish America's first novel. She is Courseware Editor of CHum.  相似文献   

16.
This article focuses on the pendulum-like change in the way people read and use text, which was triggered by the introduction of new reading and writing technologies in human history. The paper argues that textual features, which characterized the ancient pre-print writing culture, disappeared with the establishment of the modern-day print culture and has been “revived” in the digital post-modern era. This claim is based on the analysis of four cases which demonstrate this textual-pendulum swing: (1) The swing from concrete iconic-graphic representation of letters and words in the ancient alphabet to abstract phonetic representation of text in modern eras, and from written abstract computer commands “back” to the concrete iconic representation in graphic user interfaces of the digital era; (2) The swing from scroll reading in the pre-print era to page or book reading in the print era and “back” to scroll reading in the digital era; (3) The swing from a low level of authorship in the pre-print era to a strong authorship perception in the print era, and “back” to a low degree of authorship in the digital era; (4) The swing from synchronic representation of text in both visual and audio formats during the pre-print era to a visual representation only in print, and “back” to a synchronic representation in many environments of the digital era. We suggest that the print culture, which is usually considered the natural and preferred textual environment, should be regarded as the exception.  相似文献   

17.
In author attribution studies function words or lexical measures areoften used to differentiate the authors' textual fingerprints. Thesestudies can be thought of as quantifying the texts, representing thetext with measured variables that stand for specific textual features.The resulting quantifications, while proven useful for statisticallydifferentiating among the texts, bear no resemblance to the understanding a human reader – even an astute one – would develop whilereading the texts. In this paper we present an attribution study that,instead, characterizes the texts according to the representationallanguage choices of the authors, similar to a way we believe close humanreaders come to know a text and distinguish its rhetorical purpose. Fromour automated quantification of The Federalist papers, it isclear why human readers find it impossible to distinguish the authorshipof the disputed papers. Our findings suggest that changes occur in theprocesses of rhetorical invention when undertaken in collaborativesituations. This points to a need to re-evaluate the premise ofautonomous authorship that has informed attribution studies of The Federalist case.  相似文献   

18.
URICA! II is an interactive collation system for the Apple Macintosh family of personal computers. It is designed to facilitate the collation of texts for the purpose of determining textual variants. The URICA! II system provides several utilities which support the collation and conflation of texts. The collation utilities allow a user to compare interactively two text files and record their differences in a variant data file. If desired, the computer can resolve simple variants automatically, involving the user only to resolve the more difficult variants. The conflation utility combines the variant data files from multiple collations into a single file which lists all of the textual variants, keyed against a common master text. Michael L. Hilton is an assistant professor of Computer Science at the University of South Carolina. His research interests include the design of computer hardware, speech processing, and stylometric analysis of literary texts. An earlier version of this paper appeared in the Proceedings of the 11th International Conference on Computers and the Humanities.  相似文献   

19.
There is much active research into the design of automated bidding agents, particularly for environments that involve multiple decoupled auctions. These settings are complex partly because an agent’s strategy depends on information about other bidders’interests. When bidders’ valuation distributions are not known ex ante, machine learning techniques can be used to approximate them from historical data. It is a characteristic feature of auctions, however, that information about some bidders’valuations is systematically concealed. This occurs in the sense that some bidders may fail to bid at all because the asking price exceeds their valuations, and also in the sense that a high bidder may not be compelled to reveal her valuation. Ignoring these “hidden bids” can introduce bias into the estimation of valuation distributions. To overcome this problem, we propose an EM-based algorithm. We validate the algorithm experimentally using agents that react to their environments both decision-theoretically and game-theoretically, using both synthetic and real-world (eBay) datasets. We show that our approach estimates bidders’ valuation distributions and the distribution over the true number of bidders significantly more accurately than more straightforward density estimation techniques. Editors: Amy Greenwald and Michael Littman An earlier version of this work was presented at the Workshop on Game-Theoretic and Decision-Theoretic Agents (GTDT) 2005, Edinburgh, Scotland.  相似文献   

20.
Estimating the relative frequencies of linguistic features isa fundamental task in linguistic computation. As the amountof text or speech that is available from a given user of thelanguage typically varies greatly, and the sample sizes tendto be small, the most straightforward methods do not alwaysgive the most informative answers. Bootstrap and Bayesian methodsprovide techniques for handling the uncertainty in small samples.We describe these techniques for estimating frequencies fromsmall samples, and show how they can be applied to the studyof linguistic change. As a test case, we use the introductionof the pronoun you as subject in the data provided by the Corpusof Early English Correspondence (c. 1410–1681).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号