首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Dictionary markup is one of the concerns of the Text Encoding Initiative (TEI), an international project for text encoding. In this paper, we investigate ways to use and extend the TEI encoding scheme for the markup of Korean dictionary entries. Since TEI suggestions for dictionary markup are mainly for western language dictionaries, we need to cope with problems to be encountered in encoding Korean dictionary entries. We try to extend and modify the TEI encoding scheme in the way suggested by the TEI. Also, we restrict the content model so that the encoded dictionary might be viewed as a database as well as a computerized, originally printed, dictionary. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

2.
A dramatic work may be seen either as an event or as a text; the TEI guidelines make it possible to encode a dramatic work in either way, but do not attempt to solve the difficult problem of doing both at once. The basic element of a dramatic work, when seen as a text, is the speech; the guidelines also provide elements for encoding other familiar parts of dramatic texts (such as stage directions and cast lists), as well as for encoding analytic information on various aspects of texts and performances that is not normally included in printed dramatic texts. There are often other formal structures in dramatic works that intersect with the structure of speeches — metrical structures, for example; we discuss approaches for encoding these structures.  相似文献   

3.
A dramatic work may be seen either as an event or as a text; the TEI guidelines make it possible to encode a dramatic work in either way, but do not attempt to solve the difficult problem of doing both at once. The basic element of a dramatic work, when seen as a text, is the speech; the guidelines also provide elements for encoding other familiar parts of dramatic texts (such as stage directions and cast lists), as well as for encoding analytic information on various aspects of texts and performances that is not normally included in printed dramatic texts. There are often other formal structures in dramatic works that intersect with the structure of speeches — metrical structures, for example; we discuss approaches for encoding these structures.John Lavagnino is a graduate student in English and American Literature at Brandeis University. His fields of interest include Renaissance drama, modern literature, textual scholarship, and electronic textuality. He is Electronics Editor ofThe Collected Works of Thomas Middleton (forthcoming from Oxford University Press).Elli Mylonas is a Lead Project Analyst for the Scholarly Technology Group at Brown University. Formerly she was the Managing Editor of the Perseus Project. Her areas of interest are Roman poetry, textual markup and SGML, and hypertext.The work described in this paper is the outcome of the discussions of the Performance Working Group, whose members are Elli Mylonas (chair), Rosanne G. Potter, John Lavagnino, and Lou Burnard. The authors wish to thank the other two members for their contributions.  相似文献   

4.
The TEI Header plays a vital role in the documentation and interchange of TEI conformant electronic texts. Moreover, this role is becoming increasingly important as more people follow the recommendations set out in TEI P3, and libraries, archives, and electronic text centres seek to share their holdings of electronic texts. However, the fact that TEI P3 allows for flexibility in the structure and content of TEI Headers has meant that divergent practices have begun to emerge within the numerous projects and initiatives creating TEI texts. With this in mind, the Oxford Text Archive hosted a one-day colloquium of leading TEI exponents, at which invited participants were encouraged to share their views and expertise on creating TEI Headers, and work together to develop some recommendations towards good practice.  相似文献   

5.
This paper discusses the basic design of the encoding scheme described by the Text Encoding Initiative'sGuidelines for Electronic Text Encoding and Interchange (TEI document number TEI P3, hereafter simplyP3 orthe Guidelines). It first reviews the basic design goals of the TEI project and their development during the course of the project. Next, it outlines some basic notions relevant for the design of any markup language and uses those notions to describe the basic structure of the TEI encoding scheme. It also describes briefly the core tag set defined in chapter 6 of P3, and the default text structure defined in chapter 7 of that work. The final section of the paper attempts an evaluation of P3 in the light of its original design goals, and outlines areas in which further work is still needed.C. M. Sperberg-McQueen is a Senior Research Programmer at the academic computer center of the University of Illinois at Chicago; his interests include medieval Germanic languages and literatures and the theory of electronic text markup. Since 1988 he has been editor in chief of the ACH/ACL/ALLC Text Encoding Initiative. Lou Burnard is Director of the Oxford Text Archive at Oxford University Computing Services, with interests in electronic text and database technology. He is European Editor of the Text Encoding Initiative's Guidelines.  相似文献   

6.
This paper describes the multilingual text editor MtScript developed in the framework of the MULTEXT project.MtScript enables the use of many differentwriting systems in the same document (Latin, Arabic,Cyrillic, Hebrew, Chinese, Japanese, etc.). Editingfunctions enable the insertion or deletion of textzones even if they have opposite writing directions.In addition, the languages in the text can be marked,customized keyboard input rules can be associated witheach language and different character coding systems(one or two bytes) can be combined. MtScript isbased on a portable environment (Tcl/Tk). MtScript.1.1version has been developed underUnix/X-Windows (Solaris, Linux systems) and otherversions are planned to be ported to the Windows andMacintosh environments. The current 1.1 versionpresents several limits that will be fixed in futureversions, such as the justification of bi-directionaltexts, printing support, and text import/exportsupport. Future versions will use SGML and TEI norms,which offer ways of encoding multilingual texts andare to a large extent meant for interchange.  相似文献   

7.
当前的英文语法纠错模型往往忽略了有利于语法纠错的文本句法知识, 从而使得英语语法纠错模型的纠错能力受到影响. 针对上述问题, 提出一种基于差分融合句法特征的英语语法纠错模型. 首先, 本文提出的句法编码器不仅可以直接从文本中无监督地生成依存关系图和成分句法树信息, 而且还能将上述两种异构的句法结构进行特征融合, 编码成高维的句法表征. 其次, 为了同时利用文本中的语义和句法信息, 差分融合模块先使用差分正则化加强语义编码器捕获句法编码器未能生成的语义特征, 然后采用协同注意力将句法表征和语义表征进一步融合, 作为Transformer编码端的输出特征, 最终输入到解码端, 从而生成语法正确的文本. 在CoNLL-2014 英文纠错任务数据集上进行对比实验, 结果表明, 该方法的准确率和F0.5值优于基于Copy-Augmented Transformer的语法纠错模型, 其F0.5值提升了5.2个百分点, 并且句法知识避免了标注数据过少问题, 具有更优的文本纠错效果.  相似文献   

8.
藏语语料库TEI标记规范探讨   总被引:1,自引:0,他引:1  
在语言信息处理过程中,大规模真实文本处理已成为一个研究热点。藏语语料库的标记在汉藏英机器翻译、信息检索、文本数据挖掘、词典编纂的研究工作中占很重要的地位。为了便于数据交换和共享,该文基于TEI编码的藏语语料,对藏语语料库中文本的属性信息和结构信息标记做了系统而全面的探讨。  相似文献   

9.
10.
The TEI Guidelines provide little detail on how to encode a text within the physical structures of the book in which it is contained. This paper describes the physical structures of an early printed book and presents two methods for encoding a text within that structure through use of the TEI elements and .  相似文献   

11.
向量空间模型中结合句法的文本表示研究   总被引:1,自引:1,他引:0       下载免费PDF全文
为增强向量空间模型(VSM)中项的语义描述性,克服VSM中各语义单元相互独立的缺陷,提出一种基于短语的特征粒度描述方法。该方法从文本的表示及特征项之间的组织方式入手,通过句法规则识别基本短语,构建特征与中心动词的关系树,利用基本短语代替BOW中的词。实验结果表明,采用基本短语的文本表示可提高分类的性能,增加项之间的联系,克服特征项相互独立的缺陷,在特征数量较少的情况下仍能保持良好的分类效果。  相似文献   

12.
In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.  相似文献   

13.
This paper reports on a method for exploiting a bitext as the primary linguistic information source for the design of a generation environment for specialized bilingual documentation. The paper discusses such issues as Text Encoding Initiative (TEI), proposals for specialized corpus tagging, text segmentation and alignment of translation units and their allocation into translation memories, Document Type Definition (DTD), abstraction from tagged texts, and DTD deployment for bilingual text generation. The parallel corpus used for experimentation has two main features:  相似文献   

14.
This paper explores theoretical and practical aspects of intertextuality, in relation to the highly interpretative <intertextuality> tag within the SGML tagset developed by the Orlando Project for its history of women's writing in the British Isles. Arguing that the concept of intertextuality is both crucial to and poses particular challenges to the creation of an encoding scheme for literary historical text, it outlines the ways in which the project's tags address broader issues of intertextuality. The paper then describes the specific <intertextuality> tag in detail, and argues on the basis of provisional results drawn from the Orlando Project's textbase that despite the impossibility of tracking intertextuality exhaustively or devising a tagset that completely disambiguates the concept, this tag provides useful pathways through the textbase and valuable departure points for further inquiry. Finally, the paper argues that the challenges to notions of rigour posed by the concept of intertextuality can help us fruitfully to examine some of the suppositions (gendered and other) that we bring to electronic text markup.  相似文献   

15.
王景慧  卢玲 《计算机应用研究》2023,40(5):1410-1415+1440
中文实体关系抽取多以字符序列处理文本,存在字符语义表征不足、长字符序列语义遗忘等问题,制约了远距离实体的召回率,为此提出了一种融合依存句法信息的关系导向型抽取方法。输入层以字符序列和基于同义词表示的词序列为输入;编码端用长短时记忆网络(LSTM)进行文本编码,并加入全局依存信息,用于产生关系门的表示;解码端加入依存类型信息,并在关系门的作用下,用双向长短时记忆网络(BiLSTM)解码得到实体关系三元组。该方法在SanWen、FinRE、DuIE、IPRE中文数据集上的F1值分别较基线方法提高5.84%、2.11%、2.69%和0.39%。消融实验表明,提出的全局依存信息和依存类型信息表示方法均可提升抽取性能,对长句和远距离实体的抽取性能也稳定地优于基线方法。  相似文献   

16.
汉语小句的俄语对应单位研究   总被引:1,自引:0,他引:1  
该文标注汉俄平行文本中汉语小句的俄语对应单位,并统计分析。首先,根据汉语小句切分对齐切分俄语,得到俄语对应单位;其次,对俄语对应单位进行语法标注;最后,基于标注语料,分析发现俄语对应单位。研究发现: (1)句子组成部分多(74.85%),句子少(25.15%); (2)单一述谓核心多(69.04%),无述谓核心次之(27.63%),多述谓核心少(3.33%); (3)单一述谓核心以简单谓语最多(31.84%),无述谓核心以动词短语最多(51.26%),多述谓核心以主从复合句最多(47.92%)。  相似文献   

17.
为了提高短文本语义相似度计算的准确率,提出一种新的计算方法:将文本分割为句子单元,对句子进行句法依存分析,句子之间相似度计算建立在词语间相似度计算的基础上,在计算词语语义相似度时考虑词语的新特征——情感特征,并提出一种综合方法对词语进行词义消歧,综合词的词性与词语所处的语境,再依据Hownet语义词典计算词语语义相似度;将句子中词语之间的语义相似度根据句子结构加权平均得到句子的语义相似度,最后通过一种新的方法——二元集合法——计算短文本的语义相似度。词语相似度与短文本相似度的准确率分别达到了87.63%和93.77%。实验结果表明,本文方法确实提高了短文本语义相似度的准确率。  相似文献   

18.
The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.  相似文献   

19.
文档检索中句法信息的有效利用研究   总被引:1,自引:0,他引:1  
利用词项依存关系来改进词袋模型,一直是文本检索中一个热门话题。已有的定义词项依存的方法中,有两类主要的方法一类是词汇层次的依存关系,利用统计近邻信息来定义词项依存关系,另一类是句法层次的依存关系,由句法结构来定义词项依存关系。虽然已有的研究表明,相对于词袋模型,利用词项依存关系能够显著地提高检索性能,但这两类词项依存关系却缺乏系统的比较在利用词项依存关系来改进文档和查询的表达上,如何有效地利用句法信息,哪些句法信息对文本检索比较有效,依然是个有待研究的问题。为此,在文档表达上,比较了利用近邻信息和句法信息定义的词项依存关系的性能;在查询表达上,对利用不同层次的句法信息所定义的词项依存关系的性能进行了比较。为了系统地比较这些词项依存关系对检索性能的影响,在语言模型基础上,以平滑为思路,提出了一个能方便融入这两类词项依存关系的检索模型。在TREC语料上的实验表明,对于文档表达来说,句法关系较统计近邻关系没有明显的差别。在查询表达上,基于名词/专有词短语的部分句法信息较其他的句法信息更加有效。  相似文献   

20.
车冰倩  周栋 《计算机应用》2021,41(4):976-983
为文本推荐合适的标签是更好地组织和使用文本内容的一项有效手段,目前大部分标签推荐方法主要通过挖掘文本内容来进行推荐。然而,大部分数据信息并非独立存在,如语料库中的文本间的词共现关系可形成复杂的网络结构。以往研究表明,文本间的网络结构信息和文本内容信息可以分别从两个不同的角度对同一文本的语义进行概括,并且从两方面提取的信息可以互为补充和解释。基于此,提出一种同时对文本网络结构信息和文本内容信息进行建模的标签推荐方法。该方法首先使用图卷积神经网络(GCN)提取文本间网络的结构信息,然后使用循环神经网络(RNN)提取文本内容信息,最后使用注意力机制结合文本间网络结构信息和文本内容信息进行标签的推荐。与基于图卷积神经网络(GCN)的标签推荐方法、基于主题注意力的长短时记忆(TLSTM)神经网络的标签推荐方法等基线方法相比,提出的使用注意力机制结合网络结构信息与文本内容信息的标签推荐方法具有更好的性能。如在Mathematics Stack Exchange数据集上所提方法的准确率、召回率和F1值相较最优基线方法分别提高了2.3%、3.8%、7.0%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号