首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 875 毫秒
1.
This research begins by distinguishing a small number of “central” languages from the “noncentral languages”, where centrality is measured by the extent to which a given language is supported by natural language processing tools and research. We analyse the conditions under which noncentral language projects (NCLPs) and central language projects are conducted. We establish a number of important differences which have far-reaching consequences for NCLPs. In order to overcome the difficulties inherent in NCLPs, traditional research strategies have to be reconsidered. Successful styles of scientific cooperation, such as those found in open-source software development or in the development of the Wikipedia, provide alternative views of how NCLPs might be designed. We elaborate the concepts of free software and software pools and argue that NCLPs, in their own interests, should embrace an open-source approach for the resources they develop and pool these resources together with other similar open-source resources. The expected advantages of this approach are so important that we suggest that funding organizations put it as sine qua non condition into project contracts. All trademarks are hereby acknowledged.  相似文献   

2.
Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.  相似文献   

3.
In this paper, we describe tools and resources for the study of African languages developed at the Collaborative Research Centre 632 “Information Structure”. These include deeply annotated data collections of 25 sub-Saharan languages that are described together with their annotation scheme, as well as the corpus tool ANNIS, which provides unified access to a broad variety of annotations created with a range of different tools. With the application of ANNIS to several African data collections, we illustrate its suitability for the purpose of language documentation, distributed access, and the creation of data archives.  相似文献   

4.
Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.  相似文献   

5.
KGDB:统一模型和语言的知识图谱数据库管理系统   总被引:2,自引:0,他引:2  
知识图谱是人工智能的重要基石,其目前主要有RDF图和属性图两种数据模型,在这两种数据模型之上有数种查询语言,RDF图上的查询语言为SPARQL,属性图上的查询语言主要为Cypher.十年来,各个社区开发了分别针对RDF图和属性图的不同数据管理方法,不统一的数据模型和查询语言限制了知识图谱的更广应用.KGDB (Knowledge Graph Database)是统一模型和语言的知识图谱数据库管理系统:(1)以关系模型为基础,提出统一的存储方案,支持RDF图和属性图的高效存储,满足知识图谱数据存储和查询负载的需求;(2)使用基于特征集的聚类方法解决无类型三元组的存储问题;(3)实现了SPARQL和Cypher两种不同知识图谱查询语言的互操作性,使其能够操作同一个知识图谱.在真实数据集和合成数据集上进行的大量实验表明,KGDB与已有知识图谱数据库管理系统相比,不仅能够提供更加高效的存储管理,而且具有更高的查询效率.KGDB平均比gStore和Neo4j节省了30%的存储空间,基本图模式查询上的实验表明,在真实数据集上的查询速度普遍高于gStore和Neo4j,最快可提高2个数量级.  相似文献   

6.
(DNA) computing by carving   总被引:1,自引:0,他引:1  
 Inspired by the experiments reported recently in the emerging area of DNA computing, we consider a somewhat unusual type of a computation strategy: generate a (large) set of candidate solutions of a problem, then remove the non-solutions such that what remains is the set of solutions. We call this a computation by carving. This leads both to a speculation with possible important consequences and to interesting theoretical computer science (formal language) questions. The speculation is that in this way we can “compute” non-recursively enumerable languages, because the family of recursively enumerable languages is not closed under complementation. The formal language theory questions concern sequences of languages with certain regularities, needed as languages to be extracted from the total language of candidate solutions of a problem. Specifically, we consider sequences of languages obtained by starting from a given regular language and iteratively applying to it a given finite state sequential transducer (a gsm). Computing by carving with respect to such a sequence of languages can identify all context-sensitive languages and can also lead to non-recursively enumerable languages (but not all recursively enumerable languages can be obtained in this way). In practical circumstances, the carving process should be finite, hence, in general, approximations of the desired language are obtained. We also briefly discuss this aspect.  相似文献   

7.
We describe methods for improving the performance of statistical machine translation (SMT) between four linguistically different languages, i.e., Chinese, English, Japanese, and Korean by using morphosyntactic knowledge. For the purpose of reducing the translation ambiguities and generating grammatically correct and fluent translation output, we address the use of shallow linguistic knowledge, that is: (1) enriching a word with its morphosyntactic features, (2) obtaining shallow linguistically-motivated phrase pairs, (3) iteratively refining word alignment using filtered phrase pairs, and (4) building a language model from morphosyntactically enriched words. Previous studies reported that the introduction of syntactic features into SMT models resulted in only a slight improvement in performance in spite of the heavy computational expense, however, this study demonstrates the effectiveness of morphosyntactic features, when reliable, discriminative features are used. Our experimental results show that word representations that incorporate morphosyntactic features significantly improve the performance of the translation model and language model. Moreover, we show that refining the word alignment using fine-grained phrase pairs is effective in improving system performance.  相似文献   

8.
In this paper we consider networks of evolutionary processors as language generating and computational devices. When the filters are regular languages one gets the computational power of Turing machines with networks of size at most six, depending on the underlying graph. When the filters are defined by random context conditions, we obtain an incomparability result with the families of regular and context-free languages. Despite their simplicity, we show how the latter networks might be used for solving an NP-complete problem, namely the “3-colorability problem”, in linear time and linear resources (nodes, symbols, rules). Received: 26 September 2002 / 22 January 2003 RID="*" ID="*" Work supported by the Generalitat de Catalunya, Direcció General de Recerca (PIV2001-50).  相似文献   

9.
Fundamentals of control flow in workflows   总被引:2,自引:0,他引:2  
Abstract. Although workflow management emerged as a research area well over a decade ago, little consensus has been reached as to what should be essential ingredients of a workflow specification language. As a result, the market is flooded with workflow management systems, based on different paradigms and using a large variety of concepts. The goal of this paper is to establish a formal foundation for control-flow aspects of workflow specification languages, that assists in understanding fundamental properties of such languages, in particular their expressive power. Workflow languages can be fully characterized in terms of the evaluation strategy they use, the concepts they support, and the syntactic restrictions they impose. A number of results pertaining to this classification will be proven. This should not only aid those developing workflow specifications in practice, but also those developing new workflow engines. Received 16 January 2001 / 13 November 2002 This research is supported by an ARC SPIRT grant “Component System Architecture for an Open Distributed Enterprise Management System with Configurable Workflow Support” between QUT and Mincom.  相似文献   

10.
Previous work on semantics-based multi-stage programming (MSP) language design focused on homogeneous designs, where the generating and the generated languages are the same. Homogeneous designs simply add a hygienic quasi-quotation and evaluation mechanism to a base language. An apparent disadvantage of this approach is that the programmer is bound to both the expressivity and performance characteristics of the base language. This paper proposes a practical means to avoid this by providing specialized translations from subsets of the base language to different target languages. This approach preserves the homogeneous “look” of multi-stage programs, and, more importantly, the static guarantees about the generated code. In addition, compared to an explicitly heterogeneous approach, it promotes reuse of generator source code and systematic exploration of the performance characteristics of the target languages. To illustrate the proposed approach, we design and implement a translation to a subset of C suitable for numerical computation, and show that it preserves static typing. The translation is implemented, and evaluated with several benchmarks. The implementation is available in the online distribution of MetaOCaml.  相似文献   

11.
Recently there has been a great deal of interest in higher-order syntax which seeks to extend standard initial algebra semantics to cover languages with variable binding. The canonical example studied in the literature is that of the untyped λ-calculus which is handled as an instance of the general theory of binding algebras, cf. Fiore et al. [13]. Another important syntactic construction is that of explicit substitutions which are used to model local definitions and to implement reduction in the λ-calculus. The syntax of a language with explicit substitutions does not form a binding algebra as an explicit substitution may bind an arbitrary number of variables. Thus explicit substitutions are a natural test case for the further development of the theory and applications of syntax with variable binding. This paper shows that a language containing explicit substitutions and a first-order signature Σ is naturally modelled as the initial algebra of the Id + F Σ∘_ +_ ∘ _ endofunctor. We derive a similar formula for adding explicit substitutions to the untyped λ-calculus and then show these initial algebras provide useful datatypes for manipulating abstract syntax by implementing two reduction machines. We also comment on the apparent lack of modularity in syntax with variable binding as compared to first-order languages.  相似文献   

12.
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.  相似文献   

13.
Combinatorial property testing deals with the following relaxation of decision problems: Given a fixed property and an input x, one wants to decide whether x satisfies the property or is “far” from satisfying it. The main focus of property testing is in identifying large families of properties that can be tested with a certain number of queries to the input. In this paper we study the relation between the space complexity of a language and its query complexity. Our main result is that for any space complexity s(n) ≤ log n there is a language with space complexity O(s(n)) and query complexity 2Ω(s(n)). Our result has implications with respect to testing languages accepted by certain restricted machines. Alon et al. [FOCS 1999] have shown that any regular language is testable with a constant number of queries. It is well known that any language in space o(log log n) is regular, thus implying that such languages can be so tested. It was previously known that there are languages in space O(log n) that are not testable with a constant number of queries and Newman [FOCS 2000] raised the question of closing the exponential gap between these two results. A special case of our main result resolves this problem as it implies that there is a language in space O(log log n) that is not testable with a constant number of queries. It was also previously known that the class of testable properties cannot be extended to all context-free languages. We further show that one cannot even extend the family of testable languages to the class of languages accepted by single counter machines.   相似文献   

14.
In speech recognition research,because of the variety of languages,corresponding speech recognition systems need to be constructed for different languages.Especially in a dialect speech recognition system,there are many special words and oral language features.In addition,dialect speech data is very scarce.Therefore,constructing a dialect speech recognition system is difficult.This paper constructs a speech recognition system for Sichuan dialect by combining a hidden Markov model(HMM)and a deep long short-term memory(LSTM)network.Using the HMM-LSTM architecture,we created a Sichuan dialect dataset and implemented a speech recognition system for this dataset.Compared with the deep neural network(DNN),the LSTM network can overcome the problem that the DNN only captures the context of a fixed number of information items.Moreover,to identify polyphone and special pronunciation vocabularies in Sichuan dialect accurately,we collect all the characters in the dataset and their common phoneme sequences to form a lexicon.Finally,this system yields a 11.34%character error rate on the Sichuan dialect evaluation dataset.As far as we know,it is the best performance for this corpus at present.  相似文献   

15.
Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, since word-forms cannot be matched against a morphological lexicon. This paper discusses a machine learning approach to the automatic lemmatization of unknown words in Slovene texts. We decompose the problem of learning to perform lemmatization into two subproblems: learning to perform morphosyntactic tagging of words in a text, and learning to perform morphological analysis, which produces the lemma from the word-form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. We train the tagger on a manually annotated corpus consisting of 100,000 running words. We train the analyzer on open-class inflecting Slovene words, namely nouns, adjectives, and main verbs, together being characterized by more than 400 different morphosyntactic tags. The training set for the analyzer consists of a morphological lexicon containing 15,000 lemmas. We evaluate the learned model on word lists extracted from a corpus of Slovene texts containing 500,000 words, and show that our morphological analysis module achieves 98.6% accuracy, while the combination of the tagger and analyzer is 92.0% accurate on unknown inflecting Slovene words.  相似文献   

16.
Substructural type systems are designed from the insight inspired by the development of linear and substructural logics. Substructural type systems promise to control the usage of computational resources statically, thus detect more program errors at an early stage than traditional type systems do. In the past decade, substructural type systems have been deployed in the design of novel programming languages, such as Vault, etc. This paper presents a general typing theory for substructural type system. First, we define a universal semantic framework for substructural types by interpreting them as characteristic intervals composed of type qualifiers. Based on this framework, we present the design of a substructural calculus λSL with subtyping relations. After giving syntax, typing rules and operational semantics for λSL, we prove the type safety theorem. The new calculus λSL can guarantee many more safety invariants than traditional lambda calculus, which is demonstrated by showing that the λSL calculus can serve as an idealized type intermediate language, and defining a type-preserving translation from ordinary typed lambda calculus into λSL.  相似文献   

17.
We define three operations on strings and languages suggested by the process of gene assembly in hypotrichous ciliates. This process is considered to be a prine example of DNA computing in vivo. This paper is devoted to some computational aspects of these operations from a formal language point of view. The closure of the classes of regular and context-free languages under these operations is settled. Then, we consider theld-macronuclear language of a given languageL, which consists of allld-macronuclear strings obtained from the strings ofL by iteratively applying the loop-direct repeat-excision. Finally, we discuss some open problems and further directions of research. Rudolf Freund: He received his master and doctor degree in computer science from the Vienna University of Technology, Austria, in 1980 and 1982, respectively. In 1986, he received his master degree in mathematics and physics from the University Vienna, Austria. In 1988 he joined the Vienna University of Technology in Austria, where he became an Associate Professor in September 1995. He has given various lectures in theoretical computer science, especially on formal languages and automata. His research interests include array and graph grammars, regulated rewritung, infinite words, syntactic pattern recognition, neural networks, and especially models and systems for biological computing. In these fields he is author of more than sixty scientific papers. Carlos Martín-Vide: He is Professor and Head of the Research Group on Mathematical Linguistics at Rovira i Virgili University, Tarragona, Spain. His specialities are formal language theory and mathematical linguistics. His last volume edited is Where Mathematics, Computer Science, Linguistics and Biology Meet (Kluwer, 2001, with V. Mitrana). He published 150 papers in conference proceedings and journals such as: Acta Informatica, BioSystems. Computational Linguistics, Computers and Artificial Intelligence, Information Processing Letters, Information Sciences, International Journal of Computer Mathematics, New Generation Computing, Publicationes Mathematicae Debrecen, and Theoretical Computer Science. He is the editor-in-chief of the journal Grammars (Kluwer), and the chairman of the 1st International PhD School in Formal Languages and Applications (2001–2003). Victor Mitrana, Ph.D.: He is Professor of Computer Science at the Faculty of Mathematics, University of Bucharest. He received his MSc and PhD from the University of Bucharest in 1986 and 1993, respectively. In 1999 he was awarded with the “Gheorghe Lazar” Prize for Mathematics of the Romanian Academy. His research interests include: formal language theory and applications, combinatorics on words, computational models inspired from biology, mathematical linguistics. In these areas, he published three books, more than 100 papers, and edited two books. He is an associate editor of “The Korean Journal of Computational and Applied Mathematics” and an editor of “Journal of Universal Computer Science”.  相似文献   

18.

Speech provides a natural way for human–computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 h from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in European Portuguese.

  相似文献   

19.
It is well known that every boundary (linear) eNCE graph language that is connected and degree-bounded by a constant is in LOGCFL (NLOG). The present paper proves an upper bound of the complexity of boundary (linear) eNCE graph languages whose component and degree bounds are functions in the size of the graph. As a corollary of this analysis, the known LOGCFL and NLOG results stated above hold even if “connected” is replaced by “component-bounded by log n.” Received: 15 January 1997 / 17 January 2001  相似文献   

20.
Visibly pushdown languages are an interesting subclass of deterministic context-free languages that can model nonregular properties of interest in program analysis. Such class properly contains typical classes of parenthesized languages such as “parenthesis”, “bracketed”, “balanced” and “input-driven” languages. It is closed under boolean operations and has decidable decision problems such as emptiness, inclusion and universality. We study the membership problem for visibly pushdown languages, and show that it can be solved in time linear in both the size of the input grammar and the length of the input word. The algorithm relies on a reduction to the reachability problem for game graphs. We also discuss the time complexity of the membership problem for the class of balanced languages which is the largest among those cited above. Besides the intrinsic theoretical interest, we further motivate our main result showing an application to the validation of XML documents against Schema and Document Type Definitions (DTDs). Work partially supported by funds for the research from MIUR 2006, grant “Metodi Formali per la verifica di sistemi chiusi ed aperti”, Università di Salerno. A preliminary version of this paper was published in the Proceedings of the 4th International Symposium Automated Technology for Verification and Analysis (ATVA 2006), Lecture Notes in Computer Science 4218, pp. 96–109, 2006.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号