首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we present a method of Human-Computer-Interaction (HCI) through 3D air-writing. Our proposed method includes a natural way of interaction without pen and paper. The online texts are drawn on air by 3D gestures using fingertip within the field of view of a Leap motion sensor. The texts consist of single stroke only. Hence gaps between adjacent words are usually absent. This makes the system different as compared to the conventional 2D writing using pen and paper. We have collected a dataset that comprises with 320 Latin sentences. We have used a heuristic to segment 3D words from sentences. Subsequently, we present a methodology to segment continuous 3D strokes into lines of texts by finding large gaps between the end and start of the lines. This is followed by segmentation of the text lines into words. In the next phase, a Hidden Markov Model (HMM) based classifier is used to recognize 3D sequences of segmented words. We have used dynamic as well as simple features for classification. We have recorded an overall accuracy of 80.3 % in word segmentation. Recognition accuracies of 92.73 % and 90.24 % have been recorded when tested with dynamic and simple features, respectively. The results show that the Leap motion device can be a low-cost but useful solution for inputting text naturally as compared to conventional systems. In future, this may be extended such that the system can successfully work on cluttered gestures.  相似文献   

2.
3.
古汉语以单音节词为主,其一词多义现象十分突出,这为现代人理解古文含义带来了一定的挑战。为了更好地实现古汉语词义的分析和判别,该研究基于传统辞书和语料库反映的语言事实,设计了针对古汉语多义词的词义划分原则,并对常用古汉语单音节词进行词义级别的知识整理,据此对包含多义词的语料开展词义标注。现有的语料库包含3.87万条标注数据,规模超过117.6万字,丰富了古代汉语领域的语言资源。实验显示,基于该语料库和BERT语言模型,词义判别算法准确率达到80%左右。进一步地,该文以词义历时演变分析和义族归纳为案例,初步探索了语料库与词义消歧技术在语言本体研究和词典编撰等领域的应用。  相似文献   

4.
语料库作为基本的语言数据库和知识库,是各种自然语言处理方法实现的基础。随着统计方法在自然语言处理中的广泛应用,语料库建设已成为重要的研究课题。自动分词是句法分析的一项不可或缺的基础性工作,其性能直接影响句法分析。本文通过对85万字节藏语语料的统计分析和藏语词的分布特点、语法功能研究,介绍基于词典库的藏文自动分词系统的模型,给出了切分用词典库的结构、格分块算法和还原算法。系统的研制为藏文输入法研究、藏文电子词典建设、藏文字词频统计、搜索引擎的设计和实现、机器翻译系统的开发、网络信息安全、藏文语料库建设以及藏语语义分析研究奠定了基础。  相似文献   

5.
运用改进的分词方法进行外国译名识别的研究   总被引:2,自引:0,他引:2  
该文首先介绍了基于词典的分词算法的语言模型和一种基于词典分词算法:最大词频分词法。分析了基于词典的分词算法的语言模型,指出其无法处理未登录词的原因。针对此原因,提出了引入动态词典的方法,将最大词频分词算法和局部频率法相结合以解决未登录词中译名识别的问题。最后,给出了一个系统实现。  相似文献   

6.
This paper presents a new method with which to assist individuals with no background in linguistics to create monolingual dictionaries such as those used by the morphological analysers of many natural language processing applications. The involvement of non-expert users is especially critical for under-resourced languages which either lack or cannot afford the recruitment of a skilled workforce. Adding a word to a morphological dictionary usually requires identifying its stem along with the inflection paradigm that can be used in order to generate all the word forms of the new entry. Our method works under the assumption that the average speakers of a language can successfully answer the polar question “is x a valid form of the word w to be inserted?”, where x represents tentative alternative (inflected) forms of the new word w. The experiments show that with a small number of polar questions the correct stem and paradigm can be obtained from non-experts with high success rates. We study the impact of different heuristic and probabilistic approaches on the actual number of questions.  相似文献   

7.
Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their pre-processing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turkish, Malay and Amharic. Unfortunately, no algorithm has been used to stem text in Hausa, a Chadic language spoken in West Africa. To address this need, we propose stemming Hausa text using affix-stripping rules and reference lookup. We stemmed Hausa text, using 78 affix stripping rules applied in 4 steps and a reference look-up consisting of 1500 Hausa root words. The over-stemming index, under-stemming index, stemmer weight, word stemmed factor, correctly stemmed words factor and average words conflation factor were calculated to determine the effect of reference look-up on the strength and accuracy of the stemmer. It was observed that reference look-up aided in reducing both over-stemming and under-stemming errors, increased accuracy and has a tendency to reduce the strength of an affix stripping stemmer. The rationality behind the approach used is discussed and directions for future research are identified.  相似文献   

8.
Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents.  相似文献   

9.
As was shown earlier, for a linear differential–algebraic system A 1 y′ + A 0 y = 0 with a selected part of unknowns (entries of a column vector y), it is possible to construct a differential system ?′ = B ?, where the column vector ? is formed by some entries of y, and a linear algebraic system by means of which the selected entries that are not contained in ? can be expressed in terms of the selected entries included in ?. In the paper, sizes of the differential and algebraic systems obtained are studied. Conditions are established under the fulfillment of which the size of the algebraic system is determined unambiguously and the size of the differential system is minimal.  相似文献   

10.
The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. We propose that Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics, has great potential for bootstrapping NLP tools for resource-poor languages. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we describe the expansion of the ODIN resource—a database containing many thousands of instances of IGT for over a thousand languages. We enrich the original IGT data by adding word alignment and syntactic structure. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we adopt and extend a new XML format for IGT, called Xigt. We also develop two packages for manipulating IGT data: one, INTENT, enriches raw IGT automatically, and the other, XigtEdit, is a graphical IGT editor.  相似文献   

11.
In the literature on logics of imperfect information it is often stated, incorrectly, that the Game-Theoretical Semantics of Independence-Friendly (IF) quantifiers captures the idea that the players of semantical games are forced to make some moves without knowledge of the moves of other players. We survey here the alternative semantics for IF logic that have been suggested in order to enforce this “epistemic reading” of sentences. We introduce some new proposals, and a more general logical language which distinguishes between “independence from actions” and “independence from strategies”. New semantics for IF logic can be obtained by choosing embeddings of the set of IF sentences into this larger language. We compare all the semantics proposed and their purported game-theoretical justifications, and disprove a few claims that have been made in the literature.  相似文献   

12.
目前比较流行的中文分词方法为基于统计模型的机器学习方法。基于统计的方法一般采用人工标注的句子级的标注语料进行训练,但是这种方法往往忽略了已有的经过多年积累的人工标注的词典信息。这些信息尤其是在面向跨领域时,由于目标领域句子级别的标注资源稀少,从而显得更加珍贵。因此如何充分而且有效的在基于统计的模型中利用词典信息,是一个非常值得关注的工作。最近已有部分工作对它进行了研究,按照词典信息融入方式大致可以分为两类:一类是在基于字的序列标注模型中融入词典特征,而另一类是在基于词的柱搜索模型中融入特征。对这两类方法进行比较,并进一步进行结合。实验表明,这两类方法结合之后,词典信息可以得到更充分的利用,最终无论是在同领域测试和还是在跨领域测试上都取得了更优的性能。  相似文献   

13.
在新闻领域标注语料上训练的中文分词系统在跨领域时性能会有明显下降。针对目标领域的大规模标注语料难以获取的问题,该文提出Active learning算法与n-gram统计特征相结合的领域自适应方法。该方法通过对目标领域文本与已有标注语料的差异进行统计分析,选择含有最多未标记过的语言现象的小规模语料优先进行人工标注,然后再结合大规模文本中的n-gram统计特征训练目标领域的分词系统。该文采用了CRF训练模型,并在100万句的科技文献领域上,验证了所提方法的有效性,评测数据为人工标注的300句科技文献语料。实验结果显示,在科技文献测试语料上,基于Active Learning训练的分词系统在各项评测指标上均有提高。
  相似文献   

14.
Adverbial constructions v gotovom vide, na vsyakii sluchai, etc. that are an intrinsic resource of the Russian language along with secondary prepositions of v vide and na sluchai type are discussed. The availability and linguistic features of these constructions shows grammaticalization occurring in the modern Russian language to cover not only separate units but also word combinations.  相似文献   

15.
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.  相似文献   

16.
目前的情绪词典通常对情绪词语进行情绪类别和强度的标注,但缺乏对词语的情绪表达和情绪认知结果进行区分的能力。同时,直接在词语条目上进行标注经常由于词语的语义歧义导致情绪标注结果存在歧义。该文在对个体情绪产生和迁移机制进行分析的基础上,建立了基于“刺激认知—反射表达”的文本情绪计算框架。并在此框架下对情绪相关词语的功能和特性进行分析,探索了一种新型情绪词典建设方法。首先,引入HowNet提供的词语语义信息,将同一词语转变为不同语义的多个词条进行标注减少情绪标注歧义。其次,将词语的情绪表达方式和情绪认知结果加以区分,分别标注从不同角度观测到的词条情绪类别和强度,同时对词语的情绪表达和情绪认知类型进行了细化分类。最终初步构建出一个具有清晰框架、丰富情绪信息和较低歧义的新型情绪词典。  相似文献   

17.
The speech recognition system basically extracts the textual information present in the speech. In the present work, speaker independent isolated word recognition system for one of the south Indian language—Kannada has been developed. For European languages such as English, large amount of research has been carried out in the context of speech recognition. But, speech recognition in Indian languages such as Kannada reported significantly less amount of work and there are no standard speech corpus readily available. In the present study, speech database has been developed by recording the speech utterances of regional Kannada news corpus of different speakers. The speech recognition system has been implemented using the Hidden Markov Tool Kit. Two separate pronunciation dictionaries namely phone based and syllable based dictionaries are built in-order to design and evaluate the performances of phone-level and syllable-level sub-word acoustical models. Experiments have been carried out and results are analyzed by varying the number of Gaussian mixtures in each state of monophone Hidden Markov Model (HMM). Also, context dependent triphone HMM models have been built for the same Kannada speech corpus and the recognition accuracies are comparatively analyzed. Mel frequency cepstral coefficients along with their first and second derivative coefficients are used as feature vectors and are computed in acoustic front-end processing. The overall word recognition accuracy of 60.2 and 74.35 % respectively for monophone and triphone models have been obtained. The study shows a good improvement in the accuracy of isolated-word Kannada speech recognition system using triphone HMM models compared to that of monophone HMM models.  相似文献   

18.
As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus.  相似文献   

19.
该文介绍了以《淮南子》为文本的上古汉语分词及词性标注语料库及其构建过程。该文采取了自动分词与词性标注并结合人工校正的方法构建该语料库,其中自动过程使用领域适应方法优化标注模型,在分词和词性标注上均显著提升了标注性能。分析了上古汉语的词汇特点,并以此为基础描述了一些显式的词汇形态特征,将其运用于我们的自动分词及词性标注中,特别对词性标注系统带来了有效帮助。总结并分析了自动分词和词性标注中出现的错误,最后描述了整个语料库的词汇和词性分布特点。提出的方法在《淮南子》的标注过程中得到了验证,为日后扩展到其他古汉语资源提供了参考。同时,基于该文工作得到的《淮南子》语料库也为日后的古汉语研究提供了有益的资源。  相似文献   

20.
This paper presents an agent-based model of the emergence and evolution of a language system for Boolean coordination. The model assumes the agents have cognitive capacities for invention, adoption, abstraction, repair and adaptation, a common lexicon for basic concepts, and the ability to construct complex concepts using recursive combinations of basic concepts and logical operations such as negation, conjunction or disjunction. It also supposes the agents initially have neither a lexicon for logical operations nor the ability to express logical combinations of basic concepts through language. The results of the experiments we have performed show that a language system for Boolean coordination emerges as a result of a process of self-organisation of the agents’ linguistic interactions when these agents adapt their preferences for vocabulary, syntactic categories and word order to those they observe are used more often by other agents. Such a language system allows the unambiguous communication of higher-order logic terms representing logical combinations of basic properties with non-trivial recursive structure, and it can be reliably transmitted across generations according to the results of our experiments. Furthermore, the conceptual and linguistic systems, and simplification and repair operations of the agent-based model proposed are more general than those defined in previous works, because they not only allow the simulation of the emergence and evolution of a language system for the Boolean coordination of basic properties, but also for the Boolean coordination of higher-order logic terms of any Boolean type which can represent the meaning of nouns, sentences, verbs, adjectives, adverbs, prepositions, prepositional phrases and subexpressions not traditionally analysed as forming constituents, using linguistic devices such as syntactic categories, word order and function words.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号