首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 906 毫秒
1.
2.
We examine various logics that combine knowledge, awareness, and change of awareness. An agent can become aware of propositional propositions but also of other agents or of herself. The dual operation to becoming aware, forgetting, can also be modelled. Our proposals are based on a novel notion of structural similarity that we call awareness bisimulation, the obvious notion of modal similarity for structures encoding knowledge and awareness.  相似文献   

3.
MatchSim: a novel similarity measure based on maximum neighborhood matching   总被引:1,自引:1,他引:0  
Measuring object similarity in a graph is a fundamental data- mining problem in various application domains, including Web linkage mining, social network analysis, information retrieval, and recommender systems. In this paper, we focus on the neighbor-based approach that is based on the intuition that ??similar objects have similar neighbors?? and propose a novel similarity measure called MatchSim. Our method recursively defines the similarity between two objects by the average similarity of the maximum-matched similar neighbor pairs between them. We show that MatchSim conforms to the basic intuition of similarity; therefore, it can overcome the counterintuitive contradiction in SimRank. Moreover, MatchSim can be viewed as an extension of the traditional neighbor-counting scheme by taking the similarities between neighbors into account, leading to higher flexibility. We present the MatchSim score computation process and prove its convergence. We also analyze its time and space complexity and suggest two accelerating techniques: (1) proposing a simple pruning strategy and (2) adopting an approximation algorithm for maximum matching computation. Experimental results on real-world datasets show that although our method is less efficient computationally, it outperforms classic methods in terms of accuracy.  相似文献   

4.
The K-Nearest Neighbor (K-NN) search problem is the way to find the K closest and most similar objects to a given query. The K-NN is essential for many applications such as information retrieval and visualization, machine learning and data mining. The exponential growth of data imposes to find approximate approaches to this problem. Permutation-based indexing is one of the most recent techniques for approximate similarity search. Objects are represented by permutation lists ordering their distances to a set of selected reference objects, following the idea that two neighboring objects have the same surrounding. In this paper, we propose a novel quantized representation of permutation lists with its related data structure for effective retrieval on single and multicore architectures. Our novel permutation-based indexing strategy is built to be fast, memory efficient and scalable. This is experimentally demonstrated in comparison to existing proposals using several large-scale datasets of millions of documents and of different dimensions.  相似文献   

5.
This paper presents an approach for automatic grading of essays. Student essays are compared against a model or key essay provided by the teacher. The similarity between a student essay and the model essay is measured by the cosine of their contained angle in an n-dimensional semantic space. The model essay is preprocessed by removing stopwords, extracting keywords, assigning weights to keywords to reflect their importance and finally by linking every keyword to a subject-oriented synonym list. The student essay, by comparison, is preprocessed by removing stopwords and then by extracting keywords. The keywords extracted from the model essay and the keywords extracted from students essays together with weights provided by teacher are used to build feature vectors for teacher and students essays. The obtained grade depends on the similarity between these vectors (calculated by using the cosine formula). A simulator was implemented to test the viability of the proposed approach. It was fed with student essays (at the university level) gathered from database management course over three semesters. The results were very encouraging and the agreement between the auto-grader and human grader was as good as the agreement between human graders.  相似文献   

6.
7.
All distance learning participants (students, professors, instructors, mentors, tutors and the rest) would like to know how well the students have assimilated the study materials being taught. The analysis and assessment of the knowledge students have acquired over a semester are an integral part of the independent studies process at the most advanced universities worldwide. A formal test or exam during the semester would cause needless stress for students. To resolve this problem, the authors of this article have developed a Biometric and Intelligent Self-Assessment of Student Progress (BISASP) System. The obtained research results are comparable with the results from other similar studies. This article ends with two case studies to demonstrate practical operation of the BISASP System. The first case study analyses the interdependencies between microtremors, stress and student marks. The second case study compares the marks assigned to students during the e-self-assessment, prior to the e-test and during the e-test. The dependence, determined in the second case study, between the student marks scored for the real examination and the marks based on their self-evaluation is statistically significant (the significance >0.99%). The original contribution of this article, compared to the research results published earlier, is as follows: the BISASP System developed by the authors is superior to the traditional self-assessment systems due to the use of voice stress analysis and a special algorithm, which permits a more detailed analysis of the knowledge attained by a student.  相似文献   

8.
In recent years, researchers have paid more and more attention on data mining of practical applications. Aimed to the problem of symptom classification of Chinese traditional medicine, this paper proposes a novel computing model based on the similarities among attributes of high dimension data to compute the similarity between any tuples. This model assumes data attributes as basic vectors of m dimensions and each tuple as a sum vector of all the attribute-vectors. Based on the transcendental concept similarity information among attributes, it suggests a novel distance algorithm to compute the similarity distance of any pair of attribute-vectors. In this method, the computing of similarity between any tuples are turned to the formulas of attribute-vectors and their projections of each other, and the similarity between any pair of tuples can be worked out by computing these vectors and formulas. This paper also presents a novel classification algorithm based on the similarity computing model and successfully applies the algorithm into the symptom classification of Chinese traditional medicine. The efficiency of the algorithm is proved by extensive experiments.  相似文献   

9.
The traditional problem of similarity search requires to find, within a set of points, those that are closer to a query point q, according to a distance function d. In this paper we introduce the novel problem of metric information filtering (MIF): in this scenario, each point xi comes with its own distance function di and the task is to efficiently determine those points that are close enough, according to di, to a query point q. MIF can be seen as an extension of both the similarity search problem and of approaches currently used in content-based information filtering, since in MIF user profiles (points) and new items (queries) are compared using arbitrary, personalized, metrics. We introduce the basic concepts of MIF and provide alternative resolution strategies aiming to reduce processing costs. Our experimental results show that the proposed solutions are indeed effective in reducing evaluation costs.  相似文献   

10.
短答案自动评分是智慧教学中的一个关键问题。目前自动评分不准确的主要原因是: (1)预先给定的参考答案不能覆盖多样化的学生答题情况; (2)不能准确刻画学生答案与参考答案匹配情况。针对上述问题,该文采用基于聚类与最大相似度方法选择代表性学生答案构建更完备的参考答案,尽可能覆盖学生不同的答题情况;在此基础上,利用基于注意力机制的深度神经网络模型来提升系统对学生答案与参考答案匹配情况的刻画。相关数据集上的实验结果表明: 该文模型有效提升了自动评分的准确率。  相似文献   

11.
In this paper we introduce VideoGraph, a novel non-linear representation for scene structure of a video. Unlike classical linear sequential organization, VideoGraph concentrates the video content across the time line by structuring scenes and materializes with two-dimensional graph, which enables non-linear exploration on the scenes and their transitions. To construct VideoGraph, we adopt a sub-shot induced method to evaluate the spatio-temporal similarity between shot segments of video. Then, scene structure is derived by grouping similar shots and identifying the valid transitions between scenes. The final stage is to represent the scene structure using a graph with respect to scene transition topology. Our VideoGraph can provide a condensed representation in the scene level and facilitate a non-linear manner to browse videos. Experimental results are presented to demonstrate the effectiveness and efficiency by using VideoGraph to explore and access the video content.  相似文献   

12.
DNA sequence comparison by a novel probabilistic method   总被引:1,自引:0,他引:1  
This paper proposes a novel method for comparing DNA sequences. By using a graphical representation, we are able to construct the probability distributions of DNA sequences. These probability distributions can then be used to make similarity studies by using the symmetrised Kullback-Leibler divergence. After presenting our method, we test it using six DNA sequences taken from the threonine operons of Escherichia coli K-12 and Shigella flexneri. Our approach is then used to study the evolution of primates using mitochondrial DNA data. Our method allows us to reconstruct a phylogenetic tree for primate evolution. In addition, we use our technique to analyze the classification and phylogeny of the Tomato Yellow Leaf Curl Virus (TYLCV) based on its whole genome sequences. These examples show that large volumes of DNA sequences can be handled more easily and more quickly by our approach than by the existing multiple alignment methods. Moreover, our method, unlike other approaches, does not require human intervention, because it can be applied automatically.  相似文献   

13.
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.  相似文献   

14.
We propose a novel knowledge-based technique for inter-document similarity computation, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature, but CSA differs from them because it is designed to be portable to any RDF knowledge base. In fact, our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a Semantic Context Vector, a novel model for representing the context of a document, which is exploited by CSA to compute inter-document similarity effectively. Moreover, we show how CSA can be effectively applied in the Information Retrieval domain. Experimental results show that: (i) for the general task of inter-document similarity, CSA outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones built on top of specific knowledge bases; (ii) for Information Retrieval tasks, enriching documents with context (i.e., employing the Semantic Context Vector model) improves the results quality of the state-of-the-art technique that employs such similar semantic enrichment.  相似文献   

15.
The similarity search problem has received considerable attention in database research community. In sensor network applications, this problem is even more important due to the imprecision of the sensor hardware, and variation of environmental parameters. Traditional similarity search mechanisms are both improper and inefficient for these highly energy-constrained sensors. A difficulty is that it is hard to predict which sensor has the most similar (or closest) data item such that many or even all sensors need to send their data to the query node for further comparison. In this paper, we propose a similarity search algorithm (SSA), which is a novel framework based on the concept of Hilbert curve over a data-centric storage structure, for efficiently processing similarity search queries in sensor networks. SSA successfully avoids the need of collecting data from all sensors in the network in searching for the most similar data item. The performance study reveals that this mechanism is highly efficient and significantly outperforms previous approaches in processing similarity search queries.  相似文献   

16.
One promise of current information retrieval systems is the capability to identify risk groups for certain diseases and pathologies based on the automatic analysis of vast amounts of Electronic Medical Records repositories. However, the complexity and the degree of specialization of the language used by the experts in this context, make this task both challenging and complex. In this work, we introduce a novel experimental study to evaluate the performance of the two semantic similarity metrics (Path and Intrinsic IC-Path, both widely accepted in the literature) in a real-life information retrieval situation. In order to achieve this goal and due to the lack of methodologies for this context in the literature, we propose a straightforward information retrieval system for the biomedical field based on the UMLS Metathesaurus and on semantic similarity metrics. In contrast with previous studies which focus on testbeds with limited and controlled sets of concepts, we use a large amount of information (101,712 medical documents extracted from TREC Medical Records Track 2011). Our results show that in real-life cases, both metrics display similar performance, Path (F-Measure = 0.430) e Intrinsic IC-Path (F-Measure = 0.427). Thereby we suggest that the use of Intrinsic IC-Path is not justified in real scenarios.  相似文献   

17.
In the specializedliterature, there are many approaches developed for capturing textual measures: textual similarity, textual readability and textual sentiment. This paper proposes a new sentiment similarity measures between pairs of words using a fuzzy-based approach in which words are considered single-valued neutrosophic sets. We build our study with the aid of the lexical resource SentiWordNet 3.0 as our intended scope is to design a new word-level similarity measure calculated by means of the sentiment scores of the involved words. Our study pays attention to the polysemous words because these words are a real challenge for any application that processes natural language data. After our knowledge, this approach is quite new in the literature and the obtained results give us hope for further investigations.  相似文献   

18.
Link-based similarity measures play a significant role in many graph based applications. Consequently, measuring node similarity in a graph is a fundamental problem of graph datamining. Personalized pagerank (PPR) and simrank (SR) have emerged as the most popular and influential link-based similarity measures. Recently, a novel link-based similarity measure, penetrating rank (P-Rank), which enriches SR, was proposed. In practice, PPR, SR and P-Rank scores are calculated by iterative methods. As the number of iterations increases so does the overhead of the calculation. The ideal solution is that computing similarity within the minimum number of iterations is sufficient to guarantee a desired accuracy. However, the existing upper bounds are too coarse to be useful in general. Therefore, we focus on designing an accurate and tight upper bounds for PPR, SR, and P-Rank in the paper. Our upper bounds are designed based on the following intuition: the smaller the difference between the two consecutive iteration steps is, the smaller the difference between the theoretical and iterative similarity scores becomes. Furthermore, we demonstrate the effectiveness of our upper bounds in the scenario of top-k similar nodes queries, where our upper bounds helps accelerate the speed of the query. We also run a comprehensive set of experiments on real world data sets to verify the effectiveness and efficiency of our upper bounds.  相似文献   

19.
《Information Systems》2004,29(5):405-420
This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size.In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, Dtwlb, which consistently underestimates the time warping distance and satisfies the triangular inequality. Dtwlb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and Dtwlb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.  相似文献   

20.
基于语义分析的作者身份识别方法研究   总被引:5,自引:0,他引:5  
作者身份识别是一项应用广泛的研究,身份识别的关键问题是从作品中提取出代表语体风格的识别特征,并根据这些风格特征,评估作品与作品之间的风格相似度。传统的身份识别方法,主要考察作者遣词造句、段落组织等各种代表文体风格的特征,其中基于标点符号和最常见功能词频数的分析方法受到较为普遍的认同。本文依据文体学理论,利用HowNet知识库,提出一种新的基于词汇语义分析的相似度评估方法,有效利用了功能词以外的其他词汇,达到了较好的身份识别性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号