共查询到20条相似文献,搜索用时 46 毫秒
1.
基于隐主题分析和文本聚类的微博客中新闻话题的发现 总被引:1,自引:0,他引:1
提出一种在大规模微博客短文本数据集上发现新闻话题的方法。利用隐主题分析技术,解决短文本相似度度量的问题。在每个时间窗口内,根据新闻的特点选取出最有可能谈论新闻事件的微博客文本,然后用两层的K均值和层次聚类的混合聚类方法,对这个时间窗口内的那些最有可能谈论新闻事件的微博文本进行聚类,从而检测出新闻话题。此方法能较好地解决微博客短文本的数据稀疏性及数据量巨大的问题。实验证明该算法的有效性。 相似文献
2.
In this paper, we discuss how the idea of design patterns can be used in the context of the World Wide Web, for both designing and implementing web sites or more complex information systems. We first motivate our work by discussing which are the most outstanding problems in designing Web-based information systems. Then we briefly introduce design patterns and show how they are used to record and reuse design information. We next present some simple though powerful design patterns and show known uses in the WWW. Finally, we outline a process for building applications by combining a design methodology (OOHDM) with design patterns. 相似文献
3.
《Knowledge》2006,19(3):164-171
Due to its open characteristic, the Web is being posted with vast amount of new information changes continuously. Consequently, at any time, it is conceivable that there will be hot issues (emerging topics) being discussed in any information area on the Web. However, it is not practical for the user to browse the Web manually all the time for the changes. Thus, we need this Emerging Topic Tracking System (ETTS) as an information agent, to detect the changes in the information area of our interest and generate a summary of changes back to us regularly. This summary of changes will be telling the latest most discussed issues and thus revealing the emerging topics in the particular information area. With this system, we will be ‘all time aware’ of the latest trends of the WWW information space. 相似文献
4.
Michael Johnson Farshad Fotouhi Sorin DrĂghici Ming Dong Duo Xu 《Multimedia Tools and Applications》2004,24(2):155-188
This paper describes our research into a query-by-semantics approach to searching the World Wide Web. This research extends existing work, which had focused on a query-by-structure approach for the Web. We present a system that allows users to request documents containing not only specific content information, but also to specify that documents be of a certain type. The system captures and utilizes structure information as well as content during a distributed query of the Web. The system also allows the user the option of creating their own document types by providing the system with example documents. In addition, although the system still gives users the option of dynamically querying the web, the incorporation of a document database has improved the response time involved in the search process. Based on extensive testing and validation presented herein, it is clear that a system that incorporates structure and document semantic information into the query process can significantly improve search results over the standard keyword search. 相似文献
5.
6.
Nikolaj Tatti 《Data mining and knowledge discovery》2014,28(5-6):1429-1454
Discovering the underlying structure of a given graph is one of the fundamental goals in graph mining. Given a graph, we can often order vertices in a way that neighboring vertices have a higher probability of being connected to each other. This implies that the edges form a band around the diagonal in the adjacency matrix. Such structure may rise for example if the graph was created over time: each vertex had an active time interval during which the vertex was connected with other active vertices. The goal of this paper is to model this phenomenon. To this end, we formulate an optimization problem: given a graph and an integer \(K\) , we want to order graph vertices and partition the ordered adjacency matrix into \(K\) bands such that bands closer to the diagonal are more dense. We measure the goodness of a segmentation using the log-likelihood of a log-linear model, a flexible family of distributions containing many standard distributions. We divide the problem into two subproblems: finding the order and finding the bands. We show that discovering bands can be done in polynomial time with isotonic regression, and we also introduce a heuristic iterative approach. For discovering the order we use Fiedler order accompanied with a simple combinatorial refinement. We demonstrate empirically that our heuristic works well in practice. 相似文献
7.
9.
《IEEE transactions on pattern analysis and machine intelligence》2006,32(7):454-466
One of the challenging problems for software developers is guaranteeing that a system as built is consistent with its architectural design. In this paper, we describe a technique that uses runtime observations about an executing system to construct an architectural view of the system. In this technique, we develop mappings that exploit regularities in system implementation and architectural style. These mappings describe how low-level system events can be interpreted as more abstract architectural operations and are formally defined using Colored Petri Nets. In this paper, we describe a system, called DiscoTect, that uses these mappings and we introduce the DiscoSTEP mapping language and its formal definition. Two case studies showing the application of DiscoTect suggest that the tool is practical to apply to legacy systems and can dynamically verify conformance to a preexisting architectural specification. 相似文献
10.
Wei-Min Shen 《国际智能系统杂志》1992,7(7):623-635
Knowledge bases open new horizons for machine learning research. One challenge is to design learning programs to expand the knowledge base using the knowledge that is currently available. This article addresses the problem of discovering regularities in large knowledge bases that contain many assertions in different domains. the article begins with a definition of regularities and gives the motivation for such a definition. It then outlines a framework that attempts to integrate induction with knowledge. Although the implementation of the framework currently uses only a statistical method for confirming hypotheses, its application to a real knowledge base has shown some encouraging and interesting results. © 1992 John Wiley & Sons, Inc. 相似文献
11.
Dongyi Wang Jidong Ge Hao Hu Bin Luo Liguo Huang 《Expert systems with applications》2012,39(15):11970-11978
The aim of process mining is to discover the process model from the event log which is recorded by the information system. Typical steps of process mining algorithm can be described as: (1) generating event traces from event log, (2) analyzing event traces and obtaining ordering relations of tasks, (3) generating process model with ordering relations of tasks. The first two steps could be very time consuming involving millions of events and thousands of event traces. This paper presents a novel algorithm (λ-algorithm) which almost eliminates these two steps in generating event traces from event log and analyzing event traces so as to reduce the performance of process mining algorithm. Firstly, we retrieve the event multiset (input data of algorithm marked as MS) which records the frequency of each event but ignores their orders when extracted from event logs. The event in event multiset contains the information of post-activities. Secondly, we obtain ordering relations from event multiset. The ordering relations contain causal dependency, potential parallelism and non-potential parallelism. Finally, we discover a process models with ordering relations. The complexity of λ-algorithm is only bound up with the event classes (the set of events in event logs) that has significantly improved the performance of existing process mining algorithms and is expected to be more practical in real-world process mining based on event logs, as well as being able to detect SWF-nets, short-loops and most of implicit dependency (generated by non-free choice constructions). 相似文献
12.
Discovering Social Networks from Event Logs 总被引:5,自引:0,他引:5
Process mining techniques allow for the discovery of knowledge based on so-called “event logs”, i.e., a log recording the
execution of activities in some business process. Many information systems provide such logs, e.g., most WFM, ERP, CRM, SCM,
and B2B systems record transactions in a systematic way. Process mining techniques typically focus on performance and control-flow
issues. However, event logs typically also log the performer, e.g., the person initiating or completing some activity. This paper focuses on mining social networks using this information.
For example, it is possible to build a social network based on the hand-over of work from one performer to the next. By combining
concepts from workflow management and social network analysis, it is possible to discover and analyze social networks. This
paper defines metrics, presents a tool, and applies these to a real event log within the setting of a large Dutch organization. 相似文献
13.
Evelina Lamma Fabrizio Riguzzi Sergio Storari Paola Mello Anna Nanetti 《New Generation Computing》2003,21(2):123-133
A huge amount of data is daily collected from clinical microbiology laboratories. These data concern the resistance or susceptibility
of bacteria to tested antibiotics. Almost all microbiology laboratories follow standard antibiotic testing guidelines which
suggest antibiotic test execution methods and result interpretation and validation (among them, those annually published by
NCCLS2,3). Guidelines basically specify, for each species, the antibiotics to be tested, how to interpret the results of tests and
a list of exceptions regarding particular antibiotic test results. Even if these standards are quite assessed, they do not
consider peculiar features of a given hospital laboratory, which possibly influence the antimicrobial test results, and the
further validation process.
In order to improve and better tailor the validation process, we have applied knowledge discovery techniques, and data mining
in particular, to microbiological data with the purpose of discovering new validation rules, not yet included in NCCLS guidelines,
but considered plausible and correct by interviewed experts. In particular, we applied the knowledge discovery process in
order to find (association) rules relating to each other the susceptibility or resistance of a bacterium to different antibiotics.
This approach is not antithetic, but complementary to that based on NCCLS rules: it proved very effective in validating some
of them, and also in extending that compendium. In this respect, the new discovered knowledge has lead microbiologists to
be aware of new correlations among some antimicrobial test results, which were previously unnoticed. Last but not least, the
new discovered rules, taking into account the history of the considered laboratory, are better tailored to the hospital situation,
and this is very important since some resistances to antibiotics are specific to particular, local hospital environments.
Evelina Lamma, Ph.D.: She got her degree in Electrical Engineering at the University of Bologna in 1985, and her Ph.D. in Computer Science in 1990.
Her research activity centers on logic programming languages, artificial intelligence and agent-based programming. She was
co-organizers of the 3rd International Workshop on Extensions of Logic Programming ELP92, held in Bologna in February 1992,
and of the 6th Italian Congress on Artificial Intelligence, held in Bologna in September 1999. She is a member of the Italian
Association for Artificial Intelligence (AI*IA), associated with ECCAI. Currently, she is Full Professor at the University of Ferrara, where she teaches Artificial Intelligence
and Fondations of Computer Science.
Fabrizio Riguzzi, Ph.D.: He is Assistant Professor at the Department of Engineering of the University of Ferrara, Italy. He received his Laurea from
the University of Bologna in 1999. He joined the Department of Engineering of the University of Ferrara in 1999. He has been
a visiting researcher at the University of Cyprus and at the New University of Lisbon. His research interests include: data
mining (and in particular methods for learning from multirelational data), machine learning, belief revision, genetic algorithms
and software engineering.
Sergio Storari: He got his degree in Electrical Engineering at the University of Ferrara in 1998. His research activity centers on artificial
intelligence, knowledge-based systems, data mining and multi-agent systems. He is a member of the Italian Association for
Artificial Intelligence (AI*IA), associated with ECCAI. Currently, he is attending the third year of Ph.D. course about “Study and application of Artificial
Intelligence techniques for medical data analysis” at DEIS University of Bologna.
Paola Mello, Ph.D.: She got her degree in Electrical Engineering at the University of Bologna in 1982, and her Ph.D. in Computer Science in 1988.
Her research activity centers on knowledge representation, logic programming, artificial intelligence and knowledge-based
systems. She was co-organizers of the 3rd International Workshop on Extensions of Logic Programming ELP92, held in Bologna
in February 1992, and of the 6th Italian Congress on Artificial Intelligence, Held in Bologna in September 1999. She is a
member of the Italian Association for Artificial Intelligence (AI*IA), associated with ECCAI. Currently, she is Full Professor at the University of Bologna, where she teaches Artificial Intelligence
and Fondations of Computer Science.
Anna Nanetti: She got a degree in biologics sciences at the University of Bologna in 1974. Currently, she is an Academic Recearcher in
the Microbiology section of the Clinical, Specialist and Experimental Medicine Department of the Faculty of Medicine and Surgery,
University of Bologna. 相似文献
14.
Dawei Zhou Arun Karthikeyan Kangyang Wang Nan Cao Jingrui He 《Data mining and knowledge discovery》2017,31(2):400-423
Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets. 相似文献
15.
Michael Jamieson Yulia Eskin Afsaneh Fazly Suzanne Stevenson Sven J. Dickinson 《Computer Vision and Image Understanding》2012,116(7):842-853
We address the problem of automatically learning the recurring associations between the visual structures in images and the words in their associated captions, yielding a set of named object models that can be used for subsequent image annotation. In previous work, we used language to drive the perceptual grouping of local features into configurations that capture small parts (patches) of an object. However, model scope was poor, leading to poor object localization during detection (annotation), and ambiguity was high when part detections were weak. We extend and significantly revise our previous framework by using language to drive the perceptual grouping of parts, each a configuration in the previous framework, into hierarchical configurations that offer greater spatial extent and flexibility. The resulting hierarchical multipart models remain scale, translation and rotation invariant, but are more reliable detectors and provide better localization. Moreover, unlike typical frameworks for learning object models, our approach requires no bounding boxes around the objects to be learned, can handle heavily cluttered training scenes, and is robust in the face of noisy captions, i.e., where objects in an image may not be named in the caption, and objects named in the caption may not appear in the image. We demonstrate improved precision and recall in annotation over the non-hierarchical technique and also show extended spatial coverage of detected objects. 相似文献
16.
Discovering Frequent Closed Partial Orders from Strings 总被引:2,自引:0,他引:2
Jian Pei Haixun Wang Jian Liu Ke Wang Jianyong Wang Yu P.S. 《Knowledge and Data Engineering, IEEE Transactions on》2006,18(11):1467-1481
Mining knowledge about ordering from sequence data is an important problem with many applications, such as bioinformatics, Web mining, network management, and intrusion detection. For example, if many customers follow a partial order in their purchases of a series of products, the partial order can be used to predict other related customers' future purchases and develop marketing campaigns. Moreover, some biological sequences (e.g., microarray data) can be clustered based on the partial orders shared by the sequences. Given a set of items, a total order of a subset of items can be represented as a string. A string database is a multiset of strings. In this paper, we identify a novel problem of mining frequent closed partial orders from strings. Frequent closed partial orders capture the nonredundant and interesting ordering information from string databases. Importantly, mining frequent closed partial orders can discover meaningful knowledge that cannot be disclosed by previous data mining techniques. However, the problem of mining frequent closed partial orders is challenging. To tackle the problem, we develop Frecpo (for frequent closed partial order), a practically efficient algorithm for mining the complete set of frequent closed partial orders from large string databases. Several interesting pruning techniques are devised to speed up the search. We report an extensive performance study on both real data sets and synthetic data sets to illustrate the effectiveness and the efficiency of our approach 相似文献
17.
In many large e-commerce organizations, multiple data sources are often used to describe the same customers, thus it is important to consolidate data of multiple sources for intelligent business decision making. In this paper, we propose a novel method that predicts the classification of data from multiple sources without class labels in each source. We test our method on artificial and real-world datasets, and show that it can classify the data accurately. From the machine learning perspective, our method removes the fundamental assumption of providing class labels in supervised learning, and bridges the gap between supervised and unsupervised learning. 相似文献
18.
Discovering Frequent Agreement Subtrees from Phylogenetic Data 总被引:1,自引:0,他引:1
We study a new data mining problem concerning the discovery of frequent agreement subtrees (FASTs) from a set of phylogenetic trees. A phylogenetic tree, or phylogeny, is an unordered tree in which the order among siblings is unimportant. Furthermore, each leaf in the tree has a label representing a taxon (species or organism) name, whereas internal nodes are unlabeled. The tree may have a root, representing the common ancestor of all species in the tree, or may be unrooted. An unrooted phylogeny arises due to the lack of sufficient evidence to infer a common ancestor of the taxa in the tree. The FAST problem addressed here is a natural extension of the maximum agreement subtree (MAST) problem widely studied in the computational phylogenetics community. The paper establishes a framework for tackling the FAST problem for both rooted and unrooted phylogenetic trees using data mining techniques. We first develop a novel canonical form for rooted trees together with a phylogeny-aware tree expansion scheme for generating candidate subtrees level by level. Then, we present an efficient algorithm to find all FASTs in a given set of rooted trees, through an Apriori-like approach. We show the correctness and completeness of the proposed method. Finally, we discuss the extensions of the techniques to unrooted trees. Experimental results demonstrate that the proposed methods work well, and are capable of finding interesting patterns in both synthetic data and real phylogenetic trees. 相似文献
19.
Discovering colored Petri nets from event logs 总被引:1,自引:0,他引:1
A. Rozinat R. S. Mans M. Song W. M. P. van der Aalst 《International Journal on Software Tools for Technology Transfer (STTT)》2008,10(1):57-74
Process-aware information systems typically log events (e.g., in transaction logs or audit trails) related to the actual execution
of business processes. Analysis of these execution logs may reveal important knowledge that can help organizations to improve
the quality of their services. Starting from a process model, which can be discovered by conventional process mining algorithms,
we analyze how data attributes influence the choices made in the process based on past process executions using decision mining,
also referred to as decision point analysis. In this paper we describe how the resulting model (including the discovered data
dependencies) can be represented as a Colored Petri Net (CPN), and how further perspectives, such as the performance and organizational perspective, can be incorporated. We also
present a
CPN Tools Export
plug-in implemented within the ProM framework. Using this plug-in, simulation models in ProM obtained via a combination of
various process mining techniques can be exported to CPN Tools. We believe that the combination of automatic discovery of process models using ProM and the simulation capabilities of CPN Tools offers an
innovative way to improve business processes. The discovered process model describes reality better than most hand-crafted simulation models. Moreover, the simulation
models are constructed in such a way that it is easy to explore various redesigns.
A. Rozinat’s research was supported by the IOP program of the Dutch Ministry of Economic Affairs.
M. Song’s research was supported by the Technology Foundation
STW. 相似文献
20.
The integration of data mining techniques with data warehousing is gaining popularity due to the fact that both disciplines complement each other in extracting knowledge from large datasets. However, the majority of approaches focus on applying data mining as a front end technology to mine data warehouses. Surprisingly, little progress has been made in incorporating mining techniques in the design of data warehouses. While methods such as data clustering applied on multidimensional data have been shown to enhance the knowledge discovery process, a number of fundamental issues remain unresolved with respect to the design of multidimensional schema. These relate to automated support for the selection of informative dimension and fact variables in high dimensional and data intensive environments, an activity which may challenge the capabilities of human designers on account of the sheer scale of data volume and variables involved. In this research, we propose a methodology that selects a subset of informative dimension and fact variables from an initial set of candidates. Our experimental results conducted on three real world datasets taken from the UCI machine learning repository show that the knowledge discovered from the schema that we generated was more diverse and informative than the standard approach of mining the original data without the use of our multidimensional structure imposed on it. 相似文献