首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the k-medoid problem is NP-hard, all existing work deals with approximate solutions on relatively small datasets. This paper aims at efficient methods for very large spatial databases, motivated by: (1) the high and ever increasing availability of spatial data, and (2) the need for novel query types and improved services. The proposed solutions exploit the intrinsic grouping properties of a data partition index in order to read only a small part of the dataset. Compared to previous approaches, we achieve results of comparable or better quality at a small fraction of the CPU and I/O costs (seconds as opposed to hours, and tens of node accesses instead of thousands). In addition, we study medoid-aggregate queries, where k is not known in advance, but we are asked to compute a medoid set that leads to an average distance close to a user-specified value. Similarly, medoid-optimization queries aim at minimizing both the number of medoids k and the average distance. We also consider the max version for the aforementioned problems, where the goal is to minimize the maximum (instead of the average) distance between any object and its closest medoid. Finally, we investigate bichromatic and weighted medoid versions for all query types, as well as, maximum capacity and dynamic medoids.  相似文献   

2.
Twitter has become a major tool for spreading news, for dissemination of positions and ideas, and for the commenting and analysis of current world events. However, with more than 500 million tweets flowing per day, it is necessary to find efficient ways of collecting, storing, managing, mining and visualizing all this information. This is especially relevant if one considers that Twitter has no ways of indexing tweet contents, and that the only available categorization “mechanism” is the #hashtag, which is totally dependent of a user's will to use it. This paper presents an intelligent platform and framework, named MISNIS - Intelligent Mining of Public Social Networks’ Influence in Society - that facilitates these issues and allows a non-technical user to easily mine a given topic from a very large tweet's corpus and obtain relevant contents and indicators such as user influence or sentiment analysis.When compared to other existent similar platforms, MISNIS is an expert system that includes specifically developed intelligent techniques that: (1) Circumvent the Twitter API restrictions that limit access to 1% of all flowing tweets. The platform has been able to collect more than 80% of all flowing portuguese language tweets in Portugal when online; (2) Intelligently retrieve most tweets related to a given topic even when the tweets do not contain the topic #hashtag or user indicated keywords. A 40% increase in the number of retrieved relevant tweets has been reported in real world case studies.The platform is currently focused on Portuguese language tweets posted in Portugal. However, most developed technologies are language independent (e.g. intelligent retrieval, sentiment analysis, etc.), and technically MISNIS can be easily expanded to cover other languages and locations.  相似文献   

3.
Semi-supervised graph clustering: a kernel approach   总被引:6,自引:0,他引:6  
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.  相似文献   

4.
Major challenges of clustering geo-referenced data include identifying arbitrarily shaped clusters, properly utilizing spatial information, coping with diverse extrinsic characteristics of clusters and supporting region discovery tasks. The goal of region discovery is to identify interesting regions in geo-referenced datasets based on a domain expert’s notion of interestingness. Almost all agglomerative clustering algorithms only focus on the first challenge. The goal of the proposed work is to develop agglomerative clustering frameworks that deal with all four challenges. In particular, we propose a generic agglomerative clustering framework for geo-referenced datasets (GAC-GEO) generalizing agglomerative clustering by allowing for three plug-in components. GAC-GEO agglomerates neighboring clusters maximizing a plug-in fitness function that capture the notion of interestingness of clusters. It enhances typical agglomerative clustering algorithms in two ways: fitness functions support task-specific clustering, whereas generic neighboring relationships increase the number of merging candidates. We also demonstrate that existing agglomerative clustering algorithms can be considered as specific cases of GAC-GEO. We evaluate the proposed framework on an artificial dataset and two real-world applications involving region discovery. The experimental results show that GAC-GEO is capable of identifying arbitrarily shaped hotspots for different data mining tasks.  相似文献   

5.
6.
Information Systems and e-Business Management - Sentiment analysis is an emerging field that helps in understanding the sentiments of users on microblogging sites. Many sentiment analysis...  相似文献   

7.
8.
Twitter has recently emerged as a popular microblogging service that has 284 million monthly active users around the world. A part of the 500 million tweets posted on Twitter everyday are personal observations of immediate environment. If provided with time and location information, these observations can be seen as sensory readings for monitoring and localizing objects and events of interests. Location information on Twitter, however, is scarce, with less than 1% of tweets have associated GPS coordinates. Current researches on Twitter location inference mostly focus on city-level or coarser inference, and cannot provide accurate results for fine-grained locations. We propose an event monitoring system for Twitter that emphasizes local events, called SNAF (Sense and Focus). The system filters personal observations posted on Twitter and infers location of each report. Our extensive experiments with real Twitter data show that, the proposed observation filtering approach can have about 22% improvement over existing filtering techniques, and our location inference approach can increase the location accuracy by up to 36% within the 3km error range. By aggregating the observation reports with location information, our prototype event monitoring system can detect real world events, in many case earlier than news reports.  相似文献   

9.
The objective of this paper is to explain our approach called “Work Flow Methodology for Analysis and Conceptual Data Base Design of Large Scale Computer Based Information System”. The user fills in, through the different steps of the methodology and in the light of the definition of dynamic adaptive system, a number of forms which relate the topological dimension to the time dimension for each application of a given system. In addition, we obtain the “Unit Subschema” which defines the responsibilities of issuing and authorization of receiving information at the proper time. Finally, we apply our methodology to the Registration System in Kuwait University.  相似文献   

10.
Bej  Saptarshi  Davtyan  Narek  Wolfien  Markus  Nassar  Mariam  Wolkenhauer  Olaf 《Machine Learning》2021,110(2):279-301
Machine Learning - The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority...  相似文献   

11.
In the recent days, web mining is the one of the most widely used research area for finding the patterns from the web page. Similarly, web content mining is defined as the process of extracting some useful information from the web pages. For this mining, a Block Acquiring Page Segmentation (BAPS) technique is proposed in the existing work, which removes the irrelevant information by retrieving the contents. Also, the Tag-Annotation-Demand (TAD) re-ranking methodology is employed to generate the personalized images. The major disadvantage of these techniques is that it fails to retrieve both the images and web page contents. In order to overcome this issue, this paper focused to integrate the TAD and BAPS techniques for the image and web page content retrieval. There are two important steps are involved in this paper, which includes, server database upload and content extraction from the database. Furthermore, the databases are applied on the Semantic Annotation Based Clustering (SABC) for image and Semantic Based Clustering (SBC) for webpage content. The main intention of the proposed work is to accurately retrieve both the images and web pages. In experiments, the performance of the proposed SABC technique is evaluated and analyzed in terms of computation time, precision and recall.  相似文献   

12.
In this paper a hybrid system and a hierarchical neural-net approaches are proposed to solve the automatic labeling problem for unsupervised clustering. The first method involves the application of nonneural clustering algorithms directly to the output of a neural net; and the second one is based on a multilayer organization of neural units. Both methods are a substantial improvement with respect to the most important unsupervised neural algorithms existing in the literature. Experimental results are shown to illustrate clustering performance of the systems.  相似文献   

13.
类别不平衡数据是指不同类别的样本数目差异很大,AUC(area under the ROC curve)是衡量不平衡数据分类器性能的一个重要指标,由于AUC不可微,研究者提出了众多替代成对损失函数优化AUC。成对损失的样本对数目为正负样本数目的乘积,大量成对损失较小的正负样本对影响了分类器的性能。针对这一问题,提出了一种加权的成对损失函数WPLoss,通过赋予成对损失较大的正负样本对更高的损失权重,减少大量成对损失较小的正负样本对的影响,进而提升分类器的性能。在20newsgroup和Reuters-21578数据集上的实验结果验证了WPLoss的有效性,表明WPLoss能够提升面向不平衡数据的分类器性能。  相似文献   

14.
《Information & Management》2002,40(2):133-146
Information quality (IQ) is critical in organizations. Yet, despite a decade of active research and practice, the field lacks comprehensive methodologies for its assessment and improvement. Here, we develop such a methodology, which we call AIM quality (AIMQ) to form a basis for IQ assessment and benchmarking. The methodology is illustrated through its application to five major organizations. The methodology encompasses a model of IQ, a questionnaire to measure IQ, and analysis techniques for interpreting the IQ measures. We develop and validate the questionnaire and use it to collect data on the status of organizational IQ. These data are used to assess and benchmark IQ for four quadrants of the model. These analysis techniques are applied to analyze the gap between an organization and best practices. They are also applied to analyze gaps between IS professionals and information consumers. The results of the techniques are useful for determining the best area for IQ improvement activities.  相似文献   

15.
Chien-Hsing Wu   《Knowledge》2002,15(8):507-514
Many approaches to the granulization have been presented for knowledge discovery. However, the inconsistent tuples that exist in granulized datasets are hardly ever revealed. In this paper, we developed a model, tuple consistency recognition model (TCRM) to help efficiently detect inconsistent tuples for datasets that are granulized. The main outputs of the developed model include explored inconsistent tuples and consumed processing time. We further conducted an empirical test where eighteen continuous real-life datasets granulized by the equal width interval technique that embedded S-plus histogram binning algorithm (SHBA) and largest binning size algorithm (LBSA) binning algorithms were diagnosed. Remarkable results: almost 40% of the granulized datasets contain inconsistent tuples and 22% have the amount of inconsistent tuples more than 20%.  相似文献   

16.
SCTP: a proposed standard for robust Internet data transport   总被引:1,自引:0,他引:1  
Caro  A.L.  Jr. Iyengar  J.R. Amer  P.D. Ladha  S. Heinz  G.J.  II Shah  K.C. 《Computer》2003,36(11):56-63
The stream control transmission protocol (SCTP) is an evolving general purpose Internet transport protocol designed to bridge the gap between TCP and UDP. SCTP evolved from a telephony signaling protocol for IP networks and is now a proposed standard with the Internet Engineering Task Force. Like TCP, SCTP provides a reliable, full-duplex connection and mechanisms to control network congestion. However, SCTP expands transport layer possibilities beyond TCP and UDP, offering new delivery options that are particularly desirable for telephony signaling and multimedia applications.  相似文献   

17.
This paper analyzes the role of situational information as an antecedent of terrorists’ opportunistic decision making in the volatile and extreme environment of the Mumbai terrorist attack. We especially focus on how Mumbai terrorists monitored and utilized situational information to mount attacks against civilians. Situational information which was broadcast through live media and Twitter contributed to the terrorists’ decision making process and, as a result, increased the effectiveness of hand-held weapons to accomplish their terrorist goal. By utilizing a framework drawn from Situation Awareness (SA) theory, this paper aims to (1) analyze the content of Twitter postings of the Mumbai terror incident, (2) expose the vulnerabilities of Twitter as a participatory emergency reporting system in the terrorism context, and (3), based on the content analysis of Twitter postings, we suggest a conceptual framework for analyzing information control in the context of terrorism.  相似文献   

18.
Multimedia Systems - Human activity recognition has been a significant goal of computer vision since its inception and has developed considerably in the last years. Recent approaches to this...  相似文献   

19.
AADL (architecture analysis and design language) concentrates on the modeling and analysis of application system architectures. It is quite popular for its simple syntax, powerful functionality and extensibility and has been widely applied in embedded systems for its advantage. However, it is not enough for AADL to model cyber-physical systems (CPS) mainly because it cannot be used to model the continuous dynamic behaviors. This paper proposes an approach to construct a new sublanguage of AADL called AADL+, to facilitate the modeling of not only the discrete and continuous behavior of CPS, but also interaction between cyber components and physical components. The syntax and semantics of the sublanguage are provided to describe the behaviors of the systems. What’s more, we develop a plug-in to OSATE (open-source AADL tool environment) for the modeling of CPS. And the plug-in supports syntax checking and simulation of the system model through linking with modelica. Finally, the AADL+ annex is successfully applied to model a lunar rover control system.  相似文献   

20.
Similarity search is important in information-retrieval applications where objects are usually represented as vectors of high dimensionality. This paper proposes a new dimensionality-reduction technique and an indexing mechanism for high-dimensional datasets. The proposed technique reduces the dimensions for which coordinates are less than a critical value with respect to each data vector. This flexible datawise dimensionality reduction contributes to improving indexing mechanisms for high-dimensional datasets that are in skewed distributions in all coordinates. To apply the proposed technique to information retrieval, a CVA file (compact VA file), which is a revised version of the VA file is developed. By using a CVA file, the size of index files is reduced further, while the tightness of the index bounds is held maximally. The effectiveness is confirmed by synthetic and real data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号