首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters.We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method’s performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests.  相似文献   

2.
A number of studies report that ICT sectors are responsible for up to 10% of the worldwide power consumption and that a substantial share of such amount is due to the Internet infrastructure. To accommodate the traffic in the peak hours, Internet Service Providers (ISP) have overprovisioned their networks, with the result that most of the links and devices are under-utilized most of the time. Thus, under-utilized links and devices may be put in a sleep state in order to save power and that might be achieved by properly routing traffic flows. In this paper, we address the design of a joint admission control and routing scheme aiming at maximizing the number of admitted flow requests while minimizing the number of nodes and links that need to stay active. We assume an online routing paradigm, where flow requests are processed one-by-one, with no knowledge of future flow requests. Each flow request has requirements in terms of bandwidth and m additive measures (e.g., delay, jitter). We develop a new routing algorithm, E2-MCRA, which searches for a feasible path for a given flow request that requires the least number of nodes and links to be turned on. The basic concepts of E2-MCRA are look-ahead, the depth-first search approach and a path length definition as a function of the available bandwidth, the additive QoS constraints and the current status (on/off) of the nodes and links along the path. Finally, we present the results of the simulation studies we conducted to evaluate the performance of the proposed algorithm.  相似文献   

3.
Rather than a document that is constantly being written as in the wiki approach, the Living Document (LD) is a document that also acts as a document router, operating by means of structured and organized social tagging and using existing ontologies. It offers an environment where users can manage papers and related information, share their knowledge with their peers and discover hidden associations amongst the shared knowledge. The LD builds upon both the Semantic Web, which values the integration of well-structured data, and the Social Web, which aims to facilitate interaction amongst people by means of user-generated content. In this vein, the LD is similar to a social networking system, with users as central nodes in the network, with the difference that interaction is focused on papers rather than people. Papers, with their ability to represent research interests, expertise, affiliations, and links to web based tools and databanks, are the central axis for interaction amongst users. To support this, we have also implemented a novel web prototype that enables researchers to accomplish three activities central to the Semantic Web vision: organizing, sharing and discovering. Availability: http://www.scientifik.info/livingdocument.  相似文献   

4.
Entangled cloud storage (Aspnes et al., ESORICS 2004) enables a set of clients to “entangle” their files into a single clew to be stored by a (potentially malicious) cloud provider. The entanglement makes it impossible to modify or delete significant part of the clew without affecting all files encoded in the clew. A clew keeps the files in it private but still lets each client recover his own data by interacting with the cloud provider; no cooperation from other clients is needed. At the same time, the cloud provider is discouraged from altering or overwriting any significant part of the clew as this will imply that none of the clients can recover their files.We put forward the first simulation-based security definition for entangled cloud storage, in the framework of universal composability (Canetti, 2001). We then construct a protocol satisfying our security definition, relying on an entangled encoding scheme based on privacy-preserving polynomial interpolation; entangled encodings were originally proposed by Aspnes et al. as useful tools for the purpose of data entanglement. As a contribution of independent interest we revisit the security notions for entangled encodings, putting forward stronger definitions than previous work (that for instance did not consider collusion between clients and the cloud provider).Protocols for entangled cloud storage find application in the cloud setting, where clients store their files on a remote server and need to be ensured that the cloud provider will not modify or delete their data illegitimately. Current solutions, e.g., based on Provable Data Possession and Proof of Retrievability, require the server to be challenged regularly to provide evidence that the clients’ files are stored at a given time. Entangled cloud storage provides an alternative approach where any single client operates implicitly on behalf of all others, i.e., as long as one client’s files are intact, the entire remote database continues to be safe and unblemished.  相似文献   

5.
Our study is motivated by the need to enable quality of service (QoS), congestion control and fair rate allocation for all end applications. We propose a new approach to address these needs which is different from the current practice whereby end applications pursue their own rate control using TCP. Our approach comprises a network rate management protocol (RMP) that controls the rate of all flows (at an aggregate level based on routes) subject to QoS requirements. The RMP control also facilitates a new TCP sliding-window congestion control based on the fair target rates computed by the RMP. Each non-TCP aggregate flow is policed by its respective edge router and each TCP flow adapts its window size as to achieve the RMP suggested fair target rate. The stability analysis of the new TCP congestion control is performed in a linearly scalable framework, which is less restrictive than a fluid model. We show that our proposed control is linearly scalable and establish its global asymptotic stability under arbitrary and variable information time lags, aka totally asynchronous conditions. The stability and the vitality of our control is verified by two means. One is a simulation of a network comprising 74 core links and up to 768 flows, each using its own access link. The simulation is also used to compare our control with the congestion control algorithms used in Fast, Vegas and Reno TCPs. The second verification means is an actual implementation of the control in the Linux kernel and its experimentation in a WAN testbed network comprising six routers and long haul links running UDP flows as well as CUBIC, N-RENO and C-TCP flows. Our experiments demonstrate that our approach can guarantee fair rates for all flows and QoS to premium flows.  相似文献   

6.
It has been demonstrated that object-oriented frameworks can bring all kinds of advantages to application developers. To gain the advantages, application developers have to follow the framework-based development process. One step of the process is to integrate new components for framework extension. This is defined as a framework extension task in this work. In this task, application developers have to (1) retrieve examples, (2) acquire necessary documents, which are defined as the documents containing example adaptation information, and (3) adapt examples. Currently, acquiring necessary documents requires a lot of time because it is achieved through manually searching the Internet. Although there are many approaches to correctly acquiring those documents, the focus is never on time reduction. To satisfy the new criterion, we find the following challenging issues: (1) the dynamics of the valid document version, and (2) the uncertainty of the relevant necessary documents. The first issue is that the valid document version varies according to the framework version under which the retrieved example is workable. The second one is that the relevant necessary documents cannot be decided until a specific necessary document is specified. To resolve those two issues, a Self-adaptive Document link provision system, named SeaDoc, is provided in this work. SeaDoc resolves the dynamics by dynamically constructing document links with the corresponding valid document version. SeaDoc also resolves the uncertainty by adaptively selecting highly relevant document links. The experimental results show that SeaDoc reduces the time by 73 and 83 % compared with other two approaches.  相似文献   

7.
A control method is proposed for the construction of a quality assessment of scientific and technical documents in natural languages based on the formalization of the perceptions of a document’s content-related context. A method is provided for using the models of documents characterizing their subject and content alongside bibliometric and scientometric data and indicators to identify both the objective and subjective (authors’ and readers’) content-related context of the analyzed document. An outlook is given as to how the context analysis of scientific and technical documents, taking into account quantitative measures of quality (information capacity, significance, and independence of content), as well as traditional bibliometric and scientometric indicators (the document’s citation index and the journal’s impact factor) provides for an objective assessment of the document’s quality1.  相似文献   

8.
9.
The storage and retrieval of multimedia data is a crucial problem in multimedia information systems due to the huge storage requirements. It is necessary to provide an efficient methodology for the indexing of multimedia data for rapid retrieval. The aim of this paper is to introduce a methodology to represent, simplify, store, retrieve and reconstruct an image from a repository. An algebraic representation of the spatio-temporal relations present in a document is constructed from an equivalent graph representation and used to index the document. We use this representation to simplify and later reconstruct the complete index. This methodology has been tested by implementation of a prototype system called Simplified Modeling to Access and ReTrieve multimedia information (SMART). Experimental results show that the complexity of an index of a 2D document is O (n*(n−1)/k) with k≥2 as opposed to the O (n*(n−1)/2) known so far. Since k depends on the number of objects in an image more complex documents have lower overall complexity.  相似文献   

10.
《Information Systems》2002,27(7):459-486
XML is spreading out as a standard for semistructured documents on the Web, so the possibility of querying XML documents which are linked by XML links is becoming a goal to achieve. In this paper we present XML-GLrec, an extended version of the graphical query language for XML documents XML-GL. XML-GL allows to extract and restructure information from XML specified WWW documents. We extend XML-GL in the following directions: (i) XML-GLrec allows to represent XML simple links, so that it is possible to query whole XML specified WWW sites in a simple and intuitive way; (ii) XML-GLrec improves the expressive power of XML-GL, where only transitive closure can be expressed, by allowing generic recursion; (iii) finally, we permit the user to specify queries in an easier fashion, by allowing sequences of nested query, in the same way as in SQL.  相似文献   

11.
The extent and complexity of environmental data management needs have increased significantly over the past several years. As environmental regulations increase, and compliance solutions are evaluated, the need to monitor and document performance becomes increasingly important. Regulatory complexities and volumes of monitoring data mandate the use of detailed procedures to assure compliance with accounting for wastes produced and disposed. A commercially available comprehensive environmental data management system offers a solution, particularly in times when personnel availability for in-house software development is limited. The modular ECOTRACtm(1) system enables users to customize a system to their data management needs. Based on dBASE IIItm(2), the system offers the flexibility to meet specific needs without extensive programming or computer knowledge. Standard reports allow consistent and timely reporting to management and regulatory authorities. Case studies demonstrate efficiencies gained through use of commercially available environmental data management software for microcomputers. ECOTRACtm software has proven useful to a variety of industry applications, and has been favorably received by independent technical reviewers.  相似文献   

12.
Rumor spreading in social networks   总被引:1,自引:0,他引:1  
Social networks are an interesting class of graphs likely to become of increasing importance in the future, not only theoretically, but also for its probable applications to ad hoc and mobile networking. Rumor spreading is one of the basic mechanisms for information dissemination in networks; its relevance stemming from its simplicity of implementation and effectiveness. In this paper, we study the performance of rumor spreading in the classic preferential attachment model of Bollobás et al. which is considered to be a valuable model for social networks. We prove that, in these networks: (a) The standard PUSH-PULL strategy delivers the message to all nodes within O(log2n) rounds with high probability; (b) by themselves, PUSH and PULL require polynomially many rounds. (These results are under the assumption that m, the number of new links added with each new node is at least 2. If m=1 the graph is disconnected with high probability, so no rumor spreading strategy can work.) Our analysis is based on a careful study of some new properties of preferential attachment graphs which could be of independent interest.  相似文献   

13.
This paper presents recent advancement in and applications of TOUGH-FLAC, a simulator for multiphase fluid flow and geomechanics. The TOUGH-FLAC simulator links the TOUGH family multiphase fluid and heat transport codes with the commercial FLAC3D geomechanical simulator. The most significant new TOUGH-FLAC development in the past few years is a revised architecture, enabling a more rigorous and tight coupling procedure with improved computational efficiency. The applications presented in this paper are related to modeling of crustal deformations caused by deep underground fluid movements and pressure changes as a result of both industrial activities (the In Salah CO2 Storage Project and the Geysers Geothermal Field) and natural events (the 1960s Matsushiro Earthquake Swarm). Finally, the paper provides some perspectives on the future of TOUGH-FLAC in light of its applicability to practical problems and the need for high-performance computing capabilities for field-scale problems, such as industrial-scale CO2 storage and enhanced geothermal systems. It is concluded that despite some limitations to fully adapting a commercial code such as FLAC3D for some specialized research and computational needs, TOUGH-FLAC is likely to remain a pragmatic simulation approach, with an increasing number of users in both academia and industry.  相似文献   

14.
Most World-Wide Web information servers provide simple browsing access to collections of static text or hypertext files. This paper describes some interactive World-Wide Web servers that produce information displays and documents dynamically rather than just providing access to static files. The PARC Map Viewer uses a geographic database to create and display maps of any part of the world on demand. The Digital Tradition folk music server provides access to a large database of song lyrics and melodies. These applications take advantage of the multimedia capabilities of World-Wide Web to deliver graphical and audio content as well as formatted text. Hypertext links are used not only for navigation, but also for setting search and presentation parameters. In these applications the HTML format and the HTTP protocol are used like a user interface tool kit to provide not only document retrieval but a complete custom user interface specialized for the application.  相似文献   

15.
Although next generation sequencing applications are getting dominant in molecular genetics, there are still many institutions that want to utilize their legacy sequencers as much as possible. An important concern in sequencing services is the quality of trace files presented to the customers. In this respect, the quality of the trace files should be screened and low quality files should be handled differently before reaching to customers. The quality scores already present in the trace files provide some useful information, however by incorporating auxiliary information we can improve to reliability of these scores. To this end, we used a feature based supervised classification strategy which requires a set of training and testing trace files qualities of which are determined manually. We tested several machine learning algorithms, namely k-nearest neighbors, Naive Bayes, Support Vector Machines and Random Forest, on a public DNA trace repository. Our results indicate that RF method with only 4 simple features provides a classification accuracy rate of 94.68% with a high level of reliability of concurrence (Kappa = 0.8679).  相似文献   

16.
17.
Three methods are introduced for generating complete scans of multidimensional spaces. The traditional method is to use a raster (typically generated by nested iteration) which generates points at the maximum resolution and fills the space slowly. New methods are desirable, because in many applications it is desirable for the scanned points to be distributed throughout the space and for the resolution to increase with the number of points scanned. Three simple methods are introduced in this paper. Two of the methods are members of a class of methods in which the reverse-bit-order operator maps points from "R(esolution)-space" to the desired space. In "R-space" the distance from the origin determines the resolution level of the scanned point. The two scans occupy points in such a way that a distance measure such as the L1 norm or the L norm increases with the progress of the scan. The third method uses iteration of primitive polynomials modulo 2 to generate a nonrepeating sequence of binary numbers which eventually fills the space. This method is most computationally efficient, but the L norm method generates partial scans which completely sample the space at intermediate levels of resolution. Applications are expected in scientific visualization, graphics rendering, multicriterion optimization, and progressive image transmission.  相似文献   

18.
Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for long-range communication. In this paper, we study the partitioning and scheduling techniques required to obtain effective parallel performance on applications that use a range of hierarchical N-body methods. To obtain representative coverage, we first examine applications that use the two best methods known for classical N-body problems: the Barnes-Hut method and the fast multipole method. Then, we examine a recent hierarchical method for radiosity calculations in computer graphics, which applies the hierarchical N-body approach to a problem with very different characteristics. We find that straightforward decomposition techniques which an automatic scheduler might implement do not scale well, because they are unable to simultaneously provide load balancing and data locality. However, all the applications yield very good parallel performance if appropriate partitioning and scheduling techniques are implemented by the programmer. For the applications that use the Barnes-Hut and fast multipole methods, simple yet effective partitioning techniques can be developed by exploiting some key insights into both the methods and the classical problems that they solve. Using a novel partitioning technique, even relatively small problems achieve 45-fold speedups on a 48-processor Stanford DASH machine (a cache-coherent, shared address space multiprocessor) and 118-fold speedups on a 128-processor simulated architecture. The very different characteristics of the radiosity application require a different partitioning/scheduling approach to be used for it; however, it too yields very good parallel performance.  相似文献   

19.
Tries for approximate string matching   总被引:1,自引:0,他引:1  
Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers, case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie based method whose cost is independent of document size. Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments indicate that tries will outperform the linear methods for larger values of k. The indexes combine suffixes and so are compact in storage. When the text itself does not need to be stored, as in a spelling checker, we even obtain negative overhead: 50% compression. We discuss a variety of applications and extensions, including best match (for spelling checkers), case insensitivity, and limited approximate regular expression matching  相似文献   

20.
The distributed nature of the Web, as a decentralized system exchanging information between heterogeneous sources, has underlined the need to manage interoperability, i.e., the ability to automatically interpret information in Web documents exchanged between different sources, necessary for efficient information management and search applications. In this context, XML was introduced as a data representation standard that simplifies the tasks of interoperation and integration among heterogeneous data sources, allowing to represent data in (semi-) structured documents consisting of hierarchically nested elements and atomic attributes. However, while XML was shown most effective in exchanging data, i.e., in syntactic interoperability, it has been proven limited when it comes to handling semantics, i.e.,  semantic interoperability, since it only specifies the syntactic and structural properties of the data without any further semantic meaning. As a result, XML semantic-aware processing has become a motivating challenge in Web data management, requiring dedicated semantic analysis and disambiguation methods to assign well-defined meaning to XML elements and attributes. In this context, most existing approaches: (i) ignore the problem of identifying ambiguous XML elements/nodes, (ii) only partially consider their structural relationships/context, (iii) use syntactic information in processing XML data regardless of the semantics involved, and (iv) are static in adopting fixed disambiguation constraints thus limiting user involvement. In this paper, we provide a new XML Semantic Disambiguation Framework titled XSDFdesigned to address each of the above limitations, taking as input: an XML document, and then producing as output a semantically augmented XML tree made of unambiguous semantic concepts extracted from a reference machine-readable semantic network. XSDF consists of four main modules for: (i) linguistic pre-processing of simple/compound XML node labels and values, (ii) selecting ambiguous XML nodes as targets for disambiguation, (iii) representing target nodes as special sphere neighborhood vectors including all XML structural relationships within a (user-chosen) range, and (iv) running context vectors through a hybrid disambiguation process, combining two approaches: concept-basedand context-based disambiguation, allowing the user to tune disambiguation parameters following her needs. Conducted experiments demonstrate the effectiveness and efficiency of our approach in comparison with alternative methods. We also discuss some practical applications of our method, ranging over semantic-aware query rewriting, semantic document clustering and classification, Mobile and Web services search and discovery, as well as blog analysis and event detection in social networks and tweets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号