首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.  相似文献   

2.
3.
4.
Automatic vehicle classification is an important area of research for intelligent transportation, traffic surveillance and security. A working image-based vehicle classification system is proposed in this paper. The first component vehicle detection is implemented by applying histogram of oriented gradient features and SVM classifier. The second component vehicle classification, which is the emphasis of this paper, is accomplished by a hybrid model composed of clustering and kernel autoassociator (KAA). The KAA model is a generalization of auto-associative networks by training to recall the inputs through kernel subspace. As an effective one-class classification strategy, KAA has been proposed to implement classification with rejection, showing balanced error–rejection trade-off. With a large number of training samples, however, the training of KAA becomes problematic due to the difficulties involved with directly creating the kernel matrix. As a solution, a hybrid model consisting of self-organizing map (SOM) and KAM has been proposed to first acquire prototypes and then construct the KAA model, which has been proven efficient in internet intrusion detection. The hybrid model is further studied in this paper, with several clustering algorithms compared, including k-mean clustering, SOM and Neural Gas. Experimental results using more than 2,500 images from four types of vehicles (bus, light truck, car and van) demonstrated the effectiveness of the hybrid model. The proposed scheme offers a performance of accuracy over $95~\%$ with a rejection rate $8~\%$ and reliability over $98~\%$ with a rejection rate of $20~\%$ . This exhibits promising potentials for real-world applications.  相似文献   

5.
In the analysis of a newspaper page an important step is the clustering of various text blocks into logical units, i.e., into articles. We propose three algorithms based on text processing techniques to cluster articles in newspaper pages. Based on the complexity of the three algorithms and experiments on actual pages from the Italian newspaper L'Adige, we select one of the algorithms as the preferred choice to solve the textual clustering problem.  相似文献   

6.
We report progress on the NL versus UL problem.
  • We show that counting the number of s-t paths in graphs where the number of s-v paths for any v is bounded by a polynomial can be done in FUL: the unambiguous log-space function class. Several new upper bounds follow from this including ${{{ReachFewL} \subseteq {UL}}}$ and ${{{LFew} \subseteq {UL}^{FewL}}}$
  • We investigate the complexity of min-uniqueness—a central notion in studying the NL versus UL problem. In this regard we revisit the class OptL[log n] and introduce UOptL[log n], an unambiguous version of OptL[log n]. We investigate the relation between UOptL[log n] and other existing complexity classes.
  • We consider the unambiguous hierarchies over UL and UOptL[log n]. We show that the hierarchy over UOptL[log n] collapses. This implies that ${{{ULH} \subseteq {L}^{{promiseUL}}}}$ thus collapsing the UL hierarchy.
  • We show that the reachability problem over graphs embedded on 3 pages is complete for NL. This contrasts with the reachability problem over graphs embedded on 2 pages, which is log-space equivalent to the reachability problem in planar graphs and hence is in UL.
  •   相似文献   

    7.
    8.
    Forms are our gates to the Web. They enable us to access the deep content of Web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present opal, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems, opal advances the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a Web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern Web forms, opal outperforms previous approaches for form labeling by a significant margin. For form interpretation, opal uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect ( $>$ > 97 % accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of opal’s form interpretations through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.  相似文献   

    9.
    Page replacement algorithms of main memory in modern operating systems are crucial in system performance. When memory is full, a page replacement algorithm exploits temporal locality and frequency of page references to evict the page that is least likely to be accessed in the near future. Subsequently, loading the majority of data directly from memory improves performance by reducing I/O waits of accessing slow storage. Research of replacement algorithms that maximizes hit ratio while incurring as less overhead as possible has been constantly studied. In this paper, we propose a time-shift least recently used (TSLRU) algorithm that converts frequency information of page references into temporal locality. Frequent accesses of a page are thus recognized and accumulated in terms of time. Moreover, pages being loaded into memory for the first time are not necessarily the most recently used pages. As a result, one-pass pages are evicted sooner in our algorithm than in traditional LRU algorithm. Our performance evaluations show that the TSLRU outperforms conventional page replacement algorithms on both artificial and real application traces. For example, hit ratio of TSLRU advances ARC by \(4.17\%\) and LRU by \(5.91\%\) on normal distributed workloads. Moreover, TSLRU outperforms ARC by over \(2\%\) on half of the application traces tested.  相似文献   

    10.
    11.
    In this paper, we introduce a new problem termed query reverse engineering (QRE). Given a database \(D\) and a result table \(T\) —the output of some known or unknown query \(Q\) on \(D\) —the goal of QRE is to reverse-engineer a query \(Q'\) such that the output of query \(Q'\) on database \(D\) (denoted by \(Q'(D)\) ) is equal to \(T\) (i.e., \(Q(D)\) ). The QRE problem has useful applications in database usability, data analysis, and data security. In this work, we propose a data-driven approach, TALOS for Tree-based classifier with At Least One Semantics, that is based on a novel dynamic data classification formulation and extend the approach to efficiently support the three key dimensions of the QRE problem: whether the input query is known/unknown, supporting different query fragments, and supporting multiple database versions.  相似文献   

    12.
    Regularized multiple-criteria linear programming (RMCLP) model is a new powerful method for classification and has been used in various real-life data mining problems. In this paper, a new Universum-regularized multiple-criteria linear programming (called ${\mathfrak{U}}$ -RMCLP) was proposed and firstly applied to railway safety field, which is useful extension of RMCLP. Experiments in public datasets show that ${\mathfrak{U}}$ -RMCLP can get better results than its original model. Furthermore, experiment results in the trouble of moving freight car detection system (TFDS) datasets indicate that the accuracy of ${\mathfrak{U}}$ -RMCLP has been up to 91 %, which will provide great help for TFDS system.  相似文献   

    13.
    14.
    The Hamiltonian Cycle problem is the problem of deciding whether an n-vertex graph G has a cycle passing through all vertices of G. This problem is a classic NP-complete problem. Finding an exact algorithm that solves it in ${\mathcal {O}}^{*}(\alpha^{n})$ time for some constant α<2 was a notorious open problem until very recently, when Björklund presented a randomized algorithm that uses ${\mathcal {O}}^{*}(1.657^{n})$ time and polynomial space. The Longest Cycle problem, in which the task is to find a cycle of maximum length, is a natural generalization of the Hamiltonian Cycle problem. For a claw-free graph G, finding a longest cycle is equivalent to finding a closed trail (i.e., a connected even subgraph, possibly consisting of a single vertex) that dominates the largest number of edges of some associated graph H. Using this translation we obtain two deterministic algorithms that solve the Longest Cycle problem, and consequently the Hamiltonian Cycle problem, for claw-free graphs: one algorithm that uses ${\mathcal {O}}^{*}(1.6818^{n})$ time and exponential space, and one algorithm that uses ${\mathcal {O}}^{*}(1.8878^{n})$ time and polynomial space.  相似文献   

    15.
    16.
    17.
    Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost $70\,\%$ of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate of up to $77\,\%$ . The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.  相似文献   

    18.
    Although the earliest-deadline-first (EDF) policy is known to be optimal for preemptive real-time task scheduling in uniprocessor systems, the schedulability analysis problem has recently been shown to be $\mathit{co}\mathcal{NP}$ -hard. Therefore, approximation algorithms, and in particular, approximations based on resource augmentation have attracted a lot of attention for both uniprocessor and multiprocessor systems. Resource augmentation based approximations assume a certain speedup of the processor(s). Using the notion of approximate demand bound function (dbf), in this paper we show that for uniprocessor systems the resource augmentation factor is at most $\frac{2e-1}{e} \approx1.6322$ , where e is the Euler number. We approximate the dbf using a linear approximation when the analysis interval length of interest is larger than the relative deadline of the task. For identical multiprocessor systems with M processors and constrained-deadline task sets, we show that the deadline-monotonic partitioning (that has been proposed by Baruah and Fisher) with the approximate dbf leads to an approximation factor of $\frac{3e-1}{e}-\frac{1}{M} \approx 2.6322-\frac{1}{M}$ with respect to resource augmentation. We also show that the corresponding factor is $3-\frac{1}{M}$ for arbitrary-deadline task sets. The best known results so far were $3-\frac{1}{M}$ for constrained-deadline tasks and $4-\frac {2}{M}$ for arbitrary-deadline ones. Our tighter analysis exploits the structure of the approximate dbf directly and uses the processor utilization violations (which were ignored in all previous analysis) for analyzing resource augmentation factors. We also provide concrete input instances to show that the lower bound on the resource augmentation factor for uniprocessor systems—using the above approximate dbf—is 1.5, and the corresponding bound is 2.5 for identical multiprocessor systems with an arbitrary order of fitting and a large number of processors. Further, we also provide a polynomial-time approximation scheme (PTAS) to derive near-optimal solutions under the assumption that the ratio of the maximum relative deadline to the minimum relative deadline of tasks is a constant, which is a more relaxed assumption compared to the assumptions required for deriving such a PTAS in the past.  相似文献   

    19.
    Bufferless Network-on-Chip (NoC) emerges as an interesting option for NoC design in recent years, which can save considerable router power and area. However, bufferless NoC only works well under low-to-medium load because it becomes more easily congested as message injection rate increases. In this paper, we propose a novel distributed source-throttling congestion control mechanism that relieves the effect of congestion in bufferless NoC under high load, called Cbufferless. The proposed strategy uses a novel congestion detection and control mechanism, computing average deflection rate of routing flit and distributed throttling message injection. Utilizing the new mechanism, the congestion information can be directly obtained inside node, which allows the mechanism to be fully distributed without requiring any transmission of global congestion information among neighbor routers and within a router. Simulation results show that the proposed mechanism improves system throughput by up to $\sim $ 30 and $\sim $ 15.5 %, saves energy consumption by up to $\sim $ 40 and $\sim $ 19 % than that of baseline and injection rate throttling bufferless NoCs, respectively, and keeps lower message latency under congested load when compared.  相似文献   

    20.
    We give partial results on the factorization conjecture on codes proposed by Schützenberger. We consider a family of finite maximal codes $C$ over the alphabet $A = \{a, b\}$ and we prove that the factorization conjecture holds for these codes. This family contains $(p,4)$ -codes, where a $(p,4)$ -code $C$ is a finite maximal code over $A$ such that each word in $C$ has at most four occurrences of $b$ and $a^p \in C$ , for a prime number $p$ . We also discuss the structure of these codes. The obtained results once again show relations between factorizations of finite maximal codes and factorizations of finite cyclic groups.  相似文献   

    设为首页 | 免责声明 | 关于勤云 | 加入收藏

    Copyright©北京勤云科技发展有限公司  京ICP备09084417号