Human–Object Interaction (HOI) detection is important to human-centric scene understanding tasks. Existing works tend to assume that the same verb has similar visual characteristics in different HOI categories, an approach that ignores the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection in three distinct ways. First, we refine features for HOI detection to be polysemy-aware through the use of two novel modules: namely, Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important elements in human and object appearance features for each HOI category to be identified; moreover, LPFA augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce intra-class variation for the same verb. Second, we introduce a novel Polysemy-Aware Modal Fusion module, which guides PD-Net to make decisions based on feature types deemed more important according to the language priors. Third, we propose to relieve the verb polysemy problem through sharing verb classifiers for semantically similar HOI categories. Furthermore, to expedite research on the verb polysemy problem, we build a new benchmark dataset named HOI-VerbPolysemy (HOI-VP), which includes common verbs (predicates) that have diverse semantic meanings in the real world. Finally, through deciphering the visual polysemy of verbs, our approach is demonstrated to outperform state-of-the-art methods by significant margins on the HICO-DET, V-COCO, and HOI-VP databases. Code and data in this paper are available at https://github.com/MuchHair/PD-Net.
相似文献C-Mantec neural network constructive algorithm Ortega (C-Mantec neural network algorithm implementation on MATLAB. https://github.com/IvanGGomez/CmantecPaco, 2015) creates very compact architectures with generalization capabilities similar to feed-forward networks trained by the well-known back-propagation algorithm. Nevertheless, constructive algorithms suffer much from the problem of overfitting, and thus, in this work the learning procedure is first analyzed for networks created by this algorithm with the aim of trying to understand the training dynamics that will permit optimization possibilities. Secondly, several optimization strategies are analyzed for the position of class separating hyperplanes, and the results analyzed on a set of public domain benchmark data sets. The results indicate that with these modifications a small increase in prediction accuracy of C-Mantec can be obtained but in general this was not better when compared to a standard support vector machine, except in some cases when a mixed strategy is used.
相似文献The computational complexity of solving nonlinear support vector machine (SVM) is prohibitive on large-scale data. In particular, this issue becomes very sensitive when the data represents additional difficulties such as highly imbalanced class sizes. Typically, nonlinear kernels produce significantly higher classification quality to linear kernels but introduce extra kernel and model parameters which requires computationally expensive fitting. This increases the quality but also reduces the performance dramatically. We introduce a generalized fast multilevel framework for regular and weighted SVM and discuss several versions of its algorithmic components that lead to a good trade-off between quality and time. Our framework is implemented using PETSc which allows an easy integration with scientific computing tasks. The experimental results demonstrate significant speed up compared to the state-of-the-art nonlinear SVM libraries. Reproducibility: our source code, documentation and parameters are available at https://github.com/esadr/mlsvm.
相似文献Discriminative correlation filter (DCF) played a dominant role in visual tracking tasks in early years. However, with the recent development of deep learning, the Siamese based networks begin to prevail. Unlike DCF, most Siamese network based tracking methods take the first frame as the reference, while ignoring the information from the subsequent frames. As a result, these methods may fail under unforeseeable situations (e.g. target scale/size changes, variant illuminations, occlusions etc.). Meanwhile, other deep learning based tracking methods learn discriminative filters online, where the training samples are extracted from a few fixed frames with predictable labels. However, these methods have the same limitations as Siamese-based trackers. The training samples are prone to have cumulative errors, which ultimately lead to tracking loss. In this situation, we propose SiamET, a Siamese-based network using Resnet-50 as its backbone with enhanced template module. Different from existing methods, our templates are acquired based on all historical frames. Extensive experiments have been carried out on popular datasets to verify the effectiveness of our method. It turns out that our tracker achieves superior performances than the state-of-the-art methods on 4 challenging benchmarks, including OTB100, VOT2018, VOT2019 and LaSOT. Specifically, we achieve an EAO score of 0.480 on VOT2018 with 31 FPS. Code is available at https://github.com/yu-1238/SiamET
相似文献Information on social media is multi-modal, most of which contains the meaning of sarcasm. In recent years, many people have studied the problem of sarcasm detection. Many traditional methods have been proposed in this field, but the study of deep learning methods to detect sarcasm is still insufficient. It is necessary to comprehensively consider the information of the text,the changes of the tone of the audio signal,the facial expressions and the body posture in the image to detect sarcasm. This paper proposes a multi-level late-fusion learning framework with residual connections, a more reasonable experimental data-set split and two model variants based on different experimental settings. Extensive experiments on the MUStARD show that our methods are better than other fusion models. In our speaker-independent experimental split, the multi-modality has a 4.85% improvement over the single-modality, and the Error rate reduction has an improvement of 11.8%. The latest code will be updated to this URL later: https://github.com/DingNing123/m_fusion
相似文献Question answer selection in the Chinese medical field is very challenging since it requires effective text representations to capture the complex semantic relationships between Chinese questions and answers. Recent approaches on deep learning, e.g., CNN and RNN, have shown their potential in improving the selection quality. However, these existing methods can only capture a part or one-side of semantic relationships while ignoring the other rich and sophisticated ones, leading to limited performance improvement. In this paper, a series of neural network models are proposed to address Chinese medical question answer selection issue. In order to model the complex relationships between questions and answers, we develop both single and hybrid models with CNN and GRU to combine the merits of different neural network architectures. This is different from existing works that can onpy capture partial relationships by utilizing a single network structure. Extensive experimental results on cMedQA dataset demonstrate that the proposed hybrid models, especially BiGRU-CNN, significantly outperform the state-of-the-art methods. The source codes of our models are available in the GitHub (https://github.com/zhangyuteng/MedicalQA-CNN-BiGRU).
相似文献Deep learning has been extensively researched in the field of document analysis and has shown excellent performance across a wide range of document-related tasks. As a result, a great deal of emphasis is now being placed on its practical deployment and integration into modern industrial document processing pipelines. It is well known, however, that deep learning models are data-hungry and often require huge volumes of annotated data in order to achieve competitive performances. And since data annotation is a costly and labor-intensive process, it remains one of the major hurdles to their practical deployment. This study investigates the possibility of using active learning to reduce the costs of data annotation in the context of document image classification, which is one of the core components of modern document processing pipelines. The results of this study demonstrate that by utilizing active learning (AL), deep document classification models can achieve competitive performances to the models trained on fully annotated datasets and, in some cases, even surpass them by annotating only 15–40% of the total training dataset. Furthermore, this study demonstrates that modern AL strategies significantly outperform random querying, and in many cases achieve comparable performance to the models trained on fully annotated datasets even in the presence of practical deployment issues such as data imbalance, and annotation noise, and thus, offer tremendous benefits in real-world deployment of deep document classification models. The code to reproduce our experiments is publicly available at https://github.com/saifullah3396/doc_al.
相似文献Forecasting of multivariate time series data, for instance the prediction of electricity consumption, solar power production, and polyphonic piano pieces, has numerous valuable applications. However, complex and non-linear interdependencies between time steps and series complicate this task. To obtain accurate prediction, it is crucial to model long-term dependency in time series data, which can be achieved by recurrent neural networks (RNNs) with an attention mechanism. The typical attention mechanism reviews the information at each previous time step and selects relevant information to help generate the outputs; however, it fails to capture temporal patterns across multiple time steps. In this paper, we propose using a set of filters to extract time-invariant temporal patterns, similar to transforming time series data into its “frequency domain”. Then we propose a novel attention mechanism to select relevant time series, and use its frequency domain information for multivariate forecasting. We apply the proposed model on several real-world tasks and achieve state-of-the-art performance in almost all of cases. Our source code is available at https://github.com/gantheory/TPA-LSTM.
相似文献Person re-identification (Re-ID) in real-world scenarios suffers from various degradations, e.g., low resolution, weak lighting, and bad weather. These degradations hinders identity feature learning and significantly degrades Re-ID performance. To address these issues, in this paper, we propose a degradation invariance learning framework for robust person Re-ID. Concretely, we first design a content-degradation feature disentanglement strategy to capture and isolate task-irrelevant features contained in the degraded image. Then, to avoid the catastrophic forgetting problem, we introduce a memory replay algorithm to further consolidate invariance knowledge learned from the previous pre-training to improve subsequent identity feature learning. In this way, our framework is able to continuously maintain degradation-invariant priors from one or more datasets to improve the robustness of identity features, achieving state-of-the-art Re-ID performance on several challenging real-world benchmarks with a unified model. Furthermore, the proposed framework can be extended to low-level image processing, e.g., low-light image enhancement, demonstrating the potential of our method as a general framework for the various vision tasks. Code and trained models will be available at: https://github.com/hyk1996/Degradation-Invariant-Re-ID-pytorch.
相似文献Music genre classification based on visual representation has been successfully explored over the last years. Recently, there has been increasing interest in attempting convolutional neural networks (CNNs) to achieve the task. However, most of the existing methods employ the mature CNN structures proposed in image recognition without any modification, which results in the learning features that are not adequate for music genre classification. Faced with the challenge of this issue, we fully exploit the low-level information from spectrograms of audio and develop a novel CNN architecture in this paper. The proposed CNN architecture takes the multi-scale time-frequency information into considerations, which transfers more suitable semantic features for the decision-making layer to discriminate the genre of the unknown music clip. The experiments are evaluated on the benchmark datasets including GTZAN, Ballroom, and Extended Ballroom. The experimental results show that the proposed method can achieve 93.9%, 96.7%, 97.2% classification accuracies respectively, which to the best of our knowledge, are the best results on these public datasets so far. It is notable that the trained model by our proposed network possesses tiny size, only 0.18M, which can be applied in mobile phones or other devices with limited computational resources. Codes and model will be available at https://github.com/CaifengLiu/music-genre-classification.
相似文献