首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
郑影  李大辉 《计算机科学》2014,41(2):270-275
社会媒体是人们用来分享意见、见解、观念和经验的平台或工具,目前已经发展成具有重大影响力的新媒体。而微博作为社会媒体的一个重要部分,对信息的传播起到了很大的作用。面向微博内容的信息抽取就是要从充满噪音的、零碎的、非结构化的微博内容的自由文本中提取有价值的结构化的信息,以利于从微博内容中有效地获取信息。提出了一种基于因子图的微博事件抽取方法来准确地抽取微博中所反映的事件。最后通过实验验证了该方法在性能和准确性上都比其他的方法要高。  相似文献   

2.
一种全自动生成网页信息抽取Wrapper的方法   总被引:4,自引:2,他引:4  
Web网页信息抽取是近年来广泛关注的话题。如何最快最准地从大量Web网页中获取主要数据成为该领域的一个研究重点。文章中提出了一种全自动化生成网页信息抽取Wrapper的方法。该方法充分利用网页设计模版的结构化、层次化特点,运用网页链接分类算法和网页结构分离算法,抽取出网页中各个信息单元,并输出相应Wrapper。利用Wrapper能够对同类网页自动地进行信息抽取。实验结果表明,该方法同时实现了对网页中严格的结构化信息和松散的结构化信息的自动化抽取,抽取结果达到非常高的准确率。  相似文献   

3.
从漏洞信息当中抽取结构化信息对于安全研究而言有重要意义。安全研究者常需要在大规模的CVE数据中按特定要求进行筛选,或对漏洞进行自动化的分析测试。然而现有的CVE数据库中只包含了非结构化的文本描述和并不完备的辅助信息。从描述文本抽取结构化的信息能帮助研究者更好地组织与分析CVE。总结漏洞描述包含的七种核心要素,为结构化抽取建立模型,并将信息抽取转换为一个序列标注模型,构建数据集对其进行训练。实验表明,该模型能够以较高的准确率从CVE文本中抽取出各类关键信息。  相似文献   

4.
针对半结构化文本的信息抽取粒度较大,不能对抽取结果进行有效语义分析的问题,面向领域提出一种基于模式匹配的结构化信息二次抽取方法.该方法以Web文档形式呈现的半结构化文本为对象,对粗粒度抽取结果进行领域识别,根据识别结果加载相应领域词库.根据模式中各个角色的词性实现模式角色到分词序列词语的映射,从分词序列中抽取出结构化信息,为准确的语义分析提供支持.实验表明该方法能获得更准确的抽取结果.  相似文献   

5.
针对金融类公告中的结构化数据难以被高效快速提取的问题,提出一种基于文档结构与Bi-LSTM-CRF网络模型的信息抽取方法。自定义一种文档结构树生成算法,利用规则从文档结构树中抽取所需节点信息;构建基于信息句触发词的局部句子规则,抽取包含结构化字段信息的信息句;将字段的结构化信息抽取看作序列标注问题,分词时加入领域知识词典,构建基于Bi-LSTM-CRF的神经网络模型进行字段信息识别。实验结果表明,该信息抽取方法可以满足多类型公告的结构化信息提取,最终的信息句与字段信息抽取的平均F1值均可达到91%以上,验证了该方法在产品业务中的可行性和实用性。  相似文献   

6.
王锟 《福建电脑》2008,(3):133-133,144
web信息抽取是对html文本中包含的信息进行结构化处理,抽取出有用的信息。本文提出了一种web信息抽取方法,通过清洗半结构化的HTML页面信息将其转化为结构化的XHTML格式信息,再利用DB29的SQL/XML语言,实现web信息的抽取。实验表明,该方法能够准确地提取数据块,正确抽取块内信息。  相似文献   

7.
信息抽取是数据挖掘的一个重要领域,文本信息抽取是指从一段自由文本中抽取出指定的信息并将其结构化数 据存入知识库供用户查询或下一步处理所用。人物属性信息抽取是智能人物类搜索引擎构建的重要基础,同时结构化信 息也是计算机所能理解的一种数据格式。作者提出了一种自动获取百科人物属性的方法,该方法利用各属性值的词性信 息来定位到百科自由文本中,通过统计的方法发现规则,再根据规则匹配从百科文本中获取人物属性信息。实验表明该 方法从百科文本中抽取人物属性信息是有效的。抽取的结果可以用来构建人物属性知识库。  相似文献   

8.
随着因特网技术的迅速发展,网上信息成几何级数增长,如何从这些海量联机非结构化文本中自动抽取出结构化信息成为目前重要的研究课题。研究了基于隐马尔可夫模型的Web信息抽取算法,着重探讨了隐马尔可夫模型在文本信息抽取中应该如何应用,数据应该如何标记,并对隐马尔可夫模型在文本信息抽取中的应用提出了几个改进的方法,建立了基于HMM的Web信息抽取模型,并对信息抽取后的数据进行了分析对比,验证了改进算法的有效性。  相似文献   

9.
信息抽取技术用于从非结构化文本数据中提取关注度较高的信息。事件抽取技术是信息抽取研究领域中具有挑战的研究方向。事件抽取的目的是从非结构化文本数据中抽取描述事件的关键元素,并以结构化的方式呈现。事件抽取被看作序列标注任务,首先采用ALBERT预训练模型学习特征,其次引入条件随机场CRF模型提高序列标注性能,最后完成事件类型以及事件要素的识别分类。在ACE2005标准语料库上的实验结果表明,与现有模型相比,ALBERT-CRF模型在触发词识别和分类任务上的召回率和F值均有所提高。  相似文献   

10.
基于合一句法和实体语义树的中文语义关系抽取   总被引:1,自引:0,他引:1  
该文提出了一种基于卷积树核函数的中文实体语义关系抽取方法,该方法通过在关系实例的结构化信息中加入实体语义信息,如实体类型、引用类型和GPE角色等,从而构造能有效捕获结构化信息和实体语义信息的合一句法和实体语义关系树,以提高中文语义关系抽取的性能。在ACE RDC 2005中文基准语料上进行的关系探测和关系抽取的实验表明,该方法能显著提高中文语义关系抽取性能,大类抽取的最佳F值达到67.0,这说明结构化句法信息和实体语义信息在中文语义关系抽取中具有互补性。  相似文献   

11.
Social media has become an important source of information and a medium for following and spreading trends, news, and ideas all over the world. Although determining the subjects of individual posts is important to extract users' interests from social media, this task is nontrivial because posts are highly contextualized and informal and have limited length. To address this problem, we propose a user modeling framework that maps the content of texts in social media to relevant categories in news media. In our framework, the semantic gaps between social media and news media are reduced by using Wikipedia as an external knowledge base. We map term-based features from a short text and a news category into Wikipedia-based features such as Wikipedia categories and article entities. A user's microposts are thus represented in a rich feature space of words. Experimental results show that our proposed method using Wikipedia-based features outperforms other existing methods of identifying users' interests from social media.  相似文献   

12.
Digital inequality is one of the most critical issues in the “information age”, few studies have examined the social inequality in information resources and digital use patterns. In the rural areas, such information communication technology (ICT) facilities could not guarantee that users can easily access information technology and overcome the so-called “digital divide.” This research aims to discover the psychological factors that influence information and communication technology (ICT) adoption behavior, as well as confirm whether “information literacy” and “digital skills” have moderator effects in the research model. Using a survey of 875 participants and a structural equation modeling approach, we find that task characteristics and social interaction improve media richness, media experience, and media technostress, which in turn enhance ICT adoption behavior. The proposed theoretical model shows that the impact of ICT adoption behavior is moderated by information literacy and digital skills. The findings of this research can offer guidelines for policy makers and educators who evaluate a community's ICT adoption behavior so as to provide proper access to ICT and promote its visibility by incorporating ICT in educational activities.  相似文献   

13.
A large number of texts are rapidly generated as streaming data in social media. Since it is difficult to process such text streams with limited memory in real time, researchers are resorting to text stream compression and sampling to obtain a small portion of valuable information from the streams. In this study, we investigate the crucial question of how to use less memory space to store more valuable texts to maintain the global information of the stream. First, we propose a text stream sampling framework based on compressed sensing theory, which can sample a text stream with a lightweight framework to reduce the space consumption while still retaining the most valuable texts. We then develop a query word-based retrieval task as well as a topic detection and evolution analysis task on the sample stream to evaluate the performance of the framework in retaining valuable information. The framework is evaluated from several aspects using two representative datasets of social media, including compression ratio, runtime, information reserved rate, and efficiency of the text analysis tasks. Experimental results demonstrate that the proposed framework outperforms baseline methods and is able to complete the text analysis tasks with promising results.  相似文献   

14.
Examining the particular value of each platform for big data would be difficult because of the variety of social media forms and sizes. Using social media to objectively and subjectively analyze large groups of individuals makes it the most effective tool for this task. There are numerous sources of big data within the organization. Social media can be identified by the interaction and communication it facilitates. Utilizing social media has become a daily occurrence in modern society. In addition, this frequent use generates data demonstrating the importance of researching the relationship between big data and social media. It is because so many internet users are also active on social media. We conducted a systematic literature review (SLR) to identify 42 articles published between 2018 and 2022 that examined the significance of big data in social media and upcoming issues in this field. We also discuss the potential benefits of utilizing big data in social media. Our analysis discovered open problems and future challenges, such as high-quality data, information accessibility, speed, natural language processing (NLP), and enhancing prediction approaches. As proven by our investigations of evaluation metrics for big data in social media, the distribution reveals that 24% is related to data-trace, 12% is related to execution time, 21% to accuracy, 6% to cost, 10% to recall, 11% to precision, 11% to F1-score, and 5% run time complexity.  相似文献   

15.
ABSTRACT

People increasingly use microblogging platforms such as Twitter during natural disasters and emergencies. Research studies have revealed the usefulness of the data available on Twitter for several disaster response tasks. However, making sense of social media data is a challenging task due to several reasons such as limitations of available tools to analyse high-volume and high-velocity data streams, dealing with information overload, among others. To eliminate such limitations, in this work, we first show that textual and imagery content on social media provide complementary information useful to improve situational awareness. We then explore ways in which various Artificial Intelligence techniques from Natural Language Processing and Computer Vision fields can exploit such complementary information generated during disaster events. Finally, we propose a methodological approach that combines several computational techniques effectively in a unified framework to help humanitarian organisations in their relief efforts. We conduct extensive experiments using textual and imagery content from millions of tweets posted during the three major disaster events in the 2017 Atlantic Hurricane season. Our study reveals that the distributions of various types of useful information can inform crisis managers and responders and facilitate the development of future automated systems for disaster management.  相似文献   

16.
Nowadays, many e-commerce websites allow users to login with their existing social networking accounts. When a new user comes to an e-commerce website, it is interesting to study whether the information from external social media platforms can be utilized to alleviate the cold-start problem. In this paper, we focus on a specific task on cross-site information sharing, i.e., leveraging the text posted by a user on the social media platform (termed as social text) to infer his/her purchase preference of product categories on an e-commerce platform. To solve the task, a key problem is how to effectively represent the social text in a way that its information can be utilized on the e-commerce platform. We study two major kinds of text representation methods for predicting cross-site purchase preference, including shallow textual features and deep textual features learned by deep neural network models. We conduct extensive experiments on a large linked dataset, and our experimental results indicate that it is promising to utilize the social text for predicting purchase preference. Specially, the deep neural network approach has shown a more powerful predictive ability when the number of categories becomes large.  相似文献   

17.
18.
Rumor detection has become an emerging and active research field in recent years. At the core is to model the rumor characteristics inherent in rich information, such as propagation patterns in social network and semantic patterns in post content, and differentiate them from the truth. However, existing works on rumor detection fall short in modeling heterogeneous information, either using one single information source only (e.g., social network, or post content) or ignoring the relations among multiple sources (e.g., fusing social and content features via simple concatenation).Therefore, they possibly have drawbacks in comprehensively understanding the rumors, and detecting them accurately. In this work, we explore contrastive self-supervised learning on heterogeneous information sources, so as to reveal their relations and characterize rumors better. Technically, we supplement the main supervised task of detection with an auxiliary self-supervised task, which enriches post representations via post self-discrimination.Specifically, given two heterogeneous views of a post (i.e., representations encoding social patterns and semantic patterns), the discrimination is done by maximizing the mutual information between different views of the same post compared to that of other posts. We devise cluster-wise and instance-wise approaches to generate the views and conduct the discrimination, considering different relations of information sources. We term this framework as self-supervised rumor detection (SRD). Extensive experiments on three real-world datasets validate the effectiveness of SRD for automatic rumor detection on social media.  相似文献   

19.
Research on recommendation systems has gained a considerable amount of attention over the past decade as the number of online users and online contents continue to grow at an exponential rate. With the evolution of the social web, people generate and consume data in real time using online services such as Twitter, Facebook, and web news portals. With the rapidly growing online community, web-based retail systems and social media sites have to process several millions of user requests per day. Generating quality recommendations using this vast amount of data is itself a very challenging task. Nevertheless, opposed to the web-based retailers such as Amazon and Netflix, the above-mentioned social networking sites have to face an additional challenge when generating recommendations as their contents are very rapidly changing. Therefore, providing fresh information in the least amount of time is a major objective of such recommender systems. Although collaborative filtering is a widely used technique in recommendation systems, generating the recommendation model using this approach is a costly task, and often done offline. Hence, it is difficult to use collaborative filtering in the presence of dynamically changing contents, as such systems require frequent updates to the recommendation model to maintain the accuracy and the freshness of the recommendations. Parallel processing power of graphic processing units (GPUs) can be used to process large volumes of data with dynamically changing contents in real time, and accelerate the recommendation process for social media data streams. In this paper, we address the issue of rapidly changing contents, and propose a parallel on-the-fly collaborative filtering algorithm using GPUs to facilitate frequent updates to the recommendations model. We use a hybrid similarity calculation method by combining the item–item collaborative filtering with item category information and temporal information. The experimental results on real-world datasets show that the proposed algorithm outperformed several existing online CF algorithms in terms of accuracy, memory consumption, and runtime. It was also observed that the proposed algorithm scaled well with the data rate and the data volume, and generated recommendations in a timely manner.  相似文献   

20.
During a crisis citizens reach for their smart phones to report, comment and explore information surrounding the crisis. These actions often involve social media and this data forms a large repository of real-time, crisis related information. Law enforcement agencies and other first responders see this information as having untapped potential. That is, it has the capacity extend their situational awareness beyond the scope of a usual command and control centre. Despite this potential, the sheer volume, the speed at which it arrives, and unstructured nature of social media means that making sense of this data is not a trivial task and one that is not yet satisfactorily solved; both in crisis management and beyond. Therefore we propose a multi-stage process to extract meaning from this data that will provide relevant and near real-time information to command and control to assist in decision support. This process begins with the capture of real-time social media data, the development of specific LEA and crisis focused taxonomies for categorisation and entity extraction, the application of formal concept analysis for aggregation and corroboration and the presentation of this data via map-based and other visualisations. We demonstrate that this novel use of formal concept analysis in combination with context-based entity extraction has the potential to inform law enforcement and/or humanitarian responders about on-going crisis events using social media data in the context of the 2015 Nepal earthquake.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号