首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
基于DOM修剪的藏文Web信息提取   总被引:1,自引:0,他引:1       下载免费PDF全文
随着互联网的普及和藏文信息技术的不断发展,出现了大量的藏文网站。该文根据藏文“音节点”的特征识别藏文网页并进行抓取。在建立DOM树的基础上,分析网页的链接、非链接文本与主题信息块之间的相关度。通过语义修剪算法提取藏文主题信息。经测试证实,该算法在藏文网页识别和藏文主题信息提取中具有较好的适应性。  相似文献   

2.
面对当前大量的文本数据信息,如何帮助人们准确定位所需信息,成为文本挖掘领域的一个研究趋势。通过将文本分类和聚类方法应用于信息检索-—对网页文本进行聚类,提出了基于超链接信息的Web文本自动聚类模型。利用结构挖掘技术获得主题领域的多个权威网页作为初始聚类中心,通过去除超链接信息中的噪声和多余链接得到网站的简明拓扑结构,并结合内容挖掘,动态调整聚类中心,最终将网页聚成各主题下的不同子类别。  相似文献   

3.
一种基于特征符号的网页主题信息抽取方法   总被引:1,自引:0,他引:1       下载免费PDF全文
王舒  朱敏  张明  牛颢  赵瑜 《计算机应用研究》2009,26(12):4539-4541
随着Internet网络的日益普及,Web上的海量数据给文本挖掘尤其是网页主题提取带来了更多的挑战,现有的文本提取方法在保证高准确率的同时无法满足Web挖掘方法的通用性。通过对Web网页结构进行研究,对网页生成树模型进行了改进,找到网页结构的通用规则,提出一种基于特征符号的提取方法CECS(content extraction characteristic symbols),结合相关度对网页主题内容进行提取。实验证明,所提算法具有很高的准确性和通用性。  相似文献   

4.
针对传统主题爬虫的不足, 提出一种基于主题相关概念和网页分块的主题爬虫。先通过主题分类树获取主题相关概念集合, 然后结合主题描述文档构建主题向量来描述主题; 下载网页后引入网页分块来穿越“灰色隧道”; 采用文本内容和链接结构相结合的策略计算候选链接优先级, 并在HITS算法的基础上提出了R-HITS算法计算链接结构对候选链接优先级的贡献。实验结果表明, 利用该方法实现的主题爬虫查准率达66%、信息量总和达53%, 在垂直搜索引擎和舆情分析应用方面有更好的搜索效果。  相似文献   

5.
在基于链接的概率隐含语义分析的基础上提出一种融合文本链接的增量方法进行主题建模。首先在原有网页集上进行主题建模;然后随着网页的结构和内容动态变化,利用一种合理的更新机制更新模型参数,从而高效快速地处理在线网页流的动态变化。此外,提出一个自适应非对称学习方法融合文本与链接模态的隐含主题。对于每个网页,它在两种模态上的主题分布通过加权进行融合,而权值由该网页的特征词分布的熵值确定。由于融合之后的概率结构合理地关联了链接模态和文本模态的信息,故能得到很好的建模效果。两种类型的数据集上的实验结果显示该算法可以有效地节省时间,并对网页分类有较大性能的提高,此外还提供了由本文模型生成的主题显示结果。  相似文献   

6.
网页主题挖掘对自然语言处理如网页文本分类、文摘自动生成、信息融合等具有重要意义.挖掘网页主题可以帮助用户更好地理解网页内容.尽管已有一些从普通文本中挖掘概念的工作,但其很少考虑单词所属标签和位置对单词权重的影响,且没有工作给出上述两种影响因子的计算方法.借助WordNet,将网页主题从词语扩展到概念层次,提出了使用词性标注和词义消歧确定网页中单词词义并充分利用标签影响因子和位置影响因子对网页正文文本特征进行权重修正的主题概念挖掘方法,给出了两种影响因子的计算公式.在DMOZ数据集上的实验结果表明,修正权重可以明显提高主题挖掘精度,最高可达到0.95.  相似文献   

7.
为了获得大量用于机器翻译研究的汉维(维吾尔)文语料,提出一种从网页中自动获取主题信息的方法。考虑到有主题网页中主题信息分布相对集中、文本密度较高,并且这类网页中大量的噪音信息是由链接引入的,提出的算法首先将链接分为噪音链接和非噪音链接,并在源码中删除噪音链接的锚文本和非噪音链接的HTML标签,然后利用容器标签将源码划分为若干部分并删除文本长度和文本密度均小于各自阈值的源码块。针对汉维网页做了实验,实验结果表明,算法在设置合适的阈值的情况下良好率达到90%以上。  相似文献   

8.
由于Web上网页的急剧增加,信息搜索与挖掘越来越引起人们的重视。然而网页中除了主题信息外,还有噪声信息,从而网页净化技术受到越来越多的研究人员的关注,并提出了各种算法。借鉴人工免疫系统在计算机网络入侵检测中的应用,提出了一种基于AIS的网页去噪算法。同时对网页中的数据进行了异常识别的非线性研究。  相似文献   

9.
聚类技术能将大规模数据按照数据的相似性划分成用户可迅速理解的簇.从而使用户更快地了解大量文档中所包含的内容。因此.聚类技术成为搜索引擎中不可或缺的部分和研究热点。Web上的AJAX应用和PowerPoint文件等弱链接文档由于缺乏足够的超链接信息,导致搜索该类文档时.排序结果不佳。针对该问题.给出一个弱链接文档的搜索引擎框架,并重点描述一个基于网页搜索结果的弱链接文档排序算法.基于聚类的弱链接文档排序算法利用聚类算法从高质量的网页搜索结果中提取与查询相关的主题.并根据主题的相关网页的排名确定该主题的重要性.根据识别的带权重的主题计算弱链接文档的排序值。实验结果表明该算法能够为弱链接文档产生较好的排序结果.  相似文献   

10.
韦莎  朱焱 《计算机应用》2016,36(3):735-739
针对因Web中存在由正常网页指向垃圾网页的链接,导致排序算法(Anti-TrustRank等)检测性能降低的问题,提出了一种主题相似度和链接权重相结合,共同调节网页非信任值传播的排序算法,即主题链接非信任排序(TLDR)。首先,运用隐含狄利克雷分配(LDA)模型得到所有网页的主题分布,并计算相互链接网页间的主题相似度;其次,根据Web图计算链接权重,并与主题相似度结合,得到主题链接权重矩阵;然后,利用主题链接权重调节非信任值传播,改进Anti-TrustRank和加权非信任值排序(WATR)算法,使网页得到更合理的非信任值;最后,将所有网页的非信任值进行排序,通过划分阈值检测出垃圾网页。在数据集WEBSPAM-UK2007上进行的实验结果表明,与Anti-TrustRank和WATR相比,TLDR的SpamFactor分别提高了45%和23.7%,F1-measure(阈值取600)分别提高了3.4个百分点和0.5个百分点, spam比例(前三个桶)分别提高了15个百分点和10个百分点。因此,主题与链接权重相结合的TLDR算法能有效提高垃圾网页检测性能。  相似文献   

11.
Abstract This paper describes an approach to the design of interactive multimedia materials being developed in a European Community project. The developmental process is seen as a dialogue between technologists and teachers. This dialogue is often problematic because of the differences in training, experience and culture between them. Conditions needed for fruitful dialogue are described and the generic model for learning design used in the project is explained.  相似文献   

12.
European Community policy and the market   总被引:1,自引:0,他引:1  
Abstract This paper starts with some reflections on the policy considerations and priorities which are shaping European Commission (EC) research programmes. Then it attempts to position the current projects which seek to capitalise on information and communications technologies for learning in relation to these priorities and the apparent realities of the marketplace. It concludes that while there are grounds to be optimistic about the contribution EC programmes can make to the efficiency and standard of education and training, they are still too technology driven.  相似文献   

13.
融合集成方法已经广泛应用在模式识别领域,然而一些基分类器实时性能稳定性较差,导致多分类器融合性能差,针对上述问题本文提出了一种新的基于多分类器的子融合集成分类器系统。该方法考虑在度量层融合层次之上通过对各类基多分类器进行动态选择,票数最多的类别作为融合系统中对特征向量识别的类别,构成一种新的自适应子融合集成分类器方法。实验表明,该方法比传统的分类器以及分类融合方法识别准确率明显更高,具有更好的鲁棒性。  相似文献   

14.
Development of software intensive systems (systems) in practice involves a series of self-contained phases for the lifecycle of a system. Semantic and temporal gaps, which occur among phases and among developer disciplines within and across phases, hinder the ongoing development of a system because of the interdependencies among phases and among disciplines. Such gaps are magnified among systems that are developed at different times by different development teams, which may limit reuse of artifacts of systems development and interoperability among the systems. This article discusses such gaps and a systems development process for avoiding them.  相似文献   

15.
This paper presents control charts models and the necessary simulation software for the location of economic values of the control parameters. The simulation program is written in FORTRAN, requires only 10K of main storage, and can run on most mini and micro computers. Two models are presented - one describes the process when it is operating at full capacity and the other when the process is operating under capacity. The models allow the product quality to deteriorate to a further level before an existing out-of-control state is detected, and they can also be used in situations where no prior knowledge exists of the out-of-control causes and the resulting proportion defectives.  相似文献   

16.
Going through a few examples of robot artists who are recognized worldwide, we try to analyze the deepest meaning of what is called “robot art” and the related art field definition. We also try to highlight its well-marked borders, such as kinetic sculptures, kinetic art, cyber art, and cyberpunk. A brief excursion into the importance of the context, the message, and its semiotics is also provided, case by case, together with a few hints on the history of this discipline in the light of an artistic perspective. Therefore, the aim of this article is to try to summarize the main characteristics that might classify robot art as a unique and innovative discipline, and to track down some of the principles by which a robotic artifact can or cannot be considered an art piece in terms of social, cultural, and strictly artistic interest. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

17.
《计算机科学》2007,34(4):148-148
Recent years have seen rapid advances in various grid-related technologies, middleware, and applications. The GCC conference has become one of the largest scientific events worldwide in grid and cooperative computing. The 6th international conference on grid and cooperative computing (GCC2007) Sponsored by China Computer Federation (CCF),Institute of Computing Technology, Chinese Academy of Sciences (ICT) and Xinjiang University ,and in Cooperation with IEEE Computer Soceity ,is to be held from August 16 to 18, 2007 in Urumchi, Xinjiang, China.  相似文献   

18.
Although there are many arguments that logic is an appropriate tool for artificial intelligence, there has been a perceived problem with the monotonicity of classical logic. This paper elaborates on the idea that reasoning should be viewed as theory formation where logic tells us the consequences of our assumptions. The two activities of predicting what is expected to be true and explaining observations are considered in a simple theory formation framework. Properties of each activity are discussed, along with a number of proposals as to what should be predicted or accepted as reasonable explanations. An architecture is proposed to combine explanation and prediction into one coherent framework. Algorithms used to implement the system as well as examples from a running implementation are given.  相似文献   

19.
为了设计一种具有低成本、低功耗、易操作、功能强且可靠性高的煤矿井下安全分站,针对煤矿安全生产实际,文章提出了采用MCS-51系列单片机为核心、具有CAN总线通信接口的煤矿井下安全监控分站的设计方案;首先给出煤矿井下安全监控分站的整体构架设计,然后着重阐述模拟量输入信号处理系统的设计过程,最后说明单片机最小系统及其键盘、显示、报警、通信等各个组成部分的设计;为验证设计方案的可行性与有效性,使用Proteus软件对设计内容进行仿真验证,设计的煤矿井下安全监控分站具有瓦斯、温度等模拟量参数超标报警功能和电机开停、风门开闭等开关量指示功能;仿真结果表明:设计的煤矿井下安全监控分站具有一定的实际应用价值.  相似文献   

20.
This paper provides the author's personal views and perspectives on software process improvement. Starting with his first work on technology assessment in IBM over 20 years ago, Watts Humphrey describes the process improvement work he has been directly involved in. This includes the development of the early process assessment methods, the original design of the CMM, and the introduction of the Personal Software Process (PSP)SM and Team Software Process (TSP){SM}. In addition to describing the original motivation for this work, the author also reviews many of the problems he and his associates encountered and why they solved them the way they did. He also comments on the outstanding issues and likely directions for future work. Finally, this work has built on the experiences and contributions of many people. Mr. Humphrey only describes work that he was personally involved in and he names many of the key contributors. However, so many people have been involved in this work that a full list of the important participants would be impractical.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号