首页 | 本学科首页   官方微博 | 高级检索  
     

基于多域VQGAN的文本生成国画方法研究
引用本文:孙泽龙,杨国兴,温静远,费楠益,卢志武,文继荣. 基于多域VQGAN的文本生成国画方法研究[J]. 软件学报, 2023, 34(5): 2116-2133
作者姓名:孙泽龙  杨国兴  温静远  费楠益  卢志武  文继荣
作者单位:中国人民大学 高瓴人工智能学院, 北京 100872;中国人民大学 信息学院, 北京 100872
基金项目:国家自然科学基金(61976220,61832017);北京高等学校卓越青年科学家计划(BJJWZYJH012019100020098)
摘    要:随着生成式对抗网络的出现,从文本描述合成图像最近成为一个活跃的研究领域.然而,目前文本描述往往使用英文,生成的对象也大多是人脸和花鸟等,专门针对中文和中国画的研究较少.同时,文本生成图像任务往往需要大量标注好的图像文本对,制作数据集的代价昂贵.随着多模态预训练的出现与推进,使得能够以一种优化的方式来指导生成对抗网络的生成过程,大大减少了对数据集和计算资源的需求.提出一种多域VQGAN模型来同时生成多种域的中国画,并利用多模态预训练模型WenLan来计算生成图像和文本描述之间的距离损失,通过优化输入多域VQGAN的隐空间变量来达到图片与文本语义一致的效果.对模型进行了消融实验,详细比较了不同结构的多域VQGAN的FID及R-precisoin指标,并进行了用户调查研究.结果表示,使用完整的多域VQGAN模型在图像质量和文本图像语义一致性上均超过原VQGAN模型的生成结果.

关 键 词:文本生成图像  多域生成  中国画生成
收稿时间:2022-04-16
修稿时间:2022-05-29

Text-to-Chinese-painting Method Based on Multi-domain VQGAN
SUN Ze-Long,YANG Guo-Xing,WEN Jing-Yuan,FEI Nan-Yi,LU Zhi-Wu,WEN Ji-Rong. Text-to-Chinese-painting Method Based on Multi-domain VQGAN[J]. Journal of Software, 2023, 34(5): 2116-2133
Authors:SUN Ze-Long  YANG Guo-Xing  WEN Jing-Yuan  FEI Nan-Yi  LU Zhi-Wu  WEN Ji-Rong
Affiliation:Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China
Abstract:With the development of generative adversarial networks (GANs), synthesizing images from textual descriptions has become an active research area. However, textual descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers, birds, etc. Few studies have been conducted on the generation of Chinese paintings with Chinese descriptions. The text-to-image generation often requires an enormous number of labeled image-text pairs, and the cost of dataset production is high. With the advance in multimodal pre-training, the GAN generation process can be guided in an optimized way, which significantly reduces the demand for datasets and computational resources. In this study, a multi-domain vector quatization generative adversarial network (VQGAN) model is proposed to simultaneously generate Chinese paintings in multiple domains. Furthermore, a multimodal pre-trained model WenLan is used to calculate the distance loss between generated images and textual descriptions. The semantic consistency between images and texts is achieved by optimization of the hidden space variables input into multi-domain VQGAN. Finally, an ablation experiment is conducted to compare different variants of multi-domain VQGAN in terms of the FID and R-precision metrics, and a user investigation is carried out. The results demonstrate that the complete multi-domain VQGAN model outperforms the original VQGAN model in terms of image quality and text-image semantic consistency.
Keywords:text-to-image generation  multi-domain generation  Chinese painting generation
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号