首页 | 本学科首页   官方微博 | 高级检索  
     

基于时空数据特征的寄递涉烟犯罪分析方法
引用本文:乔浪超,王进录,高宝红,杨新刚,冯文涛,许荣垚,卫毅然,刘伟.基于时空数据特征的寄递涉烟犯罪分析方法[J].中国烟草学报,2023,29(1):116-126.
作者姓名:乔浪超  王进录  高宝红  杨新刚  冯文涛  许荣垚  卫毅然  刘伟
作者单位:1.中国烟草总公司陕西省公司,专卖监督管理处,陕西西安雁南四路19号 710061
摘    要:【目的】使用大数据和人工智能技术研究基于寄递大数据的“互联网+寄递”新型涉烟犯罪分析方法。【方法】使用中文分词技术对寄递大数据进行预处理。提出了“寄递时空模式”新概念并计算其时域和频域统计量作为时空特征。使用特征选择和降维方法计算时空特征集合中的优选特征,并比较不同分类器算法结合优选特征构建的涉烟犯罪分析模型的性能。【结果】(1)提出的时空特征具有区分涉烟和不涉烟寄递数据的能力。随机森林和GBDT分类器整体性能最好,在准确率、阳性和阴性预测值等指标上均达到0.94以上。(2)基于优选特征建立的分析模型可以取得和初始特征模型接近的预测结果,优选特征数据储存量仅为原始特征数据的40%。(3)CFS特征选择方法选出的优选特征对涉烟预测模型结果的可解释性提供了依据。(4)初步实验表明本文方法可满足寄递涉烟分析的实时性要求。【结论】基于“寄递时空模式”计算的时空特征结合分类器可区分涉烟和不涉烟寄递数据。

关 键 词:寄递涉烟犯罪  寄递时空模式  时间序列分析  特征选择和降维  机器学习
收稿时间:2021-08-26

Express-related counterfeit cigarette criminality analysis based on spatio-temporal data features
Affiliation:1.Division of Monopoly Administration, Shaanxi Branch of China National Tobacco Corporation, Xi'an 710061, China2.School of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an 710121, China
Abstract:  Background  This study aims to study the express-related counterfeit cigarette criminality based on big data and artificial intelligence technology.  Methods  In the pre-processing stage, Chinese word segmentation method was adopted to process the original data. Then a novel concept named "spatio-temporal pattern of delivery and receiving address" was presented, which is actually time series data established based express package delivery and receiving frequency data within a time span. Spatio-temporal data features can be computed based on spatio-temporal pattern by using time and frequency domain statistics. Next, a CFS (Correlation-based feature selection) or PCA (Principal component analysis) algorithm was applied for the initial spatio-temporal feature pool to determine an optimal feature cluster. Then, the express-related counterfeit cigarette criminality analysis model was trained and optimized and the performance of models using different classifiers was compared.  Results  (1) All four classifier models including random forest, logistic regression, gradient boosting decision tree and long short-term memory deep neural network applied in the experiments achieved encouraging experimental results with satisfactory accuracy, PPV and NPV, which implied the proposed spatio-temporal data features has the ability to discriminate the cigarette-related from normal express data. Decision tree-based classifier models like random forest and gradient boosting decision tree classifier yielded the highest accuracy, PPV and NPV, which were all greater than 0.94. (2) Prediction models with optimal feature cluster determined by CFS (Correlation-based feature selection) or PCA(Principal component analysis) algorithm all exhibited slightly lower performance than that of initial feature pool. The storage space of optimal feature cluster accounted for only 40 percent of the initial feature pool. (3) CFS method utilized in the experiments can pick out optimal feature cluster from initial feature pool, which supports the interpretability for prediction results generated by the model. (4) Preliminary experimental results showed that the proposed prediction model can meet the real-time requirements of express-related counterfeit cigarette criminality analysis.  Conclusion  Classifiers in cooperated with the spatio-temporal data features computed based on "the spatio-temporal pattern of delivery and receiving address" can discriminate counterfeit cigarette-related express packages from normal express packages. 
Keywords:
点击此处可从《中国烟草学报》浏览原始摘要信息
点击此处可从《中国烟草学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号