深度神經網絡模型壓縮綜述

李江昀; 趙義凱; 薛卓爾; 蔡錚; 李擎

doi:10.13374/j.issn2095-9389.2019.03.27.002

深度神經網絡模型壓縮綜述

doi: 10.13374/j.issn2095-9389.2019.03.27.002

李江昀^{1), 2)},
趙義凱^{1), 2)},
薛卓爾¹⁾,
蔡錚¹⁾,
李擎^{1), 2), ,}

1).
北京科技大學自動化學院, 北京 100083
2).
工業過程知識自動化教育部重點實驗室, 北京 100083

基金項目:

國家自然科學基金資助項目 61671054

北京市自然科學基金資助項目 4182038

詳細信息

通訊作者:
李擎, E-mail: Liqing@ies.ustb.edu.cn

中圖分類號: TP183
計量
- 文章訪問數: 2831
- HTML全文瀏覽量: 771
- PDF下載量: 662
- 被引次數: 0
出版歷程
- 收稿日期: 2019-03-27
- 刊出日期: 2019-10-01

A survey of model compression for deep neural networks

LI Jiang-yun^{1), 2)},
ZHAO Yi-kai^{1), 2)},
XUE Zhuo-er¹⁾,
CAI Zheng¹⁾,
LI Qing^{1), 2)
, ,}

1).
School of Automation&Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China
2).
Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing 100083, China

More Information

Corresponding author: LI Qing, E-mail: Liqing@ies.ustb.edu.cn

摘要

摘要: 深度神經網絡近年在計算機視覺以及自然語言處理等任務上不斷刷新已有最好性能，已經成為最受關注的研究方向.深度網絡模型雖然性能顯著，但由于參數量巨大、存儲成本與計算成本過高，仍然難以部署到硬件受限的嵌入式或移動設備上.相關研究發現，基于卷積神經網絡的深度模型本身存在參數冗余，模型中存在對最終結果無用的參數，這為深度網絡模型壓縮提供了理論支持.因此，如何在保證模型精度條件下降低模型大小已經成為熱點問題.本文對國內外學者近幾年在模型壓縮方面所取得的成果與進展進行了分類歸納并對其優缺點進行評價，并探討了模型壓縮目前存在的問題以及未來的發展方向.
- 深度神經網絡 /
- 模型壓縮 /
- 深度學習 /
- 網絡剪枝 /
- 網絡蒸餾
Abstract: In recent years, deep neural networks (DNN) have attracted increasing attention because of their excellent performance in computer vision and natural language processing. The success of deep learning is due to the fact that the models have more layers and more parameters, which gives them stronger nonlinear fitting ability. Furthermore, the continuous updating of hardware equipment makes it possible to quickly train deep learning models. The development of deep learning is driven by the greater amounts of available annotated or unannotated data. Specifically, large-scale data provide models with greater learning space and stronger generalization ability. Although the performance of deep neural networks is significant, they are difficult to deploy in embedded or mobile devices with limited hardware due to their large number of parameters and high storage and computing costs. Recent studies have found that deep models based on a convolutional neural network are characterized by parameter redundancy as well as parameters that are irrelevant to the final model results, which provides theoretical support for the compression of deep network models. Therefore, determining ways to reduce model size while retaining model precision has become a hot research issue. Model compression refers to the reduction of a trained model through some operation to obtain a lightweight network with equivalent performance. After model compression, there are fewer network parameters and usually a reduction in the computation required, which greatly reduces the computational and storage costs and enables the deployment of the model in restricted hardware conditions. In this paper, the achievements and progress made in recent years by domestic and foreign scholars with respect to model compressionwere classified and summarized and their advantages and disadvantages were evaluated, including network pruning, parameter sharing, quantization, network decomposition, and network distillation. Then, existing problems and the future development of model compression were discussed.
- deep neural networks /
- model compression /
- deep learning /
- network pruning /
- network distilling

HTML全文

圖 1 減去一個濾波器及其對應的特征圖^[21]

Figure 1. Pruning a filter and its corresponding feature map^[21]

下載: 全尺寸圖片幻燈片

圖 2 通過標量量化與中心體微調實現權重共享^[14]

Figure 2. Weight sharing by scalar quantization (top) and centroid fine-tuning (bottom)^[14]

下載: 全尺寸圖片幻燈片

圖 3 使用知識提示方式訓練學生網絡^[59].(a)教師學生網絡; (b)暗示學習; (c)知識蒸餾

Figure 3. Training a student network using hints^[59]: (a)teacher and student networks; (b)hints training; (c)knowledge distillation

下載: 全尺寸圖片幻燈片

中文字幕在线观看

參考文獻(90)

[1]	LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436 doi: 10.1038/nature14539
[2]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks//Advances in Neural Information Processing Systems. Lake Tahoe, 2012: 1097
[3]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J/OL]. ArXiv Preprint (2015-04-10)[2019-03-22]. https://arxiv.org/abs/1409.1556
[4]	Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, 2015: 1
[5]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington DC, 2016: 770
[6]	Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, 2017: 4700
[7]	Le Q V, Ngiam J, Coates A, et al. On optimization methods for deep learning//Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 2011: 265
[8]	Han Y F, Jiang T H, Ma Y P, et al. Compression of deep neural networks. Comput Appl Res, 2018, 35(10): 2894 doi: 10.3969/j.issn.1001-3695.2018.10.003 韓云飛, 蔣同海, 馬玉鵬, 等. 深度神經網絡的壓縮研究. 計算機應用研究, 2018, 35(10): 2894 doi: 10.3969/j.issn.1001-3695.2018.10.003
[9]	Setiono R, Liu H. Neural-network feature selector. IEEE Trans Neural Networks, 1997, 8(3): 654 doi: 10.1109/72.572104
[10]	LeCun Y, Denker J S, Solla S A, et al. Optimal brain damage//Advances in Neural Information Processing Systems. Denver, 1989: 598
[11]	Hassibi B, Stork D G, Wolff G J. Optimal brain surgeon and general network pruning//IEEE International Conference on Neural Networks. San Francisco, 1993: 293
[12]	Hassibi B, Stork D G. Second order derivatives for network pruning: optimal brain surgeon//Advances in Neural Information Processing Systems. Denver, 1993: 164
[13]	Han S, Pool J, Tran J, et al. Learning both weights and connections for efficient neural network// Advances in Neural Information Processing Systems. Montreal, 2015: 1135
[14]	Han S, Mao H, Dally W J. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding[J/OL]. ArXiv Preprint (2016-02-15)[2019-03-22]. https://arxiv.org/abs/1510.00149
[15]	Srinivas S, Subramanya A, Venkatesh Babu R. Training sparse neural networks//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops. Hawaii, 2017: 138
[16]	Anwar S, Hwang K, Sung W. Structured pruning of deep convolutional neural networks. ACM J Emerg Technol Comput Syst, 2017, 13(3): 32 doi: 10.1145/3005348
[17]	Wen W, Wu C P, Wang Y D, et al. Learning structured sparsity in deep neural networks//Advances in Neural Information Processing Systems. Barcelona, 2016: 2074
[18]	Lin S H, Ji R R, Li Y C, et al. Toward compact ConvNets via structure-sparsity regularized filter pruning. IEEE Trans Neural Networks Learn Syst, 2019: 1. http://www.ncbi.nlm.nih.gov/pubmed/30990448
[19]	Guo Y W, Yao A B, Chen Y R. Dynamic network surgery for efficient DNNs//Advances in Neural Information Processing Systems. Barcelona, 2016: 1379
[20]	Jia H P, Xiang X S, Fan D, et al. DropPruning for model compression[J/OL]. ArXiv Preprint (2018-12-05)[2019-03-22]. https://arxiv.org/abs/1812.02035
[21]	Li H, Kadav A, Durdanovic I, et al. Pruning filters for efficient convnets[J/OL]. ArXiv Preprint (2017-03-10)[2019-03-22]. https://arxiv.org/abs/1608.08710
[22]	Hu H Y, Peng R, Tai Y W, et al. Network trimming: a data-driven neuron pruning approach towards efficient deep architectures[J/OL]. arXiv preprint (2016-07-12)[2019-03-22]. https://arxiv.org/abs/1607.03250
[23]	Tian Q, Arbel T, Clark J J. Deep LDA-pruned nets for efficient facial gender classification//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops. Hawaii, 2017: 10
[24]	Luo J H, Wu J X, Lin W Y. ThiNet: a filter level pruning method for deep neural network compression//Proceedings of the IEEE International Conference on Computer Vision. Venice, 2017: 5058
[25]	He Y, Kang G L, Dong X Y, et al. Soft filter pruning for accelerating deep convolutional neural networks[J/OL]. ArXiv Preprint (2018-08-21)[2019-03-22]. https://arxiv.org/abs/1808.06866
[26]	He Y H, Zhang X Y, Sun J. Channel pruning for accelerating very deep neural networks[J/OL]. ArXiv Preprint (2017-08-21)[2019-03-22]. https://arxiv.org/abs/1707.06168
[27]	Hu Y M, Sun S Y, Li J Q, et al. Multi-loss-aware channel pruning of deep networks[J/OL]. ArXiv Preprint (2019-02-27)[2019-03-22]. https://arxiv.org/abs/1902.10364
[28]	Zhuang Z W, Tan M K, Zhuang B H, et al. Discrimination-aware channel pruning for deep neural networks[J/OL]. ArXiv Preprint (2019-01-14)[2019-03-22]. https://arxiv.org/abs/1810.11809
[29]	He Y H, Han S. ADC: automated deep compression and acceleration with reinforcement learning[J/OL]. ArXiv Preprint (2019-01-16)[2019-03-22]. https://arxiv.org/abs/1802.03494v1
[30]	Appuswamy R, Nayak T, Arthur J, et al. Structured convolution matrices for energy-efficient deep learning[J/OL]. ArXiv Preprint (2016-06-08)[2019-03-22]. https://arxiv.org/abs/1606.02407
[31]	Sindhwani V, Sainath T N, Kumar S. Structured transforms for small-footprint deep learning[J/OL]. ArXiv Preprint (2015-10-06)[2019-03-22]. https://arxiv.org/abs/1510.01722
[32]	Cheng Y, Yu F X, Feris R S, et al. An exploration of parameter redundancy in deep networks with circulant projections[J/OL]. ArXiv Preprint (2015-10-27)[2019-03-22]. https://arxiv.org/abs/1502.03436
[33]	Chen W L, Wilson J T, Tyree S, et al. Compressing neural networks with the hashing trick//Proceedings of the 32nd International Conference on Machine Learning. Lille, 2015: 2285
[34]	Shi L, Feng S K, Zhu Z F. Functional hashing for compressing neural networks[J/OL]. ArXiv Preprint (2016-05-20)[2019-03-22]. https://arxiv.org/abs/1605.06560
[35]	Wu J R, Wang Y, Wu Z Y, et al. Deep k-Means: Re-training and parameter sharing with harder cluster assignments for compressing deep convolutions[J/OL]. ArXiv Preprint (2018-06-24)[2019-03-22]. https://arxiv.org/abs/1806.09228
[36]	Lu Z Y, Sindhwani V, Sainath T N. Learning compact recurrent neural networks[J/OL]. ArXiv Preprint (2016-04-09)[2019-03-22]. https://arxiv.org/abs/1604.02594
[37]	Jin X J, Yang Y Z, Xu N, et al. WSNet: compact and efficient networks through weight sampling[J/OL]. ArXiv Preprint (2018-05-22)[2019-03-22]. https://arxiv.org/abs/1711.10067
[38]	Yang Y Z, Jojic N, Huan J. FSNet: Compression of deep convolutional neural networks by filter summary[J/OL]. ArXiv Preprint (2019-02-13)[2019-03-22]. https://arxiv.org/abs/1902.03264
[39]	Gupta S, Agrawal A, Gopalakrishnan K, et al. Deep learning with limited numerical precision[J/OL]. ArXiv Preprint (2015-02-09)[2019-03-22]. https://arxiv.org/abs/1502.02551
[40]	Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 2704
[41]	Courbariaux M, Bengio Y, David J P. BinaryConnect: training deep neural networks with binary weights during propagations//Advances in Neural Information Processing Systems. Canada, 2015: 3123
[42]	Courbariaux M, Hubara I, Soudry D, et al. Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or -1[J/OL]. ArXiv Preprint (2016-03-17)[2019-03-22]. https://arxiv.org/abs/1602.02830
[43]	Rastegari M, Ordonez V, Redmon J, et al. XNOR-Net: ImageNet classification using binary convolutional neural networks[J/OL]. ArXiv Preprint (2016-08-02)[2019-03-22]. https://arxiv.org/abs/1603.05279
[44]	Li Z F, Ni B B, Zhang W J, et al. Performance guaranteed network acceleration via high-order residual quantization//Proceedings of the IEEE International Conference on Computer Vision. Venice, 2017: 2584
[45]	Hwang K, Sung W. Fixed-point feedforward deep neural network design using weights+1, 0, and -1//2014 IEEE Workshop on Signal Processing Systems (SiPS). Belfast, 2014: 1
[46]	Hou L, Yao Q M, Kwok J T. Loss-aware binarization of deep networks[J/OL]. ArXiv Preprint (2018-05-10)[2019-03-22]. https://arxiv.org/abs/1611.01600
[47]	Lee H, Battle A, Raina R, et al. Efficient sparse coding algorithms//Advances in Neural Information Processing Systems. Canada, 2007
[48]	Gudovskiy D A, Rigazio L. ShiftCNN: generalized low-precision architecture for inference of convolutional neural networks[J/OL]. ArXiv Preprint (2017-06-07)[2019-03-22]. https://arxiv.org/abs/1706.02393
[49]	Meller E, Finkelstein A, Almog U, et al. Same, same but different-recovering neural network quantization error through weight factorization[J/OL]. ArXiv Preprint (2019-02-05)[2019-03-22]. https://arxiv.org/abs/1902.01917
[50]	Xu Y H, Zhang S, Qi Y Y, et al. DNQ: Dynamic network quantization[J/OL]. ArXiv Preprint (2018-12-06)[2019-03-22]. https://arxiv.org/abs/1812.02375
[51]	Golub G H, Reinsch C. Singular value decomposition and least squares solutions//Linear Algebra. Springer, Berlin, 1971: 134
[52]	Jaderberg M, Vedaldi A, Zisserman A. Speeding up convolutional neural networks with low rank expansions[J/OL]. ArXiv Preprint (2014-05-15)[2019-03-22]. https://arxiv.org/abs/1405.3866
[53]	Kim Y D, Park E, Yoo S, et al. Compression of deep convolutional neural networks for fast and low power mobile applications[J/OL]. ArXiv Preprint (2016-02-24)[2019-03-22]. https://arxiv.org/abs/1511.06530
[54]	Denil M, Shakibi B, Dinh L, et al. Predicting parameters in deep learning//Advances in Neural Information Processing Systems. Lake Tahoe, 2013: 2148
[55]	Calvi G G, Moniri A, Mahfouz M, et al. Tucker tensor layer in fully connected neural networks[J/OL]. ArXiv Preprint (2019-03-14)[2019-03-22]. https://arxiv.org/abs/1903.06133
[56]	Bucilu? C, Caruana R, Niculescu-Mizil A. Model compression//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, 2006: 535
[57]	Ba J, Caruana R. Do deep nets really need to be deep?//Advances in Neural Information Processing Systems. Canada, 2014: 2654
[58]	Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J/OL]. ArXiv Preprint (2015-03-09)[2019-03-22]. https://arxiv.org/abs/1503.02531
[59]	Romero A, Ballas N, Kahou S E, et al. FitNets: hints for thin deep nets[J/PL]. ArXiv Preprint (2015-03-27)[2019-03-22]. https://arxiv.org/abs/1412.6550
[60]	Chen T Q, Goodfellow I, Shlens J. Net2Net: accelerating learning via knowledge transfer[J/OL]. ArXiv Preprint (2016-04-23)[2019-03-22]. https://arxiv.org/abs/1511.05641
[61]	Li Z Z, Hoiem D. Learning without forgetting. IEEE Trans Pattern Anal Mach Intell, 2018, 40(12): 2935 doi: 10.1109/TPAMI.2017.2773081
[62]	Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer[J/OL]. ArXiv Preprint (2017-02-12)[2019-03-22]. https://arxiv.org/abs/1612.03928
[63]	Mirzadeh S I, Farajtabar M, Li A, et al. Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher[J/OL]. ArXiv Preprint (2019-02-09)[2019-03-22]. https://arxiv.org/abs/1902.03393
[64]	Liu X, Wang X G, Matwin S. Improving the interpretability of deep neural networks with knowledge distillation[J/OL]. ArXiv Preprint (2018-12-28)[2019-03-22]. https://arxiv.org/abs/1812.10924
[65]	Yang C L, Xie L X, Su C, et al. Snapshot distillation: Teacher-student optimization in one generation[J/OL]. ArXiv Preprint (2018-12-01)[2019-03-22]. https://arxiv.org/abs/1812.00123
[66]	Wang J, Bao W D, Sun L C, et al. Private model compression via knowledge distillation[J/OL]. ArXiv Preprint (2018-11-13)[2019-03-22]. https://arxiv.org/abs/1811.05072
[67]	Lee S H, Kim D H, Song B C. Self-supervised knowledge distillation using singular value decomposition[J/OL]. ArXiv Preprint (2018-07-18)[2019-03-22]. https://arxiv.org/abs/1807.06819
[68]	Lan X, Zhu X T, Gong S G. Knowledge distillation by on-the-fly native ensemble[J/OL]. ArXiv Preprint (2018-09-08)[2019-03-22]. https://arxiv.org/abs/1806.04606
[69]	Liu Y J, Che W X, Zhao H P, et al. Distilling knowledge for search-based structured prediction[J/OL]. ArXiv Preprint (2018-05-29)[2019-03-22]. https://arxiv.org/abs/1805.11224
[70]	Li Q Q, Jin S Y, Yan J J. Mimicking very efficient network for object detection//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 6356
[71]	Chen G B, Choi W, Yu X, et al. Learning efficient object detection models with knowledge distillation//Advances in Neural Information Processing Systems. Long Beach, 2017: 742
[72]	Wang T, Yuan L, Zhang X P, et al. Distilling object detectors with fine-grained feature imitation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 4933
[73]	Liu Y F, Chen K, Liu C, et al. Structured knowledge distillation for semantic segmentation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 2604
[74]	He T, Shen C H, Tian Z, et al. Knowledge adaptation for efficient semantic segmentation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 578
[75]	Iandola F N, Han S, Moskewicz M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size[J/OL]. ArXiv Preprint (2016-11-04)[2019-03-22]. https://arxiv.org/abs/1602.07360
[76]	Haward A G, Zhu M L, Chen B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[J/OL]. ArXiv Preprint (2017-04-17)[2019-03-22]. https://arxiv.org/abs/1704.04861
[77]	Sandler M, Haward A, Zhu M L, et al. MobileNetV2: inverted residuals and linear bottlenecks// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 4510
[78]	Haward A, Sandler M, Chu G, et al. Searching for MobileNetV3[J/OL]. ArXiv Preprint (2019-08-24)[2019-10-10]. https://arxiv.org/abs/1905.02244
[79]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 7132
[80]	Zhang X Y, Zhou X Y, Lin M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 6848
[81]	Ma N N, Zhang X Y, Zheng H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design// Proceedings of the European Conference on Computer Vision. Munich, 2018: 116
[82]	Chollet F. Xception: Deep learning with depthwise separable convolutions//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 1251
[83]	Xie S N, Girshick R, Dollar P, et al. Aggregated residual transformations for deep neural networks//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 1492
[84]	Gao H Y, Wang Z Y, Ji S W. ChannelNets: Compact and efficient convolutional neural networks via channel-wise convolutions//Advances in Neural Information Processing Systems. Salt Lake City, 2018: 5197
[85]	Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet and the impact of residual connections on learning[J/OL]. ArXiv Preprint (2016-08-23)[2019-03-22]. https://arxiv.org/abs/1602.07261
[86]	Tan M X, Chen B, Pang R M, et al. MnasNet: Platform-aware architecture search for mobile//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 2820
[87]	Tan M X, Le Q V. EfficientNet: Rethinking model scaling for convolutional neural networks[J/OL]. ArXiv Preprint (2019-06-10)[2019-10-10]. https://arxiv.org/abs/1905.11946
[88]	Liu C X, Chen L C, Schroff F, et al. Auto-DeepLab: Hierarchical neural architecture search for semantic image segmentation// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, 2019: 82
[89]	Huang G, Chen D L, Li T H, et al. Multi-scale dense networks for resource efficient image classification[J/OL]. ArXiv Preprint (2018-06-07)[2019-03-22]. https://arxiv.org/abs/1703.09844
[90]	Shelhamer E, Rakelly K, Hoffman J, et al. Clockwork convnets for video semantic segmentation[J/OL]. ArXiv Preprint (2016-08-11)[2019-03-22]. https://arxiv.org/abs/1608.03609