• 《工程索引》(EI)刊源期刊
    • 中文核心期刊
    • 中國科技論文統計源期刊
    • 中國科學引文數據庫來源期刊

    留言板

    尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

    姓名
    郵箱
    手機號碼
    標題
    留言內容
    驗證碼

    基于DL-T及遷移學習的語音識別研究

    張威 劉晨 費鴻博 李巍 俞經虎 曹毅

    張威, 劉晨, 費鴻博, 李巍, 俞經虎, 曹毅. 基于DL-T及遷移學習的語音識別研究[J]. 工程科學學報, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001
    引用本文: 張威, 劉晨, 費鴻博, 李巍, 俞經虎, 曹毅. 基于DL-T及遷移學習的語音識別研究[J]. 工程科學學報, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001
    ZHANG Wei, LIU Chen, FEI Hong-bo, LI Wei, YU Jing-hu, CAO Yi. Research on automatic speech recognition based on a DL–T and transfer learning[J]. Chinese Journal of Engineering, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001
    Citation: ZHANG Wei, LIU Chen, FEI Hong-bo, LI Wei, YU Jing-hu, CAO Yi. Research on automatic speech recognition based on a DL–T and transfer learning[J]. Chinese Journal of Engineering, 2021, 43(3): 433-441. doi: 10.13374/j.issn2095-9389.2020.01.12.001

    基于DL-T及遷移學習的語音識別研究

    doi: 10.13374/j.issn2095-9389.2020.01.12.001
    基金項目: 國家自然科學基金資助項目(51375209);江蘇省“六大人才高峰”計劃資助項目(ZBZZ–012);江蘇省研究生創新計劃資助項目(KYCX18_0630, KYCX18_1846)
    詳細信息
      通訊作者:

      E-mail:caoyi@jiangnan.edu.cn

    • 中圖分類號: TN912.3

    Research on automatic speech recognition based on a DL–T and transfer learning

    More Information
    • 摘要: 為解決RNN–T語音識別時預測錯誤率高、收斂速度慢的問題,本文提出了一種基于DL–T的聲學建模方法。首先介紹了RNN–T聲學模型;其次結合DenseNet與LSTM網絡提出了一種新的聲學建模方法— —DL–T,該方法可提取原始語音的高維信息從而加強特征信息重用、減輕梯度問題便于深層信息傳遞,使其兼具預測錯誤率低及收斂速度快的優點;然后,為進一步提高聲學模型的準確率,提出了一種適合DL–T的遷移學習方法;最后為驗證上述方法,采用DL–T聲學模型,基于Aishell–1數據集開展了語音識別研究。研究結果表明:DL–T相較于RNN–T預測錯誤率相對降低了12.52%,模型最終錯誤率可達10.34%。因此,DL–T可顯著改善RNN–T的預測錯誤率和收斂速度。

       

    • 圖  1  RNN–T聲學模型結構圖

      Figure  1.  Acoustic model of RNN–T

      圖  2  DenseNet模型結構圖

      Figure  2.  Model structure of DenseNet

      圖  3  DL–T編碼網絡結構圖

      Figure  3.  Encoder network structure of a DL–T

      圖  4  遷移學習方法結構圖

      Figure  4.  Method of transfer learning

      圖  5  基線模型實驗曲線圖。(a)初始訓練損失值曲線圖;(b)遷移學習損失值曲線圖;(c)初始訓練錯誤率曲線圖;(d)遷移學習錯誤率曲線圖

      Figure  5.  Curves of the baseline model:(a) loss curve on initial training stage; (b) loss curve on transfer learning stage; (c) prediction error rate curve on initial training stage; (d) prediction error rate curve on transfer learning stage

      圖  6  DL–T實驗曲線圖。(a)不同聲學模型初始訓練損失值曲線圖;(b)不同聲學模型遷移學習損失值曲線圖;(c)不同聲學模型初始訓練錯誤率曲線圖;(d)不同聲學模型遷移學習錯誤率曲線圖

      Figure  6.  Curves of the DenseNet–LSTM–Transducer: (a) loss curve of different acoustic models on initial training stage; (b) loss curve of different acoustic models on transfer learning stage; (c) prediction error rate curve of different acoustic models on initial training stage; (d) prediction error rate curve of different acoustic models on transfer learning stage

      表  1  RNN–T基線模型實驗結果

      Table  1.   Experimental results of RNN–T’s baseline %

      Acoustic modelInitial modelTLTL+LM
      Dev CERTest CERDev CERTest CERDev CERTest CER
      RNN-T[15]10.1311.82
      E3D117.6918.9214.4216.3112.0713.57
      E4D115.0317.3913.6615.5811.2513.07
      E5D119.6222.3514.1416.2211.8913.53
      E4D212.1214.5410.7412.749.1310.65
      下載: 導出CSV

      表  2  DL-T實驗結果

      Table  2.   Experimental results of DL–T %

      Acoustic modelInitial modelTLTL+LM
      Dev CERTest CERDev CERTest CERDev CERTest CER
      SA–T[15]9.2110.46
      LAS[28]10.56
      DE3D115.1717.3113.7815.9211.8513.52
      DE4D113.7015.8412.7814.8011.2112.95
      DE5D115.9218.3813.4615.3011.5713.90
      DE4D211.2313.4510.6912.558.8010.34
      下載: 導出CSV

      表  3  不同語言模型對聲學模型的影響

      Table  3.   Effects of different language model weights on the acoustic model %

      Value of LMDev CERTest CER
      0.28.9110.47
      0.38.8010.34
      0.48.8910.45
      下載: 導出CSV
      中文字幕在线观看
    • [1] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 2012, 29(6): 82
      [2] Graves A, Mohamed A, Hinton G E. Speech recognition with deep recurrent neural networks // 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, 2013: 6645
      [3] Seltzer M L, Ju Y C, Tashev I, et al. In-car media search. IEEE Signal Process Mag, 2011, 28(4): 50
      [4] Yu D, Deng L. Analytical Deep Learning: Speech Recognition Practice. Yu K, Qian Y M, Translated. 5th ed. Beijing: Publishing House of Electronic Industry, 2016

      俞棟, 鄧力. 解析深度學習: 語音識別實踐. 俞凱, 錢彥旻, 譯. 5版. 北京: 電子工業出版社, 2016
      [5] Peddinti V, Wang Y M, Povey D, et al. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process Lett, 2018, 25(3): 373
      [6] Povey D, Cheng G F, Wang Y M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks // Conference of the International Speech Communication Association. Hyderabad, 2018: 3743
      [7] Xing A H, Zhang P Y, Pan J L, et al. SVD-based DNN pruning and retraining. J Tsinghua Univ Sci Technol, 2016, 56(7): 772

      刑安昊, 張鵬遠, 潘接林, 等. 基于SVD的DNN裁剪方法和重訓練. 清華大學學報: 自然科學版, 2016, 56(7):772
      [8] Graves A, Fernandez S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks // Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, 2006: 369
      [9] Zhang Y, Pezeshki M, Brakel P, et al. Towards end-to-end speech recognition with deep convolutional neural networks // Conference of the International Speech Communication Association. California, 2016: 410
      [10] Zhang W, Zhai M H, Huang Z L, et al. Towards end-to-end speech recognition with deep multipath convolutional neural networks // 12th International Conference on Intelligent Robotics and Applications. Shenyang, 2019: 332
      [11] Zhang S L, Lei M. Acoustic modeling with DFSMN-CTC and joint CTC-CE learning // Conference of the International Speech Communication Association. Hyderabad, 2018: 771
      [12] Dong L H, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition // IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, 2018: 5884
      [13] Graves A. Sequence transduction with recurrent neural networks // Proceedings of the 29th International Conference on Machine Learning. Edinburgh, 2012: 235
      [14] Rao K, Sak H, Prabhavalkar R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer // 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Okinawa, 2017
      [15] Tian Z K, Yi J Y, Tao J H, et al. Self-attention transducers for end-to-end speech recognition // Conference of the International Speech Communication Association. Graz, 2019: 4395
      [16] Bu H, Du J Y, Na X Y, et al. Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline[J/OL]. arXiv preprint (2017-09-16)[2019-10-10]. http://arxiv.org/abs/17-09.05522
      [17] Battenberg E, Chen J T, Child R, et al. Exploring neural transducers for end-to-end speech recognition // 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Okinawa, 2017: 206
      [18] Williams R J, Zipser D. Gradient-based learning algorithms for recurrent networks and their computational complexity // Back-propagation: Theory, Architectures and Applications. 1995: 433
      [19] Huang G, Liu Z, Maaten L V D, et al. Densely connected convolutional networks // IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, 2017: 4700
      [20] Cao Y, Huang Z L, Zhang W, et al. Urban sound event classification with the N-order dense convolutional network. J Xidian Univ Nat Sci, 2019, 46(6): 9

      曹毅, 黃子龍, 張威, 等. N-DenseNet的城市聲音事件分類模型. 西安電子科技大學學報: 自然科學版, 2019, 46(6):9
      [21] Zhang S, Gong Y H, Wang J J. The development of deep convolutional neural networks and its application in computer vision. Chin J Comput, 2019, 42(3): 453

      張順, 龔怡宏, 王進軍. 深度卷積神經網絡的發展及其在計算機視覺領域的應用. 計算機學報, 2019, 42(3):453
      [22] Zhou F Y, Jin L P, Dong J. Review of convolutional neural networks. Chin J Comput, 2017, 40(6): 1229 doi: 10.11897/SP.J.1016.2017.01229

      周飛燕, 金林鵬, 董軍. 卷積神經網絡研究綜述. 計算機學報, 2017, 40(6):1229 doi: 10.11897/SP.J.1016.2017.01229
      [23] Yi J Y, Tao J H, Liu B, et al. Transfer learning for acoustic modeling of noise robust speech recognition. J Tsinghua Univ Sci Technol, 2018, 58(1): 55

      易江燕, 陶建華, 劉斌, 等. 基于遷移學習的噪聲魯棒性語音識別聲學建模. 清華大學學報: 自然科學版, 2018, 58(1):55
      [24] Xue J B, Han J Q, Zheng T R, et al. A multi-task learning framework for overcoming the catastrophic forgetting in automatic speech recognition[J/OL]. arXiv preprint (2019-04-17)[2019-10-10]. https://arxiv.org/abs-/1904.08039
      [25] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality // Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2.Canada, 2013: 3111
      [26] Povey D, Ghoshal A, Boulianne G, et al. The Kaldi speech recognition toolkit // IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, 2011
      [27] Paszke A, Gross S, Chintala S, et al. Automatic differentiation in PyTorch // 31st Conference on Neural Information Processing Systems. Long Beach, 2017
      [28] Shan C, Weng C, Wang G, et al. Component fusion: learning replaceable language model component for end-to-end speech recognition system // IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, 2019: 5361
    • 加載中
    圖(6) / 表(3)
    計量
    • 文章訪問數:  2876
    • HTML全文瀏覽量:  1250
    • PDF下載量:  140
    • 被引次數: 0
    出版歷程
    • 收稿日期:  2020-01-12
    • 網絡出版日期:  2022-10-14
    • 刊出日期:  2021-03-26

    目錄

      /

      返回文章
      返回