ABSTRACT
Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems. An experiment on the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN.
- Bengio., Y. (1999). Markovian models for sequential data. Neural Computing Surveys, 2, 129--162.Google Scholar
- Bishop, C. (1995). Neural Networks for Pattern Recognition, chapter 6. Oxford University Press, Inc. Google ScholarCross Ref
- Bourlard, H., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Kluwer Academic Publishers. Google ScholarDigital Library
- Bridle, J. (1990). Probabilistic interpretation of feed-forward classification network outputs, with relationships to statistical pattern recognition. In F. Soulie and J. Herault (Eds.), Neurocomputing: Algorithms, architectures and applications, 227--236. Springer-Verlag.Google Scholar
- Gers, F.; Schraudolph, N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3, 115--143. Google ScholarDigital Library
- Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. Proceedings of the 2005 International Conference on Artificial Neural Networks. Warsaw, Poland. Google ScholarDigital Library
- Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18, 602--610. Google ScholarDigital Library
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9, 1735--1780. Google ScholarDigital Library
- Kadous, M. W. (2002). Temporal classification: Extending the classification paradigm to multivariate time series. Doctoral dissertation, School of Computer Science & Engineering, University of New South Wales. Google ScholarDigital Library
- Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning (pp. 282--289). Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998). Efficient backprop. Neural Networks: Tricks of the trade. Springer. Google ScholarDigital Library
- Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE (pp. 257--286). IEEE.Google ScholarCross Ref
- Robinson, A. J. (1991). Several improvements to a recurrent error propagation network phone recognition system (Technical Report CUED/FINFENG/TR82). University of Cambridge.Google Scholar
- Robinson, A. J. (1994). An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5, 298--305.Google ScholarDigital Library
- Schraudolph, N. N. (2002). Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent. Neural Comp., 14, 1723--1738. Google ScholarDigital Library
- Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45, 2673--2681. Google ScholarDigital Library
- Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78, 1550--1560.Google ScholarCross Ref
Index Terms
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
Recommendations
Chinese Audio Transcription Using Connectionist Temporal Classification
ICCCM '20: Proceedings of the 8th International Conference on Computer and Communications ManagementMandarin is one of the global languages that have large users and speakers. There are several important factors for learners to be an expert in Mandarin. To be able to communicate properly, mastery in Chinese character (hànzì) and pīnyīn are required. ...
A Study of All-Convolutional Encoders for Connectionist Temporal Classification
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Connectionist temporal classification (CTC) is a popular sequence prediction approach for automatic speech recognition that is typically used with models based on recurrent neural networks (RNNs). We explore whether deep convolutional neural networks (...
Comments