About the project

implementation of End-to-End Automatic Speech Recognition architectures.

January 25, 2024
My Role

Automatic Speech Recognition

Implemented automatic speech recognition architectures based on CTC1, Listen, Attend and Spell2 and LAS-CTC3.

The models were trained and tested on a subset of the HarperValleyBank Dataset4. Which is hosted here. The dataset is used to train models which predicts each spoken character.


  • Feature Extraction
    • Uses Librosa to extract wav log melspectrogram
    • Character encoding
  • Training end-to-end ASR
    • Multiple implementation of ASR model architecture including attention based models
    • Regularization of attention-based network to respect CTC alignments (LAS-CTC)
    • Utilizes Lightning Trainer API
    • Training process logs and visualization with Wandb
    • Teacher-forcing
  • Decoding
    • Greedy decoding
    • imposes a CTC objective on the decoding
    • CTC-Rules

Model Run Report

Model run report obtained from Wandb

View project on GitHub


  1. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, A Graves et al.
  2. Listen, Attend and Spell, W Chan et al.
  3. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, S Kim et al.
  4. CS224S: Spoken Language Processing