CNN-Based Accent Similarity Detection Using Masked Spectrogram Reconstruction

At the forefront of innovation in the Responsible Artificial Intelligence Lab, one of our PhD students is addressing the challenge of accent recognition in automatic speech recognition (ASR) systems, particularly for low-resource non-native English accents like Ghanaian English. Due to limited data and inadequate model representation, ASR systems struggle with high word error rates (WER) for non-native accents.

To overcome this, we propose a Convolutional Neural Network-based Masked Spectrogram Reconstruction (CNN-MSR) framework for accent classification and similarity assessment. The model employs spectrograms with random masking for feature extraction and integrates Much Lower Frame Rate (mLFR) processing to enhance computational efficiency while preserving accent-specific information.

The CNN-MSR model achieves a classification accuracy of 90.71%, surpassing traditional approaches. It significantly reduces WER for non-native speakers to 0.5 and effectively differentiates native from non-native accents, as demonstrated through t-SNE, UMAP visualizations, and cosine similarity analysis. Additionally, it addresses the class imbalance in non-native Ghanaian accents using stratified splitting, proportional oversampling, and weighted sampling.

Our findings underscore the need for more extensive and diverse datasets to improve non-native accent classification. Future research will expand the dataset and explore transformer-based masked language models to enhance performance, fostering more inclusive and adaptable ASR systems for multilingual environments

Amina Salifu (PhD)