TY - GEN
T1 - Closing the Loop on Speech to Music Translation
T2 - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2025
AU - Krishnan, Gopika
AU - Drabek, Julia
AU - Anantapadmanabhan, Akshay
AU - Ganguli, Kaustuv Kanti
AU - Guedes, Carlos
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - This paper presents a pipeline to convert spoken Konnakol sequences, a South Indian vocal percussion language, into synthetic rhythmic sequences performed on the mridangam. We fine-tune the Whisper speech-to-text model on Konnakol data, enabling accurate transcription of spoken sequences, despite the small size of our dataset (approximately 15 minutes). The transcriptions are rhythmically encoded in a format that is compatible with the Konnakol Typewriter, a web application that converts these sequences into mridangam audio. Additionally, these transcriptions serve as input for a Markov model, which generates new rhythmic sequences that can also be processed through the Konnakol Typewriter to produce mridangam audio. Whisper's performance is impressive with very low error rates, making it an ideal tool for this task. This pipeline not only facilitates the transcription of Konnakol but also opens possibilities for creating educational tools, preserving cultural heritage, and generating data for rhythm-based applications. Future work will focus on refining the process to improve accuracy and versatility.
AB - This paper presents a pipeline to convert spoken Konnakol sequences, a South Indian vocal percussion language, into synthetic rhythmic sequences performed on the mridangam. We fine-tune the Whisper speech-to-text model on Konnakol data, enabling accurate transcription of spoken sequences, despite the small size of our dataset (approximately 15 minutes). The transcriptions are rhythmically encoded in a format that is compatible with the Konnakol Typewriter, a web application that converts these sequences into mridangam audio. Additionally, these transcriptions serve as input for a Markov model, which generates new rhythmic sequences that can also be processed through the Konnakol Typewriter to produce mridangam audio. Whisper's performance is impressive with very low error rates, making it an ideal tool for this task. This pipeline not only facilitates the transcription of Konnakol but also opens possibilities for creating educational tools, preserving cultural heritage, and generating data for rhythm-based applications. Future work will focus on refining the process to improve accuracy and versatility.
KW - Automatic Speech Recognition (ASR)
KW - Carnatic Music
KW - Konnakol Transcription
KW - Machine Learning
KW - Markov Chain Generation
UR - http://www.scopus.com/inward/record.url?scp=105007802991&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105007802991&partnerID=8YFLogxK
U2 - 10.1109/ICASSPW65056.2025.11011256
DO - 10.1109/ICASSPW65056.2025.11011256
M3 - Conference contribution
AN - SCOPUS:105007802991
T3 - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2025 - Workshop Proceedings
BT - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2025 - Workshop Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 6 April 2025 through 11 April 2025
ER -