Personal Speech Synthesis:
Try and guess which one of these recordings is me and which is Lyrebird. Easy, right? I’ve got a ways to go
Without getting very in-depth with a project I am aiming to complete within a few years, I’d like to update you all on my progress on synthesizing my own voice. The main goal for this branch of my project is to create my own personification of a personal assistant. While this isn’t easy, with the complexity of the voices behind Siri and Alexa I’m confident that with enough training time and data I can mimic at least similar results, but with my own voice. Anyway, with all that out of the way, here’s where I’m at:
If we’re talking about completely finished results, technically I haven’t finished anything. Through one of my favorite podcasts, RadioLab, I cam across a startup called Lyrebird that uses AI to deconstruct parts of your speech using hundreds of recordings and slowly piece together a copy of your voice. It’s not great, especially because they have the service limited to 300 recordings so far (about an hour), but it does start to sound a little like me, even through the very robotic tone. Included in this post is a recording I compiled through Lyrebird and one I did of myself, check it out. This would suffice for testing out what I need to build the personal assistant, except that they dont have a public API yet (otherwise I’d let you all get me to say whatever you wanted me to say through this post). This means, other than manually typing out what i want it to say, I cant impliment it in anything like an assistant. So, I kept looking around and found something else that might work.
Google DeepMind has done some amazing work in the last few years, but their WaveNet research really blew me away. I’ll link to their original paper here. What they’ve done since then is build upon this using TensorFlow to preprocess and train a model based on any amount of recordings you feed it, only stopping at a defined step level. The process and technology is pretty amazing, and I’m specifically looking at using their Tacotron 2 implementation. Various versions exist on GitHub. You can hear some of the examples here; they really blow my mind. So, I’ve installed all the needed dependencies and will start training a model for my own voice! As soon as I record about 24 hours of my voice…. hah, easier said than done. So, this project is going to take a lot longer than I expected. Luckily, my GTX 1080 has a compute score of 6.1 so if I leave it training for a day or two, once I do all those pesky recordings, it should sound pretty great.
For now, that’s pretty much all I have, so updates to come!