How to use Elevenlabs, how to about text-to-speech, speech-to-speech, design voices, clone your voice, and I’m going to show you everything I do to get the best results from this amazing speech synthesis [Music] tool.
What is Elevenlabs?
So if you don’t know, Elevenlabs is a speech synthesis AI tool that allows you to generate speech from text and manipulate audio of voice recordings to give you a realistic AI voice. I think that Elevenlabs is genuinely one of the most realistic AI voice generators out there in 2024.
Pricing and Plans
Honestly, it’s actually super cheap too. You can try it for free, but you quickly get limited in terms of usage. If you create a free account, your limits are a little bit bigger. But honestly, it’s super cheap, so I just recommend that you start on the starter plan, which includes 10 custom voices, 30,000 characters (equating to about 30 minutes of voiceover according to their estimator).
Commercial License Included
On top of that, the Starter Plan also includes the commercial license, which means that you can use it in paid projects. Plus, it’s like a dollar for the first month and then $5 afterwards, which again is super cheap—the price of a coffee.
Upgrading to the Creator’s Plan
Later on down the line, if you do start hitting limits (which I don’t think you will to start off with), you just go to the Creator’s Plan.
Common Misconceptions About Elevenlabs
Now, most people I speak to about Elevenlabs don’t actually fully understand what it is. Most people just think it’s a simple text-to-speech generator. But it’s much more than that.
Understanding Context in AI Speech
The Elevenlabs AI actually understands context, which means that if you write something in the style of a book, the AI is going to try and interpret how to perform a setup passage from the context of the writing itself.
Emotions and Voice Acting
On top of that, it’s also got a bunch of settings that can be used to achieve a wide range of emotions. It’s more like a voice actor than just a regular text-to-speech generator.
Navigating the Elevenlabs Interface
Let me show you on the computer. Hey YouTube, there’s a link to sign up to Elevenlabs in the description. It is an affiliate link, but it won’t cost you anything extra and it’ll help Alec make more high-quality videos just like this one.
Default Tool: Speech Synthesis
Once you’re in your account, the default tool is the speech synthesis tool. This is where you can generate voiceovers from text. So basically, it’s the text-to-speech tool.
Key Features and Settings
At the top, you’ll notice that you’ve actually got tasks where you’ve got two options: text-to-speech and speech-to-speech. We’re going to cover speech-to-speech in just a little bit.
Three Important Drop-down Menus
Now down here, we’ve got three settings or three drop-down menus in the settings section, and these three are probably the three most important that you want to take your time and customize.
Pre-Made Voices
The first drop-down menu is where you can choose from a bunch of different pre-made male and female voices.
Voice Tags and Their Meanings
If I open up, as you can see, I’ve got loads of different options on the left here. We can click to preview:
“Never mistake motion for action.”
And then you’ll notice it’s got a name and then a few tags.
-
Accent Tags: The first tag, the purple tag, is the accent (e.g., American, Irish, British English, Italian).
-
Tone or Style Tags: The second tag is actually the tone or style of the voice (e.g., Whispering, calm, well-rounded).
-
Recommended Use Case Tags: The third tag is actually the recommended use case (e.g., meditation, ASMR, narration, news presenter).
Famous Elevenlabs Voices
If you spend a lot of time on social media, I bet you’ve actually heard this voice right here:
“Allow the world to live as it chooses and allow yourself to live as you choose.”
That one is probably like one of the most famous Elevenlabs voices.
Exploring Unique Voices
But then you’ve also got voices like Arnold, which I think sound like:
“What worries you masters you.”
And then if we scroll down a little bit, we’ve actually got—where is he—you’ll recognize this one:
“Maybe life isn’t about finding yourself; life is about creating yourself.”
Maybe it’s just me, and obviously Elevenlabs didn’t do this on purpose, I think, but interesting.
Voice Settings
Next, you’ve got voice settings, and this actually looks a little bit complicated to begin with, but don’t worry. I’m going to explain and I’m also going to give some recommendations as to what you should do.
Stability Slider
You’ve actually got three sliders:
-
Stability:
-
The more you slide this to the right, the more stable it’s going to become, meaning that there’s going to be more consistency in terms of the voice on regenerations. But it can make it sound a little monotone.
-
The more you slide it to the left, the more unstable and variable it’s going to be. Increasing the variability can make the speech more expressive with the outputs varying a lot between regenerations, but this can actually also lead to instabilities.
-
As you can see, we’ve got this red zone, so we should always try and keep it at 30% or above. If you are generating long chunks of text, I do recommend staying on the more stable side. That way, your renders are more consistent. But if you’re doing short one-liners or short video content, maybe get a little bit experimental—see what kind of crazy results you can get with the unstable stuff.
Clarity and Similarity Enhancement
Next up, we’ve got clarity and similarity enhancement. This dictates how closely the AI should adhere to the original voice when attempting to replicate it.
Managing Original Audio Quality
-
If the original audio is bad quality and the similarity slider is set to high, the AI may reproduce some of the unwanted background noise when trying to mimic the voice from the recording.
-
For text-to-speech, this is actually okay as we’re choosing from pre-made voices, so they can almost always be set to high.
-
I just tend to leave this on default, but basically, if your captured audio is good, set it to high, and if your captured audio isn’t as good, maybe try playing with it down a little lower.