The AWS example is $4 per million char. The starter of this service is $7.92. And the cheapest option is $2 per million which is 2x cheaper, not 8x. Yes the AWS (and google) neural voices are $16 but then the front page voice sampled is not from the neural AWS Polly but standard (Matthew IIRC).
Shady, really shady. It's a shame, I would not mind a good competitor to Polly/Google.
Plus this service starts at $249/month for the first non-"evaluation" plan compared to AWS & GCP which are fully Pay as you go. That's a complete deal breaker for anyone who wants to use it for a side project.
Yes, that's true. We're currently not targeting side projects at the moment as we do not have infrastructure to do so. I hope do be able to support smaller developers (like myself) soon. Step-by-step!
I’d say businesses that already spend $$$ on a 3rd party TTS service and know how expensive it is or have content but haven’t pulled the trigger due to the prohibitive cost. I wrote this on the Product Hunt page:
> “… read aloud apps (e.g. Pocket, Speechify, etc), UGC platforms (e.g. Medium, SubStack, etc), publications (e.g. Bloomberg, NY Times, etc), and e-learning platforms (e.g. Duolingo, Pearson, etc). We’re also interested in providing discounts to non-profits (e.g. Wikipedia, etc)”
I'm going to argue yours is not better. I preferred the AWS example, which was no doubt cherry picked, over all of your voices. Especially during the transition between sentences there is an audible glitch in the audio, whereas AWS transitioned smoothly.
I get lots of weird random artifacts in Unreal. Sometimes it's a weird "vibrato" on some words, some words are unrealistically raspy. Plus it's a bit too sibilant in general compared to AWS. Clicking the "Redo" button fixes it but then other artifacts crop up in other places, which leads me to believe this is completely fixable.
And what do you mean by cherry-picked? I don't think the Amazon example is cherry picked, you can type your own custom text there! They're just using the AWS API themselves. Maybe I misunderstood you.
Anyway, it's a bit unfair because Amazon probably spent millions in their product, but I wouldn't exactly call it better...
Sure, that's why I put "arguably" as I honestly know that AWS is really good. But I did pick the best AWS voice which is the most popular one amongst many read aloud apps, though! Anyways, I'll try to implement speaker selection for AWS so that the comparison is more fair. I barely finished our multi-speaker feature before launching on PH.
You should probably edit the title - some amount of puffery is reasonable when showcasing one's own work but if you're so wide of the mark, the only feedback you'll get is about how your representations are inaccurate. Which is what's happening.
I think both are pretty good, but to my ears the UnrealSpeech ones all have very sharp, grating "s" sounds. The AWS voice is much smoother in this regard. Is this something that can be configured to a degree or would you have to post-process it with an de-esser? Because I can't imagine listing to those voices for anything longer than the example text.
The demo and copy are great. Not everyone will agree it is "better" because that's subjective but your website does a good job of telling people it is better and the demo is good enough that some people will accept what they see/read/hear. :clap:
Your heading makes it seem like you want the direct comparison to AWS polly so maybe add a table to directly compare different aspects of your product vs aws that make it better. Sound quality is just one attribute to compare. What about SDKs, limits, code samples, use cases, more nitty gritty sound comparison details, etc.
Pricing should also be more transparent if you want to compare aws to yours - what maths did you do to get 8x cheaper because at first glance that is misleading.
I agree on the table idea to more candidly and clearly compare other aspects. This launch/experiment was that the quality/cost would be the main factors, and we’d go from there, iterating and customizing it to work for early customers.
The 8x math is per 1M characters. I do see that since we’re charging a subscription, it may not be a fair comparison. But the minimum commitment is is so small that I thought it wouldn’t matter for the customers I’m targeting right now. I do think it can be misleading because people might expect pay-as-you-go.
We aren’t able to provide pay-as-you-go right now, so I’ll look into updating the copy or how we communicate the subscription model!
I definitely recognized the voice of real people alive today in your example set. I assume it's some kind of a trained ai you feed hours of content to for them to determine the speech pattern, question is, are you paying royalties to those individuals for their contributions?
I'm not. I'll try to reach out and figure out a license of some sort. I suspect the royalties we could pay out is probably negligible. I think it might be safest for me to get rid of the "recognizable" voices and simply create new synthetic voices that are not of real people.
I'm pretty sure tech like this is similar to self-driving tech like 6-10 years ago in the sense that there are no laws addressing it. Like no one wrote a law saying "a driver must be in the driver's seat of a car" ahead of time. Youtube has already reinstated a Jay-Z audio deepfake that was originally taken down.
I agree, I think entirely synthetic voices will be the way a lot of services like this can operate in the future. Unfortunately I haven't seen much research in this area. Guess it's outside the typical "take a dataset and optimize the hell out of it" realm of a lot of ML research since the synthetic voice will not exist in any dataset ahead of time.
Been thinking a lot about how to accomplish this myself for a similar product I'm building, glad to hear someone else is thinking about it too!
Would it be possible to mix multiple people/voices in a training set? Or does that confuse the AI? Could be interesting to create a real but kind of not real model. If that makes sense…
Plenty of the internet lawyers came out to rabidly defend the right of Github to pirate data to feed into Copilot, so I wouldn't be that worried about IP. I would be more worried about picking the wrong voices, such as those with strong political connotations.
That's a great insight. I'll look into the Github/Copilot more. I can't code without Copilot anymore and that piques my interest. But yeah, we def need politically neutral voices.
on the flip side, I found the "professor" voice endearing. I'm not necessarily a fan (although not a hater either), but I thought it was:
A. Impressive display of capability
B. A very clever choice as it's recognizable but not as universal as say Obama's voice or Joe Rogan's would be
C. Brilliant marketing
I probably wouldn't offer it as a "real" voice for use in bulk through the API due to the legal concerns, but on the marketing page it's really cool and I would hang on to it. Plus if you get sued it would be great publicity :-)
Haha, I really appreciate this feedback. Everything I wanted to hear.
Frankly, trying to fight against the Goliath, with 0 marketing budget, and I’m desperately hoping to create noise. Breaking a rule or two is something AWS can’t do at their scale.
On the bulk offering, I do have a clear path forward. In short, I can create new synthetic voices. It’s like those this-person-does-not-exist images but for voice. “unreal speech”
This is a typical brute behavior. It's not about breaking a rule or two for the sake of success, it's about doing it while riding on people's backs. You think that stealing someone's voice is ok for a small startup company because you don't have the budget to know better while this is not the issue at hand. You obviously know this is wrong and are trying to get away with it riding on people's good intentions.
This is blunt and clear immoral behavior and you are fully aware of it. You just think that you have the right to reach success not by your own skills, but by stepping on other people, and until they complain you'll keep doing it.
PS: Even if Amazon does the same thing it's no excuse for you. Look up tu quoque fallacy.
Alright, Mr. Noble. You're either naive, narrow-minded, unable to see the big picture, or all above the above.
Bringing others down will never be an antidote to not building or achieving anything in your life. Your time will be better spent if you focus on bringing yourself up and others around you.
I think you want to get a rise out of someone over the Internet, with your identity hidden, to get attention. Hence, this will be my last comment-- but if you want to turn it around, I'd be down to help.
That name might cause you some grief, as you're shortening it to "[logo] UNREAL".
I personally think it's fine, but I'd expect some people finding it searching for speech synthesis for the unreal engine and then getting outraged about getting "tricked".
Not sure what to do about it though. It just jumped to the top of my mind while looking at the landing page.
Got it, I appreciate the feedback. I didn't really think about people searching for speech synthesis for unreal engine. I'll try to address this somehow!
Yeah, ours is obviously an early version developed with far less resources in less time. But I'm hoping that the 6-8x lower price point is a selling point (considering the quality is relatively comparable, and it'll get better)
The only sample I found arguably better than AWS was Female B. The others were too close to an uncanny valley (Professor, Entrepreneur), or the cadence was all over the place (Male A and Male B).
I spend a lot of time listening to AWS's text-to-speech. It can be distracting at times. But with Unreal Speech's text-to-speech, I'd lose focus incredibly quickly and focus on the issues with cadence or general weirdness (Professor's pitch change is too abrupt, and weirdly gets caught on the word used which throws off the cadence of natural speech).
Honestly, I'm super happy that you found one voice arguable better than AWS.
I might argue, though, you might get used to our voices quickly once you start using it frequently. I've had this feedback before from someone who was very used to AWS's monotonous voice, and he actually changed his opinion after listening to a couple of articles. Previously, I built https://audioread.com which got me to talk to a bunch of users.
I think the random generator might be broken.... It spit out some offensive stuff that maybe you don't want on a business site, though it did give me a laugh lol...
> What the fk did you just fking say about me, you little btch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Qaeda, and I have over 300 confirmed kills.
dammit to hell (in a disappointed but congratulatory voice), I'm building basically the exact same product as audioread.com right now. I thought it was a great idea and was shocked I couldn't find any implementations yet :-(
kudos to you on having a wonderful idea :-D Guess I'll move on to my next idea
I'm not a big fan of NLP, personally, as I think its kind of creepy how easily it is to deepfake a voice these days given a very moderate amount of content that can be scraped online: but I like the 'disrupt the giant' attitude you have.
> Yo! Let’s at least chat and brainstorm about it. Would you want to ping me on twitter? @automationism
But honestly, THIS is why I keep coming back to HN, rather than take an adversarial route and needlessly bicker entrepreneurs should be collaborating and utilizing each others strengths.
Well done, I know want to see your progress as your product matures! Who knows, I (a AI/ML student) might have a need for your services yet.
For every voice sample that Unreal offers on the front page comparison, I found that the Unreal voice mashed up words (or tried to speak syllables too quickly), most noticeably on "synthesis." I did notice some more natural-like pauses and cadence, but the clarity of speech wasn't as good as the AWS example. For anything I could think that I would want AI-narrated content, I would prefer clarity over natural sounding nuances.
This is fantastic feedback! The "mashed up words" are something I'm trying to fix. And I totally agree that the AWS voice is much more clear. Maybe we can get to clear & natural before AWS?
Having "Better" on the title of your landing page probably isn't the best idea. This is such a subjective territory that it's very difficult to say that one voice sounds better than another unless you have a blind poll with at least a few hundred people claiming they preferred your product over AWS.
It doesn't matter if you can argue its better, it only matters if you can show that the vast majority of your target demographic prefers your product over AWS, which I couldn't find any evidence of.
I respectfully disagree. I think it's great to have on the landing page. "Better" is a pretty clearly subjective term so I immediately know it's somebody's opinion. The fact that you can directly compare them also makes it so you can decide for yourself within seconds.
The "8x cheaper than AWS" feels deceptive though. The pricing is not apples to apples so it's only 8x cheaper at the most favorable point in the graph (when spending $1,000 a month and using exactly the number of characters offered). For me that's outrageously more expensive than AWS. It's more like 8x more expensive than AWS rather than 8x cheaper, which is such a dramatic swing I wondered if I was misreading the pricing somehow. My usage on Polly last month was about $75 but the month before was $5 and this month will be closer to $20.
It's a real shame because after hearing the output I was ready to move everything over. I should have looked at pricing before getting excited, but I took the 8x claim at face value.
I use a lot of AWS Polly neural and have listened to a hundred or so hours of Polly output (building an MVP/prototype). The cost of AWS is high (for me as an independent dev) but I've tried several other services and none have been as good at making natural speech (the kind one could tolerate for an audiobook for example).
If this was cheaper or if I could buy a small time license and run it on my own hardware (which takes the heavy costs away from Unreal) I would totally do that. Alternatively, I'd be willing to pay close to AWS pricing for a pay as you go, with a one-at-a-time rate-limit in place (to avoid the scaling/provisioning challenges on the ops side). I know it's not likely to happen, but just wanted to throw it out there.
Right off the bat, the third sentence looks much more like what my data looks like, and the kind of text I wanted people to use it for. Where is the line from?
It seems like a very bad idea to use two famous public speakers' voices as training data when you clearly do not have permission to do so (no credit to be found, you simply renamed Jordan Peterson as "Professor"). It's hard to imagine that working out well as you try to sell API access. Seems like a legal nightmare waiting to happen.
Not directly related to this API, but I’ve begun wondering if the next revolution in speech synthesis is to integrate with a natural language model like gpt-3 in order to gain semantic awareness, and use that context to produce emotional expressiveness and inflection that is attuned to the meaning and tone of the text.
Imho, the next revolution in speech synthesis will come from using guided diffusion models, leveraging the recent breakthrough in image synthesis (Dall-e), to generate spectrograms (spectrograms are images).
Using this slower generative approach it will allow to produce large high-quality enriched audio datasets with parametric text, timings and emotion.
Then you use these datasets to bootstrap in a supervised fashion the existing traditional architectures to make the generation faster.
The usual problem of text-to-speech is that you have to go from a low-information space (aka text) to a high-information space (aka sound). And therefore training is ill-defined because one input text can have several correct sound. But once you have an enriched text with inflections and parameters, speaker embedding, the mapping then become one enriched text to one exact audio and the training become well-defined and easy.
Cool - I probably shouldn’t be surprised that the smart minds in machine learning and natural language processing are well ahead of me, an interested lay person!
Any good links or HN comment threads that you’d recommend?
Oh, got it. You mean it doesn't pronounce "$99/mo" correctly? Yeah, there are quite a few cases like this where we don't translate symbols, abbreviations, etc quite right, yet. But unfortunately, in this case, I spelled out "$99 per month" but it still sounds awkward.
There was a technical issue with the how it sounded but, yes the underlining point was you need some kind of pay as you go model or something. if you can't offer a free tier.
Don't get me wrong it's pretty cool tech if you trained your own model.
I'm also having a hard time coming up a use case. Outside of spam and games I'm having a hard time coming up with generative AI is useful at all anyway.
The service is for businesses. For them, tens of millions of characters per month is not much at all. And they'd integrate the API and provide services to end-users/consumers. I can see how one could find it disingenuous, though, although I don't 100% agree.
I have to say that this is the most insane thing I've seen on the internet this week.
And I'm not talking about the quality of your product. I mean to use Jordan Peterson without consent or reference is not bold as other are saying, but is verging on criminal.
> thanks for a great question. Candidly speaking, I’d say we’re taking a “move fast” and “seek forgiveness later” approach. The plan is a) to try to apologize and get a license if there are demands and if that doesn’t work b) create new synthetic voices that are not of real people.
This infuriates me to no end. People with that kind of mentality fuck up the entire startup ecosystem for everyone.
Apparently the creator has been down this road before and got told off. Apparently a firm "stop using people's voices without permission" was not sufficient.
So that makes this intentional and the excuse of "I didn't know any better" no longer plays.
Same thing. I immediately recognized Gary Vee. And I imagine that using audio generated that sounds exactly like him without his permission is gray area at best.
You're intentionally missing the point. This is wrong and you know it. Impressions are different than building a model that can say anything using someone's voice, especially if they're recognizable and/or relative public figures.
It's like someone selling stolen credit cards and then claiming they're doing nothing wrong because people who are buying them are the ones committing the crimes.
Hmm, I've been in the space for a bit, and I think it's not unsafe to say I picked the best voice AWS provides. I could've implemented a multi-speaker feature for AWS, but I just didn't get a chance. I did try IBM, but it sounds worse than AWS?
If you build a product no matter what you have to be honest to yourself and imho most of the neural voices from azure sound better than your example. They may miss some of the tempre of your voices but the tempre comes from the examples you fed it... tbh it's not much better than doing it yourself with something like https://github.com/neonbjb/tortoise-tts
Well, sure, I mean MS is a 2T company with 180K employees. So I wouldn't be too surprised if theirs sounds better than mine. The tortoise tts repo seems pretty random though. Are you trying to promote something of your own or something? haha
English only, no SSML. This was probably the least minimal MVP I've built so far, though. Gotta still iterate. SSML is prob more important than non-English for now.
I actually like the Gary V one. I think using voices that people are very familiar with is helpful for showing the ability of the product. It allows me to A-B the voice with how I know Gary's voice is in my head.
Just pick somebody completely neutral with no political slant whatsoever. There will unfortunately always be people - on each side - who are rather sensitive.
Gary Vaynerchuk (GaryVee) wrote Crush It and is pretty synonomous with hustle culture. He loves the grind and promotes that as the way to move up in the world, if people so choose.
I guess people get upset when someone on the internet (that they could easily ignore, just as you have) tells them to work harder.
He's also loves promoting arbitrage plays... so I kind of blame him for the insane secondhand markets on a ton of normal stuff. Everyone is buying up all the stock and trying to flip it for a profit so they can be like GaryVee at garage sales. This does bother me. I went to go buy a new pair of shoes I bought 3 years ago for $90, but no one has them. I looked on eBay and they are going for $600. That's madness. It's not just Gary's fault though, it's the companies that opt for "drops" and hype over scale and actually meeting demand. But that's a whole differnet rabbit hole.
Both people have a lot of fans, but also get a lot of hate from a particular demogrpahic. I assume the majority is indiffernet.
I don't think the point was that the voices are controversial (that can be debated) but rather the questionable legality of using essentially a deep fake of a celebrity's voice and selling it.
I was hoping to like this but the pricing is a total deal-breaker for me.
Personally, AWS' and Google's pay-as-you-go plan win every time regardless of the "better" and "cheaper" claims IMO. You must introduce pay-as-you-go plans if you really want some good traction. I really like the Entrepreneur voice (the professor sounds like Jordan Peterson)
On more thing, I think in the comparison you have sampled the AWS non-neural voice while using the price of neural voice for the price comparison - this does not sound like a fair comparison to me (but please correct me if I'm wrong and I'll edit my comment).
I agree and I'm saddened that it's a deal breaker for you. Unfortunately, we do not have infrastructure to offer pay-as-you-go (actually pretty challenging ops). We're forced to target a smaller niche today and expand in the future (i.e. offer pay-as-you-go).
But for those who are already spending, let's say, $250+ a month on TTS, this is a sweet deal. They are my initial target customers.
It seems more straightforward to me in cases of imagery, where there is less room for ambiguity. How different does a voice have to sound to not "be" another person's voice? I believe the laws on the books today are intended to deal explicitly with the actual person's voice recorded by some microphone. I think once you move into the realm of "this is a voice, generated from a model trained on audio samples of some person's voice" then it becomes unclear whether existing laws apply. If you further add into the equation that some models are trained on several speakers then it gets even muddier I think. If you advertise "this is person X" then I think it becomes problematic, but for different reasons, since at that point you're using the persons' name to advertise your product.
IDK what you mean about biometric data being protected. I'm pretty sure there's no law stopping me from pulling your fingerprint off of a coffee mug you left at Starbucks, creating a high resolution scan of it, and posting it at godmode2019-fingerprint.com
EDIT: after some quick Googling it looks like there are some biometric privacy laws on the books in certain U.S. states that would prevent something like godmode2019-fingerprint.com, but it does not appear to be comprehensive across the US. Not sure about other countries.
Cheaper is good, but the quality is much worse and your first demo is very obviously Jordan Peterson, who has a famously annoying voice and some would say personality.
EDIT: OP clarified the Random feature uses previous submitted inputs. So, this was unintentional. It's just unlucky I hit that particular sample lol
The voices sound fine (some are worse than AWS, but some are indeed better). However, as a queer person, putting in sample texts like these[1] kinda put me off from your product no matter how good they are. IMO, that's absolutely uncalled for.
On a professional note, it's very immature( and silly even?) to use a product page to voice hostile idiosyncratic political opinions in general (regardless of whether I agree with them or not).
You're of course entitled to your opinions, and welcome to market your product however you want, though; I'm not trying to encroach on that.
Hey, thanks for the comment. The "sample" must've been pulled from the "random" feature. It basically shows texts other people have tried. This feature might have been a bad idea. I'm sorry if it offended you in any way. Not sure if I have ability to moderate the content. Do you think it's best if remove the "random" feature altogether?
You need to make it pull from Wikipedia or some other semi-moderated source, else your Random button is going to turn into Microsoft's Tay real quick. I also notice your Professor voice is pretty clearly Jordan Peterson, which some people may have a problem with.
This feature is definitely a bad idea. People have been putting stupid stuff into text-to-speech since the late 90s with iMacs G3.
If you're looking to position yourself as an inclusive company, don't regurgitate text put in by previous users. Because idiots on the internet will idiot. And that idiocy now is cosigned with your company name and logo.
If you're trying to sell this product and showing text that random people from the internet have inputted, then yes this is a terrible idea. It's only a matter of time before "Hitler was right" is displayed on your site. Do you want the trollings of dorks associated with your brand?
Can someone explain what the hell "Vichy" means in that context, my own brain and search engines seem to immediately jump to Vichy France but that's just nonsesnsicial unless there's a really contrived analogy.
I'm a homosexual person that doesn't give a flip about pride month or anything that's "gay culture", but that's overtly hostile. That person needs mental help.
Shady, really shady. It's a shame, I would not mind a good competitor to Polly/Google.