Hacker News new | past | comments | ask | show | jobs | submit login
Apple Books digital narration (authors.apple.com)
300 points by alienreborn on Jan 5, 2023 | hide | past | favorite | 234 comments



The "Helena" sample contains a pretty good test case of this system's ability to guess where emphasis and pauses go:

> I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild.

(I checked in an online sample of the ebook: there is no punctuation in this sentence.)

Unfortunately the AI completely faceplants, placing an enormous pause right in the middle of the phrase "all places wild". It actually changes the meaning of the text, making it sound more like "...the beauty he saw in all places. Wild!"

I wonder if any of these AI speech synthesis tools come with an editing tool that you could use to tell it not to put the pause there.


I guess this is more of a case of garbage in - garbage out. The original sentence itself is not well-structured. But people who write like that and don't edit it; won't care of how the AI reads it.

I don't feel like this is a product for carefully producing audiobooks, but to create them by the pound, so to speak. I'd say it's a move for the "make your own business through audiobooks" people [1] -- very strange for Apple.

[1]: I didn't know this audience existed until I saw this video on it (and the cons that happen) from Dan Olson: https://www.youtube.com/watch?v=biYciU1uiUw


To me this feels like the automated video slideshows that Apple Photos (and undoubtedly Google Photos) makes for you. Perfectly fine, but indeed even on a casual watch you notice mistakes/imperfections you'd never make if you were producing such slideshow manually.

But that's the thing...is "perfection" worth 3 hours of video editing for something you casually consume?

I think almost any audiobook listener will vastly prefer a serviceable but imperfect audiobook when compared with no audiobook at all.


I find those slideshows unintentionally hilarious - ten photos of my kid interspersed with a flash photo of the back of the washing machine and some cable that's hanging down underneath my car, an accidental screen grab of a text message, all treated with the same importance and jaunty library soundtrack.


This made me LOL, as I’ve had this happen many times. But much less so in the past 2 years, around the time iOS Photos started integrating with Apple Music.


Actually now you come to mention it I haven't noticed it very recently - maybe I've stopped looking at those suggested slideshows, maybe I've turned a setting off without realising it.


> But that's the thing...is "perfection" worth 3 hours of video editing for something you casually consume?

Any book of any length took countless hours to write and edit. Yes, I think it's worth a bit of extra time for a human to go through and read the thing aloud.

If the alternative really is no audiobook at all... okay, I guess something is better than nothing. But on the whole, I'd like publishers to just record more audiobooks, and I'm concerned this technology will result in fewer "real" audiobooks being produced.


Agreed. I bought an old Kindle for traveling because I could use its TTS to listen to ebooks. Now that's a common feature of most reader apps, yet at the time it was rare and even newer Kindles had dropped TTS.


I can understand that sentence fine, I don't see anything wrong with it.


Writer here, it's a terrible sentence.

"I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild."

"Deep currents of spiritual yearning" is both cliched and unspecific / unclear.

"That" is superfluous.

It's unclear how Bill is expressing the beauty he sees, and the sentence structure implies he's somehow responsible for the natural beauty.

"All places wild" is wilfully awkward and anachronistic.

There are countless better ways to say the same thing. For example the tone would be similar and the sentence more concise just to say:

'When Bill spoke of the beauty of nature, I could sense it inspired spiritual feelings in him.'

Or simply: Natural beauty inspired in Bill a yearning for connection to something spiritual.

Neither are great - because the central thought is unclear. Writing is to a large extent the process of expressing a thought or feeling. Clarifying exactly what one wants to say is a central part of writing and editing. The author seems to have failed to clearly define their idea or emotion, so its expression is decorated rather than clarified however you phrase it.


Could you do Hemingway, McCarthy and Joyce next? It's untenable they wrote 'unspecific / unclear / superfluous / awkward / anachronistic' language such that our robots cannot render to speech.


They write literary fiction, though. The selected passage is listed under "Nonfiction/Self-development"


"The Distant Shore: Stories of Love and Faith in the Afterlife"

It's a book of stories about her dead husband. Non-fiction, yes, but it's not exactly "How to Lose 10 Pounds in 10 Days".


You might be right with some of these criticisms about the ideas and style of the sentence, although I think most of these come down to taste and context.

Grammatically, the sentence is easy to parse and a native English reader would understand how to say it out load.


Native English speaker here: I flubbed the first time I read it - specifically the 'all places wild' bit at the end.


> Writing is to a large extent the process of expressing a thought or feeling.

Agree.

> Clarifying exactly what one wants to say is a central part of writing and editing.

In professional/technical writing, sure? Authors have been known to have different writing styles, and the style may be ambiguous or convoluted to intensify the thoughts or feelings the writer is trying to convey.

PS everyone posting here is a “writer”


It doesn’t matter if the sentence is terrible writing. It can still be easily read out loud correctly, but not by this computer program.


Context for the sentence: https://books.google.com/books?id=-v4GEAAAQBAJ&pg=PT20&lpg=P...

> Linda Hale Bucklin made a pact with her husband, Bill, to communicate after death. Here she shares her personal experience and other stories of love and faith in the afterlife.

Bill died after 40 years of marriage.

The rest of the paragraph:

> Bill loved and found solace in nature. Often, we voiced our wonder of the sight of a full moon peeking above the hills of Stinson, rising heavily until it pushed free of the horizon, a perfect circle in the dark sky. On nights with a new moon, we would walk to the end of the beach to find our favorite constellation, the Pleiades. I know Bill carried within him deep currents of spiritual yearning that he found easiest to express through the beauty he saw in all places wild.


Mind if I ask where you're from? While not really adding anything or altering the meaning (so technically it could be called superfluous), removing it here to my British brain makes the sentence sound lazy and I was definitely taught in school that it should be there. I do also agree that the sentence is terribly written in general though.


Sure. I'm Irish. I've been a professional writer for about thirteen years. Formality of sentence construction is context dependent. In prose there's usually little purpose and no solid grammatical rule for retaining superfluous words. Depending on the pace and rhythm of the piece - which in turn dictate interesting things like the readers perception of time. For example you can make time flow faster by using brief, truncated, staccato sentences. Or stretch it out with more formal, grandiloquent sentences. You can also convey the informality of a relationship, or even the physical structure of space or an object in much the same way. These are elements of voice - the character or tone of the sentence or paragraph, which is contextualised by the overall piece. I'm not articulating theory here - more trying to convey how I and other writers intuitively learn to play with prose.


Fantastic answer, thank you. Thinking about it, it's literally decades since I last wrote creative prose, maybe my formal writing muscle memory was just kicking in here.


Reader here, it's a perfectly fine sentence.


> Writing is to a large extent the process of expressing a thought or feeling

Neither of which we believe a machine to be capable of conceiving.


> "That" is superfluous.

I always use the trick when I write of reading the sentence without the "that"...still makes sense then you don't need "that". Mostly.


I don't find anything terrible about it. I'd probably break it into two sentences but that's mostly down to my style.


I also thought it was a bad sentence, but thank you for giving me solid reasons why that is so.


Thanks for this. Can you do Go Dog Go next?


>But people who write like that and don't edit it; won't care of how the AI reads it.

Now human language, limited as it already is, is to be is to be humbled before machines that humans have also invented. In our inability to create a machine capable of doing cognitively what humans do, we prefer that humans function as if they had been lobotomized, in deference to our crude machines.

We have built god, and god is stupid, and we bow before him because god has been created, once again, in the image of man.


"Our algorithm isn't stupid, it's the literature which is wrong" is not something I expected to read today.


Have you ever read a written passage out loud and failed because the text was too contrived and the nature of the intonation only became apparent after re-reading the sentence multiple times?

I have, plenty of times. And in most of those times I got really angry at the author for writing "wrong".

So yes, to me saying that the literature is wrong is nothing unexpected.


It depends. Is it technical writing? Then I agree with you. But literature is meant to make you feel a certain way, to paint a mood or a picture, or even to experiment with language.

The "AI" has trouble with that because it's fundamentally a machine for recombining already existing information.


It’s more like “the algorithm is stupid, so it can’t fix bad literature in its brain like humans do”


> Unfortunately the AI completely faceplants, placing an enormous pause right in the middle of the phrase "all places wild". It actually changes the meaning of the text, making it sound more like "...the beauty he saw in all places. Wild!"

I disagree for 2 reasons:

1. There's a perfectly fine reason to put a pause between "places" and "wild": to put emphasis on "wild". Bill doesn't see beauty in all places, but specifically in all wild places.

2. Interpreting the narration as "[...] all places. Wild!" is farfetched because the narrator pronounces "wild" very calmly and softly.

I agree the pause is a bit too long, but I was expecting way worse when I read your comment about how "the AI completely faceplants".


I disagree with both reasons.

1. A long pause between "places" and "wild", to me, signals their dis-association, that "wild" does not link with "places". However, the lack of punctuation in the written text implies the phrase "all places wild", the "all wild places" you refer to. I'm with the GP here, the AI didn't convey the meaning I'd expect from the text.

2. Also, the preceding text seems to discuss a certain "ineffability", a spiritual/magic in the world that seems diffuse, broad and subtle. With that context, pronouncing "wild" calmly and softly ties it to the earlier ineffability rather than the more discrete "places". Again, this reinforces the "...places. Wild!" interpretation. I am very impressed the AI used two or more modes of expression (time, tone) to express a feeling... but I disagree that's what the text held.

Maybe the AI's smarter than me.


> A long pause between "places" and "wild", to me, signals their dis-association, that "wild" does not link with "places"

But that long pause is still way shorter than the pauses at actual periods! Look at the waveform[0]: the ovals are the periods, the rectangle is the pause between "places" and "wild". I guess due to the length of the pauses at the actual periods, my brain automatically discards the possibility of "all places. Wild!" and then the best interpretation clearly is "all places wild" for me.

But hey, the fact that at least two people interpreted it differently says something. Maybe this was more of a faceplant than I initially realized.

[0] https://imgur.com/a/j9CtZFZ


Empirical evidence of a pause and its length is not indicative of how the typical listener will interpret it.


Maybe this is just me, but I find it really unnatural to pause in the middle of a short phrase like "all things wild". I'd emphasize "wild" by putting stress on it, not by pausing.

But this excerpt is the end of a paragraph that begins with "Bill loved and found solace in nature." and describes taking walks and looking at the moon. This doesn't support emphasizing "wild" because the author has already established that information; the important part of the sentence is "deep spiritual yearning", or maybe "easiest to express", since the author then goes on to discuss how, after he died, Bill expressed himself from beyond the grave in other ways.

I could kind of understand the AI not quite getting the emphasis right, since that's a judgement call that requires a lot of context from the rest of the book. But breaking up "all places wild" like the sample does suggests that it doesn't understand the basic grammar of the sentence.


I'm guessing that the main reason they require you to go through their "preferred partners" is that their job is to insert annontations that the speech generator needs to make it sound good.

I wonder if this is because it's difficult work, or if the tools aren't user friendly enough to put in the hands of untrained users. If it's the latter I suppose that sooner or later we won't need to go through the partners.


The sample is taken from a full audiobook that is currently available for sale, so you'd think they would've put an annotation on "all places wild" if they had that ability.


Perhaps this bug could be a feature?

  A panda walks into a café. He orders a sandwich, eats it, then draws a gun and fires two shots in the air.

  "Why?" asks the confused waiter, as the panda makes towards the exit. The panda produces a badly punctuated wildlife manual and tosses it over his shoulder.

  "I'm a panda," he says at the door. "Look it up."

  The waiter turns to the relevant entry in the manual and, sure enough, finds an explanation.

  "Panda. Large black-and-white bear-like mammal, native to China. Eats, shoots & leaves."
https://en.wikipedia.org/wiki/Eats,_Shoots_&_Leaves


By normal sentence construction it would be all wild places. It’s a good test as you say, the author is having fun with grammar to give you the idea that it’s a subset of places rather than “wild places”, so I would expect it to be written with a link between places and wild, “all places-wild.”


I've seen this construction used in a lot of places (like the name of "All Things Digital", the predecessor to Re/code), but I have never heard of anyone putting a dash between the last two words.


And might I add, Helena does not sound like a Soprano given the speech tone. (Also, smoking is bad mmmmkay)


A number of books I've been reading aloud leave out the comma after a prepositional phrase (and friends), and it totally throws off my cadence.


Will we see this level of voice synthesis in the public domain? Maybe I am out of touch but I found those examples very impressive - more impressive than the jobs vs rogan demo a few months back.

But I am also saddened at a future where all this is locked up in corporate hands - obviously there is money needed and (licensed) data needed too which Apple can get at.

Honestly I would rather eschew the ethics of it and just consume any and all voice data (youtube, podcasts, existing audiobooks, radio) that has transcripts available, perhaps because I assume corpos are already doing this, if it means we can have a free and open data model that people can run at home, maybe that makes me evil.


> saddened at a future where all this is locked up in corporate hands

I would guess this rolls out from big companies first because the first version is always the most difficult. It’s only going to get easier to do and I would totally expect end-user controlled TTS systems to get better and eventually exceed the capabilities of this version from Apple. Of course Apple isn’t going to sit still, so they will continue to improve as well.

Are there examples from ten or twenty years ago of a technology that big companies had locked up that never made it out to end users? What we have might lag, but it seems like this stuff only ever gets easier to do.


Linux is a prime example of a technology that was at first far behind its proprietary counterparts but eventually dominated and nearly extinguished all non-free competitors.


For very specific uses. Linux is not a good options for general computers used by everyday people. That experience is still owned by large corporations.


There are only two cases when you can't use Linux as you daily driver:

    * You must use proprietary software not supporting Linux
    * You must use a device with proprietary drivers not supporting Linux
For me, it works flawlessly, since I avoid both cases.


as AI requires larger and larger data sets and processing power (energy, money), how will public domain catch up to their wealth accumulation? it's not like linux where chipping away at functionality and incremental UX adds up sufficiently over time


My guess is that such "digital narration" is on the brink of becoming available as a service to authors and publishers, with Amazon and Apple trying to get ahead of it by selling a product that can only be published on their platforms then. They are surely able to undercut human narration, but even on digital narration they have an unfair advantage of earning a share on every sale as well.

Will be interesting to see how this develops. Either independent digital narration becomes competitive enough that a publisher simply gets it done once and then sells it on all platforms, or this new platform-exclusive model is so disruptive that it becomes even less economic to produce a Audiobook, effectively making Audiobooks exclusive to Apple and Amazon/Audible (and whoever else has such a digital narration engine).


I'm curious what the economics are here, because getting someone to just read a book can't possibly be that expensive, right? I'm not talking celebrity voice-over work, but just getting anybody with a passable voice to sit down a spend a day or so reading a book into a microphone? Does that really cost that much more than having someone sit down a listen to the whole AI-generated audio book to do QA? And then there's all of the engineers who have to work on the project, the hardware to run all of it, etc. Seems like if it takes months to do and has to be QA'd anyways, it can't be that much more cost effective. Now, if it's completely computer-generated and can be a push-button feature? Sure, that makes sense. I wonder how close they are to that.


You need to take into account that nobody can perfectly read a book first time. I've done a reasonable amount of short-fiction narration for audio magazines, and even with many hours worth of script reading under my belt I still fumble every third or forth sentence. So I need to re-read those, and then I (or somebody else) needs to find those errors and edit them out, replacing with the fixed audio. Then people need to listen to the entire file at least once to ensure the whole thing makes sense and I didn't leave out a sentence somewhere or something.

Audiobook narration is one of those things that is remarkably labour intensive, certainly much more than I'd have guessed before getting into it.


This was demonstrated vividly when Andy Serkis read The Hobbit unabridged live for charity then did a separate recording as a commercial release. Both were great, but you could see how many minor flaws were in the live recording that would have had to be re-recorded in the commercial version. (I’ve no idea how long it took to record the Lord of the Rings which he did a year later)


If you're talking about using a trained voice actor, yes, they will cost an hourly rate to do this, which is reflective of the care and training they put into their craft. One should also expect that they don't simply press record, and then do an entire read through. They will go back and try different takes on segments

If we're not talking about a trained voice actor, you may want to look at some thing like LibriVox, which is entirely volunteer run on public domain works, and while the efforts are appreciated, the quality is noticeably different.


As an audible customer I can tell you clearly that I often buy audiobooks by a particular narrator (discovering new authors) because the narrator is so good.

I mean, Siri is quite good at reading texts (I imagine that's a huge training corpus) but I think we'll be in "uncanny valley" for quite a few years.

It's possible that the public just gets used to that.


Same! I remember being really impressed by Justine Eyre in something, and then being a little disappointed to find that 99% of her work is romance novels.


This is achievable with a very modest (say 50-100k) budget for voice actors and compute. Less if you’re happier with lower quality. Speech synthesis is probably one of the few areas in ML that’s trivially accessible to smaller orgs.

Even Stable Diffusion was only 600k which is hardly outside the reach of a startup. The only ridiculously expensive models reserved for the big end of town are the GPT3 etc language models, and I fully expect the data/compute requirements to come down considerably in the near future.


This is already easily achievable on consumer hardware, you can train something like Tacotron 2 + WaveRNN on your own computer to achieve similar, if not better results. Check out this repo:

https://github.com/coqui-ai/TTS

You can also clone someone's voice by finetuning a pretrained LJSpeech model and training a vocoder from scratch, I've had great success with as little as 15 minutes of speech.


It really isn't though. The level of fidelity that Apple is demonstrating in those samples is very impressive. You can generate fine voices with little work using repos like that one, but to get to the level Apple has takes a lot of work.

EDIT: "fine voices" not "find voices"


> You can also clone someone's voice by finetuning a pretrained LJSpeech model and training a vocoder from scratch, I've had great success with as little as 15 minutes of speech.

Are you able to point to any articles to help get started with this please?


Unfortunately, I'm not aware of any beginner friendly tutorials.

The way I learned it was just by experimenting with various GitHub repositories (e.g. https://github.com/fatchord/WaveRNN or the one I linked earlier) but it takes a lot of trial and error. Might do a writeup at some point if I have time.


Check my other comment in this thread, you might try our dataset. :)


Will check it out, thanks


Not from the companies selling audiobooks IMHO.

They will use this technology to save money on human speakers. If they release it into the public domain we'll end up with ebooks that can read themselves aloud and they'll lose part of the incomes from audio books.

My Samsung phone can read ebooks with one of Samsung's voices right now, but it does an awful job at pauses. Basically, no commas. With a good voice I could turn each one of my ebooks in an audiobook.


I don't really think I could listen to an AI voice reading a book. Perhaps some technical stuff, but not fiction. Even the difference between a mediocre vs good voice actor is huge, and can mean the difference between finishing an audio book or stopping after a chapter.

Edit: to be very specific, a really good voice actor will take on different voices depending on which character is speaking, and will act out scenes realistically. I honestly can't imagine any AI being able to do that.


There is https://commonvoice.mozilla.org/en though I’m not sure where and how is it being used.


Common Voice is more about building a dataset or how people talk, especially with accents.

While there is/was a voice synthesis project at Mozilla it was rudimentary like 3 years ago


You can play with it on https://uberduck.ai/ and they have a very active Discord!


What exactly is "Open Source" about uberduck? It looks like a proprietary tts saas to me; no links to a git repo and the "developer" section just shows how to get an API key and hit their service.


"Once your request is submitted, it takes one to two months to process the book and conduct quality checks."

My guess is that these generated voices are far from perfect and someone has to go in and crank the algorithm to get a fair number of passages to not sound strange.

Even in the example Helena there is a word at the end of a sentence that sounds like it should be in the middle and has a bit of weirdness to it. Still, very impressive, I think better than I remember Amazon Poly sounding.


Why is that we still can't have a perfect or near-perfect text-to-speech given all the astonishing advances in ML taking place? Is TTS an area nobody is really interested in or is it harder than generating beautiful pictures and sophisticated writings?

This thing by Apple already sounds way better than the best I heard previously (NextUp Ivona) but it is not an instant-result offline tool yet and that's sad.


It's an extremely hard problem that lots of people are working on.

The trick is that we have "pretty good" results for TTS as-is, but it has significant shortcomings that are more visible in certain use cases. The operative word is "prosody" - the cadence, rhythm, and pauses that are natural when speaking that are heavily dependent on context and content.

Prosody is incredibly important to making natural utterances - TTS models that do not model prosody end up sounding very "flat", which is mostly all of the heavily used TTS engines out there right now. This is less glaring for short responses like what you would get from a voice assistant, but becomes a huge grating problem when you try to do long-form text reading.

The trick with prosody is that it often requires information and context not contained in the text to be read. You would apply a different rhythm and stresses to a horror story than you would to a conference keynote speech, for example. It also requires a more sophisticated understanding of the content of text rather than simply its constituent words, in order to figure out proper stresses and pauses.

All of this is eminently solvable (as demonstrated here with the book voices) but is... rather difficult. I suspect we're not terribly close to a product where you can just feed it raw text (with annotating or otherwise providing additional data as context) and get a great result.


I wonder how effective it would be to feed the book to some other AI model first that reads the whole thing and figures out the necessary context that it can then go back and feed into the TTS model


I wanted to make a human-like reading feature for our language-learning software. Training a model isn't too hard using something like https://github.com/coqui-ai/TTS.

The weak link was the available free/open datasets. You needed a single speaker with a pleasant voice, 20hrs+ material from varied sources, recorded in a good recording enviroment with a good mic etc. For English, the go-to was LJSpeech, which doesn't fulfill all these requirements. I say 'was', as I haven't followed developments recently.

Last year we decided to make our own dataset with a Irish woman, Jenny. She has a soft Irish lilt.

Never got around around to training the model, but I will upload the raw audio and prompts here in a few hours (need to pay my internet bill in town..):

https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...



This is great! Thanks for sharing. How much did this cost you?


Are visual generative models really that more advanced, or could this simply be an artifact of their usage?

With generative visual art, people usually spend considerable time fine-tuning the results, and we don‘t get to see all the prompts that didn‘t work out (except if the failure is notable in some way).

Try e.g. illustrating a book, but using only your first prompt for each image. I think the quality would be in the same ballpark as having Siri narrate the corresponding audiobook.


You’re describing the effects of familiarity with a subject.

Stable Diffusion / Midjourney etc look really pretty to the average person but on closer inspection they rarely hold up out of the box. If you’re an experienced artist you pick up on all the flaws right away.

ChatGPT and Copilot are similar. The answers seem confident , but the more familiar you are with the domain of the answer, the quicker it becomes to see how flawed the results are.

Now going back to TTS. You’ve spent your whole life knowing what speech sounds like. Unlike those other models that require an extra level of domain knowledge, everyone innately knows the sound of humans speaking. So you’re effectively, and subconsciously, a domain expert.

This is essentially the uncanny valley effect but for other areas.


Chat-GPT and StableDiffusion aren't perfect. They still produce weird responses or visual artifacts sometimes. But, it can be easy to move past these idiosyncrasies.

I think the brain is just more sensitive to speech, because inflection and tone is a key part of communication. So even subtle artifacts in the generated voice are really obvious and annoying.

Plus, as another commenter mentioned, books are long. An issue in 1 out of 10,000 words will be enough to break emersion.


I don't find it easy to look past their idiosyncrasies at all although they can produce impressive results with fiddling and luck.

Listening to these samples, they're still robotic sounding to me just listening for 10 seconds. I can't imagine wanting to listen to a whole book like this given the option of listening to an even modestly-competent voice actor.


My uneducated opinion on the matter is that we are more tolerant of subtle errors in pictures and writings than we are in sounds. Subtle variations of tone can change the meaning of a conversation that words on paper just can't convey.


As a person who has listened to a number of non-fiction books narrated by Microsoft Sam I don't really mind "subtle variations of tone" :-) This Apple thing will already satisfy me if they release it as an offline app for converting plain text files into audio files.


The pictures have weird limbs and the writing has errors. A book is long therefore there will be a lot of issues.


Because to understand intonation and rhythm you need to perfectly understand context and emotions. I don’t doubt these things will be added soon enough, so I expect perfect reading end of this year and perfect reading in anyone’s voice with a few samples in 2024.


> Why is that we still can't have a perfect or near-perfect text-to-speech

Define perfect ;) Two different people will read the same text slightly (or not slightly) differently.

A great example is this brilliant and funny rendition of "To be or not to be" by Tim Minchin, Benedict Cumberbatch, Judy Dench, David Tennant and others. Sorry for the Facebook link, but it's very hard to find this video anywhere: https://www.facebook.com/watch/?v=585252039999241


> is it harder than generating beautiful pictures and sophisticated writings

I think one differences with pictures and audio is that pictures are two-dimensional and we can't take in the whole image at a time. This makes it easy to overlook flaws without careful inspection. And I find that although there has been some amazing AI-generated art, there are still a lot of rough edges and tweaking required to get really clean images.

As far as writing goes, I suspect that the rules of written language are easier to learn and violations easier to overlook than with generated audio.


murf dot ai has near perfect tts. I think we had a major AI breakthrough in the last couple of years


"Digitally narrated titles are a valuable complement to professionally narrated audiobooks"

Yeah, right. What a lame attempt to deflect the (fully warranted) criticism that this will put audiobook narrators out of work.


I think the cat is already out of the bag, and the death of human narration is imminent.

The fight now seems to be whether this transformation happens only in production, or companies like Apple succeed in breaking the total Audiobook price apart into "license" and "production", only buying the license and have the production done on their proprietary servers.

Overall, I agree it's inevitable that this results in a sharp decline in professionally narrated Audiobooks...


> Overall, I agree it's inevitable that this results in a sharp decline in professionally narrated Audiobooks...

Or, it will increase demand for audiobooks so much that more humans are needed to create top-notch audio.


I don't know how badly narrated Audiobooks can increase the demand for Audiobooks as a whole.

The only scenario I could imagine is a narration language where Audiobooks didn't exist so far for economic reasons (i.e. low population). Digital narration could bring down production costs to the point of making it economic, basically creating the audiobook market for this language.

But then, if the narration is bad (which it likely is because TTS is worse in minor languages), I don't know how many users could be converted to pay a premium for a better human narration. Also here I think it's more likely that funding will be used to improve the narration engine as a whole instead of going back to hiring humans and renting a studio for each book...


One of the things I like is when the narrator has to suppress a laugh during a funny passage, or can express a character's anger or frustration.

Until AI is so good that it can mimic emotion, I think there will be a market for human narrators. Of course it will be smaller than what it is now, but I think people will specialize.


> One of the things I like is when the narrator has to suppress a laugh during a funny passage, or can express a character's anger or frustration.

I doubt there are big issues for an AI to verbally mimic emotion. Placing emotion correctly in a long narration might be tricky if there are no indicators in the text, but I'm sure there will be a convenient self-service authoring tool where the Author/Publisher can adjust the emotion with a slider if he wants to finetune the result...

> Until AI is so good that it can mimic emotion, I think there will be a market for human narrators. Of course it will be smaller than what it is now, but I think people will specialize.

A smaller market means higher cost per-unit, so higher prices per Audiobook. If the publisher needs to meet a specific price (i.e. to be listed on flat-fee audiobook-portals) he might be forced to produce digital narration as a default, which means the market for an additional "premium human narration" will have to prove itself first.

I doubt that such a bar will be reached in most cases. It's more likely that people complaining about bad narration will put pressure on AI-engines to improve, but not form a market where a critical mass will pay additional 20$ for human narration...


They won't be able to. Not enough people will reject computer-generated narration and insist on real narrators for them to remain employed.


Sad for the narrators, but good for the world.

There are so many books I have that don’t have an audiobook version because the economics just aren’t there.

This is an easy way that technology can expand human experience.

Even in situations where the author reads the book, I expect it will be cheaper to train an AI to sound like the author than to put the author in a studio for 50 hours (or whatever).

I thought it was a really dumb ruling when Amazon was forced to remove the text to speech function from kindle.

I also think that screen readers are hobbled to avoid this legal issue. I want to send any text through a narrator bot and have it read it to me. There is zero need to compensate anyone other than the developer who writes the AI (and hopefully it will have open source versions donated by developers).

If I’ve bought a book, I should be able to use it as I like.


To be fair, a lot of narrators are really not doing that good of a job. I frequently hear audiobooks that have been rushed through production - mispronounced words, strange cadence, and overacting. I'll take what I heard in that demo over ACX crap any day.

The current iteration of this technology is not competing with truly great narrators, like Tom Hanks or Jim Dale.


This is as valid a criticism as complaining the phonograph will put musicians out if work. Time marches on, adapt.


I'm mixed on that because while I appreciate the craft of a professional narrator, I support a group of users who are mostly blind and there's a constant tradeoff between availability and the quality of an audio book. People value good recordings – people often have favorite narrators and will select books based on that, sometimes even outside their normal interests (which wasn't something I'd previously appreciated) – but if it's something you want to read, having it now versus a year from now matters.


Some narrators never should have had their jobs to begin with. I'm viscerally angry at the narrator of "Permutation City" on Audible. Such a great book that could not have a more bored, disinterested narrator who clearly doesn't understand the text he's narrating.

An AI TTS engine at this level would do a far better job of it than that particular dude.


Huh? The professionally narrated audiobooks don’t get memory hole’d from the earth because Apple announced this service. The sentence you quoted is intended to emphasize that professionally narrated audiobooks will continue to be available on the platform.


Can they just train a model to narrate for themselves and change section that the model makes mistakes?

TBH, human narrator on Audible sometimes just reads the stuff aloud


So happy to hear Apple putting African American voices front and center for this initiative. Along with Google’s push to make camera lenses / computational photography more accurate for darker skin (and Apple following suit) this feels like a real step forward for inclusion.


Voices have a race?


Speakers of African American English have distinct timbre, rhythm, and cadences in their speech. There has long been a lack of these distinct features in TTS. Apple appears to have added a “Black voice” (though to be clear, there are speakers of African American English who are not Black) in 2021:

https://www.consumerreports.org/digital-assistants/apples-ne...


Often there are clear voice and accent differences. Are you saying there aren't? That would be a very strange statement.


Not so strange! ... at least when it comes to the legal community[1] (pdf).

For those around during the O.J. Simpson trial, this was a very, very contentious topic during the trial!

[1]https://cpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/3/2151/fi...


I see your point. And of course it's not a 100% accurate thing, but you know what I mean. Being admissible in court as evidence and furthering representation of a group are two completely different things.


Accent depends on where you were born/raised, not on your race.


For the purposes of this discussion, this is a distinction without a difference. “Black voice” is a term of art, meant to convey that the voices in question were trained on features commonly associated with African American English. The voices in the example are US English voices.

Race, ethnicity, accent, language — all of these are complex topics with plenty of nuances. Short of writing a dissertation, no simple reply on a web forum will appropriately capture them. The good news is, we don’t need to, since most Americans are familiar with African American English, whether they know the term itself or not, because African Americans have constituted a distinct culture in the US for all of the country’s history.


This answer might have ben reasonable if US was the only place on earth. When I hear Giannis Antetokoumpo speak english for example he sounds exactly like a Spanish/Greek person speaking English even though the native population of these places is white.


The US is Apple’s primary market and the focus of this campaign.


Are you saying that voices across all races, cultures etc sound the same?


No, it depends on the culture. Both whites and blacks raised on East London will have similar accents for example.

Since we are strawmanning, are you saying that cultures and races have a 1-to-1 relationship?


I seriously doubt you can tell someone's genetics (their race) based on their voice. On the other hand, you can tell a huge amount about the culture in which they were raised.


The OG Kindle with keyboard used to have text to speech too.

It was killed by publishers who wanted to charge separately for audiobooks.

If Apple has somehow managed to get the licensing for this, I might consider buying from Apple Books in the future.


The product is actually directed at authors, offering them to have an Audiobook produced which is "digitally narrated by Apple Books".

The Author still needs to hold the rights for Audiobook production, and he needs to license a third party to produce an Audiobook (no matter if human or "digitally" narrated).

I guess that's why this is aimed at "independent Authors", to circumvent negotiating Apple's rev.share and exclusivity for that production with established publishers...


> It was killed by publishers who wanted to charge separately for audiobooks.

Any sources on this?


Amazon decides Kindle speech isn’t worth copyright fight (2009) — https://arstechnica.com/gadgets/2009/03/amazon-backs-off-on-...

See also the recent lawsuit covering the other direction, automatic transcription of Audible books. https://www.geekwire.com/2020/amazon-owned-audible-major-pub...


Thanks. I never really used Kindle Speech on my 3G Kindle, but was curious why it was suddenly gone in later versions.


You are never buying from Apple Books, it's the usual DRMed crap, more like a rental. Amazon had gotten a lot of flak, buy Apple is not any better in this respect...


If DRM works offline, it's not a rental. It's not desirable, but don't call it a rental, that just moves focus away from what the real problems are here.


Even if it works offline, it probably won't continue to work if you need to switch to a new device after their DRM servers are turned off.

DRMed content can never truly be purchased.


Most Apple and Amazon Books are DRM encumbered, but not all. AFAIK, there isn’t a way to tell before you buy the book except by choosing books from publishers that don’t use DRM on any of their titles.


licensing as in?


License/copyright for written form of book is different than the read form.

Author might sell the book rights to company X and audiobook rights to company Y. Company X can't do a text to speech version of their book without infringing on Y. Y cant do speech to text of their version without angering X.

Licenses are fun!


I guess I kind of wish they would just offer the AI narration as a feature of Apple ebooks. Such that, if you buy the book, you can have ebooks read to you by your phone. I am really just buying books off audible with the subscription they offer. There are some books (tech books that is) that are offered as audio books and I gobble those up. There are, however, many more epub/digital books that I never buy not because I'm uninterested in the content but only because I don't have the time to sit down. I assume that for said books the audience isn't large enough (and may never be) to merit anyone ever recording the audiobook.

There are certain books that I think I'll always buy the non-AI variant because narrators can bring more than natural reading, they sometimes bring different characters (sometimes more feminine, more baritone, more stereotypical accents) -- and I would melt if AI could do that kind of voice acting.


Amazon at one time tried to add voice reading to Kindle books. Authors were absolutely livid. Audiobooks are a significant income source, and taking that away from authors is going to make authors decline to sell digital books on your platform. Apple is doing this right by making it an author's choice.


I totally see this point — I’m making a separate one for ebooks that aren’t getting purchased because they haven’t been (and probably never will be) narrated by a real human.


I agree with you. Apple making this an author choice avoids some authors being angry while enabling more sales for lower-volume books that as you point out, will otherwise not have an spoken version.


Yeah that would be cool and in a free-er market more friendly to healthily competitive innovation diversity, we'd already have natural-sounding narration built into every browser and reader; which would be an accessibility UX boon. But the publisher oligopoly wouldn't stand for it and there's not really much of an incentive for the marketplace monopsony-monopoly janus/jani to bake it into their products for free or a flat fee, even if the lawsuits from rights-holders standing to lose out on audiobook sales would be worth swatting away.


This should be a feature available for any text document. The existing iOS text-to-speech is almost barely adequate, but not really.


I use the iOS Speech Accessibility feature to listen to ebooks and it works great.


that is a good feature, it just seemed like the reading I'm hearing off of the samples for these audiobooks is a tad less robotic.


I'm skeptical given the state of the art.

There is way more good audio content out there than I have the time/interest to listen too and I can't believe I'm that atypical. And a book is a relatively big listening time commitment. I'll happily pay a few dollars more for a good human narrator.


A couple of comments from a narrator whose worked through ACX before.

First, the last few years have seen a race to the bottom for narrator rates, since during the pandemic it was recognized that it's a job that can be easily done from home, literally from anywhere in the world.

Accordingly, the up-front cost for an average quality 10 hour book is only about $1,500, and can be turned around in under two weeks from a human. If you get a really good and well known narrator, it's still only about $4,000 (and you'll probably get it quicker).

Also, they're going to be competing against revenue share models from Amazon/Audible, which basically means it costs the author nothing up front. Amazon's bite out of audiobook sales is absurdly high (60%), so other companies could (and are) definitely improve on that. It's mostly a fight against Audible's brand at this point.

But back to AI: AI narration is going to have to compete against humans willing to do a lot of work for very little pay. I'm honestly not sure the compute and QA costs will be competitive. And frankly, even if it is cheaper, it's not as if those savings will be passed back to the customer.

If you'd like to look at how little it can cost to get a human to do voiceover work, check out fiverr.com and look for voice actors and narrators.


Thanks for the insights.

That doesn't really surprise me. On the flip side, I can get high quality transcriptions for $1/minute (given good audio quality).

People, even those with better than average talent at some things, just often aren't that expensive. I suspect the same is true for some of the generative AI tasks that people are all excited about--new grad English majors are pretty cheap, especially if they can be assisted by search/generative AI.


Fully agree with this. I could understand TTS for quickly converting articles to audio (and of course for visually impaired ppl), but for books the current state of this tech doesn't interest me. The qualities I want from a good narrator aren't in these samples (correct emphasis within a sentence, variable pacing dependent on context). For fiction books, good narrators will change timbre and accents depending on who is speaking in the text, not clear if they tried to achieve this at all (could have potential to use a different digital voice entirely).

I hope that the results from this type of production are clearly labeled as computer generated in the store. I don't think putting "AB Apple Books" is clear or sufficient, for someone that doesn't know about this tech "AB" sort of looks like a placeholder for some unnamed human.


I tend to agree for the current product that Apple is releasing. IMO this technology starts to get interesting for books once folks can generate audiobooks for titles that do not have an audiobook (and likely never will due to publisher disinterest). When I first got into audiobooks I wanted to go back and listen to one of my favorite books and it wasn't available :/. I also see certain audiobooks described as "unlistenable" because of something the reader does.


I often find books I want to read that don't have audio versions or the audio version is for a different translation than what I want to read. So if you are looking for specific things to read the (eventual) use of this type of technology to open up some of those in audio format seems useful.

(But totally agree with you that this isn't going to replace a good human narrator.)


Indeed. But this option will only be available if a critical mass is also willing to pay a few dollars more for human narration.

In times of flat-fee Audiobook platforms the pressure to bring down audiobook production costs will only increase, funding a full-fledged Audiobook production for each book will only become harder to justify.

Moreover, looking at what Apple describes here, they seemingly want to establish digital narration (quality) as a metric for competition between Audiobook marketplaces, not publishers. So if this works out, the major platforms will compete on digital narration and publishers will have less incentive to actually produce an Audiobook with human narrators...


That's fair and it's true of a lot of AI/ML versions of content. I still paid for human transcriptions of podcasts when I was doing them because the time needed to clean up the ML versions just wasn't a good return. But the day will certainly come when that calculus changes.

I know nothing about the economics of audiobooks. And will note that there are free public domain audio books already https://librivox.org/. But TTS will improve and, at a minimum, improved TTS will be a benefit for people who can't read for various reasons.


Well, an intent of Apple seems to be to break the price of an Audiobook into license and production cost, take control of the production using AI and pay only the publishing license, instead of having to buy the rights to sell an Audiobook as a separate work of Art (because in the end, their engine will create the work of Art from the written word).

Sadly I don't see how this will make Audiobooks any better than human narration could. It's more about streaming platforms taking more control over the content and have experienced people train their proprietary TTS engine along the way.

Just to avoid confusion on Librivox: They offer Audiobooks of works which are already in the public domain (so not only the Audiobook is in public domain, also the rights for the book have already expired). So it's a platform allowing people to make free narration of already-free content.


I have a bunch of books on my "to read" list, that still don't have a narration. I would happily listen to an AI version as an alternative.


Or you could use the command `say` on the command line on any current mac to get good-enough text-to-speech.

See full script here: https://gist.github.com/ivanistheone/de3ccb244224d101bb93320... and this doc explains how you can setup a keyboard shortcut to turn any text selection into an audio book https://docs.google.com/document/d/1mApa60zJA8rgEm6T6GF0yIem...

Here is a sample if you want to hear what it sounds like: https://minireference.com/static/tmp/constructive_feedback.m...

which is the audio from this blog post https://productivityhub.org/2019/04/19/how-to-deliver-constr...

IMHO, the computer generated voice like Alex (the default voice on mac OS) sounds better because it doesn't try to do inflections or add human character when it is reading. The real-world narrators (voice actors) seem to add too much "character" into their reading, which me distracts from the story/content. The only exception is when the narration is done by the author, in which case I'd consider the narration as part of the work.


I personally find that lack of character and inflections has completely turned me off of audiobooks in favor of podcasting. The typical monotone audio narration causes me to zone out into other thoughts and I find myself rewinding or just turning it off.


I've experienced that too, but only for "bad writing."

I'm normally able to follow narrative (both fiction and non-fiction) that has something to teach, and also enjoying listening to classic literature no problem...

But sometimes I'm reading a long article from the internet and I experience what you describe (losing track of what author is saying, having to rewind to get the point). After a while, I realize it's not the computer's fault, but the article is just very low content (e.g. some authors just pile on words, emotions, opinions without a coherent narrative or point). Recently I noticed I'm able to detect GPT-generated text this way too... words without content or message.

Perhaps the monotone TTS can be a test for the "meaning" contents of a text.


If you're still interested, give graphic audio a try. They're full-cast (usually a different reader for each character) high production quality audiobooks. They cost accordingly too though.

https://www.graphicaudio.net/


TIL. This is an interesting capability of the command line. Have any more fun ones? (at least fun to a CL noob)


Here is another script `getmp3.sh` that you can use to download .mp3 file from any youtube music video:

   #!/usr/bin/env bash
   echo "Downloading mp3 from $1"
   yt-dlp -x --audio-format mp3 "$1"
You'll need to install https://github.com/yt-dlp/yt-dlp#installation before you can use that. As you can see, the "script" is just so to add a options `-x` (extract audio) and `--audio-format mp3` to convert to mp3 in the end.


I haven't figured out how to effectively search my HN favorites, else I'd probably be able to find a few more of these, but this was discussed recently:

https://git.herrbischoff.com/awesome-macos-command-line/abou...


I'm not sure how I feel about the quality of this. It... drones. The samples are really bad. It's not that the voices sound robotic. The reading is boring. If this was tested on me without prior knowledge, I'd say "not sure if human or not. But it's a bad reading either way".

Edit: Adding a few more details to my thoughts to say why it's boring. Good narration is so much more than correct pauses. Pacing. Emotion around words like death and life. Ensuring that sentences don't repeatedly end on the same inflection tone. Modulation of rhythm. None of that is there.

The last time I ran into this was when a known person started a youtube channel where they put together the script and the video and then used an AI to narrate the script. I assumed it was an AI because I figured that's how said acquaintance would have managed the budget. But it was incredibly tedious to listen to. You can see this in work here (https://www.youtube.com/watch?v=yWVvmKpCBDg). Has the same feel of the Apple digital narration. I don't know how I could listen to that easily for over an hour.


I have listened to a fair amount of fan fiction read aloud with what seems to be the default Siri voice in the fanfiction.net app. Like watching something with subtitles, you don't hear the drone after a while. It does put a lot more emphasis on the quality of the writing, which can be rough with fanfic.


Good to see mainstream accessibility work on high-quality text to speech.

On iOS/macOS, VoiceDream has offered flexible apps with voices in multiple languages and accents since 2012, e.g. for reading PDFs, web, non-DRM ePub books and scanned text, https://www.voicedream.com/about/.


Mitchell sounds exactly like Ray Porter. I wonder if he trained a model with them or they did it without his direct approval


> Mitchell sounds exactly like Ray Porter.

"Mitchell sounds like Ray Porter" is more accurate. Accent is completely different, so they don't sound exactly alike. My first impression was that Mitchell sounds like a Clay Jenkinson,[1] but more a cross between Jenkinson and a male newscaster I can't place who is probably retired now, but who also narrated documentaries.

[1] https://www.youtube.com/watch?v=d8UoL0AOL3k&t=1m49s


I reached out to him on Twitter [0] asking exactly this.

According to a Reddit comment [1] it is, but they haven’t posted their source.

[0] https://twitter.com/pwnies/status/1610857711008370688?s=46&t...

[1] https://reddit.com/r/apple/comments/103iogu/_/j305eby/?conte...


I don’t think you got the correct Ray Porter.

Correct one is https://twitter.com/Ray__Porter


Doh. Thank you - didn't realize it was the double underscore.


Yea I immediately recognized him, hope he's getting paid for this.


He likely won’t be able to comment without breaching an NDA — my recollection is that the guy who voiced the original SIRI got in all sorts of trouble for trying to capitalise on it.


Would love to see one day AI used to read annotated text with multiple voices, so each person in a novel gets his/her voice and also narrative voice. Would be epic and actually better than most audio books read by a single person attempting to pretend to speak in different voices.

Was always frustrated that Kindle was barred from reading books, it is such a natural progression of capabilities. Leave up to the buyer to decide if they want to pay for the person, but default TTS should be allowed for all books, such that if I read book at home and then can continue listening during a walk.


Whoah the Madison voice sounds _exactly_ like Julia Whelan, who is a real audiobook narrator. I have listened to many articles on Audm (narrated news articles) using her voice. I wonder if she had a part in this?


The Mitchell voice also sounds almost exactly like Ray Porter, another real audiobook narrator.


I heard that and it was uncanny how much it sounded like ray porter to me


If she did she’ll likely be NDA’d up to her neck to prevent her ever admitting it publicly.


I’m very interested to see if/how the model can figure out to produce a different voice when a character is speaking and how to keep the same voice for each character across the whole book consistent. Especially the second problem is not trivial at all from my understanding of how neural networks work.



The male sample has many pauses that are distractingly long. Pretty interesting.


While TTS has broad application, I am skeptical about Apple’s process being able to compete with the best narrators.

I have my biases. My wife and I have licenses to listen to about 500 Audible audio books and in the best of them I feel like I have a human to human relationship with the narrator that is similar to a relationship with the author.

I have mostly worked on deep learning projects over the last eight years, so I appreciate the tech as an engineering tool, but I think it is important to view tech as a servant to human experience.


Not every book can afford the best narrators. A good one charges somewhere in the ballpark of $300-500 a finished hour. So for an average novel that's like $3000-5000. Not all writers can afford that, so this is an cheaper alternative.

Its like an Lexus vs Hyundai.


Best narrators? Agreed. But as someone who did some recording for a local radio station years ago, it’s an incredibly time-intensive project to record a book.


Can it pronounce ”Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo” ?

https://en.m.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buff...


The biggest failing for me:

They don't appear to be making any attempt to have the narration use inflections for different characters. This is probably fine for nonfiction books, but for fiction books, it can make it really hard to follow when a narrator does this, at least for me.


I understand this sentiment. I've been an audible subscriber since 2002 and have listened to hundreds of audiobooks, mostly fiction. The ability of the narrator to provide distinct, interesting voices for each character figures prominently into my enjoyment of a book. This technology sounds fantastic, and will likely enable pleasant narration of texts that would otherwise never have it, but I don't think it's likely to replace professional narrators for a large number of cases.


2019 I did an "Ask HN: When will text-to-speech replace narrators"

Most answers did not age well I would say.

https://news.ycombinator.com/item?id=20931541


The answers seem to have aged quite well! Apple specifically say that this is designed as a complement to human narration and not a replacement: some aspect of that is Apple protecting relationships with the audiobook industry, but it’s also fair to say that the humanness of human narration is still unmatched by text-to-speech. For many people, high quality text-to-speech will be good enough to enjoy but it doesn’t seem likely to be an audiobook replacement today.

That said, I suspect it’s less of a technology issue and more cultural: the current generation growing up on robot audio will have different expectations to previous generations, the lack of humanness probably isn’t an issue for them, so even if the technology can never exactly replicate the humanness of audiobooks, it may not matter.

(The text-to-speech on TikTok, for example, is often lampooned by people not of the TikTok generation for being disconcerting and annoying, whereas young people seem to have no problem with it… and that voice is much more artificial sounding than these Apple examples).


Also as far as predictions go, I predict that a few years from now there’ll be research and think pieces about the impact learning from artificial voices has on young children.


If this is your standard, the answer was 2020 or maybe earlier. https://news.ycombinator.com/item?id=34271329

I don't think publishers will stop hiring professional narrators for the bestsellers (instead of the long tail that has been text-only until recently) for a few years more.


I have to say that a number of audiobooks I've bought recently have been totally spoilt by the narration - and the AI voices on offer here all sound better than the humans involved in those books.

I also remember reading, a couple of years ago, that Apple was working on improving the voices for Siri - resulting in me thinking "surely Apple has more important things to work on to improve Siri". I guess this was what they were actually aiming for.


It's so true! It's such a tragedy when a really good book is ruined by bad narration. All of the Hitchhikers Guide books (except for the first one, which is masterfully read by Stephen Fry), are a good example, but there are tons of other examples. There's usually only one version as well so you can't even shop around for a version with a better narrator either.


If you can find it, look for the BBC Radio version of Hitchhikers, from the late 70s, early 80s. Each version has a slightly different variation on the story and the radio one is my favourite of the lot.

My friend had a copy of the scripts from the radio series and we used to use it as inspiration - "open the scripts at a random page and our band name will be the first thing that Marvin says ... Zootlewurdle - perfect!"

I've got them on a set of CDs somewhere, but have nothing that will play them...


I've learned to enjoy particularly robotic TTS narrations because after a while it becomes associated with my internal voice, and it feels like I'm "reading" the book versus elaborately produced audiobooks that are experiencially closer to enjoying a show. And feel less hesitation about listening at 3x speed. With practice, I can start layering on personality to different characters with neutral TTS voices.


> I guess this was what they were actually aiming for.

I would imagine it was the other direction—that this is a way for them to test out their improvements to Siri.


Is it just a coincidence that only a few days back I saw an article or video about how low a payment audible makes to its authors?

It was a famous author too.

And now this announcement.


Likely referring to Brandon Sanderson's recent comments[0].

Edit: Previous discussion on HN: [1]

[0] https://winteriscoming.net/2022/12/30/brandon-sanderson-blas...

[1] https://news.ycombinator.com/item?id=34104204


“Mitchell” is clearly Ray Porter, an absolutely phenomenal voice actor/narrator. He’s done a range of audiobooks across many genres, and anything he does is a pleasure to listen to.

I sure hope that he negotiated a gigantic amount for his data/training set provided to Apple, as this tech sounds like it’s getting advanced enough to obviate a giant chunk of the narration business overnight.


Just replied this to someone else. Instantly recognizable as Ray! Phenomenal narrator. He and Nick Podehl are my favorites


I hope Amazon does this too but probably they won't because it'll cannibalize Audible. This is good move from Apple. Take the credit!


The first versions of Amazon's Kindle did this. Then they got mired in lawsuits over it from the book industry.


Why the law suits. I would say: offer it with a big warning that this is automated and might be bad quality. Those books that have narrated versions can come with a buy extra button to have a professionally narrated version. For which I’m probably more then happy to pay. As an avid reader with 3 kids it’s nice to be able to switch between audio and text book, depending on my availability and context. As such of no professionaly narrated version is available am automated one is way better then nothing.

Also, having the progress sync between digital text book and audio version is a great UX improvement!


> Why the law suits.

Because the contracts between Amazon and the publishers do not permit this kind of work.

Like, this is (legally) a pretty open and shut issue. Amazon has a license from the publisher for the book that permits a certain range of activities that are contemplated by the contract: showing short excerpts for marketing for example.

IP licenses are carefully constructed, often because the rights are sold to different parties. You may for example license a book to adapt into a movie, but the contract would likely forbid you from adapting it for a TV show. The publisher may sell those two rights separately to two different parties.

And these contracts likely either specifically forbid constructing an audiobook (automated or otherwise) from the original book, or at least do not contemplate it. That is a clear source of lawsuits where Amazon is likely clearly in the wrong.

Another reply here mentions that the publishes do not "like" that Amazon went ahead and did this without consultation. That may very well be true - but more importantly (and constructively) it's likely that Amazon's behavior is specifically forbidden in the contract they voluntarily signed on to with the publisher.


The lawsuits were because Amazon just did it, without any author/publisher buy in. They just said "you can convert this ebook using our TTS with the kindle" and the publishers did *not* like that. IMO that is how this should work, any text anywhere should be convertible to auido.


It’s anyway just making it more difficult right, as ebooks can be converted anyway using an external program, many which have way to use gui’s. But i guess it’sas far as they can make use of liability laws


I had the same question, but Apple seems offer the narration to authors so they can choose whether a book comes with text to speech.


I'm curious about the thinking behind "mysteries and thrillers, and science fiction and fantasy are not currently supported." Is it because these genres use more non-standard words? Or is there more risk of the Apple voice being used to narrate sonething inappropriate?


I'm very surprised they make this a feature for authors, rather than a feature for users.

As a user focussed feature, it could read any audiobook out loud, and would differentiate apple books from any other audiobook platform.

I guess it's aimed at authors, because then they can charge the author for the 'narration' service....


> I'm very surprised they make this a feature for authors, rather than a feature for users.

Licensing. Audiobooks are a different license from ebooks, and trying to narrate an audiobook will infringe the licensing terms.


Apple is big enough they could just tell authors "here is our new narration feature for users. If you don't like it, pull your books off our platform.".

No author is going to win a twitter flame war because they don't want a 'speak it out loud' button provided by apple on their ebooks.

Besides - all ebooks on Apple platforms already support this via:

Settings → General → Accessibility → VoiceOver → ON turns on the VoiceOver feature

This is just a higher quality version of the same.


Your view is a bit myopic as you seem to assume Apple can just throw its weight at it and that winning in the US market would be the same as winning globally.

Books are one of the oldest media around and as a consequence most jurisdictions have fairly extensive and specific laws around them and their authors' and publishers' rights. In many cases copyright itself is ultimately based on laws created to deal with authors and publishers.

Infamously, Amazon tried to snub book pricing laws and lost. Google got into hot water with newspaper publishers because its news app violated laws originally written for citing physical newspapers.

This is like suggesting Spotify just ignore the RIAA or Netflix should just stream all content in all countries, licensing restrictions be damned.


> Apple is big enough they could just tell authors "here is our new narration feature for users. If you don't like it, pull your books off our platform.".

You assume that it is authors who sell books on Apple platform (or on any platform).

Let me introduce you to a couple of chunky boys:

- Penguin Random House https://en.wikipedia.org/wiki/Penguin_Random_House

- HarperCollins https://en.wikipedia.org/wiki/HarperCollins

- Simon & Schuster https://en.wikipedia.org/wiki/Simon_%26_Schuster


It’s not big enough to to tell the major publishers that. Especially if it wants mainstream titles on its relatively small ebook platform.


Apple is big enough to have the EU dictate what port to ship their phones with, too. The regulatory landscape requires careful navigating.


Apple would absolutely get sued by publishers, but, I don't think that Apple providing a high quality narration tool with their phone which can be used on any ebook would infringe licensing terms. It's not like they'd be saying "Here is this specific title for you to read/buy/rent", they'd be releasing a tool with the power to do that.


> I don't think that Apple providing a high quality narration tool with their phone which can be used on any ebook would infringe licensing terms.

Oh, it definitely would. This produces a derivative work besides anything else.

> It's not like they'd be saying "Here is this specific title for you to read/buy/rent", they'd be releasing a tool with the power to do that.

That's exactly what they will be doing from the point of view of copyright law.


Citation needed?

TTS tools exist on all computing platforms these days. The difference with this recent release is just that it sounds better. Apple would not be creating a derivative work, whoever uses it would be.

EDIT: It has occurred to me that they may have signed a deal as part of Apple Books that says they won't release tools of this nature. I don't know if that is the case, but that isn't the scenario I'm talking about. Just the case where there is a tool that can do high quality TTS. I do not believe that violates any copyright since it's a tool.


TTS tools are required for accessibility etc. However, "convert your ebook into audiobook with this amazing voice over" will run into issues. Amazon tried it: https://www.wsj.com/articles/SB123419309890963869


Yeah, I'm familiar with that. Amazon backed down from the fight. So there was no answer to the legal question.

EDIT: Also, since I can't get through the WSJ paywall, here is Amazon's complete statement from the time:

Quote:

Here is the full text of Amazon’s statement:

Kindle 2’s experimental text-to-speech feature is legal: no copy is made, no derivative work is created, and no performance is being given. Furthermore, we ourselves are a major participant in the professionally narrated audiobooks business through our subsidiaries Audible and Brilliance. We believe text-to-speech will introduce new customers to the convenience of listening to books and thereby grow the professionally narrated audiobooks business.

Nevertheless, we strongly believe many rights holders will be more comfortable with the text-to-speech feature if they are in the driver’s seat.

Therefore, we are modifying our systems so that rights holders can decide on a title by title basis whether they want text-to-speech enabled or disabled for any particular title. We have already begun to work on the technical changes required to give authors and publishers that choice. With this new level of control, publishers and authors will be able to decide for themselves whether it is in their commercial interests to leave text-to-speech enabled. We believe many will decide that it is.

Customers tell us that with Kindle, they read more, and buy more books. We are passionate about bringing the benefits of modern technology to long-form reading.


I regularly use my MacBook for narration. I look forward to this being better-adapted for books.


I am surprised it’s taking so long for big tech companies to roll this out. I suppose there’s too much money in audiobooks? Perhaps licensing issues? If anyone can get a “celebrity reads aloud” feature that would be Apple and that could be big.


Yes, licensing. Narrated books are a separate publishing license, with its own production-cost model. So a tech.company like Apple already having rights to sell the written words is still not licensed to publish a narrated version. To sell an Audiobook, they have to acquire the license for a separate product, which so far includes the produced Audiobook itself.

It seems Apple is trying to get the audiobook license directly from those authors who didn't sign the license away yet, undercutting production cost for the Audiobook with "digital narration" and then earning more money per sale...

I guess we're going to see human narration die very fast now, at least for some common languages, and tech.companies want to ensure that they can split license from production cost, instead of being forced to buy the "whole" Audiobook...


TBH it will depend on how those narrations turn out to be.

Cherr-picking is easy, but I paid for this, it needs to be human quality throughout


True, but I expect the critical mass of the target group will remain to be people frequently listening to Audiobooks --> Those people are more-likely subscribed to an Audiobook service --> are not paying a per-audiobook price but a flat-fee --> The flat-fee for the whole catalog is lower than buying one audiobook.

I would expect this target-group to access a mix of human and digitally narrated books during the transition to digital narration, with best-selling books still being narrated by humans. Users may then complain about the quality of the digital narration, but will keep using such services as the price-expectation is now set.

--> A competition for better digital narration engines will likely drive evolution of the engine and authoring tools, further increasing the pressure of publishers to justify the bottom-line of per-book Audiobook production costs.

> Cherry-picking is easy, but I paid for this, it needs to be human quality throughout

That's a really interesting aspect. If the Audiobook delivers the content with a human voice but still not engaging enough, how many listeners would put the blame on the narration rather than the book itself... ("I like this new song of Metallica, but I don't like how they sing it")


The quality needs to be very high for it not to be jarring. I stay with podcasts if the host has a "good radio voice". ("Not even all human voices are good enough for me.") It's just a very intimate medium to have someone's voice right in your head. If the voices have annoying quirks, those audiobooks will not be loved.

Yes, it needs to be tens of hours of perfectly good narration.


I have used TTS for books for years. It might be jarring the first time but you get used to it very quickly, I don't even think about it. I actually prefer it to my Audible books usually because it doesn't ever do anything to annoy me like some narrators, and I can understand it at whatever speeds. There are some dialogue heavy books where I have to read along though to be sure of who is talking.


I think you’re underestimating state of the art in this area. You can do amazing things with just a few minutes of readings.


No. You are vastly overestimating it. There is a reason there is no broadly available TTS service like there is for text-to-image. Anyone who says you can clone a voice in a few minutes is not talking about human-quality.


I’ve done a few models on my own. Stephen Fry is a relative easy one. This is from 2020 so I am sure state of the art is far better now.


I'm not saying you can't do it, I'm saying it likely does not sound good enough for the average person to listen to for a long time.

Got a sample?


I cannot put into words how much I want Silmarillion read by Scarlett Johansson.


I want on Spotify early faery tales and parables of dragon-slayers and dragons as read aloud by Joe Rogan, Michael Bisping, Tom Aspinal to kids of the lower political-economy.


It’s the spaces between sentences and between pauses that need work. Usually a reader will take a breath or finish exhaling. Instead Apple’s audio drops to 0 db. It sounds unnatural. Mechanical.


Tech doesn't seem that great? Google demo'd Duplex in 2018 and it was so good at voice synthesis that people were arguing about whether or not it's ethical to not disclaim you're talking to AI.


It’s odd they label the voices as “soprano” and “baritone,” because they don’t sound like it.

I suspect it’s to avoid labeling the speakers as “male” and “female.” What a joke.


Woah, the Madison voice is quite clearly Julia Whelan.


Lucasfilms licensed James Earl Jones iconic voice in perpetuity when he retired from acting (come on there is only ONE Darth Vader voice) - no doubt he and his estate in the future will get nice annual royalty cheques from Mr Mouse.

I wonder how this works with her ???.


Mitchell sounds like Alan Rickman, I felt like I was hearing Snape reading the sentence. I like it


I heard and rather confident it's Ray Porter- it's uncanny. Instantly recognized it (have listened to a number of books narrated by him)


100%, I noticed it immediately.


It would be great if these voices were Siri options. The new Siri choices are quite bad…


What accent is used by the first voice? It creeps me slightly, some kind of rz sounds...


Can I plug a standard ebook in and get digital narration? Or just Apple Books?


Neither. The author of the book has to utilize these.


So at first I thought Apple had managed to undo the whole nonsense that book publishers strong-armed Amazon into doing where they can turn off TTS narration to make you buy the audiobook. But instead this seems to just be "hey if you want to use TTS instead of a paid narrator, you can". Already kinda shitty, but there's extra shit cherries on top: the resulting recording cannot be used on other book platforms. Only Apple Books and the DRM nonsense that killed public libraries.

So it's also platform capitalist moat building, too - i.e. a scheme to deprive Amazon Audible of audiobooks. The more publishers opt to use Apple Books digital narration instead of paying a narrator, the less audiobooks will be available on Audible. And yes, you are allowed to still pay a narrator and distribute that recording on Audible, but... if you could do that, then obviously you wouldn't bother with Apple's TTS system.

Of course, the flipside of this is that Amazon refuses to bother with copyright enforcement for books not on Audible. Cory Doctorow found this out the hard way[0]. If you do not license your work to Amazon, Amazon will pay someone else to copy it, and for some reason DMCA 512 protects them[1]. So I can see this winding up being a functionally unused service anyway.

[0] https://www.audible.com/pd/Why-None-of-My-Books-Are-Availabl...

[1] To be clear, I do not oppose DMCA 512; I just don't think DRM-bearing audiobook services that charge money should be allowed to disclaim copyright liability. DMCA 512 and 1201 should be mutually exclusive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: