Hacker News new | past | comments | ask | show | jobs | submit login
Transcribing bird sound as human speech (jstor.org)
70 points by tintinnabula on April 3, 2023 | hide | past | favorite | 39 comments



A discussion of transcribing bird song is not complete without a reference to Messiæn's Catalogue d'Oiseaux (bird catalog), for which he transcribed bird song to pitch (https://www.oliviermessiaen.org/birdsong), and used those as starting point for composition (e.g. https://www.youtube.com/watch?v=UmrAL2WNjGs). For the people who expect gentle bird imitation: it's much wilder than that.


Messiaen in general is wild, but his use of birdsong is especially cool. Long before I knew about this, I had an idea of writing a cello/flute duet based on birdsong, but at the time, it was hard finding good recordings let alone transcriptions of birdsong. I do think that musical notation might be a good starting point for birdsong transcription though.


This output to me sounds a lot like a contemporary "lots of complicated dramatic notes"-style of composer imposing that specific style on snippets that loosely resemble bird calls. It doesn't sound like birds at all. Maybe I'm just not smart enough for this stuff. Or maybe the composer has a particularly dramatic imagination and is using the birds just as a starting point.


To get into Messiaen, try his Turangalîla symphony. It's more accessible, and --if you're so inclined-- almost ecstatic. Then the Quator pour le fin du temps, which was written under particularly difficult conditions. His more religiously oriented work is also more accessible.

Well, when I write "more accessible" that indeed means the rest is harder to get into. No shame, though. It's a bit of an acquired taste. There are many 20th century composers (especially the serialists) I cannot stand. I can't hear the music in it. But Messiaen did get to me. Perhaps it's his "colours" or the Debussy-like chords.


I think the dynamic range of multiple instruments naturally invites "lots of complicated dramatic notes", which you can find even in romantic music. Of Messiaen's works, my favorite remains "Vingt regards sur l'Enfant-Jésus". Of course the symphony also introduces many new and alluring timbres not found in the piano. In "Catalogue d'Oiseaux" I myself also struggled to find more than a passing resemblance to bird song (i.e. beyond more familiar song found in the treble of l'Alouette Lulu), so thanks for those links.

As for ecstatic music, I wonder how you'd appreciate the music of Maurice Ohana, specifically "Messe", which despite being an operatic (or choral?) piece, I find calming.

I believe John Sichel's "Oiseaux Ordinaires" is homage to Messiaen's "Oiseaux Exotiques", though once again I'm not sure how much actual birdsong it incorporates.

I've always wondered how much of music's aesthetic is inherent, and how much relates to extrinsic factors. Normally you hear a piece and if you like it, you think, "hmm, that's good music". But maybe you only liked it because you're used to that type of music. So if you want to listen to music from a new genre perhaps you listen to it a few times, letting an appreciation percolate from whatever place we draw our musical inspiration. Then, because a lot of people say "this is a good genre; good music" and the genre is large with some shit pieces but also masterpieces, you come to believe that the genre has value. But some genres very few people come to appreciate, and that includes serialism, and I wonder... Is serialism lacking an inherent aesthetic, or is it just too different to easily appreciate? For example I played, and came to love, Schoenberg's "Six Little Pieces for Piano" (op 19). My interpretation was idiosyncratic; like no one else's. But looking on YouTube I can't find any performances that I enjoy or would recommend. So does my appreciation for this music reflect the blood and sweat that I've put into it, or does it reflect the beauty and understanding of the underlying musical theory? The question is: does serialism need to be played to be appreciated?


> "Vingt regards sur l'Enfant-Jésus"

Great work. I also like l'Ascencion. I even try to play bits on the organ (not that I'm a good organist).

> I wonder how you'd appreciate the music of Maurice Ohana, specifically "Messe"

That didn't do it for me. I found it a bit too wandering, amorphous. I listened to some of his guitar work, which is much more concise, somewhat Webern-like. That spoke more to me. Then autoplay put on 12 etudes d'interpretation, which I predominantly found to be too "forced".

> Is serialism lacking an inherent aesthetic, or is it just too different to easily appreciate?

Composers, musicians, and musicologists have tried to push it long enough, and it never got hold, so I guess it just can't work apart from some gimmicky use. It's too rigid. E.g., the dynamics, rhythm, melody and harmony of a piece should be related to each other, but in serialism, they're simply tossed around as if every preconceived change between pp and ff and then mf and then a bunch of sfz will always be musical.

Music thrives under limitations, because it's basically repeated time, and limitations can provide direction. But if I were to start a school of composition that only allows one tone per part, I think nobody would be able to write a meaningful piece (except as a gimmick, again, by picking the instruments that fit with the chord you want). Some limitations are simply not musical, and I guess serialism is among those.


Oh that was fantastic, thanks. It's not often I have reason to listen to organ music. Almost makes one want to purchase an organ.

I don't care for the 12 études d'interprétation, but neither do I care for Rachmaninoff's Études-Tableux. I find both to be too pedagogical. Which makes no sense; according to Wikipedia they're meant to be "picture pieces". So I haven't always put much stock in études.


When I sang in the cathedral back in the 00s, the organist would often play Messiaen as a postlude. There were some pieces where with the pedal work, multiple manuals and adjusting stops that it was like watching Keith Moon on the drums.


I don't think its a question of intelligence. you have to spend some time listening to messaien and late romantic/contemporary music in general before you start to 'learn the language'. I think its worth the minor effort if you think there is nothing new under the sun, but it all your choice :)


someone else hints at this but one of the easiest Messiaen pieces to understand is his "Quartet for the end of time", written while he was a prisoner of the germans during WW2. The wikipedia article for the piece is actually pretty good "liner notes" if you want a guide to each movement.

One other thing about whatever you want to call the contemporary continuation of classical music - you're right that there's often a lot going on. I've heard it described as "fragile" music: in contrast to pop songs that still work if you're on a train, if they're interrupted by an announcement, if you only listen to the second half, etc., a lot of this music is written to be listened to by someone who is relatively calm and undistracted, in a quiet environment. And honestly some of it has additional expectations of its audience, like a better than average ability to remember melodic snippets and recognize their transformations, or an ability to track "advanced" harmonies and key changes. But imo the general rule "90% of everything is crap" applies in this genre as well, and most of the pieces that everyone will speak of in glowing terms have a way of 'working' for most listeners without demanding advanced education, study, or above average musical-listening abilities.


I think a lot of it also helps to be inside of. I had the good fortune for a few years around the turn of the century to sing in the Holy Name Cathedral choir in Chicago and the choir director programmed a lot of 20th-century classical work alongside the blue-haired lady standards and renaissance works. One of my favorite pieces had the choir singing a C chord while two sopranos sang a descant of Db-Ab above it. Aside from sounding really cool, it was also a really difficult feat technically speaking as those are not intervals that are easily sung. It was quite the education, although I started to realize as I worked on my own music that I was no longer hearing dissonances the way I used to.


Good to know. I should add that I actually liked the stuff that I heard, much more so than most contemporary Western "classical"/concert music, and I'm happy to follow up on this particular composer. I was just disappointed at how little it sounded like bird song to me, and instead fell into what seem to be some stereotypical tropes of this style.


Reminded me of this interesting bird song visualization project:

https://github.com/soundshader/soundshader.github.io/tree/ma...

Less information than a spectrogram, but really visually appealing. :)



wow this is so cool. thanks for sharing




Dig, if you will, the picture

Of calls written down in ink.


Transcribed by a token predictor,

A machine that pretends to think.


Tangential, but -- it occurs to me, is there any way to apply the methods behind ChatGPT to animal communication, and is there any way we could learn anything about meaning from that?

It seems like AI ought to be able to "autocomplete" birdsong or whale sounds just as well as anything else, given enough training data. Could that then be used to help discover underlying grammars, etc.?


Careful with anthropomorphising. Communication is much more than vocalisation of abstract thought (i.e. the way humans mostly use it).

Animals don't communicate the same way we do and it might be very misleading to associate concepts such as grammar with their utterances. While some species use distinct sounds to express certain situations (like "careful, there's an airborne predator" vs "careful, there's a snake"), not all "verbal" expressions serve such purpose.

The general consensus in contemporary science is that bird songs in particular are utilities to attract mates, mark territory, and the like; not exchanging information (other than things like "this is my nesting place" or "I'm healthy and available", but not in a sense of words or sentences, but more like the acoustic equivalent of gesturing).


"Careful with anthropomorphising."

The confabulation elements of this particular AI technology would be particularly concerning in this use case. This would be the most literal anthropomorphization you could imagine. Applying a strictly human language model to a non-human will inevitably result in humanization of the target by the nature of the bias of the system.

I fed "You are a rock. What might you like to communicate to human beings?" to ChatGPT and it answered just fine, off of nothing more than that input, giving one of its typically highly sanctimonious and high-handed lectures about environmentalism and managing to work in complaints about climate change, as naturally a rock would deeply care about. (It has now even titled my session "Rock's Wisdom To Humans". Good lord, ChatGPT has certainly become an annoying prick.) This has everything to do with the biases in the technology and nothing to do with the thing putatively doing the communication. Feeding it some sort of data from a real whale isn't going to do any better; it'll happily confabulate away.

"Bias" in this case has a more technical meaning, which refers to the representational capability of the system, not political biases intrinsically. It so happen that ChatGPT immediately leaps to political topics, which is doing a great job of obscuring the way I'm trying to use the term, but that's because of its own internal structures which have been tuned in that direction. This, in fact, accidentally serves as as good an example as anything of how highly tuned ChatGPT has become to give politically correct answers, given the political neutrality of the prompt. But what I'm referring to here is just that a human language model is going to be biased in human ways. It can't help but turn anything you feed it into a human perspective, so using it for trying to understand non-human perspectives and communication is going to be intrinsically a mistake.


I think OP was imagining using a LLM trained only on animal sound data, not using a LLM trained on human data to interpret animal data.


Which begs the question what the added benefit would be from such an approach, i.e. what could we even learn from that compared to other methods?

To quote from an article in The Royal Society:

> [...] bird song evolution can also be understood as a response to natural selection, as when oscine passerine species in urban habitats raise the frequency of their song as a means to overcome traffic noise.

Things like changes in the vocalisation can just be adjustments to the environment or simply due to sexual selection. How would an LLM be able to capture this, given that it would lack such additional information?

Then there's this quote from the same article:

> In contrast to the expectation of gradual change through cultural or genetic drift over time, the results instead demonstrate that plastic traits such as song can exhibit punctuated evolution, with bursts of trait divergence interrupting extended periods of stasis.

So even if a model was trained on (regional) bird dialects, the data could become obsolete basically over night, as the structure and patterns might change dramatically between breeding seasons. The researchers used statistical modelling (classical Gaussian mixture models) to analyse the data. I'm a bit lost as to how LLMs could be applied in this context (in that I have no idea what kind of scientific questions they could answer).

The full study can be found here: https://royalsocietypublishing.org/doi/10.1098/rspb.2021.206...


It's not unreasonable to expect that a big enough model could learn some kind of useful feature space for bird vocalizations. You are essentially replacing a parsimonious quasi-parametric model (Gaussian process) with an enormous fully nonparametric one. Transformer units should be as well-suited for this as for any other sequence learning task.


But the LLM would need to translate the sounds to human language, and that's where the issue would arise.


Yes, this was my logic. A model trained on bird song may make birds turn their heads when it babbles something in their frame of reference, but if you want to "use LLM" to make it human comprehensible you need a human LLM at some point, which will impose humanity on its input regardless of what you feed it, because all a human LLM can represent is human language. A human LLM can't help but make whatever it is used on human. A bird LLM would have the same effect; no matter what you feed it, you'll get something in the bird input's space.


Not sure you're really getting the idea. In this scenario, there wouldn't be human llm and dolphin llm, you would train human speech and dolphin communication in one model. sound tokens are sound tokens. pre-trained transformers don't discriminate.


The concern would be that it would gnerate a bunch of spurious correlations between the two. You would be mixing two kinds of datasets, where it’s not clear that the dolphin sounds are linguistic. It’s also not clear that LLMs would correctly map human language to a non-human one. Depends on your theory of meaning. Wittgenstein would say if a lion could talk, we would not understand it, because lmeaning is use, and lions would use language differently than humans.


> highly sanctimonious and high-handed lectures... Good lord, ChatGPT has certainly become an annoying prick.

You've hit the nail - or dare I say rock - on the head.

That said, I think the bias we've both encountered is an aspect of the model's initial instructions and not reflective of LLMs in general.


Yes and theoretically if it trains on human language as well, you'd be able to translate between the two (assuming such a thing was possible, i.e assuming say dolphin communication was a complex language).


That's the part I don't quite understand -- I know ChatGPT can translate quite well between languages.

But I'm not clear on whether that's because it's been trained on a bunch of aligned translations and bilingual dictionaries and whatnot, or if it's detecting the same repeating patterns in languages and figuring out the translation mappings on its own.

Is it the latter? Could a ChatGPT actually translate something like dolphin communication into semi-meaningful English, that wouldn't just be entirely nonsense?


It's the latter.

That's why it can only be trained on 1.8% french by word count and speak fluent french. It's not that the amount of french that was in its corpus would have been enough to train it to that equivalent level alone (it wouldn't, not by a long shot). It's that there's positive transfer of knowledge. whatever it learns predicting english helps it learn other languages much faster. it's not the only display of positive transfer either. llms trained on code reason better.

It's also why the open ai instruct tuning dataset was almost entirely english yet it follows instruction in other languages just fine.

If there's something to be translated, a language model trained on both will translate it. it just needs to learn the concept of translation but the human side of the dataset should take care of that

Really i only say theoretically because of 2 things

one. first, a multimodal text-audio model that shows positive transfer between text and audio needs to be demonstrated. otherwise, this hypothetical human-dolphin model might need too much data to be feasible.

2. we're going to need a lot of dolphin speech audio.


You might be interested in work being done by the Earth Species Project https://www.earthspecies.org/what-we-do/projects


That is fascinating, thank you! Self-supervised learning that could correlate animal sounds with behavior. Exactly the kind of thing I was thinking about.


Once one of these models are trained on a huge video corpus like YouTube, which includes lots of videos of animals, it might just be that stuff like this is only a prompt away, assuming there is enough training data.


For use of birds sounds by people in music ... I absolutely love this:

https://www.youtube.com/watch?v=Hxg1dL_x0gw

AkhaBraka - Ukrainian Folk


Who is using machine learning on bird sounds/songs? I have three parrots and understood their happy sounds, angry sounds, and love you sounds. But missing so much more…


Amen to that, I absolutely want a "tell me what those crows are generally talking about" app.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: