My take on the classical parts of it, as a classical pianist.
Overall: stylistic coherency on the scale of ~15 seconds. Better than anything I've heard so far. Seems to have an attachment to pedal notes.
Mozart: I would say Mozart's distinguishing characteristic as a composer is that every measure "sounds right". Even without knowing the piece, you can usually tell when a performer has made a mistake and deviated from the score. The mozart samples sound... wrong. There are parallel 5ths everywhere.
Bach: (I heard a bach sample in the live concert) - It had roughly the right consistency in the melody, but zero counterpoint, which is Bach's defining feature. Conditioning maybe not strong enough?
Rachmaninoff: Known for lush musical textures and hauntingly beautiful melodies. The samples got the texture approximately right, although I would describe them more as murky more than lush. No melody to be heard.
Funny, as someone with the useless superpower of knowing an enormous number of TV Show themes, I could hear all sorts of riffs on such themes ranging from Magnum PI to Sesame Street.
Overall, IMO it wildly gyrates from the best I have ever heard all the way to the return of Microsoft's thematically lifeless Songsmith without warning...
This is very interesting to me as an amateur pianist trying to learn Bach. Could you give an explanation of what good and bad contrapuntal playing sound like or direct me to some references?
I think this is far, far beyond any algorithmic composition system ever made before. It displays an impressive understanding of the fundamentals of music theory.
Most previous attempts at neural net composition restricted the training set to one style of music or even one composer, which is pretty silly if you understand how neural nets work. It was obvious to me if you used a very large network, chose the right input representation, and most importantly used a complete dataset of all available music, you would get great results. That's exactly what OpenAI has done here.
It is still lacking some longer term structure in the music it generates (e.g. ABA form). But I think simply scaling up further (model and dataset size both) could fix that without any breakthroughs. This seems to be OpenAI's bread and butter now: taking existing techniques and scaling them up with a few tweaks. (To be clear, I don't mean to minimize what they've done at all. "Simply" scaling up is not so simple in reality.)
What might still need some breakthroughs is applying the same technique to raw audio instead of MIDI. Perhaps what is needed is a more expressive symbolic representation than MIDI. I'm imagining an architecture with three parts: a transcription network to produce a symbolic representation (perhaps embedding vectors instead of MIDI), something like this MuseNet for the middle, and a synthesis network to translate back to raw audio. This would be analogous to gluing together a speech recognizer, text processing network, and a speech synthesizer. Such a system could generate much more natural sounding music, even perhaps with lyrics.
Aiva is way better than this honestly, and I mean the pure AI part of it, that (AIUI) only really generates monotimbral, piano-like music (the orchestration in the full pieces they release was - by their own admission - done by humans, last I looked into it). You can actually tell that Aiva involves real serendipitous ("out-of-sample") creativity of the sort that AIs (and good human composers!) are best at. (But the openly-available models I mentioned elsewhere in the thread are still a bit better, TBH. At least IMHO, and when it comes to rewarding active listening.)
Long-term structure
MuseNet uses the recompute and optimized kernels of Sparse Transformer to train a 72-layer network with 24 attention heads—with full attention over a context of 4096 tokens. This long context may be one reason why it is able to remember long term structure in a piece
the problem to the scaling thing is that we don't nearly have as many samples per artist as we do have "text pointed to by links on reddit", as used in GPT2 (openAi's text generation thing). I would expect that they can't go as crazy as they could
It's midi. It's meant to be notation, not a realistic composition. I suppose the background "machine gun" piano notes would be a bassy synth combined with 90s electronic music drums. So please do not focus that much on the audio or performance part of it, but on the composition part of it.
I do agree they could have trained it about the importance of velocity though. (That neural net, and most young music students out there too, heh)
Interesting question. I would assume yes. There should be enough data for things like sustain pedal and velocity dynamics. Music people can be "crazy" and add a lot of detail in their work. It is pretty common to find midi renditions of pieces that contain all kinds of extra information, from pitch shift cues for guitar solos to legato for string instruments (when the finger slides across the string, instead of "jumping" from position to position), to even volume automation for fade in/out effects. Classical musical notation also contains it, though way more limited than the 0-127 range of MIDI, in the piano/forte dynamics notation.
It might have been stripped off to ease complexity, or maybe it was unadequate and created odd sounding results. Maybe the midi library they used doesn't do a good job of portraying it and it is actually there! Who knows!
The opening sounds a lot like Fisher's Hornpipe [1]. And then at 0:26 and even more so at 0:40 it sounds less and less like bluegrass. The weird instrumentation makes it especially strange.
Interesting. I’m not where I can test this out and my first thought was wondering how well MuseNet could do bluegrass. One of my favorite things to look for on YouTube are bluegrass covers of popular songs.
Another interesting musical form to see how well MuseNet does would be boogie woogie.
This seems incredibly applicable to musical scores in movies. I can imagine a product where the editor/director/someone inputs a handful of variables (mood, genre, instruments, etc) and timing requests (crescendo beginning at 30s and ending at 75s, calm period from 90s - 120s, etc) and out comes a musical score for the movie that matches up with their scene editing.
With hundreds of thousands of semi-professional and amateur musicians competing on the Internet for their 15 minutes of fame, getting film or computer game score just isn't a real problem.
Besides, this is already doable, without neural networks. E.g. Karma[1]. The fact that most "AI music" enthusiasts don't know/care about such systems is a clear indicator that they aren't really interested in music or helping creators and just want to shove AI in yet another niche where it doesn't belong, then pat themselves on the back for contributing to "progress".
Further commodification of what's left of out popular culture isn't something anyone should be excited about. I just can't wait for the endless wave of trite "style blends" and the inevitable "oh, but all human music is garbage anyway" justification.
Also, some people don't seem to be aware of it either, but there is already a full-fledged genre of generative music where artists use anything from analog circuits[2] to randomized or algorithmic sequencers/arpeggiators[3] to custom-built digital devices[4] for creating entire tracks. Of course, the point of generative music usually isn't to replace the artist, it's to shift their focus.
This isn't chess or go, where you either win or you don't. Chopin composed for a reason, and it wasn't just an excuse to throw a lot of notes at the page.
It might be possible for AI to work at that level someday, but it's not just a technical problem, and you won't be able to solve it by throwing a corpus of compositions at it.
Aside from that, this still sounds like aimless noodling. It's far more polished noodling with some awareness of genre cliches, but it's still essentially aimless - and so meaningless.
>This isn't chess or go, where you either win or you don't. Chopin composed for a reason, and it wasn't just an excuse to throw a lot of notes at the page.
I disagree based on your following quote:
> Aside from that, this still sounds like aimless noodling. It's far more polished noodling with some awareness of genre cliches, but it's still essentially aimless - and so meaningless.
This sort of problem can easily be reframed as a win or lose problem, we simply consider whether the music sounds good or not. More concretely: would this music be able to convince you that a gifted human composer created it?
I'm fine with the answer being no, but I don't understand why the original comment I replied to wrote off the entire exercise. No we may not be there yet, but this seems like a good step forward to me.
I think the core difference of opinion here seems to stem from a different interpretation of "meaningless".
You're arguing an academic definition of meaningful: can you design an AI that solves a hard problem?
gambler (and I think TheOtherHobbes, who you are replying to) are arguing a practical definition of meaningful: does it solve a problem people actually have?
It's neat that you can make artificial music, but actually generating music, per gambler's original comment, isn't a problem people have. It also doesn't actually add too much to culture. Essentially, the results are "meaningless" in that, even if it was successful at sounding good, what value would it actually have besides novelty?
Something similar was done in the Left 4 Dead series with the AI Director:
Tim Larkin: We took several steps to keep the music interesting enough that the players would be inclined to keep it on as they play. We keep it changing so it won't become tedious; to this end, we created a music director that runs alongside the AI director, tracking the player's experience rather than their emotional state. We keep the music appropriate to each player's situation and highly personalized. The music engine in Left 4 Dead has a complete client-side, multi-track system per player that is completely unique to that player and can even be monitored by the spectators. Since some of the fun of Left 4 Dead is watching your friends when you're dead, we thought it was important to hear their personal soundtrack as well. This feature is unique to Left 4 Dead.
For single-player it works really well for building tension leading up to being attacked by waves of zombies and creating calm spots after high-stress encounters.
In the online versus mode however players got really good at using the musical hints as "tells" for when certain actions happen in the game. Most notably there were musical signatures consisting of a few notes that would play when the special infected characters spawn. Coordinated teams could use this to their advantage know when the infected team is about to stage an attack (like an ambush at a choke point in the map).
Even more interesting if it allows variations in the same soundtrack according to different dramatic moments in the same scenarios.
Example: a man is walking down the street vs a man is walking alone in the night down the street vs a man is walking alone in the night down the street unaware of a bright red dot on his back.
All these scenarios could use the same soundtrack, but require different dramatic levels. The game would send data to the music algorithms so that the soundtrack would reflect the right mood.
There have been covers of people taking a pre-recorded segment of a game (say, a 1v1 fight), and improvising a jazz score around it. How awesome would it be to have this happen live for any game? Anything from mood to specific sound effects.
I see the applicability more for video games. There are a bunch of problems that don't exist in other forms of music, e.g. when you want to make the music adapt to user actions.
Wouldn't solve all the problems, but I think there could be very interesting results if a composer provides skeletons of themes used, and "AI" adapts them to world and player state dynamically.
And do the same for the actors, script, lighting and sound effects. And also have some AI for the 'input variables', after all Netflix already does that to select their content.
It might be a way off that for "production", but for editing, when a score isn't finished yet, as filler to help the editing process, this could be an amazing tool.
the line between movies and games will blur dramatically. think bandersnatch, but it's the viewer saying "open the door" and the directorGAN will generate a new storyline.
While this may not be perfect composition, this is a surreal (and almost sad) moment in my life to hear music that is passable created by a computer under it's own volition. I work at a company that works with a lot of machine learning so I generally understand its limitations and haven't ever been an alarmist. That being said, I generally always thought of it being applied to automate work. For some reason I had always considered that which we normally attribute to human creativity to be off-limits. Sure now it's not great, but in 10 years will it be able to compose better music than Chopin? In 10 years, will music created by a computer surprise and delight me more than music composed by humans?
Fear not, my friend, because the execution is only a part of why music and art as a whole carries meaning for us.
Consider two tracks that are identical (forget copywrite for a minute). Between one that an AI generated and a human composed, I would personally grant the human-generated version more credit and enjoy it more. The story of how art is created and the stories of the artist are as substantial to appreciating art as a stroke of a brush or a note on a page. Computers will never replicate this until singularity.
First of all, I disagree with this premise, as the overwhelming majority of music either (1) doesn't have a particular story behind it at all or (2) becomes popular, and then people learn the story behind it.
Even accepting the premise, what happens when the next artist with a great story is simply using MuseNet to write their emotional pieces and passing it off as human? They'll be functionally the same, yet it still feels like something was lost.
What makes you think that a computer that can generate music cannot also generate a background for the "creator" of that music? It can give you all the stories that touch our hearts even more so than we can imagine.
Why generate a fake story? Why not communicate the real and moving journey of how a single note in the training data travelled through hundreds of neurons and thousands of matrices, and eventually made it past the final activation function to become a feature in the output tensor.
Given two stories, both identical, where one story is real - the real story will always be more meaningful because it has actually happened within the constraints of our reality, granting it validity and us the ability to relate to it.
Now, consider two stories, both identical, where one story is "real" and the other story is from a simulated universe. Now I'd say that both stories are of possibly equivalent value, since both have happened.
I kinda thank you for your comment, it gives me more...hope? :) (I'm an artist)
The same way when we see other human did something extraordinary that we think human can't do(based mostly on ourselves) For example, an artist who can draw something so life-like or sculptor who shape hard marble into soft, flowing clothes — with just hammer and chisel.
Human potential always intrigues us "What? Human can do THAT?" kind of way.
Yes, the machine and AI can do the same thing at the fraction of time, from the practicality standpoint, but it's not and never be the same — it's empty. It's just lifeless product and we never feel related to it.
I think that when placed against each other in this hypothetical, yes one would naturally side with a human (even if it is just out of solidarity), but what about when humans (who already claim the fame from songs written by other humans) claim the fame for song written by computers--that have no legal recourse? Would we just assume the music claimed by a human, was originated by a human?
It doesn't sound passable to me - it sounds boring, it's a hack around a text-parsing architecture ffs, trying to make it understand multi-dimensional and multi-timbral data... There are models that do better, and can "surprise and delight" you in different ways than a human would. Think about DeepDream and the whole experience of trying to spot all the weird doggie parts that the computer manages to sneak in those pictures - I don't think that a human could paint a DeepDream-dog picture nearly as effortlessly and perfectly as a computer can! I would definitely describe that as successful art, as far as it goes. But that doesn't mean that DeepDream "solves" visual art as a task!
Music created solely by machines will probably remain derivative and simplistic for a long time. I expect the biggest result of this research in the near term is that we'll be able to create tools that lower the skill and time required to create good music, kind of like an audio version of templates/autocomplete/spellcheck.
There are a lot of futurists and singularity types that take it personally when people disagree with their assessment. It's no big deal, but the open minded, progressive thing to do would be to have a debate and save the downvotes for trolls.
It's probably only a matter of time before we have a GauGAN like interface for synthetic music creation...so you could say 'i want a sad song with a soft intro and a buildup of tension here with lyrics covering these emotions and things which lasts 7 minutes'.
ML/DL is coming for a lot of the grunt work. It's coming for us as programmers as well. It's probably a few years away, but ML/DL
Given how easy it is to train a Transformer on any sequence data, and given how plentiful open source code is, I'd say "CodeNet" is probably less than a year away. OpenAI will probably do it first given they already have the setup.
I've been training on Stack Overflow and the model has already learned the syntaxes and common coding conventions of a bunch of different languages all on its own. Excited to see what else it's able to do as I keep experimenting.
Some sample outputs (you'll probably want to browse to some of the "Random" questions because by default it's showing "answers" right now and I haven't trained that model as long as some of the older question-generation ones): https://stackroboflow.com
I've tried it as well and got good syntactic results. For more sensical programs, I think we will need more layers & attn heads. Perhaps someone will fork gpt-2 and add the sparse transformer to it.
That CodeNet would be the SkyNet, essentially. What's shown here looks impressive, but it's the same good old text generator that can produce something that looks very similar to the dataset used to train it. It can't go beyond the dataset and generate something new. From the mathematical point of view, that generator interpolates samples from the dataset and generates a new sample.
To give an idea how big is the gap between MuseNet and CodeNet, we can consider a simple problem of reversing a sequence: [1,2,3,4,5] should become [5,4,3,2,1] and so on. How many samples do you need to look at to understand how to reverse an arbitrary sequence of numbers? Do you need to retrain your brain to reverse a sequence of pictures? No, because instead of memorizing the given samples, you looked at a few and built a mental model of "reversing a sequence of things". Now, the state of the art ML models can reverse sequences as long as they are using the same numbers as in the dataset, i.e. we can train them to reverse any sequence of 1..5 or 1..50 numbers, but once we add 6 to the input, the model instantly fails, no matter how complex and fancy it is. I don't even dare to add a letter to the input. Reason? 6 isn't in the samples it's learnt to interpolate. And CodeNet is supposed to generate a C++ program that would reverse any sequence, btw.
At the moment, ML is kinda stuck at this pictures interpolation stage. For AI, we don't need to interpolate samples, but need to build a "mental model" of what these samples are and as far as I know, we have no clue how to even approach this problem.
Yeah, I know what you are saying... But let's just let somebody try this experiment (and somebody eventually will), and we can judge what can or cannot be learned by the results.
We will definitely get a great code autocompleter at the very least..
Can you explain? I'm not an expert on ML by any stretch of the imagination, but you'd think with the sort of stringent logical coherence required to construct useful programs, it'd be a pretty subpar use case. Or do you mean smaller-scope tools to aid programming, like linters and autocompleters?
I wonder if you could find a representation for computer programs that eliminated all of the degrees of freedom that were syntax errors, leaving only valid programs. In a sense that's what an AST is but you can still have invalid ASTs. I bet it would be a lot easier to generate interesting programs in a representation like that.
There is cartesian genetic programming and some lisp-like models to encode a program as a tree where all combination are valid.
Combined with recent work on convolutional graph DNNs, this might be a good approach.
It's not passable...it's pretty obviously algorithmic.
The program does not have volition.
Why would you think that using statistics to generate a model of a piece of art (which is just data in the case of MIDI and pixels) would be "off-limits"? People have been doing this for decades.
No one knows the answer to your last two questions, but there is no indication that this program is leading there.
MuseNet will play an experimental concert today from 12-3p PT livestreamed on our Twitch channel. No human (including us) has ever heard these pieces before, so this will be an interesting showcase of MuseNet’s strengths and weaknesses. Through May 12th, we’re also inviting everyone to try out our prototype MuseNet-powered music generation tool.
I really, really want to see a live band or orchestra play a set of AI-composed pieces, none of which were filtered or manipulated by humans. Just generate MIDI, spit out sheet music (hopefully written well enough that it can be sight-read), and hope for the best. I'd buy a ticket for that, without a doubt!
We'll be playing generated MIDI without human filtering or manipulation today (though playing them synthetically of course)! If you know of an orchestra looking for some music, we could do the rest of what you describe :).
This may be academically interesting, but the music still sounds fake enough to be unpleasant (i.e. there's no way I'd spend any time listening to this voluntarily).
I disagree with this: some of the stuff being played live on Twitch right now I would definitely voluntarily listen to, especially if it was actually performed by people with real instruments and not just midi.
There was a baroque-pop song just now that had a ritardando that almost gave me chills. Probably copied from Chopin, but still.
Yes, the lack of human factor is very noticeable if you are a musician. I believe it's pretty similar to when grandmasters can tell if they are playing a human or a bot. Something that's hard to explain. Now saying if it's better or worse is just subjective.
The concept of how different human intelligence is from "AI" fascinates me, as it would seem to say a lot about the nature of intelligence and how far we are from GAI (pretty darn far).
One thing with it is the distance from getting a computer to compose anything to getting it to compose close to human level could be 90% of the distance to getting it to compose at a super human level.
Even several months before Alpha Go beat Lee Sedol there were people saying the AI could be good but never great. Now everyone admits it's super human.
I agree that getting really texture and subtlety in the work is really hard. However a really good system has the possibility to be better than any human has ever been, and IMO that's super exciting.
What does "super human" mean for a cultural artifact like music? With a game, "super human" can mean "beats humans". For supervised learning, where there's some label other than human opinion, "super human" can mean "more accurate than a human labeler". But for music composition, you need a difference source of information (listener ratings? sales/streaming plays?) to determine what "better" means, which would be a meaningful structural change rather than just an extension of this technique.
So long as the training set only tells it what human-composed compositions look like (MIDI files of existing music), a system has no signal to find "super human" territory.
You're right measuring the skill of a composer is hard, however as you say listener ratings and sales/streams would be a pretty good metric. Even for songs in the top 100 charts how high can an AI get? If all the top 10 songs were written by an AI that would be pretty convincing it was super human I think.
On the issue of generalising alpha go learned go by watching human played games and by playing against itself and is now better than any human. So it's not true that a system is limited only to the skill of the examples it's shown, it's possible to design something which can surpass the examples it learned from.
Yup, this is why I said that the output is somewhat uninteresting musically - there are actually machine-learning music models that do a lot better from that POV, despite being seemingly unambitious in many other ways. (I've mentioned one elsewhere in this thread, but BachBot and DeepBach also deserve mention!)
agreed, it captures mostly rhythmical elements which make it sound "like some music" but it tends to sound like a broken record soon. Maybe its decoder was not designed to capture very long-term dependencies and that makes it sound so "myopic". It s very promising though and it seems in the future we 'll have half-decent pieces suitable e.g. for background music.
Lack of narrative. It's less disturbing if you're in the mode of listening to background music, but for active listening, classical music is so much richer.
I believe they selected some of the best-sounding pieces. Playing with their tool (which is a lot of fun) is more hit-and-miss: some interesting results, but also some slightly unpleasant music.
Something I'm curious about: If I make some music I really like through this tool, do I own the copyright to that? Can I turn generated music into an album and sell it? I'm not sure if the site does caching but if it does and me and another person generate the same music, do we both own rights to that?
The legal consensus, such as it is, seems to be that (if you did not otherwise agree to a contract/license modifying this in arbitrary ways) you create a new copyright & own it if you use their music-editing tool to tweak settings until you got something you liked, because you are exercising creative control, making choices, and engaging in labor. On the other hand, if you merely generated a random sample, neither you nor anyone else own a copyright on it.
How do you prove that too? Maybe in the future somebody creates a random painting in FuturePaintGAN(TM)+3DCanvasPrinter(TM) that moves millions of people to tears and sells for hundreds of thousands in some auction house. Is that their IP?
What if that person is a monkey[1]? Is it "animal-made art"[2]?
How do you prove anything to a court about who owns a copyright? You present what evidence you have and marshal what arguments you can, and hope that the person in the right prevails.
As your own links indicate, animals have no more copyrights any more than a computer program would because they are not human, and copyright is explicitly granted to human creative efforts.
I think it will be interesting to see whether there's a legal difference between a person, being inspired by obviously copyrighted material and creating something similar or a Machine Learning model that is trained by the same.
If I look at the Marvin Gaye Blurred Lines case: https://en.wikipedia.org/wiki/Blurred_Lines#Marvin_Gaye_laws... that took years to resolve involving humans, I wouldn't personally risk to release machine-generated music that was trained on copyrighted music.
> Doable with a single 1080ti and a couple of hundred midi files?
A 1080ti would probably require something like several days or a week. It depends on how big the model is... Probably not a big deal. However, a few hundred MIDI files would be pushing it in terms of sample size. If you look at experiments like my GPT-2-small finetuning to make it generate poetry instead of random English text ( https://www.gwern.net/GPT-2 ), it really works best if you are into at least the megabyte range of text. Similarly with StyleGAN, if you want to retrain my anime face StyleGAN on a specific character ( https://www.gwern.net/Faces#transfer-learning ), you want at least a few hundred faces. Below that, you're going to need architectures designed specifically for transfer learning/few-shot learning, which are designed to work in the low _n_ regime. (They exist, but StyleGAN and GPT-2 are not them.)
I'm very into both music composition/production and ML, so this is neat. That being said, I think the "computer generated music" path is probably the wrong approach in the short term. Music is really complex and leans heavily on both emotions and creativity, which aren't even on the radar for AI. Being able to dynamically remix and modulate existing music is still really cool though.
I would kill for a VST tool that would take a set of midi tracks, and synthesize a new track for a specific instrument that "blends" with them. I would also kill for something that can take a set of "target notes" and break them up/syncopate/add rests to produce good melodies, or take a base melody and suggest changes to spice it up.
I don't think alpha go is a good example here. It play with strange brilliance, but I wouldn't call that creativity. If you care to elaborate on why you said this I'm curious to hear your rational.
To me, creativity is really about generation of "aesthetic novelty" which is hard to get from a ML algorithm that is trying to approximate patterns in training data. Eventually, there will be models trained on a wide variety of art, music, stories, etc that can recognize aesthetic and structural isomorphism between mediums (say between a grizzly picture and death metal), then we'll lose our competitive advantage. I don't think we're nearly so close to that as the singularity types would have us believe though.
I suppose I was loosely defining creativity as "ability to generate novelty or novel insight". People call AlphaGo created because he demonstrated a new and novel way to play that has since influenced others. AI Music will eventually teach us things about music in a similar way.
Theorists of creativity eg Margaret Biden don't limit it to aesthetic domains. They include creativity in Go, in mathematics, even in coding. Marcus Du Sautoy has a recent intro book.
One could say music is also a game with a strict ruleset. Indeed having played the piano for 20+ years now, I often think of how no matter how good I am, I won't invent a new form of music or even style of music, I am limited to the 'rules' of my piano.
One could not say that. The question is not about creating new styles or "forms" (??) of music. It is about creating compositions. I was only objecting to your use of AlphaGo as being based on "emotions" and being "creative". It is not generating explanatory knowledge, which is the kind of creativity we actually care about when we talk about AI. With current methods, you can only solve problems that have narrow optimization goals which are easily defined.
You might be. That hardly means that music is. Music is whatever humans think is cool to listen to.
Actually I think "inventing new forms of music" is a pretty great musical Turing test. How much neural-network training would it take to make an AI that can take the sum total of existing music, extrapolate the rules, and then deliberately break those rules in such a way as to make something that humans would find interesting?
That doesn't surprise me, if all you do is follow the rules of music theory on a standard piano. If you creatively deviate from music theory and modify your instrument (or create a new instrument) you could easily come up with something new, the question is whether anyone besides you would enjoy it.
To be fair, I recently watched a video with Dr Hannah Fry in which she played real vs AI music samples, and while the real human music was more complex, the AI composed music, though simpler, was still good and at first, hard to pick out.
This thing sounds awesome! After hearing it, I am truly amazed at the level to which AI has evolved! It's good music... Heck, it's not just good music, it's potentially borderline (with some tweaks by human professionals) great music!
But, despite this potential greatness, if there's a problem, it's that this AI only produces music...
What this AI composer really needs is an AI lyricist to write lyrics for the songs it composes!
Sort of like an AI Lerner to it's AI Loewe...
An AI Hammerstein to it's AI Rodgers...
An AI Gilbert to it's AI Sullivan...
An AI Tim Rice to it's AI Andrew Lloyd Weber...
An AI Robert Plant to it's AI Jimmy Page...
An AI Keith Richards to it's AI Mick Jagger...
An AI Paul McCartney to it's AI John Lennon...
An AI Bernie Taupin to it's AI Bernie Taupin...
An AI James Hetfield to it's AI Lars Ulrich...
An AI Wierd Al Yankovic... to it's... AI Wierd Al Yankovic... <g>
You know, an AI Assistant for this... AI Assistant... <g>
Well, an AI Assistant to write lyrics that is... An AI "Lyrcistant"... <g>
Come on, I know there's someone in the AI world who can do this! But it might be a bit challenging... the AI would not only have to write poetry, but it would have to match that poetry to all of the various characteristics of the music...
Not an easy task, to say the least!
But, for the right AI researcher... an interesting, challenging, worthy one!
(I think I hear 2001's "Daisy Bell" playing in the background...)
By the way... disclaimer: I am an AI. That is, An AI wrote this message on HN.
No, I'm kidding about that! But... how would you know? (insert ominous sounding music here) <g>
While I admire the effort, the music sounds quite unpleasant...
Jukedeck has significantly better AI generated music but since I have not found a description of how their model works, it is hard to compare it to this.
>nothing sounds better to me than the music I used to listen to when I was a teenager.
Tangental, but listening to the music from your teen years is a form of therapy for people who developed dementia. It's possible the music we impress in our teen years hold a special value in our brains.
Impressive. Would this model benefit from something like "dilated attention"? Instead of feeding it raw sound samples, we could split the input into 16 sec, 8 sec, 4 sec and so on slices, assign each slice a "sound vector" serving as a short description of that slice and let the generator take those sound vectors as input. This should supposedly let it gain global consistency in output.
Now an unpopular opinion. I'm not an ML expert, so take my words with reasonable skepticism. This fancy GPT2 model diagram can impress an uninitiated, but we are initiated, right? There is really no science there and it's still the good old numbers grinder: an input of fixed size is passed thru a big random pile of matrix multiplications and sigmoids and yields a fixed size output. We could technically replace this nice looking GTP2 model with a flat stack of matmuls and tanhs, with a ton of weights and given enough powerful GPUs (that would cost tens of millions), train that model and get the same result. It just won't make an impression of science. How are these GTP2 models designed? By somewhat random experiments with the model structure. The key here is the GPU datacenter that could quickly evaluate the model on a huge dataset. The breakthru would be achieving the same quality with very little weights.
Instead of feeding it raw sound samples, we could split the input into 16 sec, 8 sec, 4 sec and so on slices, assign each slice a "sound vector" serving as a short description of that slice and let the generator take those sound vectors as input.
I didn’t quite get it. How would you feed this variable sized input?
To illustrate more this idea, let's use soundtrack v=negh-3hi1vE on youtube. Such soundtracks consist of multiple more or less repeating patterns. The period of each pattern is different: some background pattern that sets the mood of the music may have a long period of tens of seconds. The primary pattern that's playing right now has a short period of 0.25 seconds, plays for a few seconds and then fades off. The idea is to split the soundtrack into 10 second chunks and map each chunk to a vector of a fixed size, say 128. The same thing we do with words. Now we have a sequence of shape (?, 128) that can be theoretically fed into a music generator and as long as we can map such vectors back to 10 second sound chunks, we can generate music. Then we introduce a similar sequence that splits the soundtrack into 5 second chunks. Then another sequence for 2.5 seconds chunks and so on. Now we have multiple sequences that we can feed to the generator. Currently we take 1/48000th second slices and map them to vectors, but that's about as good as trying to generate meaningful text by drawing it pixel by pixel (which we can surely do and the model will have 250 billion weights and take 2 million years to train on commodity hardware).
The same way we map words to vectors or entire pictures to vectors. We'll have another ML model that would take 1 second of sound as input (48000 1 byte numbers) and produce a say vector of 128 float32 numbers that would "describe" this 1 second of sound.
This would rule out such common mapping methods as word2vec, because unlike words, vast majority of 1 sec chunks of audio would be unique (or only repeating within a single recording).
Sure, we can probably find a way to map two similar chunks to two similar vectors. However, with 1:1 mapping the resulting vectors will be just as unique. That's a problem, because, if you recall, we want to predict the next unit of music based on the units the model has seen so far. Training a model for this task requires showing it sequences of encoded units of music (vectors), where we must have many examples of how a particular vector follows a combination of particular vectors. If most of our vectors are unique, we won't have enough examples to train the model. For example, showing the model multiple examples of a phrase "I'm going to [some verb]", it will eventually learn that "to" after "I'm going" is quite likely, that a verb is more likely after "to" than an adjective, etc. This wouldn't have happened if the model saw 'going' or 'to' only once during training.
Can we diff spectrograms to define the "distance" between two chunks of sound and use this measure to guide the ML learning process?
Would it help to decompose sound into subpatterns with Fourier transform?
Afaik, there is a similar technique for recognizing faces: a face picture is mapped to a "face vector". Yet this technique doesn't need the notion of "sequence of faces" to train the model. Can we use it to get "sound vectors"?
I'm not sure what would be useful "subpatterns" of sound. In language modeling, there are word based, and character based models. Given enough text, an RNN can be trained on either, and I'm not sure which approach is better. For music the closest equivalent of a word is (probably) a chord, and the closest equivalent of a character is (probably) a single note, but perhaps it should be something like a harmonic, I don't know.
Unlike faces, music is a sequence (of sounds). It's closer to video than to an image. So we need to chop it up and to encode each chunk.
Ultimately, I believe that we just need a lot of data. Given enough data, we can train a model which is large enough to learn everything it needs in the end to end fashion. Primary achievement of GPT-2 paper is training a big model on lots of data. In this work, it appears they only used a couple of available midi datasets for training, which is probably not enough. Training on all available audio recordings (either raw, or converted to symbolic format) would probably be a game changer.
The same way we feed the variable size sequence of characters or sound samples into this RNN. Instead of raw samples at the 16 kHz rate, we'll have one sequence of 1 sample per second, another sequence of 1 sample per 0.5 seconds and so on. We can go as far as 1 sample per 1/48000 sec, but I don't think it's practical (but this is what these music generators do).
We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound:
S[0..n] = the raw input, 48000 bytes per second of sound
F[1][k..k+48000] -> [0..255], maps 1 second of sound to a "sound vector".
F[2][k..k+96000] -> ..., same, but takes 2 seconds of sound as input
Now instead of the raw input S, we can use the sequences F[1], F[2], etc. Supposedly, F[10] would detect patterns that change every 10 seconds. It's common in soundtracks to have some background "mood" melody that changes a bit every 10-15 seconds, then a more loud and faster melody that changes every 5 seconds and so on, up to some very frequent patterns like F[0.2] that's used in drum'n'bass or electronic music in general.
This is how music is composed by people, I guess. Most of the electronic music can be decomposed into 5-6 patterns that repeat with almost mathematical precision. The artist only randomly changes params of each layer during the soundtrack, e.g. layer #3 with a period of 7 seconds slightly changes frequency for the next 20 seconds, etc.
Masterpieces have the same multilayered structure, except that those subpatterns are more complex.
I'm not an ML guy, so can't say if this is an autoencoder.
We can combine multiple sequences in any way we want. Obviously, we can come up with some nice looking "tower of lstms" where each level of that tower processes the corresponding F[i] sequence: sequence F1 goes to level T1 which is a bunch of LSTMs; then F2 and the output of T1 go to T2 and so on. The only thing that I think matters is (1) feed all these sequences to the model and (2) have enough weights in the model. And obviously a big GPU farm to run experiments.
Ok, but if we are using a hierarchical model like multilayer lstm, shouldn’t we expect it to learn to extract the relevant info at multiple time scales? I mean, shouldn’t the output of T1 already contain all the important info in F2? If not, what extra information do you hope to supply there via F2?
T1 indeed contains all the info needed, but T1 also has limited capacity and can't capture long patterns. T1 would need to have 100s of billions weights to capture minute long patterns. I think this idea is similar to the often used skip connections.
But the job of T1 is not to capture long term patterns, it’s to extract useful short scale features for T2 so that T2 could extract longer term patterns. T3 would hopefully extract even longer scale patterns from T2 output, and so on. That’s the point of having the lstm hierarchy, right?
Why would you try to manually duplicate this process by creating F1, F2, etc?
The idea of skip connections would be like feeding T1 output to T3, in addition to T2. Again, I’m not sure what useful info F sequences would supply in this scenario.
This sounds reasonable, but I think in practice the capacity of T1 won't be enough to capture long patterns and the F2 sequence is supposed to help T2 to restore the lost info about the longer pattern. The idea is to make T1 really good at capturing small patterns, like speech in pop music, while T2 would be responsible for background music with longer patterns.
Don't we already do this with text translation? Why not to let one model read a printed text pixel by pixel and the other model produce a translation, also pixel by pixel? Instead we choose to split printed text into small chunks (that we call words), give every chunk a "word vector" (those word2vec models) and produce text also one word at a time.
Not very impressive/interesting, they aren't releasing their code or models, and it seems that they had to use lots of rather obscure hacks to make the whole multi-instrument thing work. (Though I suppose it definitely is impressive in other ways, especially hacking their existing GPT architecture to do something worthwhile with non-text tokens. So, I'm not saying that any sort of work should be dismissed outright!)
I do still think that the good old https://github.com/hexahedria/biaxial-rnn-music-composition (Hexahedria's Biaxial RNN/Tied Parallel Networks, also published a paper achieving SOTA at least for that time, on a variety of midi datasets) has more interesting/compelling output musically, despite starting from a comparatively tiny dataset and using rather elegant and easily-undestandable convolution- and RNN-based techniques. Too bad that the implementation relies on Theano which is quite endangered at this point (doesn't seem to support up-to-date python 3?), but I do think it's a compelling starting point if anyone really wants to work on this domain.
I expect the next major copyright law legal battle will be around the question of whether content generated by some machine learning approach that was trained on copyrighted data is a derivative work or not.
As an amateur video producer constantly on the lookout for even some basic (as in not grammy winning, or even "artist" quality) music to add to my work that I don't have to pay a fortune to license, but don't want it to be incredibly unoriginal, this path has me excited.
My project is currently on hold, but will definitely receive update in the future. As it's kind of project for fun with no hope of monetization given that the space is already crowded with Google's Magenta, and now OpenAI is joining the dance with museNet.
ML models that generate music continue to improve. However, to me, the result still just seems like a novelty and nothing more. Perhaps it will take much more iteration and improvement in compute power before I hear a piece of music that I find truly inspiring that was generated by an AI.
On the other hand, I've been much more impacted by visual "art" already being generated by AIs. Perhaps the musical medium is ironically harder to crack since the format is simpler, rather like how it took longer for an AI to defeat an expert goban player than a chess player.
Wondering if synthesizing a single instrument rather than an entire MIDI file might be better, as each note could be evaluated via finite differentials. Transformer would use as input a single note from say violin or guitar. And "translate" the pitch and envelope to the new sounds. The physical simulation of music, rather than notational logic. Problem of long range order in the composition still remains unsolved, unless the model accounts for the importance of transitions in music theory. In Neural Machine Translation, characters, words, even grammatical rules themselves can pre-define such an implicit structure.
Enjoyable to the trained mind to be sure. Especially as a moment in history. But the tunes themselves lack soul. And jarring for myself as the background music I had turned off to catch the tail end of MuseNet's performance, was of a particularly feverish level of human expressiveness ;)
Bad Brains - Live at the CBGB's 1982 (Full Concert)
What speaks to me here is this is one of the closest things we have to creativity in AI and while AGI is a ways away, this is a key step, even though it may not seem like some hard tech we can put to use in industry immediately.
I think it’s a subtle distinction, but imagine a model where we can throw in thousands of math proofs then give the model some initial assumptions and just let it run wild. I think getting a neural net to model the creative spark / the ingenuity is what has been missing.
A part of me doesn’t want to believe that it is possible, but a part of me is genuinely curious as to the consequences.
I think this has great potential for content creators and their use of royalty-free music. It might just help a lot of the big copyright issues, and distinguish each person's work with a distinct score.
As a sometimes composer, the thing that makes me most sad about this stuff is it's all computer or no computer. Deep learning with symbolic stuff is still in it's infancy, so I don't expect to really mix this with hand-composed stuff anytime soon. As far as I understand it as a layman, the way the input is compressed for learning is also how it's fed the sample, so there's no way to "pay more attention" to the piece your augmenting than the pieces in the training data.
Can someone live stream to Twitch, collect likes or comments, and then feed back and make a GAN trained on our collective response to the musical output?
I would LOVE to see where that goes... Is it going to turn into 4 chord pop? Or maybe more dominant, more resolve-y? Or maybe my assumptions will be wrong and we would collectively train for more complex music?
Feels like this kind of thing would be useful for generating better jazz backing tracks. The programmatic backing tracks you get these days are good, but can be really repetitive. I wonder if there's some way to force the model to follow changes.
Have anyone attempted to make a code autocomplete or snippet/boilerplate generator using this !? Would be nice when coding on mobile where its hard to navigate between code and navigation due to the small screen.
> MuseNet uses the same general-purpose unsupervised technology as GPT-2, a large-scale transformer model trained to predict the next token in a sequence, whether audio or text.
Perfect for youtube ads that must have obligatory music complete with the drone in the background "Pond 5.... Pond 5..." and the producer can't pitch in 5 bucks for it.
How does licensing work here? Do they have the right to "merge" that music into their neuronal weights? Is there a specific license for usage in a neural net?
Seems to have a hard time being the fifth Beatle.
I couldn't even get a cheap Beatles feel, much a sublime melody.
However, some of the piano pieces are pleasant.
I just read today a tweet by one of the React Rally spokespersons how Hacker News is a toxic forum... yet it's the least toxic most data-oriented forums I've used in years.
As a web developer, all these "do it yourself" website systems make me sick. The end results look amateur and messy most of the time, and it's an insult to the industry to give people with no visual or code-based skills the tools to create sub-par websites.
I'll bet talented musicians and real composers feel the same about MuseNet. The music probably makes them cringe to their core.
Then there is the general public...
They don't notice anything wrong with do-it-yourself websites, and this music sounds amazing.
Quite true. I've found this to be the case with almost every "composition AI" system ever created, with the exception of those that aid human composers, such as those designed by David Cope.
> Neural network generating technical death metal, via livestream 24/7 to infinity. Trained on Archspire with modified SampleRNN. Read more about our research into eliminating humans from metal: https://arxiv.org/abs/1811.06633
Despite this being a cool feat. It sounds so close to the source material (down to parts of the songs being a lofi 1-1 copy paste), that I don't see a real point of it for even existing.
(except "omg death metal neural nets lol" factor)
Overall: stylistic coherency on the scale of ~15 seconds. Better than anything I've heard so far. Seems to have an attachment to pedal notes.
Mozart: I would say Mozart's distinguishing characteristic as a composer is that every measure "sounds right". Even without knowing the piece, you can usually tell when a performer has made a mistake and deviated from the score. The mozart samples sound... wrong. There are parallel 5ths everywhere.
Bach: (I heard a bach sample in the live concert) - It had roughly the right consistency in the melody, but zero counterpoint, which is Bach's defining feature. Conditioning maybe not strong enough?
Rachmaninoff: Known for lush musical textures and hauntingly beautiful melodies. The samples got the texture approximately right, although I would describe them more as murky more than lush. No melody to be heard.