I like that this shows how hard even conceptually simple ideas are to achieve in fine-tuning LLMs. Even given a pretty good starting dataset, a decent starting model, etc. this appears to have been a challenge.
One thing it did make me think about was that these models are suitable for things that don't have a natural definitive answer. That is, picking the perfect card given a set of picks is probably combinatorially impossible to solve. But picking a good card given a set is possible and LLMs can approach human level performance.
I think this leads to a set of problems that current LLMs may be fine-tuned to solve.
That lines up with my experience- for high-stakes decisions, they rarely give me a great answer. But for low stakes decisions, they do well at giving me a good enough answer. For example, I've been using them to help find gifts for friends and children this month. I don't need the best choice to solve the problem, just a good one.
How much additional calculation occurs in high-stakes decisions by individuals. Also what is the variability in quality of high stakes decisions in humans?
I'm guessing LLM decision is rather average, but that the LLM has no easy way of spending the extra time to gather information around said high stakes decisions like a human would.
A random sampling of things GPT-4 has helped me with lately:
Where are the dates in whole foods? (A: with nuts, not fruits and veggies)
How can I steam bao without a steamer basket? (A: saucepan, 1" water, balled up aluminum foil, plate, baos, lid)
Any guess as to when this photo was taken? It looks like anywhere from the 70s to the 90s. (A: the photo paper has a logo that postdates a 2003 company merger)
Generating content for tabletop gaming with my friends (especially wacky ideas, like character names themed after items on the Taco Bell menu)
I had to buy some spare tools where I cared more about price than quality and it helped me choose some suitable brands
As mentioned, you can tell it a bit about a person (and feed in their wishlist if they have one) and it'll help you pick something they'll probably like
Finding something to do to spend an afternoon in a city while traveling
In general, anything where there is no objective best answer (meaning I can ask it to generate multiple possibilities and filter out the bad ideas) and where I value speed over correctness.
> Generating content for tabletop gaming with my friends (especially wacky ideas, like character names themed after items on the Taco Bell menu)
I've been going ham with this. Pay for ChatGPT Plus for work, so GPT-4's been helping me design encounters, plan sessions, and brainstorm ideas for one-shots. It gives me a sort of bland, vanilla idea for something, I suggest a twist on it, it goes, "oh great idea, here that is with your changes:" and I iterate with it from there.
Likewise I love theming characters, plotlines, and settings after songs, bands and albums, so I'll dump in a bunch of lyrics and ask ChatGPT to help me intertwine subtle references into descriptions, names, and plot points.
>I like that this shows how hard even conceptually simple ideas are to achieve in fine-tuning LLMs. Even given a pretty good starting dataset, a decent starting model, etc. this appears to have been a challenge.
Surely, but we can't gloss over the fact that this was accomplished by a single person.
Yes and no I think. I've seen individuals achieve things in their bedrooms that would make most corporations blush. Demoscene type stuff comes to mind as an example. Often a single person can become hyper obsessed with achieving some goal and in the absence of any interference can achieve something impressive beyond what can be achieved within a company.
Consider a PM involved in this project, feeding in requirements from a business. Instead of the "just get it done at any cost" mentality of a single person you would have KPIs and business objectives that would muddy the water.
I just mean to say that there is a gulf between what can be done by a single hacker in his basement when they have no constraints other than their imagination compared to what can be accomplished by a business. Sometimes the single-hacker achievement doesn't scale.
So, it is impressive that this is possible for a single person at all. But, from a business/operation perspective, I don't actually think that is as relevant as it may seem.
It's not the most revolutionary change to our daily lives, but I do genuinely look forward to playing against bots that have interesting play styles for games like Magic: the Gathering. I think this is a clear case where it could drastically improve the ability for the R&D team to come up with and test new mechanics at different levels of play.
> With that data, you can extract “ground truth” by looking at the draft picks made by the best players on the service (sorted by win rate).
Do you mean that you are looking at the draft picks from https://www.17lands.com/leaderboard and then sorting by Win Rate? Didn't you mean to choose Match Wins or Trophies? Otherwise, you're not measuring the best players on the service. You're training on draft choices where most choices were very good - i.e., win rate sort will show you the luckiest players, not the best ones. That will naturally show up in any validation or testing you do too.
Shouldn't this be compared not to an LLM baseline, but to a baseline where an "Elo" style score is computed for each card compared to others from the 17lands data; then, until you have two colors, suggest the best scoring card, or when you do have color(s), suggest the best scoring card within that color or a land?
I think it is possible for the LLM to have some semblance of rules knowledge, but it is more likely that it is picking up on card rarity, costs and "Big" more than anything else for unseen cards.
Your "accuracy" on the draft seems poor. I'm not sure it means what you think it means. Are you saying that when looking at the high win rate choices, where all the choices were mostly good, you happened to pick the choice that isn't the same as the player who originated the data? It actually seems harder to make a choice among all good choices.
> Do you mean that you are looking at the draft picks from https://www.17lands.com/leaderboard and then sorting by Win Rate? Didn't you mean to choose Match Wins or Trophies? Otherwise, you're not measuring the best players on the service. You're training on draft choices where most choices were very good - i.e., win rate sort will show you the luckiest players, not the best ones. That will naturally show up in any validation or testing you do too.
Ahh no just unclear in the post, I'm filtering to players in 17lands with a > 62% match win rate who are drafting at a high ranking (>=diamond rank). I look at all of those players' drafts though, even the ones where they do poorly.
> Your "accuracy" on the draft seems poor. I'm not sure it means what you think it means. Are you saying that when looking at the high win rate choices, where all the choices were mostly good, you happened to pick the choice that isn't the same as the player who originated the data? It actually seems harder to make a choice among all good choices.
Accuracy here is making the same choice from a given pack as one of the good players. Obviously subjective so not a perfect metric, but a decent check on ability to emulate a high-quality drafter.
In ELO like match-making, you typically pair together people such that they are likely to have 50% chance to win. Therefore as the OP says, filtering down to people with high (60+%) life-time win-rate creates some sort of (interesting) bias.
I would select from all games played on sufficiently high level.
They don't fully use Elo for matchmaking. There's a league system, and you get matched with players in your league. The ranks reset frequently, too.
Edit - I did the math. From the data on the MTG Elo Project, top Magic players have about a 70-75% game win percentage over an average tournament player. They have the top player at ~2300 Elo with the average being around 1500 (in matches), and have scaled the Elo system so that a 200 point gap is a 60% chance to win a best-of-three match (this is NOT the same as Chess Elo scoring).
Hmm, but that will filter out more than half the players on the Match Wins and Trophies based leaderboards, many of them Diamond and Mythic. So I think your choice of 62% match win rate is almost certainly disproportionately selecting for people who received very good draft choices, even if it includes some actually very good players in the data set.
I mean 62% might feel like a good number, but it's arbitrary, you'd have to justify how you chose it, and just eyeballing it, it is filtering out a lot of very good players with many, many more match wins.
Perhaps you can sort by Latest Rank, and filter out people with 2 or fewer trophies. Or you will have to validate with known bad draft choices in the prompt, to see what it does. Suffice it to say, I still don't think the 17Lands data represents what you think it does.
Like without a direct discussion about measuring and accounting for luck in the draft... for all I know the data is seriously flawed. It probably isn't, but it's maybe one of many, many issues to address when dealing with strategy card game AI problems.
Still not clear maybe, I'm selecting players with a 62% lifetime win rate so mostly players who have been good over a larger number of drafts!
Definitely not perfect data though, and agree that defining good in this context is hard -- a lot of the variance of "good" depends on how you play the cards either way. All good points!
> I'm selecting players with a 62% lifetime win rate so mostly players who have been good over a larger number of drafts!
Hmm, but there are a lot of players with greater than a 62% lifetime win rate with very few drafts, but there may be many of those players... do you see? The win rate isn't a good filter. You chose it, you are trying to justify it, and I'm not convinced, not without the hard numbers.
I'm not confused about what filter you chose. I just think it's a bad filter, and you haven't thought very deeply about how it affects the data, which includes presumably your test and validation data - however you're choosing to test and validate, apparently by hand, by some eyeballed examples.
Anyway I think you have to compare with a non-LLM, non-random baseline to have any sense if this stuff is working at all. I could be dead wrong. I would maybe compare with a community draft picker.
Data selection depends the use-case. Two contrasting use-cases I see are:
- Emulation
- Advisor
In case of MTG player emulation for example, I think it makes sense to group data by some rankable criteria like winrate to train rank-specific models that can mimic players of each rank.
Thanks for writing up. Rather than zeroing out the loss for the prompt, did you also try using weighted loss with Axolotl? At one point, Microsoft's GPT 3 docs suggested this was beneficial when the responses are short (like you have with "Cut in.") Domain adaptation over subreddits/forums before finetuning may help as well.
Related comment from gwern: https://news.ycombinator.com/item?id=38438859. Can't find the docs now - I think they were the old GPT 3 ones - but they suggested a low value somewhere around 0.01 and 0.1.
Also - why qlora rather than a full finetune? Using LambdaLabs, it'd cost roughly the same as your quote. Cheaper I think if you're willing to gamble with fp8: https://github.com/mosaicml/llm-foundry/tree/main/scripts/tr.... And fewer hyperparameters to tune as well
If I'm reading the author's writeup correctly, the prompt he's giving the agent at each pick contains only the names of the cards in its pool so far, and only gives the full text for the cards in the pack it's being passed. It doesn't look like context is being maintained between picks, presumably for context window size reasons.
If so, and if he's correct in his assumption that these sets are out of the bot's training cutoff window, then surely it's purely coincidence if it ends up being a good drafter? The bot would have literally no way to know what cards work well with its previous picks, what signals have been sent and received in the draft so far, etc. Not even the best human player could take (for example, from the sample prompt) "Gadwick's First Duel -- {1}{U} (uncommon)" and figure out what works well with that (if they've never seen the card before).
It would just end up picking generically good draft cards that share a color with its previous picks. Which is already what pick-order-based heuristics have always done.
> If I'm reading the author's writeup correctly, the prompt he's giving the agent at each pick contains only the names of the cards in its pool so far, and only gives the full text for the cards in the pack it's being passed. It doesn't look like context is being maintained between picks, presumably for context window size reasons.
Not quite -- there's a few ways the model learns the full card text:
* The models are trained on card trivia completions as well, where they're asked to complete the full text of the card as well as information about it (type, CMC, etc.)
* The models do still have to learn next token completion on the cards in packs, meaning they learn to predict the full text of the cards while making draft picks as well.
Net net, the bots learn the text of the new cards pretty comprehensively.
And also, since it seems you're the author, can you also clarify if your methodology allowed for the bot to track signals outside of the color-identity-count summary statistic you pass in the prompt? Something like allowing it to notice that a card has wheeled, or that a certain synergy piece was passed a few picks ago.
Only the statistics you see in the prompt (which are clearly limited). I have a lot of ideas about how you could improve that context (most likely letting the AI record and track notes throughout a draft), but this one was relatively simple to implement. Definitely room for improvement!
In case you didn't see it, https://news.ycombinator.com/item?id=38525978 (I hacked Magic the Gathering: Arena for a 100% win rate) may interest this audience if for no other reason that the investigator discovered that Sparky, the pseudo-AI in MTGA, doesn't appear to be as stupid complicated as one may have suspected from the outside
Sparky is the Arena AI, but no one ever accused it of being a good Arena AI - it is very much only there for the new player experience of playing against a dumb computer when you're first exposed to the game and don't know the rules, or for the computer equivalent of "playing against a goldfish" a deck you made to see how it draws or combos. It's not a Chess CPU.
I hope I also did not accuse it of being good, but the observation I was trying to make is that -- according to the article, I have not myself confirmed the claim -- they run the card evaluation logic and gameplanning locally, not in a data center full of H100s, which I consider to be quite a feat given the free-text-y self-modifying rules of M:TG
One of the big things to note is that Sparky plays very basic decks, with few complicated cards and combos. Rules-based AI could definitely play at a basic level using a beatdown strategy, but give it some sort of control/combo deck and it would struggle.
Hearthstone allegedly had bots hit Legend rank back in 2014 playing Aggro Shaman. Those were believed to be pretty simple rules bots, so yeah +1 that approach gets you pretty far if you have decks crafted for it.
Unless I'm misreading something, it appears that the linked paper appears to use one-hot encoding to represent each of the cards -- not any learned embedding to represent each card -- unless I'm misunderstanding what you mean by "representation learning"?
I hadn't seen this, this is awesome! You'd think given the volume of data available that this type of method would outperform an LLM, cool results.
Still some fun things about LLM representations -- you can do fun things like give the bots preferences / personality in a system prompt which is entertaining!
I wonder if you could use a smaller model or get better results if you treated each card as a token, gave the state of the draft as an input and the predicted token would be the card to pick. You woukd have to train from scratch with a custom tokenizer.
I tried adding special tokens for a reddit-style dataset once. The format was: `<|post_author|>username<|post_title|>title here...`
The resulting model was so much worse than just formatting everything plaintext. This was with MPT-30B, 15 special tokens, 300M training tokens, and a full finetune.
I may have made a mistake, but I haven't seen any open source finetunes successfully add a large number of tokens yet either.
Try doing the same thing in your dataset, but don't actually add them as "special tokens", and just let them just be multiple tokens.
Adding new tokens needs a ton of data to train what the token means. Reusing existing tokens, will allow you to easily teach that a sequence of tokens now has a new meaning after fine tuning.
I don't know how many tokens are required to get good results, because I simply didn't mark mine as "special_tokens" due to the issues that I had read about. I got great results, whereas others who have tried special tokens got pretty poor results. I'm sure there is a magic number, but it's just not been worth it for me to explore that area yet.
I was thinking something fairly similar. You could probably do pretty well with a basic NN setup this way, no need for an LLM. It wouldn't work on "never seen before cards" and would probably make some absurd picks when it's wrong, but I'd bet you could get to 90% accuracy.
It would be interesting to compare to training a NN to draft w/o the Mistral starting point (both by epoch and by $). It's not obvious to me why the LLM component would be relevant. Maybe there are enough deck lists or mock drafts on the internet to have an influence I suppose. Or maybe 'fine tune an llm' just has more infrastructure than 'create a nn'. Maybe we need a nnfiddle to make that easier.
The benefit of the LLMs is that the checkpoint already "understands" a lot by default. Finetuning is relatively cheap and makes many tasks such as this one perform decently well simply by shoving some data into it.
The base checkpoint takes a lot of compute to make, but that's what holds most of it's "knowledge" so to speak.
Making a NN from scratch means you'll have to somehow map the cards into inputs. I have limited knowledge of how MTG works, but most TGG have text descriptions and complex effects. Mapping text to logic is what LLMs are really good at, otherwise you're starting from scratch and will also need a relatively large amount of compute before it starts displaying any type of decent behaviour.
It's also easy for most software devs to do this - finetuning mostly consists of collecting text and feeding it into a finetuning script. You don't need to know linear algebra, what a "convolution" is, etc. to do finetuning.
Without Mistral, how would you get it to generalize to cards it hasn't seen before? I assume by "training a NN to draft without Mistral" you mean where the input layer is just a bitmapped vector of the cards in the pack, right? The killer feature of this experiment is that it works on sets the model has never seen before and has 0 training data on, using just the text of the card. I don't think you can do that without an LLM.
That's a good point. It looks like the article hints at some success on that front. It'd be interesting to see what that means quantitatively. Interesting that this delta could even be used as a measure of the llm's value.
I'd be curious about the difference in success w/ drafts on a new 2/2 bear with a different name, and cards with a new keyword 'fizzbangitude 7' as well.
I was actually just looking into fine-tuning an LLM for Magic: The Gathering this week -- I've been building a small card-similarity browser using semantic embeddings of cards to find functionally or flavorfully similar cards.
I've just been using InstructorXL, but either Instructor doesn't have enough innate knowledge of the game, or else I need to work on better prompts, but so far I've tried 9 different prompts, and none of them seem to perform very well for generating embeddings:
So my next step was to try and download a dataset of similar cards (I have some ideas on this), and I was trying to see if I could use this to do triplet-loss training of a large embedding model or something.
Aaaaand, that's as far as I've gotten. I haven't actually figured out _how_ to hook all of that up, but your post is extremely inspirational for me. Thank you for posting this!!
Search for articles showing you code for fine-tuning Llama 2, ideally including a colab notebook that you can run and modify yourself so that you have real code to work with. You can try to modify their working example to suit your own toy project as a first step.
I like how it identified that you haven't committed to either white or blue yet. It was aware of deck composition and not just going for the jugular. Keep tuning. It could also be Human-bias because you also played the hand. Have someone else draft against your LLM and then you play it and see if it's the same. Statistically it should match given enough games.
I've seen models learn heuristics that are harmful in real performance, and I wonder how much is accuracy directly transferrable to actually good drafting.
A question, when GPT-4 contradicts in explanation, how much of them were in fact correct?
> A question, when GPT-4 contradicts in explanation, how much of them were in fact correct?
It was mostly when a card is good in a vacuum but not as good in a specific set. WOE (which this was trained on) skewed pretty aggressive, so GPT-4 was tended to overvalue strong expensive cards (compared to what good players thought at least).
For some reason I thought fine tuning is not possible without specialized hardware (A100 / H100). Where can I learn more about hardware requirements for fine tuning on consumer GPUs?
There is not a lot of great content out there making this clear, but basically all that matters for basic fine tuning is how much VRAM you have -- since the 3090 / 4090 have 24GB VRAM they're both pretty decent fine tuning chips. I think you could probably fine-tune a model up to ~13B parameters on one of them with PEFT (https://github.com/huggingface/peft)
Definitely possible on even older off-the-shelf hardware. I use 24GB 4090s for 13b-sized models and have even used 12GB Titans for 7b models, admittedly at much slower rates.
I have a 3080Ti with 12Gb VRAM and would like to try fine tuning the same Mistral 7B model (which I found incredibly potent). Any tips on how to get started?
Really interesting, thanks for writing this up. I'd love to see this applied to actually playing the game, provided that you could fit a (long) game state in the context window.
> I was particularly interested in testing models’ ability to reason (i.e., perform a somewhat complex task that requires high context understanding) about out-of-distribution (i.e., unseen) data.
I was under the assumption that finetuneing LLMs was useful only when you need to change the model's tone (speak like a pirate, voldemort etc).
Are there other examples where LLMs were trained to reason a particular way?
Check our Orca. IIRC, it's a technique that aims to encode additional logical capabilities into smaller models by having larger models generate step-by-step solutions to various problems. This doesn't just make them speak more like GPT-4/3.5, but is supposedly making them think more like it as well.
> I was under the assumption that finetuneing LLMs was useful only when you need to change the model's tone (speak like a pirate, voldemort etc).
A lot of why I tried this out was to test the limits of this belief, you see a lot of talk like this out there and it sounded like nonsense to me.
Finetuning is fundamentally not much different than continued pretraining; if you feed the model high-quality and high-volume data I think it's reasonable to expect it to acquire new skills
In order to speak like a pirate, it has to be able to reason :) I've done some fine tunes as well similar to the MTG example, in mine I was fine tuning it to speak JSON and reason about some input- and yes, you can indeed get these models to perform on novel tasks.
Finetuning is a useful workaround for cases when the context size is unsuitable for the task at hand. Anybody knows whether it was ever considered to finetune an LLM on the Linux kernel sources' history and its associated mailing lists?
Super interesting work. Do you have thoughts how to leverage this to create a deck builder AI that would also simulate games? The major problem here is that the search space for MTG is amazingly vast.
I've seen this effort previously, pretty exciting stuff:
I've definitely thought about this problem and think it's in the range of 'feasible', but it would be pretty slow and expensive given how much context you need to provide a model for it to be able to reason about the game state. Worth trying though!
Yeah, that surprised me too, given that https://github.com/magefree/mage is open source and pretty actively developed.
But watching the rest of the video it looks like his implementation only needed to support <20 different cards that he picked, so the limited rule set he needed might have been easy enough to write.
I don't like accuracy as a loss metric - you could end up with a similar viable draft/deck at 50% accuracy or a completely non-viable one.
Is it managing mana curves, creature count, mana fixing priority etc? I'd also love to know what the common characteristics of "missing" are. Does it undervalue a splash, or removal, or fast/slow.
I think Yahoo fantasy sports and others in the space are doing amazing work on this idea. I wonder if an LLM is even necessary for this, it’s mostly maths. Analyze past winning decks and made decisions based on performance.
I think a non-LLM model would be good for learning to draft within a particular set, because it can simply encode each card and learn what cards do well together, comparative strength, etc. without learning the stats/rules text on the card.
For learning to draft in one environment and then applying it to a new set of cards, like done here, it would become far more difficult to do without an LLM. The variations and nuances of cards rules text would have to be encoded as features, which would be extremely cumbersome. The LLM gets you some level of understanding of that for free.
Fair and valid points, it’s been years since I played magic but I agree an LLM would Be very useful in reading the cards and possibly then getting into understanding strategy based on that context.
Some status (mana cost, stats) of MTG cards are numbers, but the most important part, the effect of the card, is defined in English text.
So, in my opinion, this seems to be an area where LLM can work well.
MTGA and MO development team do a lot of work to put card effects into if-then rule, but unfortunately their work is not visible to the players :)
2. The model is effectively trained to predict the next token based on the previous tokens in each of these examples, which has the side effect here of teaching it to make a draft pick based on the contents of a pack.
Nothing too fancy, just next word prediction more or less
Curious how different the performance would be if instead of a 'Hall of Famer' we tell the bot that it is decently-good, but will be deactivated if it can't achieve human-level performance...
One thing it did make me think about was that these models are suitable for things that don't have a natural definitive answer. That is, picking the perfect card given a set of picks is probably combinatorially impossible to solve. But picking a good card given a set is possible and LLMs can approach human level performance.
I think this leads to a set of problems that current LLMs may be fine-tuned to solve.