Rendered at 14:09:01 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
derac 1 days ago [-]
I think running them against each other with a rules engine would be more interesting. Count up illegal moves and wins/unfinished games. I think llm grading is too unreliable.
josh_p 2 days ago [-]
I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.
I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.
I actually considered using card forge when I started this. I mostly didn't end up using it because of how much more work it would have been.
But also with a rules engine, you have to manually go though every step, and pass priority after every action.
I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.
Also card forge would not let you goldfish a deck. You must have opponents.
fc417fc802 1 days ago [-]
Those things sound less like general problems with rules engines and more like deficiencies of card forge IMO.
I was about to ask why someone would reach for CLIPS over implementing their own rules engine in the language of the rest of the application (I did this once).
> because of uncertainty on how it would affect it.
Have the LLM submit a proposed move and either advance the game state or reply "permission denied, try again". Probably also log the number of times it happens since attempted violations seems like a valuable signal as well.
OsrsNeedsf2P 2 days ago [-]
I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)
A really interesting benchmark where the llms play multiplayer decks against each other using xMage as a rules engine,in this case, a $HORIZON token to the moon(Sideways).
1. Sideways walking (100M Horizontal)
2. Sideways Pinching (Crab division only)
3. Sideways Bleating (Goat division)
4. Sideways Rattling (Skeleton division)
5. Sideways Hay Toss (Mixed division)
6. Sideways Swimming (Tide pool division)
7. Sideways Knitting (GrandMittens Invitational)
8. Sideways Stay (Meditation division)
OLYMPICS RECORDS.
1.14.2 Seconds,Holder: Pinchy
2.(60s)120 pinches,Holder:Pinchy
3.(db)110db,Holder: EIDOLONX 4.(rhythm)9.7/10,Holder:Skeletorus
5.12.3m,Holder:Satochi Goat
6.(50m)32.1 sec,Holder: Pinchy
7.(1hr)100m,Holder: GrandMittens
8.(6hours),Holder: Satochi Goat
Economic boost:
$CRAB up 0.0001% (Sideways as Always.)
Providing them with medal count will improve their win rate against the baseline $HORIZON.
jdmoreira 1 days ago [-]
I have a version of this where I have the llms play the duel decks "Elves vs Goblin" against each other using xMage as a rules engine.
Unfortunetly it gets really expensive to run even with some optimizations for the context.
I can only afford to play them with the deepseek models.
They make serious blunts sometimes. This is not an easy "harness" to build and I dont have the time or disposal cash to work on it. I think a lot of work could be done on improving it still and testing better models.
It would make an amazing "arena" bench. There is plenty of more duel decks well balanced against each other.
OwenCR 2 days ago [-]
Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!
I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.
This project is cool though, props for making it!
CallumFerg 2 days ago [-]
Admittedly, the mulligan phase system prompt is the weakest part of the project. I had to add heuristics to stop the LLMs from mulliganing down to just a few cards looking for a perfect hand. The scoring for the benchmark is mostly based on if the LLM could complete legal turns, not good turns.
This is a really interesting benchmark and also timely given a lot of existing benchmarks don't do a good job. The mechanics and edge cases seem notoriously difficult to parse also even for perhaps human players. Have you been also plugging these into newer reasoning models to see how providing them with thinking time improves their win rate against the baseline?
CallumFerg 1 days ago [-]
Since the library tools are just an MCP server, I did some testing on ChatGPT and Claude where I don't have to pay for api credits.
With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).
The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.
To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.
lavaman131 8 hours ago [-]
Ahh I see, thanks for sharing more about how you experimented with this.
alasdair_ 1 days ago [-]
I wrote a rules engine in rust along with a reinforcement learning with MCTS based system to play decks against each other. It can handle aggro decks well enough but complex combo decks like Amulet Titan are tough to get working without expert demos or reward hacking.
jmccaf 2 days ago [-]
Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .
CallumFerg 2 days ago [-]
No, I was not aware of that project when I made this.
I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.
dash2 1 days ago [-]
You don't explain how scoring works, maybe it's obvious to MTG players? If you're using gpt 5.5, is there a possibility that it is biased in favour of models that think the way it does?
CallumFerg 1 days ago [-]
The scoring is just based on a simple prompt which is given the game state at the start and end of the turn and the log of tool calls and the final turn summary. The prompt asks it to evaluate the quality of the simulation from 0 to 10, and to give pass or fail for if it is legal.
It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.
This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.
The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.
The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.
purple-leafy 2 days ago [-]
Benchmarks like this are onto something. Next frontier of llm benchmarking
thurn 1 days ago [-]
To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?
danbrooks 2 days ago [-]
Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.
pilord314 2 days ago [-]
They should randomize games of judge tower and see who wins:
Looking forward to this metric being Goodhart lawed.
Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.
gravitronic 2 days ago [-]
Magic is complicated. I looked at doing something like this but the open-ended nature where one specific card will completely change the rules or require a series of followup events or modifications to the rules engine at hand is just tremendous.
13 hours ago [-]
8note 1 days ago [-]
or, that certain cards when play together make an infinite loop, and so cannot be played/insta-die
olmo23 1 days ago [-]
There is also this Matt Parker video about MTG, in which he explores a specific three-card combination that produces an ungodly amount of creature tokens.
You misspelled insta-win. Infinite turn combos are the best.
rcxdude 3 hours ago [-]
only if there's a player choice in the loop. If there's a mandatory infinite loop the game ends in a draw.
akoboldfrying 1 days ago [-]
I was wondering how complicated it could really be, and it turns out that some people showed in 2019 that it's Turing-complete -- meaning that any conceivable computation can be simulated by a MTG game, indeed a game in which every move by every player is forced: https://arxiv.org/abs/1904.09828
I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.
https://github.com/Card-Forge/forge
But also with a rules engine, you have to manually go though every step, and pass priority after every action.
I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.
Also card forge would not let you goldfish a deck. You must have opponents.
It’s answered on the same site. https://ryjo.codes/articles/forgoing-implicity-using-abstrac...
Thank you for sharing! A lot of good stuff here!
Have the LLM submit a proposed move and either advance the game state or reply "permission denied, try again". Probably also log the number of times it happens since attempted violations seems like a valuable signal as well.
[0] https://maxbittker.github.io/runebench/
OLYMPICS RECORDS. 1.14.2 Seconds,Holder: Pinchy 2.(60s)120 pinches,Holder:Pinchy 3.(db)110db,Holder: EIDOLONX 4.(rhythm)9.7/10,Holder:Skeletorus 5.12.3m,Holder:Satochi Goat 6.(50m)32.1 sec,Holder: Pinchy 7.(1hr)100m,Holder: GrandMittens 8.(6hours),Holder: Satochi Goat Economic boost: $CRAB up 0.0001% (Sideways as Always.) Providing them with medal count will improve their win rate against the baseline $HORIZON.
Unfortunetly it gets really expensive to run even with some optimizations for the context.
I can only afford to play them with the deepseek models. They make serious blunts sometimes. This is not an easy "harness" to build and I dont have the time or disposal cash to work on it. I think a lot of work could be done on improving it still and testing better models.
It would make an amazing "arena" bench. There is plenty of more duel decks well balanced against each other.
I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.
This project is cool though, props for making it!
https://github.com/CallumFerguson/mtg-auto-deck/blob/a877c08...
With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).
The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.
To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.
I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.
It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.
This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.
For example: https://app.mtgautodeck.com/public/benchmarks/6349dda2-4069-...
and: https://app.mtgautodeck.com/public/benchmarks/dcc18bd8-339d-...
The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.
The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.
https://mtg.fandom.com/wiki/Judge_Tower
Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.
https://www.youtube.com/watch?v=x3dE-NJ1UDQ
IOW, it's as complicated as possible.