For very modest definitions of playing. Perhaps it'd be more impressive if they recorded a demo file and let that play back without the realtime overhead? Even so it can only move in forward, back, turn, and fire. And only knows to face away from the wall it's collided with. This is so far below even basic Doom bots that I'd be afraid to call it playing.
The ASCII intermediate interpretation also seems unnecessary and very limiting. But perhaps that's to keep it near realtime, looks like 1 FPS?
And why run on a Mac? Why not a beefy PC with a GPU that can do the calculations faster?
Still, does seem like a fun challenge. Maybe with further tuning or training it can level up
It’s not. The fine tuning taught the LLM how to give single-character responses (move/fire keyboard controls) in response to a sequence of ASCII-art-ized frames of the game being played.
Is it actually ASCII art or just a textual encoding? The art representation is nice for looking under the hood and seeing something pretty, but I feel like that is a very far from optimal way to textually encode Doom for a language model to process. Especially since there is no pitching the camera, you can encode all of the information you need to represent a frame in a single line of ASCII. It they are actually using an ASCII art representation, I bet they would get way better performance encoding the frame as a single line of text.
I never realised you could encode each column of Doom as a single character, but of course you can! I suppose the one thing missing would be distance, but if you get 8 bits per character I you could reserve the upper bits to represent approximate distance.
That's weirdly inspiring! What other games can I make where the visuals are conceptually no more than a line of characters, but which can get macroexpanded into immersive graphics?
Another point to note is that you aren't stuck with a single character to encode a column of Doom as text. You could also do something like a letter to represent the content, followed by a number to represent the distance.
I think the only weird part about that is that certain letter-number pairs may be a single token with some other semantics in the model, and other letter-number pairs would be a pair of tokens. I think that could impact the performance of the model (but probably not by a huge amount).
Nothing you are saying is technically incorrect. But, optimal performance was not the goal. The goal was to see if this crazy stupid concept would actually work. And, it does!
Ah, I think I clicked the actual post link and saw nothing, and backed out. Thanks for the direct link to the video.
And yeah I totally get not aiming for optimal performance. I think it would be interesting to see how a language model could perform with a format that is less visually catered though. Like, textually there is little association between columns, it's just a string of characters, and some of them happen to be newlime characters. A more densely packed encoding would play more into the logic and reasoning encoded into the model, rather than just trying to parse out ASCII art.
Funny. I made a captcha challenge of calculus problems for a comment section on my personal blog page. But 5 years after college, I couldn't remember how to even do them myself so I changed it :-/
You don't actually need much, for a form I used to get spam in I just added a "write 42 here" so anyone who actually cares to read would be able to fill it. spam fell to 0.
(for a site with a slightly higher profile this wouldn't be enough, but for a minor corner of the internet with no ill intent actually aimed at it that turned out to be enough to block the fuzzing "fill all the forms" spam)
As contrasting experience, I did that (a simple math problem) on our contact form and it did NOT drop spam to zero; our spammers were too smart for that. Even an actual reCAPTCHA didn't completely eliminate it (although it mostly did, enough that it's fine for us).
Similarly an empty input field that is css'd to be outside the viewport is often filled by spambots but not humans. But I like the edge case UX of your idea more.
Just watch out that Chrome’s autofill doesn’t fill it in. Cost us a huge chunk of new signups until we found out. Chrome ignores autofill directives under some circumstances.
It's also visible for users with CSS overrides and/or other browser inpairments. The more I think about it the more strongly I prefer the "type 42" explicit input field.
The question I got was surprisingly simple: it asked to find "the least real root of the polynomial p(x) = (x+5)(x-4)(x+1)". A determined attacker can quickly hack together something with Tesseract and feed it into even GPT-3.5 to get the correct answer to questions like these.
I guess that means the captcha is doing its job, since running LLMs isn't very cheap or scalable. But any harder problem means you start filtering a significant chunk of human users. Based on the other replies to your comment, it seems that the questions at their current difficulty already stop a lot of human users, yet allow a determined attacker with the setup I described pass through easily.
Absolute banger.
But the auto-aim on vertical axis is missing. You should be able to have the crosshair under an enemy and still hit them.
But in any case, nicely done!
Funny enough, when I've tried to introduce (indoctrinate) friends to DOOM, "how do I aim up" has consistently been the biggest hangup.
This makes sense when I try to indoctrinate my teenager who grew up on Halo and Call of Duty. But I began noticing this hangup in the late 90s with friends my own age.
Doom is still under copyright protection last I knew. The source is GPL, but have the assets ever been liberally licensed? I think they're more abandonware.
I'm sure you could still do it, but personally I try to respect copyright strictly for any projects I'm going to share. It just feels annoying to have copyright nonsense hanging over me otherwise.
This made me curious, so I looked at the original Doom shareware distributions on archive.org. They do include a license that allows free distribution but prohibits commercial use and generally seems to want you to not do anything other than run the software as designed. Although there are several different versions of the license and I didn't look through all of them, it's possible that some distributions were made with less restrictive licenses.
This surprised me because I thought that id's original shareware releases actually had more permissive licenses than that. Maybe the original Commander Keen did.
I guess maybe id/ZeniMax/Microsoft could theoretically sue you. But in practice the shareware assets are used completely freely without issue all over the internet.
Having re-watched that movie recently, he's not wrong -- that's a deeply odd book for an apparent 8 year old girl to be holding. And with the amount of aliens that look like humans across the movies...
They call it entrapment - the officials put him in a position where be believes he's required to shoot in order to pass a test, but he sees no reason to. So finally he has to go with his gut and shoot the most probable target, even if he would have if not placed in that situation with those expectations.
I always thought there is a room for mini web games in 2024. Currently no decent site to simply play some little games is a bummer. I would appreciate games like this to play between my coding sessions. And I am obviously not interested in downloading games, I am interested in web native games.
Google has been contracting for the military doing AI for over a decade, I'm pretty sure targeting objects w/ a computer in a combat type situation isn't going to stop anyone. They have aim bots for most FPS games too
I'm not sure it's possible to make secure. To render the positions of the enemies, the browser receives 4 coords. To submit the capcha, the browser submits 4 coords – the same ones it received. Perhaps you could track the variance between the exact position and the position the user selected, as well as timing. But would it be enough?
Not really Doom, a few years old, and now broken apparently. IIRC it was basically just a mouse only shooting gallery mini-game.
EDIT: Not broken, just not obvious one must click the sound options to start. Still just a mouse gallery mini-game. Doubtful you'd even need AI to solve it.
If they switch to canvas rendering and include some twist (eg. shoot x but not y, limit input rate, etc), then I think that a considerable computing effort would be necessary to break the lock
Wow, that's pretty impressive to me and I think it's awesome that you were able to put this together quickly. I admit that I don't have a CV background, so maybe this is easier for a programmer who's already experienced in that area.
To be fair I don't think you need CV in this specific case where the problem space is very limited.
1. There's no lighting, so the enemies have specific, fixed pixel colours that don't appear in any of the backgrounds. Scan and target these.
2. Enemies appear in a specific zone in the canvas. Makes scan faster, combines with below.
If there's expected ambiguity one can a. detect a few interesting background properties by looking at pixels where enemies never appear (e.g corners), and/or b. use a couple of other pixels relative to the candidate match (maybe neighbours, maybe not, could just as well be 20px down, 10 left) to discriminate.
Side story: one day my team was tasked with doing textual document content recognition for some biz. Everyone was like "oh it's going to be $$$ to pull out CV+OCR and have the OCR learn the specific font".
Turns out the document in question was:
- an extremely standardised gov format
- produced only by gov administration
- of a known fixed, overall size with clear identifiable boundaries
- printing known, standardised list of fields at fixed position
- with a known, standard font specifically made for quick automatic recognition
- containing only /[A-Za-z0-9]/ chars (plus a few I can't recall, but essentially dash, plus, slash...)
- on a known, standardised background
- the only variable is the quality of the scan and the size parameters
So I put a file upload form, piped the image through some reasonable imagemagick filter sequence to turn it into a no-background monochrome, look for corners/borders, resize+rotate, scan through the image til I hit a black pixel, then look at pixel-lit/unlit patterns (think 7 segment display in reverse).
Cobbled the thing in a couple afternoons, with a quick, simple UI to have the user crop/rotate the doc (putting it mostly upright). It was stupidly fast to run and success rate was very high. Interestingly enough the failure mode was very good as it could reliably tell "ok I can't make any sense out of this" vs OCR which claimed success but outputted gibberish.
You can get surprisingly far with very little when you have known knowns.
Nah, a proper anecdote should end with 'and you could check a one checkbox at the gov site and instead of the scan you would receive the 'printed' PDF/A with the text layer intact'.
But yeah, there is always a way to optimize. Even if making a clean room implementation (ie not looking at the source of that DOOM captcha) you can easily narrow down a recognition to a couple of 2x2 blocks and just pattern match them against a known background (ie not a monster).
I'm in awe at the late stages of this cat and mouse game. I write a lot of bots and scrapers, and I feel thoroughly out-gunned against a bunch of PhD data scientists.
I know this is just for fun, but I think this could be a genuinely good solution if it was heavily obfuscated, and the enemy positions were streamed from the server.
Who else is clicking "click to start" like me? It turns out you have to choose one of the buttons. I thought they are there to allow me to enable/disable the sound, but they also both act as start buttons.
Didn't know a simple interface with a sound switch and a game start button can be designed this badly.
Button may be a verb, but it doesn't have to. Generally, buttons represent one of three things:
1. an action – this is usually a verb in the imperative mood (e.g. Reply, Save, Add to basket)
2. a status – those omit the verb and only specify the new state of an object, which might be a lot of things, like a noun (Spam), an adjective (Favourite) or maybe an adverb (In progress, Later)
3. a navigation item – on the Web, this is better represented as a link, so let's not go into this here
I would argue that "with/without sound" is a clear example of a status here.
It's still way clearer to not omit the verb. "Report spam" vs just "Spam".
Also links are not buttons. There's nothing to get into here. It's straight up wrong from every perspective to think of a link as a button even from an accessibility standpoint.
Or have the "click to start" text cliclable and start the game with sound. Anyone who wants it muted will make sure to first click the mute symbol and then the ambiguity resolves itself anyway.
MathDoku does that and I hate it, because sometimes cookies expire and it plays loud music in the middle of the night when I start it. What's wrong with
I think most people would agree your solution is preferable, but the spirit of this subthread was "what's the smallest change that would improve things" rather than "how could it be redesigned from scratch?"
I would also argue the MathDoku problem is different. That sounds like a mode confusion type issue, where the user expects a certain level of automation but it has been disabled by the system without adequate feedback.
What's wrong with "start with sound" and "start without sound"? That's a guaranteed single click, whereas with a checkbox you need either one or two clicks.
Who else is missing the forest for the trees? It turns out you have to focus on the merit of the contribution instead of inconsequential UI design optimization.
Didn’t know a simple demo (with disclaimers) from someone who is clearly doing something novel could be commented on this badly.
I'd argue that if it confuses the user it's not inconsequential. And also, something can be both innovative and at the same time have room for improvement. Companies are literally chasing down user feedback.
A user's feedback is one of the best things that can ever happen to your program, the worst is to never ever get used by anyone, and the second worse is to have the users walk away with no idea why.
I certainly was confused and had a hard time starting it. If a significant amount of people can't even figure out how to start the game, the problem isn't inconsequential.
I agree with you, but this is distracting from the merits of the demo. Also, this is currently #2 on the front page so clearly many people are able to navigate the demo UI, even if it is suboptimal.
I decided to leave only a secondary comment at the bottom of the thread for the same reason as yours and still got 14 ups (i.e. thanks) in a short time before this branch bubbled up. People definitely get confused and that's worth talking about before the merits of the demo, cause you have to run it somehow. I almost left too thinking it's broken, hugged or something. It is distracting and we'll live through it :)
I tapped "click to start" on my phone a few times, saw nothing happened and assumed it didn't work on mobile and tapped back to come read the comments. I am neuroatypical, though, maybe I don't count as human.
https://news.ycombinator.com/item?id=39813174