Hacker News new | past | comments | ask | show | jobs | submit login
Capturing the Flag with GPT-4 (micahflee.com)
235 points by hiddencost on April 25, 2023 | hide | past | favorite | 56 comments



Neat! My mind-blown moment with GPT-4 was realizing that it will often be able to tell you the output of the (unique, not available in training data) scripts it writes for you.


I was actually working on something in this vain yesterday, asking it for output and found it often generated the output I asked for, it was not actually the output of the SQL query that it wrote.

The query it wrote wasn't even valid SQL but it was close enough to make you think it would work.


I know this is extremely unnecessary and pedantic but I think you meant "in this vein" instead of "in this vain".


Well, maybe the OP was working in vain, given that the SQL query was not valid code.


yes, thank you I always get them mixed up


Yeah I asked it a Python question and got back an answer with some Python code to demonstrate. The Python code worked great but demonstrated the exact opposite of the answer that was given.


Are you sure you were using GPT-4? If so, can you provide a transcript or screenshot?



It is impressive, but no, it won’t.

It’ll tell you what a typical output for the command might be, and the more complex the script, the more wrong and full of hallucinations it will be.

There’s a huge difference.

Specifically, you have no way of knowing the difference between accurate outputs and inaccurate outputs, without running the command yourself, making it largely worthless.

Without access to environment, it’s not possible to magically know what the output of a command will be, it doesn’t have an embedded understanding of code, specifically not iterations and mapped data or mathematical functions.

For trivial obvious outputs it’s good, but it’s not executing the code; it’s generating what seems like plausible output; and if the output is trivially derived from the input, it’ll be impressively accurate.

…but, as the complexity of the task increases or the task deviates from “standard problem space” the triviality of generating accurate output decreases and it stops generating impressive outputs.

Tldr; yes, but it doesn’t scale well beyond trivial outputs.


> Specifically, you have no way of knowing the difference between accurate outputs and inaccurate outputs, without running the command yourself, making it largely worthless.

The former is necessarily the case given the Halting Problem; the latter is falsified by the fact we can reason about code despite the Halting Problem.


> the latter is falsified by the fact we can reason about code despite the Halting Problem.

I'm not talking in general terms, or describing 'what the code does' in a summary or bullet point high-level form. No one is arguing that it can't summarize and describe what code does. These models are very good at that.

I'm talking specifically about generating the output of the command, as the OP specifically mentioned.

It does generate the exact output for commands and scripts if you request it, sometimes even if you don't, just as an example; they're just, often, hallucinated rubbish.

Being impressed that GPT can invent from 'thin air' some creative writing (fiction) when you tell it `pretend you're a docker container and now run 'ls'` is I feel, missing the boat, in terms of understanding or being impressed by the capabilities of these LLMs.


And they're often not, even for quite complex functions that requires symbolically executing quite a few calculations to get to the pre-requisite output.

Nobody is impressed that it "can invent from 'thin air' some creative writing (fiction)", but that it often does not and in fact produces correct output. You're right we can't rely on it producing the correct output as it currently stands, but that it is capable of doing this at all is impressive.


>> without running the command yourself,

> the latter is falsified by the fact we can reason about code despite the Halting Problem

i think wokwokwok's point holds true in practice.

Our patience and working-memory is far more limited than what is essential to accurately model all the necessary details of even moderately complex algorithms in our head.

One of the main reasons to limit code-complexity to improve readability/maintainability.

https://en.wikipedia.org/wiki/Cyclomatic_complexity


Can you explain how the halting problem applies here?


Suspect this is a troll posting, but on the off chance I'm wrong... The LLM gives the output of a command. To do so, it has to be able to determine when the command exits. This is exactly the halting problem.

For a trivial example, what is the output of:

```

while True:

  pass

print("goodby world") ```

(this is also proof that leaving out the curly braces makes code harder instead of simpler #python-lie-to-me. multiple edits to get this to render correctly on HN )


Rice's theorem may actually be more appropriate here (but it's a consequence of the halting theorem).

But it's important to note that just because there's no algorithm that works on ALL programs doesn't mean that the semantic properties of all programs are undecidable. Clearly for the particular programs where the program is bounded and guaranteed to terminate (e.g. no unbounded loops or recursion allowed) we can determine such properties, and I believe theorem provers in fact only allow such programs. And similarly you can restrict yourself to only the programs that you can prove will terminate in N steps (which might be excluding some programs that do terminate but require more than N steps of compute to prove).


> It’ll tell you what a typical output for the command might be, and the more complex the script, the more wrong and full of hallucinations it will be.

It doesn't even have to be complex. Ask it about a program in a language that's not very popular and the odds that it'll completely screw up its answers is high.


Obviously I wouldn't have been impressed by hallucinated output. I'm talking about correctly modeling a variable's state through its conception of what a Python interpreter does, which requires building a model of that interpreter and extracting consequences from it, rather than pure language statistics.

I agree that LLMs will often hallucinate. There is obviously no guarantee that the output is correct. But sometimes it is correct anyway, which I notice by actually running the code.

Here is a trivial example which I only mention to bring the conversation back to the reality of GPT-4 actually being able to do things like this:

Me:

You are a Python interpreter. Please give the correct output of the supplied code, with no commentary.

  >>> a = ["wokwokwok", "says", "i", "have", "no", "understanding", "of", "code"]
  >>> a.append("!")
  >>> " ".join([w.upper() for w in a])
GPT-4, on first attempt:

WOKWOKWOK SAYS I HAVE NO UNDERSTANDING OF CODE !


That’s because you’ve asked for something that is trivially derived.

> Tldr; yes, but it doesn’t scale well beyond trivial outputs.


Which algorithms are trivial and which are non-trivial? Is sorting trivial? base64-encoding? Can you provide a set of small scripts that you expect GPT-4 to be unable to correctly simulate execution of?

It is weird that you seem to agree that it is capable of performing algorithmic simulation, while discounting that with "but it’s not executing the code; it’s generating what seems like plausible output", in a way that seems suspiciously close to defining anything it simulates correctly as "trivial", and anything it would fail at as "executing the code"...


> the more complex the script, the more wrong and full of hallucinations it will be

I often have that same problem myself.


> Tldr; yes, but it doesn’t scale well beyond trivial outputs.

I understand that the current LLM architectures are fundamentally incapable of goal-seeking and that they lack any concept of "correctness". However, I also recognize that somehow, the incredibly "wrong" architecture of ChatGPT is able to be useful.

I'd like to think that with access to a sandbox runtime environment, a significantly larger context window, and perhaps additional copies of the LLM "supervising/orchestrating" multiple "lower" copies of the LLM by breaking down large work into smaller tasks, that the current ChatGPT architecture could scale well beyond trivial scripts.

And then I hope we abandon this LLM architecture and develop architectures which can actually internally work towards "going beyond" in terms of quality of output for a given task.


Instead of abandoning anything, why not just use appropriate tooling? Boilerplate if you want some code written for you, debugger if you want to know what's wrong with what you're doing. Been around for more than twenty years.


That's provably only possible for "small" scripts, due to the halting problem and all that.


That is very interesting, which I'd like to try once GPT4 is widely available. I did try GPT3 with a CTF a few weeks ago, and it gave seemingly plausible code but outright incorrect answers.


I did some advent of code exercises with GPT-3 and it often ended the script with something like

    // returns 1337
    return result;
Sometimes the comment stated the correct answer to the puzzle, but the script returned something else since it had a bug.


GPT-3 isn't the 1337 h4x0r it thinks it is.


Hum... maybe it runs the scripts and sees the output?


It does not. Regular ChatGPT without plugins does not have access to any tools. Throw it a script with some weird outputs and it'll definitely fail, every time. While the script 'evaluation' stuff can be pretty decently impressive, it is not actually executing anything.


I used it on the htb cyber apocalypse ctf a few weeks ago. It did phenomenal and I was able to get a ton of flags on the crypto and ML challenges I otherwise would not have even tried.


I get that it's a way to find out what GPT4 is capable of, but IMHO that defeats the point of a game like CTF.

It's like playing with an aimbot. You may be beating the other players, but where's the fun?


I've run CTF games at major conferences. The point is to solve the challenges by any means necessary within the rules... and that which is not forbidden is allowed.

If I, as a person running a CTF, did not want my players to do this, I would set up a few problems which would have incorrect (but not obviously so) "solutions" generated when fed to LLM.

The Shamir's Secret Sharing reminds me of the time I was playing DEF CON CTF Quals, and had the bright idea to try to scan for challenges. I found one - it involved hiding fragments of a split secret in a modified version of ADVENT. I solved it. Even when the board was fully opened, it was nowhere to be seen. You does your hacks and you takes your chances...


> it involved hiding fragments of a split secret in a modified version of ADVENT. I solved it. Even when the board was fully opened, it was nowhere to be seen

Can you explain what this means? I don't understand, except for the split secret part.

Also: How did you do it?


https://en.wikipedia.org/wiki/Colossal_Cave_Adventure

http://point-at-infinity.org/ssss/

DEF CON CTF quals is (or was) "Jeopardy" style with five categories with five problems each. The thing I found was not one of the 25 problems.


My comment was negative and didn't bring anything. I should have refrain.

Sorry to the author and thanks for the article, I learned about lagrange interpolation.


I think it’s a reasonable thing to consider, and you brought it up in a healthy, non-inflammatory way.


I often use cheats and walkthroughs, to the chagrin of my gamer friends. I have fun though, so who cares?


[flagged]


Yeah, this one was not hard to spot. Very obvious ChatGPT style. Try playing with a prompt preamble to set the writing style and tone.


besides, the account's called catgpt3, might as well pick a ... feline type of prompt


It's like using a calculator on a math test. On some tests they're allowed, on others using one would be cheating.


I appreciate they say they need to learn more about 'exec' when asking GPT4, it also plays well into some of the reading strategies I've seen to get a high-level understanding then read the documentation with more general idea going in, could also lower frustration for new engineers so they push through to the end!


I was hoping for someone to have poisoned GPT4's training data with a flag you had to get it to say.

It would be a pretty big flex.


Especially since they would have had to do it in 2021!

The MIT Mystery Hunt has occasionally managed to get the NYT crossword on the day of the hunt to contain a clue answer or two.


The spinning globe in the background appears to have an East Pole and a West Pole.


> Especially because this challenge actually includes a very tricky part related to base-27

It looks like a bog standard function for converting to a base X number. All GPT-4 had to do was paste 27 in.


Chat-GPT is quite capable of doing some "reasoning", but assume for a second it's not, and all it does is to search for human-written solutions and adapt them a little to your question. That alone is incredibly helpful.


I tried using gpt4 on the qualifier rounds for hackasat and I think you need to know the basics of what is a good answer (or maybe satellite red teaming is too obscure) but I never got a flag out of it .


Hehe, you can solve the two "Street" challenges with GPT-4, they are bog standard entry-level competitive programming challenge disguised as a reverse engineering challenge.


I thought "/shurdles" problem was to be solved via "chroot".


You need to be root or have CAP_SYS_CHROOT to use the chroot system call. You can however create a new user and mount namespace on distros that allow unprivileged namespaces (Ubuntu) and then chroot away. The challenge could have been solved that way depending on the kernel used and if the binary was a suid reading a flag file.

But the way the challenge was designed, it's more about just changing argv[0] rather than the actual executable path.


Yep, challenge author here, and it was definitely to teach that `argv[0]` is not trustworthy. I've seen privileged processes try to re-invoke themselves (as, say, a child process) by looking at `argv[0]` rather than something like `/proc/self/exe` (which is also subject to race conditions if the directory is writable).

The binary was not setuid, but was only executable (not readable) by the user used.


>The binary was not setuid, but was only executable (not readable) by the user used.

Ah, then ptrace/gdb could have been used to dump it out as well :). Looks like a fun CTF, too bad I was too busy for bsides this year..


Yes. Thank you. I just missed that part.

It's been a long while since doing basic linux administration. I am getting rusty.


> And this time, I did it with the help of GTP-4

Should've had *GPT*-4 proofread this.


Hehe, I make the typo all the time! I need to consciously remind myself: Generative Pre-trained Transformer.


Just remember the letters are in alphabetical order.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: