Capturing the Flag with GPT-4

cjbprime · on April 25, 2023

Neat! My mind-blown moment with GPT-4 was realizing that it will often be able to tell you the output of the (unique, not available in training data) scripts it writes for you.

web3-is-a-scam · on April 25, 2023

I was actually working on something in this vain yesterday, asking it for output and found it often generated the output I asked for, it was not actually the output of the SQL query that it wrote.

The query it wrote wasn't even valid SQL but it was close enough to make you think it would work.

HPMOR · on April 25, 2023

I know this is extremely unnecessary and pedantic but I think you meant "in this vein" instead of "in this vain".

glitchc · on April 25, 2023

Well, maybe the OP was working in vain, given that the SQL query was not valid code.

web3-is-a-scam · on April 25, 2023

yes, thank you I always get them mixed up

saulpw · on April 25, 2023

Yeah I asked it a Python question and got back an answer with some Python code to demonstrate. The Python code worked great but demonstrated the exact opposite of the answer that was given.

bumbledraven · on April 25, 2023

Are you sure you were using GPT-4? If so, can you provide a transcript or screenshot?

aix1 · on April 25, 2023

Relevant: https://www.engraved.blog/building-a-virtual-machine-inside/

wokwokwok · on April 25, 2023

It is impressive, but no, it won’t.

It’ll tell you what a typical output for the command might be, and the more complex the script, the more wrong and full of hallucinations it will be.

There’s a huge difference.

Specifically, you have no way of knowing the difference between accurate outputs and inaccurate outputs, without running the command yourself, making it largely worthless.

Without access to environment, it’s not possible to magically know what the output of a command will be, it doesn’t have an embedded understanding of code, specifically not iterations and mapped data or mathematical functions.

For trivial obvious outputs it’s good, but it’s not executing the code; it’s generating what seems like plausible output; and if the output is trivially derived from the input, it’ll be impressively accurate.

…but, as the complexity of the task increases or the task deviates from “standard problem space” the triviality of generating accurate output decreases and it stops generating impressive outputs.

Tldr; yes, but it doesn’t scale well beyond trivial outputs.

ben_w · on April 25, 2023

> Specifically, you have no way of knowing the difference between accurate outputs and inaccurate outputs, without running the command yourself, making it largely worthless.

The former is necessarily the case given the Halting Problem; the latter is falsified by the fact we can reason about code despite the Halting Problem.

wokwokwok · on April 25, 2023

> the latter is falsified by the fact we can reason about code despite the Halting Problem.

I'm not talking in general terms, or describing 'what the code does' in a summary or bullet point high-level form. No one is arguing that it can't summarize and describe what code does. These models are very good at that.

I'm talking specifically about generating the output of the command, as the OP specifically mentioned.

It does generate the exact output for commands and scripts if you request it, sometimes even if you don't, just as an example; they're just, often, hallucinated rubbish.

Being impressed that GPT can invent from 'thin air' some creative writing (fiction) when you tell it `pretend you're a docker container and now run 'ls'` is I feel, missing the boat, in terms of understanding or being impressed by the capabilities of these LLMs.

vidarh · on April 25, 2023

And they're often not, even for quite complex functions that requires symbolically executing quite a few calculations to get to the pre-requisite output.

Nobody is impressed that it "can invent from 'thin air' some creative writing (fiction)", but that it often does not and in fact produces correct output. You're right we can't rely on it producing the correct output as it currently stands, but that it is capable of doing this at all is impressive.

cvs268 · on April 25, 2023

>> without running the command yourself,

> the latter is falsified by the fact we can reason about code despite the Halting Problem

i think wokwokwok's point holds true in practice.

Our patience and working-memory is far more limited than what is essential to accurately model all the necessary details of even moderately complex algorithms in our head.

One of the main reasons to limit code-complexity to improve readability/maintainability.

https://en.wikipedia.org/wiki/Cyclomatic_complexity

STM32F030R8 · on April 25, 2023

Can you explain how the halting problem applies here?

mellavora · on April 25, 2023

Suspect this is a troll posting, but on the off chance I'm wrong... The LLM gives the output of a command. To do so, it has to be able to determine when the command exits. This is exactly the halting problem.

For a trivial example, what is the output of:

```

while True:

  pass

print("goodby world") ```

(this is also proof that leaving out the curly braces makes code harder instead of simpler #python-lie-to-me. multiple edits to get this to render correctly on HN )

krackers · on April 25, 2023

Rice's theorem may actually be more appropriate here (but it's a consequence of the halting theorem).

But it's important to note that just because there's no algorithm that works on ALL programs doesn't mean that the semantic properties of all programs are undecidable. Clearly for the particular programs where the program is bounded and guaranteed to terminate (e.g. no unbounded loops or recursion allowed) we can determine such properties, and I believe theorem provers in fact only allow such programs. And similarly you can restrict yourself to only the programs that you can prove will terminate in N steps (which might be excluding some programs that do terminate but require more than N steps of compute to prove).

pmoriarty · on April 25, 2023

> It’ll tell you what a typical output for the command might be, and the more complex the script, the more wrong and full of hallucinations it will be.

It doesn't even have to be complex. Ask it about a program in a language that's not very popular and the odds that it'll completely screw up its answers is high.

cjbprime · on April 25, 2023

Obviously I wouldn't have been impressed by hallucinated output. I'm talking about correctly modeling a variable's state through its conception of what a Python interpreter does, which requires building a model of that interpreter and extracting consequences from it, rather than pure language statistics.

I agree that LLMs will often hallucinate. There is obviously no guarantee that the output is correct. But sometimes it is correct anyway, which I notice by actually running the code.

Here is a trivial example which I only mention to bring the conversation back to the reality of GPT-4 actually being able to do things like this:

Me:

You are a Python interpreter. Please give the correct output of the supplied code, with no commentary.

  >>> a = ["wokwokwok", "says", "i", "have", "no", "understanding", "of", "code"]
  >>> a.append("!")
  >>> " ".join([w.upper() for w in a])

GPT-4, on first attempt:

WOKWOKWOK SAYS I HAVE NO UNDERSTANDING OF CODE !

wokwokwok · on April 27, 2023

That’s because you’ve asked for something that is trivially derived.

> Tldr; yes, but it doesn’t scale well beyond trivial outputs.

cjbprime · on April 28, 2023

Which algorithms are trivial and which are non-trivial? Is sorting trivial? base64-encoding? Can you provide a set of small scripts that you expect GPT-4 to be unable to correctly simulate execution of?

It is weird that you seem to agree that it is capable of performing algorithmic simulation, while discounting that with "but it’s not executing the code; it’s generating what seems like plausible output", in a way that seems suspiciously close to defining anything it simulates correctly as "trivial", and anything it would fail at as "executing the code"...

DennisP · on April 25, 2023

> the more complex the script, the more wrong and full of hallucinations it will be

I often have that same problem myself.

reaperman · on April 25, 2023

> Tldr; yes, but it doesn’t scale well beyond trivial outputs.

I understand that the current LLM architectures are fundamentally incapable of goal-seeking and that they lack any concept of "correctness". However, I also recognize that somehow, the incredibly "wrong" architecture of ChatGPT is able to be useful.

I'd like to think that with access to a sandbox runtime environment, a significantly larger context window, and perhaps additional copies of the LLM "supervising/orchestrating" multiple "lower" copies of the LLM by breaking down large work into smaller tasks, that the current ChatGPT architecture could scale well beyond trivial scripts.

And then I hope we abandon this LLM architecture and develop architectures which can actually internally work towards "going beyond" in terms of quality of output for a given task.

hammyhavoc · on April 25, 2023

Instead of abandoning anything, why not just use appropriate tooling? Boilerplate if you want some code written for you, debugger if you want to know what's wrong with what you're doing. Been around for more than twenty years.

dist-epoch · on April 25, 2023

That's provably only possible for "small" scripts, due to the halting problem and all that.

politelemon · on April 25, 2023

That is very interesting, which I'd like to try once GPT4 is widely available. I did try GPT3 with a CTF a few weeks ago, and it gave seemingly plausible code but outright incorrect answers.

yreg · on April 25, 2023

I did some advent of code exercises with GPT-3 and it often ended the script with something like

    // returns 1337
    return result;

Sometimes the comment stated the correct answer to the puzzle, but the script returned something else since it had a bug.

moffkalast · on April 25, 2023

GPT-3 isn't the 1337 h4x0r it thinks it is.

erfgh · on April 25, 2023

Hum... maybe it runs the scripts and sees the output?

SargeZT · on April 25, 2023

It does not. Regular ChatGPT without plugins does not have access to any tools. Throw it a script with some weird outputs and it'll definitely fail, every time. While the script 'evaluation' stuff can be pretty decently impressive, it is not actually executing anything.

yao420 · on April 25, 2023

I used it on the htb cyber apocalypse ctf a few weeks ago. It did phenomenal and I was able to get a ton of flags on the crypto and ML challenges I otherwise would not have even tried.

gjadi · on April 25, 2023

I get that it's a way to find out what GPT4 is capable of, but IMHO that defeats the point of a game like CTF.

It's like playing with an aimbot. You may be beating the other players, but where's the fun?

ryan-c · on April 25, 2023

I've run CTF games at major conferences. The point is to solve the challenges by any means necessary within the rules... and that which is not forbidden is allowed.

If I, as a person running a CTF, did not want my players to do this, I would set up a few problems which would have incorrect (but not obviously so) "solutions" generated when fed to LLM.

The Shamir's Secret Sharing reminds me of the time I was playing DEF CON CTF Quals, and had the bright idea to try to scan for challenges. I found one - it involved hiding fragments of a split secret in a modified version of ADVENT. I solved it. Even when the board was fully opened, it was nowhere to be seen. You does your hacks and you takes your chances...

pmoriarty · on April 25, 2023

> it involved hiding fragments of a split secret in a modified version of ADVENT. I solved it. Even when the board was fully opened, it was nowhere to be seen

Can you explain what this means? I don't understand, except for the split secret part.

Also: How did you do it?

ryan-c · on April 25, 2023

https://en.wikipedia.org/wiki/Colossal_Cave_Adventure

http://point-at-infinity.org/ssss/

DEF CON CTF quals is (or was) "Jeopardy" style with five categories with five problems each. The thing I found was not one of the 25 problems.

gjadi · on April 25, 2023

My comment was negative and didn't bring anything. I should have refrain.

Sorry to the author and thanks for the article, I learned about lagrange interpolation.

d23 · on April 25, 2023

I think it’s a reasonable thing to consider, and you brought it up in a healthy, non-inflammatory way.

blowski · on April 25, 2023

I often use cheats and walkthroughs, to the chagrin of my gamer friends. I have fun though, so who cares?

catgpt23 · on April 25, 2023

[flagged]

awestroke · on April 25, 2023

Yeah, this one was not hard to spot. Very obvious ChatGPT style. Try playing with a prompt preamble to set the writing style and tone.

evertedsphere · on April 25, 2023

besides, the account's called catgpt3, might as well pick a ... feline type of prompt

pmoriarty · on April 25, 2023

It's like using a calculator on a math test. On some tests they're allowed, on others using one would be cheating.

amrb · on April 25, 2023

I appreciate they say they need to learn more about 'exec' when asking GPT4, it also plays well into some of the reading strategies I've seen to get a high-level understanding then read the documentation with more general idea going in, could also lower frustration for new engineers so they push through to the end!

charcircuit · on April 25, 2023

I was hoping for someone to have poisoned GPT4's training data with a flag you had to get it to say.

It would be a pretty big flex.

cjbprime · on April 25, 2023

Especially since they would have had to do it in 2021!

The MIT Mystery Hunt has occasionally managed to get the NYT crossword on the day of the hunt to contain a clue answer or two.

divbzero · on April 25, 2023

The spinning globe in the background appears to have an East Pole and a West Pole.

dataangel · on April 25, 2023

> Especially because this challenge actually includes a very tricky part related to base-27

It looks like a bog standard function for converting to a base X number. All GPT-4 had to do was paste 27 in.

dsign · on April 25, 2023

Chat-GPT is quite capable of doing some "reasoning", but assume for a second it's not, and all it does is to search for human-written solutions and adapt them a little to your question. That alone is incredibly helpful.

zitterbewegung · on April 25, 2023

I tried using gpt4 on the qualifier rounds for hackasat and I think you need to know the basics of what is a good answer (or maybe satellite red teaming is too obscure) but I never got a flag out of it .

rfoo · on April 25, 2023

Hehe, you can solve the two "Street" challenges with GPT-4, they are bog standard entry-level competitive programming challenge disguised as a reverse engineering challenge.

fsniper · on April 25, 2023

I thought "/shurdles" problem was to be solved via "chroot".

pizzalife · on April 25, 2023

You need to be root or have CAP_SYS_CHROOT to use the chroot system call. You can however create a new user and mount namespace on distros that allow unprivileged namespaces (Ubuntu) and then chroot away. The challenge could have been solved that way depending on the kernel used and if the binary was a suid reading a flag file.

But the way the challenge was designed, it's more about just changing argv[0] rather than the actual executable path.

Matir · on April 25, 2023

Yep, challenge author here, and it was definitely to teach that `argv[0]` is not trustworthy. I've seen privileged processes try to re-invoke themselves (as, say, a child process) by looking at `argv[0]` rather than something like `/proc/self/exe` (which is also subject to race conditions if the directory is writable).

The binary was not setuid, but was only executable (not readable) by the user used.

pizzalife · on April 26, 2023

>The binary was not setuid, but was only executable (not readable) by the user used.

Ah, then ptrace/gdb could have been used to dump it out as well :). Looks like a fun CTF, too bad I was too busy for bsides this year..

fsniper · on April 26, 2023

Yes. Thank you. I just missed that part.

It's been a long while since doing basic linux administration. I am getting rusty.

edfletcher_t137 · on April 25, 2023

> And this time, I did it with the help of GTP-4

Should've had *GPT*-4 proofread this.

isaacfrond · on April 25, 2023

Hehe, I make the typo all the time! I need to consciously remind myself: Generative Pre-trained Transformer.

pmoriarty · on April 25, 2023

Just remember the letters are in alphabetical order.