Neat! My mind-blown moment with GPT-4 was realizing that it will often be able to tell you the output of the (unique, not available in training data) scripts it writes for you.
I was actually working on something in this vain yesterday, asking it for output and found it often generated the output I asked for, it was not actually the output of the SQL query that it wrote.
The query it wrote wasn't even valid SQL but it was close enough to make you think it would work.
Yeah I asked it a Python question and got back an answer with some Python code to demonstrate. The Python code worked great but demonstrated the exact opposite of the answer that was given.
It’ll tell you what a typical output for the command might be, and the more complex the script, the more wrong and full of hallucinations it will be.
There’s a huge difference.
Specifically, you have no way of knowing the difference between accurate outputs and inaccurate outputs, without running the command yourself, making it largely worthless.
Without access to environment, it’s not possible to magically know what the output of a command will be, it doesn’t have an embedded understanding of code, specifically not iterations and mapped data or mathematical functions.
For trivial obvious outputs it’s good, but it’s not executing the code; it’s generating what seems like plausible output; and if the output is trivially derived from the input, it’ll be impressively accurate.
…but, as the complexity of the task increases or the task deviates from “standard problem space” the triviality of generating accurate output decreases and it stops generating impressive outputs.
Tldr; yes, but it doesn’t scale well beyond trivial outputs.
> Specifically, you have no way of knowing the difference between accurate outputs and inaccurate outputs, without running the command yourself, making it largely worthless.
The former is necessarily the case given the Halting Problem; the latter is falsified by the fact we can reason about code despite the Halting Problem.
> the latter is falsified by the fact we can reason about code despite the Halting Problem.
I'm not talking in general terms, or describing 'what the code does' in a summary or bullet point high-level form. No one is arguing that it can't summarize and describe what code does. These models are very good at that.
I'm talking specifically about generating the output of the command, as the OP specifically mentioned.
It does generate the exact output for commands and scripts if you request it, sometimes even if you don't, just as an example; they're just, often, hallucinated rubbish.
Being impressed that GPT can invent from 'thin air' some creative writing (fiction) when you tell it `pretend you're a docker container and now run 'ls'` is I feel, missing the boat, in terms of understanding or being impressed by the capabilities of these LLMs.
And they're often not, even for quite complex functions that requires symbolically executing quite a few calculations to get to the pre-requisite output.
Nobody is impressed that it "can invent from 'thin air' some creative writing (fiction)", but that it often does not and in fact produces correct output. You're right we can't rely on it producing the correct output as it currently stands, but that it is capable of doing this at all is impressive.
> the latter is falsified by the fact we can reason about code despite the Halting Problem
i think wokwokwok's point holds true in practice.
Our patience and working-memory is far more limited than what is essential to accurately model all the necessary details of even moderately complex algorithms in our head.
One of the main reasons to limit code-complexity to improve readability/maintainability.
Suspect this is a troll posting, but on the off chance I'm wrong... The LLM gives the output of a command. To do so, it has to be able to determine when the command exits. This is exactly the halting problem.
For a trivial example, what is the output of:
```
while True:
pass
print("goodby world")
```
(this is also proof that leaving out the curly braces makes code harder instead of simpler #python-lie-to-me. multiple edits to get this to render correctly on HN )
Rice's theorem may actually be more appropriate here (but it's a consequence of the halting theorem).
But it's important to note that just because there's no algorithm that works on ALL programs doesn't mean that the semantic properties of all programs are undecidable. Clearly for the particular programs where the program is bounded and guaranteed to terminate (e.g. no unbounded loops or recursion allowed) we can determine such properties, and I believe theorem provers in fact only allow such programs. And similarly you can restrict yourself to only the programs that you can prove will terminate in N steps (which might be excluding some programs that do terminate but require more than N steps of compute to prove).
> It’ll tell you what a typical output for the command might be, and the more complex the script, the more wrong and full of hallucinations it will be.
It doesn't even have to be complex. Ask it about a program in a language that's not very popular and the odds that it'll completely screw up its answers is high.
Obviously I wouldn't have been impressed by hallucinated output. I'm talking about correctly modeling a variable's state through its conception of what a Python interpreter does, which requires building a model of that interpreter and extracting consequences from it, rather than pure language statistics.
I agree that LLMs will often hallucinate. There is obviously no guarantee that the output is correct. But sometimes it is correct anyway, which I notice by actually running the code.
Here is a trivial example which I only mention to bring the conversation back to the reality of GPT-4 actually being able to do things like this:
Me:
You are a Python interpreter. Please give the correct output of the supplied code, with no commentary.
>>> a = ["wokwokwok", "says", "i", "have", "no", "understanding", "of", "code"]
>>> a.append("!")
>>> " ".join([w.upper() for w in a])
Which algorithms are trivial and which are non-trivial? Is sorting trivial? base64-encoding? Can you provide a set of small scripts that you expect GPT-4 to be unable to correctly simulate execution of?
It is weird that you seem to agree that it is capable of performing algorithmic simulation, while discounting that with "but it’s not executing the code; it’s generating what seems like plausible output", in a way that seems suspiciously close to defining anything it simulates correctly as "trivial", and anything it would fail at as "executing the code"...
> Tldr; yes, but it doesn’t scale well beyond trivial outputs.
I understand that the current LLM architectures are fundamentally incapable of goal-seeking and that they lack any concept of "correctness". However, I also recognize that somehow, the incredibly "wrong" architecture of ChatGPT is able to be useful.
I'd like to think that with access to a sandbox runtime environment, a significantly larger context window, and perhaps additional copies of the LLM "supervising/orchestrating" multiple "lower" copies of the LLM by breaking down large work into smaller tasks, that the current ChatGPT architecture could scale well beyond trivial scripts.
And then I hope we abandon this LLM architecture and develop architectures which can actually internally work towards "going beyond" in terms of quality of output for a given task.
Instead of abandoning anything, why not just use appropriate tooling? Boilerplate if you want some code written for you, debugger if you want to know what's wrong with what you're doing. Been around for more than twenty years.
That is very interesting, which I'd like to try once GPT4 is widely available. I did try GPT3 with a CTF a few weeks ago, and it gave seemingly plausible code but outright incorrect answers.
It does not. Regular ChatGPT without plugins does not have access to any tools. Throw it a script with some weird outputs and it'll definitely fail, every time. While the script 'evaluation' stuff can be pretty decently impressive, it is not actually executing anything.
I used it on the htb cyber apocalypse ctf a few weeks ago. It did phenomenal and I was able to get a ton of flags on the crypto and ML challenges I otherwise would not have even tried.
I've run CTF games at major conferences. The point is to solve the challenges by any means necessary within the rules... and that which is not forbidden is allowed.
If I, as a person running a CTF, did not want my players to do this, I would set up a few problems which would have incorrect (but not obviously so) "solutions" generated when fed to LLM.
The Shamir's Secret Sharing reminds me of the time I was playing DEF CON CTF Quals, and had the bright idea to try to scan for challenges. I found one - it involved hiding fragments of a split secret in a modified version of ADVENT. I solved it. Even when the board was fully opened, it was nowhere to be seen. You does your hacks and you takes your chances...
> it involved hiding fragments of a split secret in a modified version of ADVENT. I solved it. Even when the board was fully opened, it was nowhere to be seen
Can you explain what this means? I don't understand, except for the split secret part.
I appreciate they say they need to learn more about 'exec' when asking GPT4,
it also plays well into some of the reading strategies I've seen to get a high-level understanding then read the documentation with more general idea going in, could also lower frustration for new engineers so they push through to the end!
Chat-GPT is quite capable of doing some "reasoning", but assume for a second it's not, and all it does is to search for human-written solutions and adapt them a little to your question. That alone is incredibly helpful.
I tried using gpt4 on the qualifier rounds for hackasat and I think you need to know the basics of what is a good answer (or maybe satellite red teaming is too obscure) but I never got a flag out of it .
Hehe, you can solve the two "Street" challenges with GPT-4, they are bog standard entry-level competitive programming challenge disguised as a reverse engineering challenge.
You need to be root or have CAP_SYS_CHROOT to use the chroot system call. You can however create a new user and mount namespace on distros that allow unprivileged namespaces (Ubuntu) and then chroot away. The challenge could have been solved that way depending on the kernel used and if the binary was a suid reading a flag file.
But the way the challenge was designed, it's more about just changing argv[0] rather than the actual executable path.
Yep, challenge author here, and it was definitely to teach that `argv[0]` is not trustworthy. I've seen privileged processes try to re-invoke themselves (as, say, a child process) by looking at `argv[0]` rather than something like `/proc/self/exe` (which is also subject to race conditions if the directory is writable).
The binary was not setuid, but was only executable (not readable) by the user used.