> C-to-Rust transpiler is a pretty doable project.
It's been done, but what comes out is terrible Rust. Everything is unsafe types with C semantics.
An intelligent C to Rust translator would be a big win. You'd need to annotate the input C with info about how long arrays are and such, to guide the translator. It might be possible to use an LLM to analyze the code and provide annotations. Usually, C code does have array length info; it's just not in a form that the language ties to the array itself. If you see
char* buf = malloc(len);
the programmer knows that "buf" has length "len", but the programmer does not. Something needs to annotate "buf" with that info so that the translator knows it. Then the translator can generate Rust:
let mut buf = vec![0;len];
The payoff comes at calls. C code:
int write_to_device(char* buf, size_t len)
is a common idiom. LLMs are good at idioms. At this point, one can guess that this
is equivalent to
fn write_to_device(buf: &[u8]) -> i32
in Rust. Then the translator has to track "len" to make sure that
assert_eq!(buf.len(), len);
is either provably true, or put in that assert to check it at run time.
So that's a path to translation into safe Rust.
Funding could probably be obtained from Homeland Security for this, given the new White House level interest in safe languages and the headaches being caused by the cyber war. Is CVS still down?
> It's been done, but what comes out is terrible Rust. Everything is unsafe types with C semantics.
The idea behind the current c2rust tool is that you'd do a one-shot conversion to Rust and then gradually do refactoring passes over the barely-Rust code to convert it to correct C code. The focus is on preserving semantics of C over writing anything close to idiomatic (cue a + b being translated to a.wrapping_add(b) all the time, e.g.). Which is an approach, but I'm not sure it ends up providing any value over "set your system to compile both C and Rust into a final image and then slowly move stuff from the C to the Rust side as appropriate" in practice.
> Usually, C code does have array length info; it's just not in a form that the language ties to the array itself.
This is actually why C23 made VLA support semi-mandatory: it enables you to describe a function signature as
int write_to_device(size_t len, char buf[len])
and C23 compilers are required to support that, even in absence of full VLA support! The intent of making this support mandatory was to be able to use that as a basis for adding better bounds-checking support to the language and compilers. (Although, as you noticed, there is an order-of-declarations issue compared to the typical idiomatic expression of such APIs in C, and the committee has yet to find a solution to that).
> The idea behind the current c2rust tool is that you'd do a one-shot conversion to Rust and then gradually do refactoring passes over the barely-Rust code to convert it to correct C code.
I've seen what comes out of the transpiler. Nobody should touch that code by hand. It's awful Rust, and uglier than the original C. Modifying that by hand is like modifying compiler-generated machine code.
> This is actually why C23 made VLA support semi-mandatory.
C23 doesn't actually use that info. You can't get the size of buf from buf. I proposed something like that 12 years ago.[1] But I wanted to add enough features to check it.
> I've seen what comes out of the transpiler. Nobody should touch that code by hand. It's awful Rust, and uglier than the original C.
I can't disagree here. I think the original idea was to rely on automated refactoring tools to try to make the generated Rust somewhat more palatable, but I never was able to get that working.
> C23 doesn't actually use that info.
True; the intent is to require it so that it can be leveraged by future extensions. The C committee tends to move glacially.
The real problem is not translating code. It's translating data types. If you can determine that a "char *" in C can be a Vec in Rust, you're most of the way there. It's no longer ambiguous what to do with the accesses.
This is where I think LLMs could help. Ask an LLM "In this code, could variable "buf" be safely represented as a Rust "Vec", and if so, what is its length?. LLMs don't really know the languages, but they have access to many samples, which is probably good enough to get a correct guess most of the time. That's enough to provide annotation hints to a dumb translator. The problem here is translating C idioms to Rust idioms, which is an LLM kind of problem.
They have made some improvements here recently. There is a lot less unsafe generated. The rest is more idiomatic too. The cost is that it will be throwing panics everywhere until you fix the faulty assumptions it asserted. I like the new way better.
It's been done, but what comes out is terrible Rust. Everything is unsafe types with C semantics.
An intelligent C to Rust translator would be a big win. You'd need to annotate the input C with info about how long arrays are and such, to guide the translator. It might be possible to use an LLM to analyze the code and provide annotations. Usually, C code does have array length info; it's just not in a form that the language ties to the array itself. If you see
the programmer knows that "buf" has length "len", but the programmer does not. Something needs to annotate "buf" with that info so that the translator knows it. Then the translator can generate Rust: The payoff comes at calls. C code: is a common idiom. LLMs are good at idioms. At this point, one can guess that this is equivalent to in Rust. Then the translator has to track "len" to make sure that is either provably true, or put in that assert to check it at run time. So that's a path to translation into safe Rust.Funding could probably be obtained from Homeland Security for this, given the new White House level interest in safe languages and the headaches being caused by the cyber war. Is CVS still down?