> We work very closely with Google DeepMind to adapt Gemini models for Google-scale coding and other Software Engineering usecases.
Considering how terrible and frequently broken the code that the public facing Gemini produces, I'll have to be honest that that kind of scares me.
Gemini frequently fails at some fairly basic stuff, even in popular languages where it would have had a lot of source material to work from; where other public models (even free ones) sail through.
To give a fun, fairly recent example, here's a prime factorisation algorithm it produce for python:
# Find the prime factorization of n
prime_factors = []
while n > 1:
p = 2
while n % p == 0:
prime_factors.append(p)
n //= p
p += 1
prime_factors.append(n)
They probably use AI for writing tests, small internal tools/scripts, building generic frontends and quick prototypes/demos/proofs of concept. That could easily be that 25% of the code. And modern LLMs are pretty okayish with that.
I believe most people use AI to help them quickly figure out how to use a library or an API without having to read all their (often out dated) documentation instead of helping them solve some mathematical challenge
If the documentation is out of date, such that it doesn't help, this doesn't bode well for the training data of the AI helping it get it right, either?
Unfortunately, it often hallucinates wrong parameters (or gets their order wrong) if there are multiple different APIs for similar packages. For example, there are plenty ML model inference packages, and the code suggestions for NVIDIA Triton Inference Server Python code are pretty much always wrong, as it generates code that’s probably correct for other Python ML inference packages with slightly different API.
I often find the opposite. Documentation can be up to date, but AI suggests deprecated or removed functions because there’s more old code than new code. Pgx v5 is a particularly consistent example.
We are sorely lacking a "Make Computer Science a Science" movement, the tech lead's blurb is par for the course, talking about "SWE productivity" with no reference to scientific inquiry and a foundational understanding of safety, correctness, verification, validation of these new LLM technologies.
Did you know that Software Engineering is a university level degree? That it is a field of scientific study, with professors who dedicate their lives to it? What happens when companies ignore science and worse yet cause harm like pollution or medical malpractice, or in this case, spread Silicon Valley lies and bullshit???
Did you know? How WEIRD.
How about you not harass other commenters with such arrogantly ignorant sarcastic questions?? Or is that part of corporate "for-profit" culture too????
> Did you know that Software Engineering is a university level degree? That it is a field of scientific study, with professors who dedicate their lives to it?
So is marketing? So finance? So is petroleum engineering?
I didn't say it's hard, but it's most definitely leetcode, as in "pointless algorithmic exercise that will only show you if the candidate recently worked on a similar question".
Curious, I would expect a programmer of your age to remember Knuth's "beware of the bugs in above code, I have only proven it's correct but haven't actually run it".
I'm happy you know math, but my point before this thread got derailed was that we're holding (coding) AI to a higher standard than actual humans, namely to expect to write bug-free code.
> my point before this thread got derailed was that we're holding (coding) AI to a higher standard than actual humans, namely to expect to write bug-free code
This seems like a very layman attitude and I would be surprised to find many devs adhering to this idea. Comments in this thread alone suggests that many devs on HN do not agree.
I hold myself to a higher standard than AI tools are capable of, from my experience. (Maybe some people don't, and that's where the disconnect is between the apologists and the naysayers?)
Humans can actually run the code and knows what it should output. the LLM can't, and putting it in a loop against code output doesn't work well either since the LLM can't navigate that well.
A senior programmer like me knows that primality-based problems like the one posed in your link are easily gamed.
Testing for small prime factors is easy - brute force is your friend. Testing for large prime factors requires more effort. So the first trick is to figure out the bounds to the problem. Is it int32? Then brute-force it. Is it int64, where you might have a value like the Mersenne prime 2^61-1? Perhaps it's time to pull out a math reference. Is it longer, like an unbounded Python int? Definitely switch to something like the GNU Multiple Precision Arithmetic Library.
In this case, the maximum value is 1,000, which means we can enumerate all distinct prime values in that range, and test for its presence in each input value, one one-by-one:
That worked without testing, though I felt better after I ran the test suite, which found no errors. Here's the test suite:
import unittest
class TestExamples(unittest.TestCase):
def test_example_1(self):
self.assertEqual(distinctPrimeFactors([2,4,3,7,10,6]), 4)
def test_example_2(self):
self.assertEqual(distinctPrimeFactors([2,4,8,16]), 1)
def test_2_is_valid(self):
self.assertEqual(distinctPrimeFactors([2]), 1)
def test_1000_is_valid(self):
self.assertEqual(distinctPrimeFactors([1_000]), 2) # (2*5)**3
def test_10_000_values_is_valid(self):
values = _primes[:20] * (10_000 // 20)
assert len(values) == 10_000
self.assertEqual(distinctPrimeFactors(values), 20)
@unittest.skipUnless(__debug__, "can only test in debug mode")
class TestConstraints(unittest.TestCase):
def test_too_few(self):
with self.assertRaisesRegex(AssertionError, "size out of range"):
distinctPrimeFactors([])
def test_too_many(self):
with self.assertRaisesRegex(AssertionError, "size out of range"):
distinctPrimeFactors([2]*10_001)
def test_num_too_small(self):
with self.assertRaisesRegex(AssertionError, "num out of range"):
distinctPrimeFactors([1])
def test_num_too_large(self):
with self.assertRaisesRegex(AssertionError, "num out of range"):
distinctPrimeFactors([1_001])
if __name__ == "__main__":
unittest.main()
I had two typos in my test suite (an "=" for "==", and a ", 20))" instead of "), 20)"), and my original test_num_too_large() tested 10_001 instead of the boundary case of 1_001, so three mistakes in total.
If I had no internet access, I would compute that table thusly:
_primes = [2]
for value in range(3, 1000):
if all(value % p > 0 for p in _primes):
_primes.append(value)
Do let me know of any remaining mistakes.
What kind of senior programmers do you work with who can't handle something like this?
EDIT: For fun I wrote an implementation based on sympy's integer factorization:
from sympy.ntheory import factorint
def distinctPrimeFactors(nums: list[int]) -> int:
distinct_factors = set()
for num in nums:
distinct_factors.update(factorint(num))
return len(distinct_factors)
Here's a new test case, which takes about 17 seconds to run:
Empirical testing (for example: https://news.ycombinator.com/item?id=33293522) has established that the people on Hacker News tend to be junior in their skills. Understanding this fact can help you understand why certain opinions and reactions are more likely here. Surprisingly, the more skilled individuals tend to be found on Reddit (same testing performed there).
I’m not sure that’s evidence; I looked at that and saw it was written in Go and just didn’t bother. As someone with 40 years of coding experience and a fundamental dislike of Go, I didn’t feel the need to even try. So the numbers can easily be skewed, surely.
Only individuals who submitted multiple bad solutions before giving up were counted as failing. If you look but don't bother, or submit a single bad solution, you aren't counted. Thousands of individuals were tested on Hacker News and Reddit, and surprisingly, it's not even close: Reddit is where the hackers are. I mean, at the time of the testing, years ago.
That doesn’t change my point. It didn’t test every dev on all platforms, it tested a subset. That subset may well have different attributes to the ones that didn’t engage. So, it says nothing about the audience for the forums as a whole, just the few thousand that engaged.
Perhaps even, there could be fewer Go programmers here and some just took a stab at it even though they don’t know the language. So it could just select for which forum has the most Go programmers. Hardly rigourous.
Agreed. But remember, this isn't the only time the population has been tested. This is just the test (from two years ago, in 2022) that I happen to have a link to.
It's also fine to be an outlier. I've been programming for 24 years and have been hanging out on HackerNews on and off for 11. HN was way more relevant to me 11 years ago than it is now, and I don't think that's necessarily only because the subject matter changed, but probably also because I have.
The way the site works is explained in the first puzzle, "Hack This Site". TLDR, it builds and runs your code against a test suite. If your solutions weren't accepted, it's because they're wrong.
Considering how terrible and frequently broken the code that the public facing Gemini produces, I'll have to be honest that that kind of scares me.
Gemini frequently fails at some fairly basic stuff, even in popular languages where it would have had a lot of source material to work from; where other public models (even free ones) sail through.
To give a fun, fairly recent example, here's a prime factorisation algorithm it produce for python:
Can you spot all the problems?