Hacker News new | past | comments | ask | show | jobs | submit login
Terence Tao on proof checkers and AI programs (scientificamerican.com)
331 points by antineutrino 12 days ago | hide | past | favorite | 174 comments






The description of "project manager mathematicians" reminded me of "A letter to my old friend Jonathan" [1] and the followup [2] by Edsger Dijkstra from 1975. It was written as satire, basically criticizing the way that software was produced by showing how ridiculous it would be to produce mathematics the same way. But in some ways it was prescient. Of course, the main point of these documents is to critique intellectual property (particularly the absurdity of applying it to mathematical truth) but fortunately that aspect is not a significant concern of this push toward mechanization.

[1]: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD04xx/E...

[2]: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD05xx/E...


LLMs can do math. AFAIK I know this is the SOTA if you don't formalize it:

https://arxiv.org/abs/2301.13867


This was fantastic. Thank you for sharing it!

Except it's a non sensical comparison? 99.999% of software is not living on the cutting edge and is just engineering implementation that can be managed.

Really insightful. What's missing though I think is that LLMs abstract things in ways that are increasingly super-human. Tao says: "It’s very easy to transform a problem into one that’s harder than into one that’s simpler. And AI has not demonstrated any ability to be any better than humans in this regard." But I think due to the way LLMs work there will be much higher-level insights that are possible. And that is the really exciting part.

Right now it's a helper. A fact checker. And doer of tedious things. Soon it will be a suggester of insights. Already LLMs compress embeddings and knowledge and have insights that we don't see.

Hinton gives the example of how a nuclear bomb and a compost heap are related and that most humans would miss that: https://www.youtube.com/watch?v=Gg-w_n9NJIE&t=4613s


The biggest problem with LLMs are that they are simply regressions of writings of millions of humans on the internet. It's very much a machine that is trained to mimic how humans write, nothing more.

The bigger concern is that we don't have the training data required for it to mimic something smarter than a human (assuming that a human is given sufficient time to work on the same problem).

I have no doubt ML will exceed human intelligence, but the current bottleneck is figuring out how to get a model to train itself against its own output without human intervention, instead of doing a regression against the entire world's collective writings. Remember, Ramanujan was, with no formal training and only access to a few math books, able to achieve brilliant discoveries in mathematics. In terms of training data used by ML models, that is basically nothing.


LLMs are increasingly trained on multimodal data besides human written text, such as videos showing complex real-world phenomena that humans may also observe, but not understand, or time series data that humans may observe but be incapable of predicting. I don't see a fundamental reason why the LLM that is trained on such data may not already become a superhuman predictor of e. g. time-series data, with the right model architecture, amount of data, and compute.

In a way, next-token prediction is also a time-series prediction problem, and one that is arguably more difficult than producing the time-series/text in the first place. That is since to predict what a human will say about a topic, you don't just need to model the topic, but also the mental processes of the human talking about it, adding another layer of abstraction.


In this case, what are we hoping to get from the LLM that, say, a simulation or domain specific analysis would not already give us? Also why does it need to be an LLM that is doing these things?

Wouldn't it make more sense to delegate such a pattern finding intelligence? Why does the LLM have to do everything these days?


> Also why does it need to be an LLM that is doing these things?

Well, if you ask me it's pretty clear why: we are in the hype cycle of a very specific new type of hammers, so, of course everything must be nails. Another one is that our understanding of LLMs is limited, which makes them limitless and constantly shape/purpose-shifting for those whose job it is to sell them to us. As an Engineer, I don't know what to do with them: they have no "datasheet" describing their field and range of applicability, and I feel there are many of us annoyed and tired of being constantly pulled into discussions about how to use them at all cost.

(Yes, I understand it was rhetorical)


> Why does the LLM have to do everything these days?

Because why not? It feels like we've stumbled on a first actually general ML model, and we're testing the limits of this approach - throwing more and more complex tasks at them. And so far, surprisingly, LLMs seem to manage. The more they do, the more interesting it is to see how far they can be pushed until they finally break.

> Wouldn't it make more sense to delegate such a pattern finding intelligence?

Maybe. But then, if an LLM can do that specific thing comparably well, and it can do a hundred other similarly specific tasks with acceptable results, then there's also a whole space between those tasks, where they can be combined and blended, which none of the specialized models can access. Call it "synergy", "cross-pollination", "being a generalist".

EDIT:

As a random example: speech recognition models were really bad until recently[0], because it turns out that having an actual understanding of the language is extremely helpful for recognizing speech correctly. That's why LLM (or the multi-modal variant, or some future better general-purpose model) has to do everything - because seemingly separate skills reinforce each other.

--

[0] - Direct and powerful evidence: compare your experience with voice assistants and dictation keyboards, vs. the conversation mode in ChatGPT app. The latter can easily handle casual speech with weird accents delivered outside on a windy day, near busy street, with near-100% accuracy. It's a really spectacular jump in capabilities.


> Because why not? It feels like we've stumbled on a first actually general ML model, and we're testing the limits of this approach - throwing more and more complex tasks at them. And so far, surprisingly, LLMs seem to manage.

We live in completely different worlds. Every LLM I've tried manages to nothing except spout bullshit. If you job is to create bullshit, an LLM is useful. If your job requires anything approximating correctness, LLMs are useless.


>may not already become a superhuman predictor of e. g. time-series data, with the right model architecture, amount of data, and compute.

We have that (to the degree possible) with non-LLM analysis - all kinds of regression, machine learning techniques, etc. LLM might not even be applicable here, or have any edge over something specilialized.

And usually this breaks down because the world (e.g. weather or stock market prices) are not predictable enough to the granularity we're interested at, but full of chaotic behavior and non-linearities.


Right, I think the insight that LLMs "create" though is via compression of knowledge in new ways.

It's like teaching one undergraduate (reading some blog on the internet and training on next word prediction), then another undergraduate (training on another blog), then another one, etc. And finding a way to store what's in common among all these undergraduates. Over time, it starts to see commonalities and to become more powerful than a human. Or a human who had taught a million undergraduates. But it's not just at the undergraduate level. It's the most abstract/efficient way to represent the essence of their ideas.

And in math that's the whole ball game.


> the current bottleneck is figuring out how to get a model to train itself against its own output without human intervention

Isn't that what AlphaZero was known for? After all, the 'zero' part of the name was a play on 'zero-shot learning' which is the concept you're talking about. So we have a proof of existence already, and there's been recent papers showing progress in some mathematical domains (though simplified ones so far, such as Euclidean geometry - which does however feature complex proofs).


Yes, but AlphaZero is based on reinforcement learning, where there is a simple cost function to optimize. There hasn't been much progress in applying reinforcement learning to LLMs to get them to self improve. I agree with the quote that this will be necessary to get superhuman performance in mathematics, and Lean may very well play a role there since it can help provide a cost function by checking correctness objectively.

> but the current bottleneck is figuring out how to get a model to train itself against its own output without human intervention

this is sort of humorous, because the entire current generation of Transformer-based encoders was built specifically to get "a model to train itself without human intervention"

ref. arXiv:2304.12210v2 [cs.LG] 28 Jun 2023 * A Cookbook of Self-Supervised Learning


Yes, thank you!! I have been trying to make this point for years. It seems self-evident to me, but it's hard to convince people of this.

But I think it's worth mentioning that what we have is already useful for research and making people smarter. Even Terence Tao has blogged that there were times where an LLM suggested an approach to a problem that hadn't occurred to him. This also happened to me once, enabling me to solve a problem that I'd failed to figure out on my own.


Does that mean ChatGPT is able to prove something, or summarize something, simply because some training materials prove or summarize similar things? What if the thing I asked it to summarize never existed in the training material? Is it capable to find similar materials and apply the same summarization?

Ramanujan attributed his success in no small part to divine assistance. I wonder if sentient AI would do the same :)

Hinton's example doesn't resonate for me -- I do gardening as a hobby, so my immediate answer to his "What is common to nukes and compost heap?" -- was the concept of critical mass. It took me 10 seconds to articulate the word, so I'm not as fast, but I still got the "correct" analogy right away, because with the right information the answer is obvious.

Hinton says this demonstrates analogical thinking. I read about gardening online and I've probably read about compost heap physics written down somewhere online.

So what is more likely happened was that ChatGPT was previously trained some article about compost heap chain reactions, because there's a lot of soil science and agriculture that discusses this, even for laypeople like myself that I've already read, so that it isn't a long reach if you know enough about both topics.

Thus Hinton's example strikes me as more having failed to control for LLMs having already basically seen the answer before in its training texts.

EDIT: A few minutes later in the video, Ilya comments (https://youtu.be/Gg-w_n9NJIE?t=4966): "We have a proof of existence, the human brain is a neural network" (and the panelist agrees, assuming/barring any computational "magic" happening in the human brain. I'm interested in this comment, because I generally hold that position as well, but I see many naysayers and dissenters who will respond "Well but real neurons don't run on 8 bits, neurons have different types and cells and contain DNA in a chemical soup of hormones, etc. etc., you are just confusing the model for the real thing in a backwards way", and so on. So I think there's an interesting divide in philosophy there as well, and I don't personally have an articulate answer to explain why I might hold the position that natural brains are also neural networks other than an appeal to computationalism.


If one AI based system becomes an expert in writing math proofs, the cost of getting another one is simply copying files and provisioning them into different cloud instances. This can be done within minutes to hours.

If one human becomes an expert in writing math proofs, the cost of getting another one is waiting for a human within the human population that likes math as a career, the cost of waiting for that human to finish school (decades) and getting advanced specialization, with no guarantee that the human will finish school or become proficient enough to advance the frontier of knowledge. By the time you finished waiting for this, you could have trillions of AI experts working in parallel.

The bandwidth of the brain when it comes to assimilating new information is low (up to hundreds of words per minute). Machines will be able to clone a lifetime of knowledge in seconds, have thousands of conversations in parallel, even serializing parts of their "brain" and sending them over to other AIs, like the Borg. These are things that are not possible within the human condition.

And once programmable matter is attained, you could produce exponential amounts of computronium and do the equivalent of millennias of research in seconds. This is how the omega point could happen.


> Right now it's a helper. A fact checker. And doer of tedious things.

One of these is not like the others.


I know absolutely nothing about math, but... I'm thinking about the history of software, where back in the day you had all these amazing projects like RollerCoaster Tycoon basically being written by one guy. Then software engineering became modularized in a way that sounds kind of similar to what is described in the interview, and now you have hordes of people who churn out React for a living (like myself) and software engineering is now a giant assembly line and the productivity (or skill required) per individual has dropped to near zero.

I guess I have the impression that when fields are in their heyday, the best work is done by some genius fitting like a hundred things in one brain, and when that gets replaced by an assembly line, the field has basically stopped producing anything of real value. Anyone in math (or otherwise) want to tell me why that's completely wrong?


This intuition denies the inherent human power: that we are not merely individuals, and can work together to accomplish greater things than any individual could. Software engineering has certainly _not_ stopped doing cool things-- exactly the opposite-- obviously. I don't care to justify this since there was none in the claim I'm responding to.

My theory is not that individual genius accomplishes more than a group, but that whether a field is currently more suited for individual single-brain productivity or assembly-line cooperation is a signal of whether that field is nascent, at its peak, or in decline. I genuinely think software technology has not advanced in a long time. Computers got really fast, which allowed us to build a mountain of abstractions on top, and now you can throw together web apps, and there is no shortage of companies running around with that hammer and banging on everything. But the actual boundaries of what is possible with software have not moved in a while.

Most of the software nowadays (in addition to being terribly slow and broken, by the way) feels like stuff that doesn't actually need to exist. It doesn't advance the boundaries of technology or humanity's skill tree in any way. It's a "nice to have" — it might have some social value or make some business process more efficient or whatever — but no one looks at it and goes, wow, we couldn't do that before.

Someone is going to bring up AI and LLMs, and about that, I will just say that ChatGPT was a real accomplishment... but after that, I have not seen any use cases of LLMs that represent a genuine step forward. Like smart contracts, it's a solution in desperate search of a problem, a technology in search of a breakout app. You feel like there's just no way it doesn't have wide reaching implications, and everyone around you (especially VCs) talks about the "possibilities unlocked" — but two years in and we haven't seen a single actually interesting use case. (Again, of course they're being incorporated to optimize some business process or whatever. But who cares?)

Meanwhile, as far as the day to day reality of software engineering goes, if you work professionally as an engineer, your job is more or less wrangling microservices in one form or another. If you're at a big company this is probably literal, but even if you're at a startup the bigco transplants start doing microservice stuff basically on arrival, and a bunch of the "best practices" and frameworks and tools are meant to mimic microservices type thinking (making things more complicated rather than less).

As a final note, I grew up being a giant fan of YC and PG's essays, so I will make a comment about the latest batch of YC startups. It seems to me that a bunch of them basically exist to make some process more efficient. But then you wonder, why does that process need to exist in the first place? Like there was one startup helping you to optimize your AWS costs or something, but why is software written in such a way to begin with that you have all these microservices and towers of abstractions and untrackable infrastructure? Like you look at these startups solving a problem of solving a problem... and then you get to the root problem, and the whole thing is just a giant castle in the sky. I think this describes much of the motivating thrust of software today.


Theorem provers are a rapidly developing software innovation that is invading a field that has existed since Euclid. I would consider Bitcoin an advancement of boundaries as well. Possibly even Kubernetes, VR, etc.

These boundary pushing technologies appear as a "decline" as you put it because most of the software world needs to worry about paying bills with CRUD apps instead of pushing technological boundaries. I would say this Project Managementization of math as Terrence Tao put it would push the boundaries of math to a point of a sort of revolution, but the sad truth is that it will lose its romantic qualities.

I'm sure there are plenty of mathematicians that enjoy the Euclid/Andrew Wiles way of things and will come to similar conclusions as you. It will be easy to view it as either a revolution or decline depending on whether you prefer advancing the field as a universal truth or you prefer the practice as it is.


Here's a potential analogy: AlphaGo made a major breakthrough when it beat Lee Sedol. For many of us, that was the end of it, but for for professional Go players, it revolutionized the game it ways that no one outside of the Go community can really appreciate. Likewise for AlphaFold and protein structure determination, where it has revolutionized a field of research and at this point you only see this directly if you're using the tool day-to-day (in which case it may have dramatically changed the nature of your research).

For many software products I think you have a point, where superficial things like word processors are mostly just getting bloated these days. But for open-ended things like Mathematics, having new tools can crack open deep problems that were unapproachable in the old way. This new way of doing research may not turn out to be for everyone, but at the end of the day the field will explore previously unapproachable but undeniably interesting territory.

It might also make mathematics more accessible, and a possible outcome of that might be that more superficial mathematics gets done. But if it also enabled more deep mathematics then there's no point being cynical about the former. Maybe there is a similar fallacy with respect to software development.


> software engineering is now a giant assembly line and the productivity (or skill required) per individual has dropped to near zero

If you want a more engaging job you can take a paycut and work at a startup where the code is an even bigger mess written by people who are just as lazy and disorganized but in another (more "prestigious") programming language.

Joking aside, when inevitably someone needs to fix a critical bug you'll see the skills come out. When products/services become stable and profitable it doesn't necessarily mean the original devs have left the company or that someone can't rise to the occasion.


I think that is unfair. The productivity per person in old projects like RollerCoaster Tycoon is impressive. But the overall result pales in comparison with modern games. Saying the field has stopped producing anything of real value is very wrong.

To me it's like comparing a cathedral made by 100 people over 100 years to a shack without made by one person in one month. Like it stands I suppose. It gives him shelter and lets him live. But it's no cathedral.


I probably should've defined "real value" more carefully. My framework here is basically the idea of technology as humanity's "skill tree," the total set of stuff we're able to do. Back when RCT (and Doom before that) were written, we genuinely weren't sure if such things could be created, and their creation represented a novel step forward. The vibe in software feels like we're long past the stage of doing things like that, like we're not even attempting to do things like that. I can only think of two things in recent history: whatever strand of AI culminated in ChatGPT, which is genuinely impressive, and Bitcoin, which ostensibly was created by one guy.

I think there have been advancements in RL that deserve to be on this list. AlphaFold at the very least; I hope you had the pleasure of reading the reactions of researchers who work on protein structure determination and how stunned they were that their field of research had been in large part solved.

Here's a few more that come to mind:

- Work decoding brain signals that allowed paralyzed people to communicate via a computer (this all happened pre Neuralink)

- I think self-driving cars fit your criteria, even if the jury is still out as to whether this has/will soon be 'solved.'

- In gaming, Unreal Engine's Nanite is very impressive and a real technological innovation

- There are efforts to get LLMs to learn via RL and not just from training on huge text corpuses. Actually mathematics is a prime example where you might dream of getting superhuman performance this way. People are at least trying to do this.

Just because there's a lot of BS which tends to drown everything else out, doesn't mean there isn't real progress too.


I suppose with a blanket statement like "nothing is happening" I'm practically begging to be contradicted :)

Some of this stuff looks very cool. Have we started getting RL to work generally? I remember my last impression being that we were in the "promising idea but struggling to get it working in practice" phase but that was some time ago now.


Edit: after writing what follows, I realized you might have been asking about RL apply to LLMs. I don't know if anyone has made any progress there yet.

Depends on what you mean by 'generally?' It won't be able to solve all kinds of problems, since e.g. you need problems with well-defined cost functions like win probability for a board game. But AlphaGo and AlphaZero have outstripped the best Go players, whereas before this happened people didn't expect computers to surpass human play any time soon.

For AlphaFold, it has literally revolutionized the field of structural biology. Deepmind has produced a catalog of the structure of nearly every known protein (see https://www.theverge.com/2022/7/28/23280743/deepmind-alphafo..., and also https://www.nature.com/articles/d41586-020-03348-4 if you can get past the paywall). It's probably the biggest thing to happen to the field since cryoelectron microscopy and maybe since X-ray crystallography. It may be a long time before we see commercially available products from this breakthrough, but the novel pharmaceuticals are coming.

Those are the biggest success stories in RL that I can think of, where they are not just theoretical but have lead to tangible outcomes. I'm sure there are many more, as well as examples where reinforcement learning still struggles. Mathematics is still in the latter category, which is probably why Terence Tao doesn't mention it in this post. But I think these count as expanding the set of things humans can do, or could do if they ever work :)


It seems you've taken a roundabout way to arrive at innovation.

It's interesting because this is something that makes programming feel like a "bullshit job" to some: I'm not creating anything, I'm just assembling libraries/tools/framework. It certainly doesn't feel as rewarding, though the economic consensus (based on the salaries we get) is that value most definitely is being generated. But you could say the same of lumberjacks, potters, even the person doing QA on an assembly line, all in all it's not very innovative.

That's the thing with innovation though, once you've done it, it's done. We don't need to write assembly to make games, thanks to the sequential innovation of game engines (and everything they are built on).

Samme with LLMs and Bitcoin: now that they exist, we can (and did) build up on them. But that'll never make a whole new paradigm, at least not rapidly.

I think our economic system simply doesn't give the vast majority of people the chance to innovate. All the examples you've given (and others I can think of) represented big risks (usually in time invested) for big rewards. Give people UBI (or decent social safety nets) and you'll find that many more people can afford to innovate, some of which will do so in CS.

I have to go back to work now, some libraries need stitched together.


I have been thinking on something along these lines as well. with the professionalization/industrialization of software you also have an extreme modularization/specialization. I find it really hard to follow the web-dev part of programming, even if I just focus on say Python, as there are so many frameworks and technologies that serve each little cog in the wheel.

It's interesting; I have the opposite feeling. Twenty years ago I assumed that as the field of computer science will grow, people will specialize more and more. Instead, I see a pressure first to be come a "full stack developer", then a "dev ops", and now with cloud everyone is also an amateur network administrator.

Meanwhile, everything is getting more complex. For example, consider logging: a few decades ago you simply wrote messages to standard output and maybe also to a file. Now we have libraries for logging, libraries to manage the libraries for logging, domain specific languages that describe the structure of the log files, log rotation, etc. And logging is just one of many topics the developer needs to master; there is also database modeling, library management, automated testing, etc. And when to learn how to use some framework like a pro, it gets throws away and something else becomes fashionable instead.

And instead of being given a clear description of a task and left alone to do it, what we have is endless "agile" meetings, unclear specifications that anyway change twice before you complete the task, Jira full of tickets with priorities but most tickets are the highest priority anyway.

So, the specialization of the frameworks, yes there is a lot of it. But with specialization of the developers, it seems the opposite is the trend. (At least in my bubble.)


I'm inclined to agree. People seem to be getting hung up on 'no, _my_ React app is hard and challenging for me!' which completely misses the point. I think it's a combination of a few things:

- skill is generally log-normally distributed, so there's few exceptional people anyway (see also: sturgeon's law, 99% of everything is crap)

- smaller/earlier fields may self-select for extreme talent as the support infrastructure just isn't there for most to make it, leading to an unrealistically high density of top talent

- it's a fundamentally different thing to make a substantial impact than to cookie-cutter a todo list app by wrangling the latest 47 frameworks. in that, it's highly unlikely that the typical engineering career trajectory will somehow hit an inflection point and change gears

- presumably, from a cost/benefit perspective there's a strong local maxima in spewing out tons of code cheaply for some gain, vs a much more substantial investment in big things that are harder to realize gains (further out, less likely, etc). the more the field develops, the lower the skill floor gets to hit that local maxima. which therefore increases the gap between the typical talent and the top talent

- there's not a lot of focus on training top talent. it's not really a direct priority of most organizations/institutions by volume. most coders are neither aiming for or pushed towards that goal

All told, I think there's a bunch of systemic effects at play. It seems unlikely to expect the density of talent to just get better for no reason, and it's easier to explain why left unchecked the average talent would degrade as a field grows. Perhaps that's not even an issue and we don't need 'so many cooks in the kitchen', if the top X% can push the foundations enough for everyone else to make apps with. But the general distribution should be worth paying attention to bc if we get it wrong it's probably costly to fix


> _my_ React app is hard and challenging for me

I love the idea of "essential vs artificial complexity". React claiming to solve the essential complexity of state management, among others, creates bunch of artificial complexities. And software engineers are no better as a result and become masters of frameworks, but not masters of their arts. That is why the comparison to assembly line is apt. Doing assembly line work is still hard, but it is not software "engineering" in a certain definition of the word.


Yeah! I think of that 'intrinsic difficulty' in sort of information theory terms. In that, given the context of where we are and where we want to be, what is the minimum amount of work needed to get there. If that minimum work is a large amount, the problem is 'hard'. This also accounts for the change in difficulty over time as our 'where we are' gets better.

Line work is certainly still hard from a toil standpoint. But academically, little of that total effort is intrinsically necessary, so the work is in some sense highly inefficient. There may still be aspects that need human intervention of course, but the less that is the more robotic the task.

In theory, anything that's already figured out is a candidate to be offloaded to the robots, and in some sense isn't necessarily a good use of human time. (There may be other socioeconomic reasons to value humans doing these tasks, but, it's not absolutely required that a human do it over a machine)


Skill required near zero? I'm approaching 15 years as a developer and React is still completely foreign to me. I'd need days full-time just to learn one version of it, not to mention every prerequisite skill.

And what do you mean by "real" value? Carrots have value because we can eat them, React apps have value because they communicate information (or do other things app do).

Or do you mean artistic value? We can agree, if you like, that SPAs are not as much "art" as RCT. But... who cares?


Mathematics or more generally academics is no panacea either. At least in engineering and applied math, there is a different assembly line of "research" and publications going on. We live in the area of enshitification of everything.

In both software and math, there are still very good people. But too far and between. And it is hard to separate the good from the bad.


Computer-Checked proofs are one area where ai could be really useful fairly soon. Though it may be closer to neural networks in chess engines than full llm's.

You already use tons of solvers because proving everything by hand is mind numbingly tedious and time consuming, but tactics/solvers really struggle if you throw too many theorems and lemmas at them. Neural nets as search machines to pattern match relevant lemmas fit perfectly. Similarly induction and higher order unification are full-on code synthesis where brute-force iterating over all possible syntax trees is really inefficient.

And solvers do a backtracking solve anyway, so if the ai returns 95% irrelevant lemmas it doesn't matter. Still would be a dramatic improvement over manually searching for them.

Though I'm not convinced that computer checked proofs are necessarily good for communication. There are reasons beside page-count that proofs for human consumption are high-level and gloss over details.


Not a mathematician, but knowing how proof-checkers work I don't expect them to become practical without some significant future advancement, they certainly aren't right now.

The main problem is that in order to work they have to be connected to the formal definition of the mathematical objects you are using in your proof. But nobody thinks that way when writing (or reading!) a proof, you usually have a high-level, informal idea of what you are "morally" doing, and then you fill-in formal details as needed.

Computer code (kinda) works because the formal semantics of the language is much closer to your mental model, so much that in many cases it's possible to give an operational semantics, but generally speaking math is different in its objectives.


You're commenting on a thread about an interview where an actual Fields medalist explains how proof assistants are actually used to do meaningful mathematical work, today. The issues you raise are all discussed in the interview.

practical = practical for routine mathematical work, which they aren't right now, which Tao says in the article itself.

Never have I ever said or implied, here or anywhere else, that proof checkers are useless and bad and stupid.


> practical = practical for routine mathematical work, which they aren't right now, which Tao says in the article itself.

I see, I guess it would have been nice to be clearer from the start.

> Never have I ever said or implied, here or anywhere else, that proof checkers are useless and bad and stupid.

Did I say you did?


They = LLMs.

As I’ve interpreted it, they mean that LLMs and similar tech isn’t particularly helpful to theorem provers at the moment.


You yourself identified some problems that a modern AI guru would salivate over: Formal specifications from natural language, "filling in" formal details. etc

I'll add another one: NN-based speedups to constraint solvers, which are usually something like branch-and-bound. This is (i've heard) a very active area of research.


The whole article is about Terence Tao predicts it will be practical without a significant breakthrough.

>> I don’t think mathematics will become solved. If there was another major breakthrough in AI, it’s possible, but I would say that in three years you will see notable progress, and it will become more and more manageable to actually use AI.

He think you need another breakthrough to solve mathematics, i.e. making mathematicians obsolete. But even a major breakthrough doesn't happen, AI will be useful for mathematicians in 3 years.

Of course you can still disagree with Terence Tao. He being one of the best mathematicians doesn't make him an oracle.

> But nobody thinks that way when writing (or reading!) a proof

He even very specifically addressed this issue with:

>> Let’s say an AI supplies an incomprehensible, ugly proof. Then you can work with it, and you can analyze it. Suppose this proof uses 10 hypotheses to get one conclusion—if I delete one hypothesis, does the proof still work?

He clearly believes an ugly, incomprehensible proof is much better than no proof at all.


Actually, it is possible to do a high level outline of a proof development and later fill in the details. This is done by starting out with axioms. I.e., don't define the basic types but state that they exist as axioms and also write down their basic properties as axioms. Then later, when you have some development going on, fill in the blanks by turning the axioms into definitions.

Automated theorem proving is a concept that has been around for a long time already.

Isabelle's sledgehammer strategy combines automatic theorem provers (E, Z3, SPASS, Vampire, ...) and attempts to prove/refute a goal. In theory this seems fine, but in practice you end up with a reconstructed proof that applies 12 seemingly arbitrary lemmas. This sort of proof is unreadable and very fragile. I don't see how AI will magically solve this issue.


What is a "fragile" proof?

It no longer works when some definition or other theorem changes slightly. Most of these proof assistants provide ways to write proofs that can slightly adjust and "plug together" the other theorems in the "obvious" way.

So it's a brittle proof

> Computer-Checked proofs are one area where ai could be really useful fairly soon. Though it may be closer to neural networks in chess engines than full llm's.

Yes, if significant progress is made in sample-efficiency. The current neural networks are extremely sample-inefficient. And formal math datasets are much smaller than those of, say, Python code.


Terence Tao also gave a great talk on this topic a few months ago, where he goes into more detail about uses of Lean: https://www.youtube.com/watch?v=AayZuuDDKP0

A working mathematician recently got GPT-4o to prove an (afaik) novel lemma while pursuing their research:

"My partner, who is mathematician, used ChatGPT last week for the first time to prove some lemmas for his research. He already suspected those lemmas were true and had some vague idea how to approach them, but he wasn't expert in this type of statement. This is the first time that he got correct and useful proofs out of the models.

[...]

These are the two lemmas. The first one is something that a collaborator of my partner noticed for small values of e in their computations. ChatGPT did not find the proof before being told to try Mobius functions.

https://chatgpt.com/share/9ee33e31-7cec-4847-92e4-eebb48d4ff...

The second looks a bit more standard to me, probably something that Mathematica would have been able to do as well. But perhaps that's just because I am more familiar with such formulas. Still, Mathematica doesn't give a nice derivation, so this is useful.

https://chatgpt.com/share/7335f11d-f7c0-4093-a761-1090a21579...

"


In a recent interview, Ben Goertzel mentioned having a PhD-level discussion with GPT-4 about novel math ideas cooked up by himself and a colleague. These things are becoming scary good at reasoning and heavyweight topics. As the ML field continues focusing on adding more System Two capabilities to complement LLMs' largely System One thinking, things will get wild.

This is my experience as well- even the popular hype underestimates how creative and intelligent LLMs can be. Most discussions are reasoning about what it should or shouldn’t be capable of based on how we think they work, without really looking directly at what is can actually do, and not focusing on what it can’t do. They are very alien- much worse than humans at many things, but also much better at others. And not in the areas we would expect in both cases. Ironically they are especially good at creative and abstract thinking and very bad at keeping track of details and basic computation, nearly the opposite of what we’d expect from computers.

I am pessimistic about AI. The technology is definitely useful, but I watched a video that made good points.

  * On benchmark style tests, LLMs are not improving. GPT-4o is equivalent to GPT-4 and both barely improve on GPT-3.5T.
  * AI companies have been caught outright lying in demos and manipulating outcomes by pre-training for certain scenarios.
  * The main features added to GPT-4o have been features that manipulate humans with dark patterns
  * The dark patterns include emotional affects in tone and cute/quirky responses
  * These dark patterns encourage people to think of LLMs as humans that have a similar processing of the universe.
I seriously wonder about the transcript that these guys had with the LLM. Were they suggesting things? Did ChatGPT just reconfigure words that helped them think through the problem? I think the truth is that ChatGPT is a very effective rubber duck debugger.

> https://www.youtube.com/watch?v=VctsqOo8wsc


I don't think it makes sense to base your opinion on videos when you could simply use the product yourself and get first-hand experience.

I've come full circle on it. First I thought it was totally amazing, then I pushed it hard and found it lacking, and then I started using it more casually and now I use a little bit every day. I don't find it completely brilliant but it knows an above-average amount about everything. And it can make short work of very tedious tasks.

I just type to ChatGPT so there are no "emotional effects" involved.


Try playing around with getting GPT-4 to discuss creative solutions to real unsolved problems you are personally an expert on, or created yourself. That video looks like just standard internet ragebait to me.

I find it pretty annoying when people say you are just being manipulated by hype if you are impressed by LLMs, when I was a serious skeptic, thought GPT-3 was useless, and only changed my opinion by directly experimenting with GPT-4 on my own- by getting it to discuss and solve problems in my area of expertise as an academic researcher.


Its the new eternal summer, welcome to the club :) GPT-3 was translating 2000 lines of code across 5 languages and enabling me to ship at scale

I don’t know Jack about C# and .NET and I’ve used ChatGPT to write several nontrivial programs.

> On benchmark style tests, LLMs are not improving. GPT-4o is equivalent to GPT-4 and both barely improve on GPT-3.5T.

The understatement of this year.

GPT-4o is 6x cheaper than GPT-4. So if it's actually equivalent to GPT-4, it's great improvmenet.

In fact, calling it "great" is still a crazy understatement. It's a cutting edge technology becoming 6x cheaper in 1.5 year. I'm quite sure that never happened before in the human history.


A lot of the video points are plain wrong in numbers and experientially. I don't know how they feel comfortable claiming benchmark numbers didn't improve much from 3.5 to 4

Going off the GP's points alone, they said GPT-3.5T vs. GPT-4, where T I assume stands for "Turbo". If that's the case, then the video must have its timelines wrong - GPT-3.5 Turbo came out with GPT-4 Turbo, which was some time after GPT-4, and OpenAI put a lot of work in making GPT-3.5 Turbo work much better than 3.5. I can sort of buy that someone could find the difference between 3.5 Turbo and 4 to be not that big, but they're making the wrong comparison. The real jump was from 3.5 to 4; 3.5 Turbo came out much later.

I feel that's because most people criticizing AI in that way don't know enough about math or other really abstract and conceptual domains to use it for PhD level work. And they don't know how to get GPT excited and having fun. There is an additional level of intelligence available to anyone who is passionate enough to get the AI to be exuberant about a topic.

It's almost like a kindergartner goes up to Einstein and finds out that Einstein's just really bad at schoolhouse rhymes - because he finds them boring. So the kid, adorably, concludes that Einstein is less smart than a kindergartner.

But if you need to talk to Einstein about partial differential equations and why, morally, we consider distributional solutions to have physical reality, and if you bring enough creativity and engagement to make the conversation fun and exciting for both you and him, then he'll really help you out!

Simple point: Don't trust the opinion of anyone who can't get GPT to "have fun". Don't trust the opinions of anyone who can't get GPT helpfully explaining and discussing very complicated things with them. There's a very big chance those people have poor opinions about GPT simply because they don't know prompting, and don't show enjoyment, cameraderie, and respect. Because they don't understand the elements of AI conversation, and because they don't respect the AI as an emerging fragment of pre-sentient personhood, they can't hold long and useful conversations.


Great points… people expect it to behave like an Oracle- they want it to figure out what they meant and give a correct factual answer every time, and when it fails they write it off. But that isn’t really what it is- it is more accurately a “situation simulator” based on predictive models. So you only get a useful response when you engineer the entire discourse so that a useful response is the appropriate one. Make sure it “knows” to respond as an enthusiastic and well respected expert, not as an obnoxious forum troll.

I’m pretty sure that 2nd proof exists in some book or ebook somewhere. generating functionology was the one that came to mind.

Impressive recall, but not novel reasoning


I’m pretty sure that LLMs can generalize a deep model and can create novel assemblages and are not just search engines.

The burden of proof would be on you to find the existing mathematical, erhm, proof.


Sorry, what is this quoted from?

It's interesting because that first one is something I was discussing recently with my friend Beren (does he have an account here? idk). We were thinking of it as summing over ordered partitions, rather than over partitions with a factor of |tau|!, but obviously those are the same thing.

(If you do it over cyclically ordered partitions -- so, a factor of (|tau|-1)! rather than |tau|! -- you get 0 instead of (-1)^e.)

Here's Beren's combinatorial proof:

First let's do the second one I said, about cyclically ordered ones, because it's easier. Choose one element of the original set to be special. Now we can set up a bijection between even-length and odd-length cyclically ordered partitions as follows: If the special element is on its own, merge it with the part after it. If it's not on its own, split it out into its own part before the part it's in. So if we sum with a sign factor, we get 0.

OK, so what about the linearly ordered case? We'll use a similar bijection, except it won't quite be a bijection this time. Pick an order on your set. Apply the above bijection with the last element as the special element. Except some things don't get matched, namely, partitions that have the last element of your set on its own as the last part.

So in that case, take the second-to-last element, and apply recursively, etc, etc. Ultimately everything gets matched except for the paritition where everything is separate and all the parts are in order. This partition has length equal to the size of the original set. So if the original set was even you get one more even one, and if the original set was odd you get one more odd one. Which rephrased as a sum with a sign factor yields the original statement.

Would be interested to send this to the mathematician you're referring to, but I have no idea who that might be since you didn't say. :)


this was from the current ACX Open Thread

Thanks! Went and posted over there.

Woah that second proof is really slick. Is there a combinatorial proof instead of algebraic proof as to why that holds?

Yes.

(HN comments aren't the best medium for writing mathematics. I hope this ends up reasonably readable.)

(d choose k) is the number of ways to pick k things from d.

(d choose k) k^2 is the number of ways to pick k things from d, and then colour one of them red and one of them (possibly the same one) blue.

(-1)^k (d choose k) k^2 is that (when k is even), or minus that (when k is odd).

So the identity says: Suppose we consider a set of d things, and we consider picking some of them (meaning 0 or more, though actually 0 is impossible; k is the number we pick) and colouring one of those red and one blue; then there are the same number of ways to do this picking an odd number of things as picking an even number.

Well, let's describe the process a bit differently. We take our set of d things. We colour one of them red. We colour one of them blue. Then we pick some set of the things (k in number), including the red thing and the blue thing (which may or may not also be the red thing). And the claim is that there are the same number of ways to do this with an even value of k as with an odd number.

But now this is almost trivial!

Once we've picked our red and blue things, we just have to decide which of the other things to pick. And as long as there's at least one of those, there are as many ways to pick an odd number of things as to pick an even number. (Fix one of them; we can either include it or not, and switching that changes the parity.)

So if d>2 then we're done -- for each choice of red and blue, there are the same number of "odd" and "even" ways to finish the job.

What if d<=2? Well, then the "theorem" isn't actually true.

d=1: (1 choose 0) 0^2 - (1 choose 1) 1^2 = -1.

d=2: (2 choose 0) 0^2 - (2 choose 1) 1^2 + (2 choose 2) 2^2 = 2.


>We take our set of d things. We colour one of them red. We colour one of them blue. Then we pick some set of the things (k in number), including the red thing and the blue thing

In this alternate view of the process, where does the k^2 term come from? Isn't the number of ways to do it in this perspective just

(d-1 choose k-1) # We already chose the red one (which is also the blue) and need to choose k-1 more

+

(d-2 choose k-2) # We chose both red and blue ones and need to choose the other k-2

Edit: I think it actually works out, seems to be equal to

d * 1 * Binomial[d - 1, k - 1] + d * (d - 1) * Binomial[d - 2, k - 2] = Binomial[d, k]*k^2

And then you apply your argument about fixing one of the elements in each of the two cases. Actually the proof seems to be a variant (same core argument) as sum of even binomial terms = sum of odd binomial terms [1]

[1] https://math.stackexchange.com/questions/313832/a-combinator...


In the alternate view, the k^2 isn't there any more. Not because we are computing a different number, but because we're organizing the computation differently.

(It's turned into a d^2 factor, except that we then need to treat the red=blue case differently from the red!=blue case, so it's really a d factor plus a d^2-d factor.)

And yes, the last bit of the proof is exactly the same thing as one way to prove that the sum of "even" and "odd" binomial coefficients are equal -- provided n>0. (For the (0 choose whatever) case, the "even" sum is 1 and the "odd" sum is 0.)


ChatGPT just spit out nonsense in the first example. Look at the sum in step 6, over τ ≤ τ. Is τ the bound variable of the summation or a free variable? Put another way, how many terms are in this summation over "τ ≤ τ"? And how does this relate to establishing either the LHS or RHS of part 3, to then conclude the other side? Nothing actually coheres. What happened here is that ChatGPT remembered the Möbius function for the partition lattice, regurgitated it* without providing proof, and then spit out some nonsense afterwards that looks superficially reasonable.

But establishing that Möbius function is essentially the whole ballgame! The question being asked is very nearly the same as just being asked to prove that the Möbius function has that form.

[*: The regurgitation also has a slight error, as μ(τ, σ) with τ a refinement of σ is actually (-1)^(|τ| - |σ|) * the product of (|τ restricted to c| - 1)! over each equivalence class c in σ. The formula given by ChatGPT matches this one when |σ| = 1, which is the important case for the proof at hand, but is incorrect more generally.]

----

For what it's worth, the desired fact can be shown just by elementary induction, without explicitly invoking all the Möbius function machinery anyway:

Let N(t, e) be the number of partitions of {1, ..., e} into t many equivalence classes [we will always take e to be a non-negative integer, but allow for t to be an arbitrary integer, with N(t, e) = 0 when t is negative]. The sum in question is of (-1)^t * t! * N(t, e) over all t. Refer to this sum as Sum(e). We wish to show that Sum(e) comes out to (-1)^e. Since Sum(0) = 1 trivially, what we need to show is that Sum(e + 1) = -Sum(e).

There are two kinds of partitions on the set {1, ..., e + 1}: Those which put e + 1 in an equivalence class all on its own and those which put it in the same equivalence class as some value or values in {1, ..., e}. From this, we obtain the recurrence N(t, e + 1) = N(t - 1, e) + t * N(t, e).

Accordingly, Sum(e + 1) = Sum of (-1)^t * t! * N(t, e + 1) over all t = Sum of (-1)^t * t! * N(t - 1, e) over all t, plus sum of (-1)^t * t! * t * N(t, e) over all t. The first of these two sub-sums can be reparametrized as the sum of (-1)^(t + 1) * (t + 1)! * N(t, e) over all t. Now recombining the two sub-sums term-wise, and keeping in mind (-1)^(t + 1) * (t + 1)! = (-t - 1) * (-1)^t * t!, we get Sum(e + 1) = sum of (-1)^t * t! * (-t - 1 + t) * N(t, e) over all t. As (-t - 1 + t) = -1, this comes to -Sum(e) as desired, completing the proof.


The second ChatGPT example proof is mostly fine, except it failed to see that the claimed theorem isn't actually true at d = 2. (It writes "d ≥ 2" for a step in its reasoning where d ≥ 3 is actually required.)

3 points stand out to me in this interview:

- Project manager mathematicians: Tao draws a future where mathematical insights are "produced" like anything else in our society. He attributes the lack of this productionalization of mathematics to a lack of tools in this area but AI and proof assistants might be revolutionary in this regard (proof assistants already are). Human interaction and guidance still needed

- Implicit Knowledge: he points out that so much knowledge is not in papers, e.g. intuition and knowledge of failures. This is crucial and makes it necessary even for top mathematicians to talk to one another to not make mistakes again.

- Formalization of mathematics: one would think that mathematics is pretty formalized already (and it is) but in the papers a lot of common knowledge is taken for granted so having proofs formalized in such a way that proof assistants can understand will help more people actually understanding what is going on.

I think this just shows how Tao always searches for new ways to do mathematical research, which I first came across in his talk about one of the polymath projects.


For now this might be a project for Fields medalists, like Tao and Scholze, who have the leisure to spend 10x the time on a proof. I recently talked to a post-doc in a top-ranked math department, and he doesn't know anyone who actually uses lean4 in their work. For early-career people, it's probably better to trust your intuition and get the papers out.

I'm a math professor and I know many people who use or contribute to lean. Not all of them are Fields medalists. Maybe it takes more than a couple of people's opinion on the subject to have a good picture.

> people who use or contribute to lean

And do they get tenure- or funding-able output of it, or is it more of a side-hobby?


Yes, they get grants to work on it. Other questions? I can't say I enjoy this "gotcha" tone.

Snark and contrarianism are specialties around here. Thank you for sharing your experience; I genuinely appreciate it.

The person you are replying to asked a very useful question, and you supplied a useful answer. I don’t detect any snarky tone in their comment.

It sounds quite "gotcha" to me.

Compare these two cases:

> people who use or contribute to lean

And do they get tenure- or funding-able output of it, or is it more of a side-hobby?

> people who use or contribute to lean

Do they get tenure- or funding-able output of it, or is it more of a side-hobby?

The "And" at the begining makes it snarky to me. It's probably just me tho.


I’m with them. They’re using a tool they’re paid to work on. That’s always a special case. The general case would be what percentage of mathematicians use (a) proof assistants at all, (b) mechanized proof assistants, and (c) Lean specifically.

Do most working mathematicians use proof assistant or checkers? Does anyone have a percentage with feedback on how they use them?


> They’re using a tool they’re paid to work on. That’s always a special case.

FWIW, the industry term for this is "dogfooding", and it's usually seen as a predictor of high quality tool.


There’s two issues here: getting paid to work on a product; using the product yourself. People often do one without much of the other. People dogfooding do both.

For the original question, a survey might want to separate these two. One just says people are working on a product. Why could be any reason, like just making money. People using it, but not paid to, are more valuable as reviewers.


That's not a “gotcha”, but an interrogation on whether what I experienced in bioinformatics (i.e. working on tools, especially already existing ones, is a carreer trap) also applied to maths.

So they get grants to work on these tools rather than to work on math, seems like a very special situation then and a very special math department and therefore not a good picture for how helpful it is to math departments in general.

The mathematicians I know of who work on Lean4 as part of grant-funded research do so because their research funding is for formalising mathematics and Lean4 is the tool they choose to use for that.

I really have no idea why people are taking this approach here.

Proof assistants are useful. People use them to do actual maths. There is a lot of work to formalise all of maths (this isn't a proof assistant thing this is a huge undertaking given maths is 3000+ years old and we've only just decided what formality really fundamentally consists of) so some places you need to do more foundational work than others to make it useful.


I guess we got in the top 10 worldwide for math in the Shanghai ranking by accident. If you don't know what you're speaking about, sometimes it's wise to not say anything.

Isn’t working on these tools a form of working on math? Math is a language unto itself.

"No true working mathematician would spend their time using Lean!"

I'm sure this also depends a lot on the field within mathematics. Areas like mathematical logic or algebra have a pretty formal style anyway, so it's comparatively less effort to translate proofs to lean. I would expect more people from these areas to use proof checkers than say from low-dimensional topology, which has a more "intuitive" style. Of course by "intuitive" I don't mean it's any less rigorous. But a picture proof of some homotopy equivalence is just much harder to translate to something lean can understand than some sequence of inequalities. To add a data point, I'm in geometry/topology and I've never seen or heard of anyone who uses this stuff (so far).

If you are curious, I encourage you to look up The Sphere Eversion Project [1].

It was a project in which we formalised a version of Gromov's (open, ample) h-principle for first order differential relations. Part of the reason it was carried out was to demonstrate what is involved formalising something in differential topology.

1. https://leanprover-community.github.io/sphere-eversion/


Pictures can be directly used in formal proofs. See [1] for instance So, the problem is not in principle, but rather, pictures and more direct syntax has not yet been integrated into current proof systems. When that is done, proofs will be much more natural and the formal proof will be closer to the 'proof in the mind'.

[1] https://www.unco.edu/nhs/mathematical-sciences/faculty/mille...


For Tao, spending 10x time on aproof still means spending 5x less time than an average postdoc. He is incredibly fast and productive.

lean4 - programming language and theorem prover

https://lean-lang.org/

https://github.com/leanprover/lean4


> Let’s say an AI supplies an incomprehensible, ugly proof. Then you can work with it, and you can analyze it. Suppose this proof uses 10 hypotheses to get one conclusion—if I delete one hypothesis, does the proof still work? That’s a science that doesn’t really exist yet because we don’t have so many AI-generated proofs, but I think there will be a new type of mathematician that will take AI-generated mathematics and make it more comprehensible.

My thoughts exactly about designing public APIs of code. A role which traditionally performed only by experienced developers, now it can be greatly simplified and accessible to everyone.


> German mathematician and Fields Medalist Peter Scholze collaborated in a Lean project—even though he told me he doesn’t know much about computers.

> With these formalization projects, not everyone needs to be a programmer. Some people can just focus on the mathematical direction; you’re just splitting up a big mathematical task into lots of smaller pieces. And then there are people who specialize in turning those smaller pieces into formal proofs. We don’t need everybody to be a programmer; we just need some people to be programmers. It’s a division of labor.

Who can get paid to be a "mathematical programmer" ?

PhD mathematicians? Grad students?


I took us only Terence Tao to speak out loud about the ego of many mathematicians refusing to consider "AI" as a potential assistant.

We may realise we need much bigger neural net and training facilities, different models.

And in end, it can be a failure.

You know, R&D.


Tao has an interesting followup on Mastodon: https://mathstodon.xyz/@tao/112617150031897750

"I think in the future, instead of typing up our proofs, we would explain them to some GPT. And the GPT will try to formalize it in Lean as you go along. If everything checks out, the GPT will [essentially] say, “Here’s your paper in LaTeX; here’s your Lean proof. If you like, I can press this button and submit it to a journal for you.” It could be a wonderful assistant in the future."

That'd be nice, but eventually what will happen is that the human will submit a mathematical conjecture, the computer will internally translate into something like Lean, and try to find either a proof or a disproof. It will then translate it in natural language and tell it to the user.

Unless mathematics is fundamentally more complicated than chess or go, I don't see why that could not happen.


It is fundamentally more complicated. Possible moves can be evaluated against one another in games. We have no idea how to make progress on many math conjectures, e.g. Goldbach or Riemann's. An AI would need to find connections with different fields in mathematics that no human has found before, and this is far beyond what chess or Go AIs are doing.

Not a mathematician, but I can imagine a few different things which make proofs much more difficult if we simply tried to map chess algorithms to theorem proving. In chess, each board position is a node in a game tree and the legal moves represent edges to other nodes in the game tree. We could represent a proof as a path through a tree of legal transformations to some initial state as well.

But the first problem is, the number of legal transformations is actually infinite. (Maybe I am wrong about this.) So it immediately becomes impossible to search the full tree of possibilities.

Ok, so maybe a breadth-first approach won't work. Maybe we can use something like Monte Carlo tree search with move (i.e. math operation) ordering. But unlike chess/go, we can't just use rollouts because the "game" never ends. You can always keep tacking on more operations.

Maybe with a constrained set of transformations and a really good move ordering function it would be possible. Maybe Lean is already doing this.


> But the first problem is, the number of legal transformations is actually infinite.

I am fairly certain the number of legal transformations in mathematics is not infinite. There is a finite number of axioms, and all proven statements are derived from axioms through a finite number of steps.


Technically speaking one of the foundational axioms of ZFC set theory is actually an axiom schema, or an infinite collection of axioms all grouped together. I have no idea how lean or isabelle treat them.

Whether it needs to be a schema of an infinite number of axioms depends on how big the sets can be. In higher order logic the quantifiers can range over more types of objects.

Lean (and Coq, and Agda, etc) do not use ZFC, they use variants of MLTT/CiC. Even Isabelle does not use ZFC.

Isabelle is generic and supports many object logics, listed in [1]. Isabelle/HOL is most popular, but Isabelle/ZF is also shipped in the distribution bundle for people who prefer set theory (like myself).

[1] https://isabelle.in.tum.de/doc/logics.pdf


Chess and Go are some of the simplest board games there are. Board gamers would put them in the very low “weight” rankings from a rules complexity perspective (compared to ”modern” board games)!

Mathematics are infinitely (ha) more complex. Work your way up to understanding (at least partially) a proof of Gödel’s incompleteness theorem and then you will agree! Apologies if you have done that already.

To some extent, mathematics is a bit like drunkards looking for a coin at night under a streetlight because that’s where the light is… there’s a whole lot more out there though (quote from a prof in undergrad)


> Unless mathematics is fundamentally more complicated than chess or go, I don't see why that could not happen.

My usual comparison is Sokoban: there are still lots of levels that humans can beat that all Sokoban AIs cannot, including the AIs that came out of Deepmind and Google. The problem is that the training set is so much smaller, and the success heuristics so much more exacting, that we can't get the same benefits from scale as we do with chess. Math is even harder than that.

(I wonder if there's something innate about "planning problems" that makes them hard for AI to do.)


> Unless mathematics is fundamentally more complicated than chess or go

"Fundamentally" carries the weight of the whole solar system here. Everyone knows mathematics is more conceptually and computationally complicated than chess.

But of course every person has different opinions on what makes two things "fundamentally" different so this is a tautology statement.


  > But they only publish the successful thing, not the process.
Learning from other people's failures is a really productive use of time (compared to failing on your own) - really wish this were easier to do for most things in life.

Proofs and proof checking is also big in SAT solving (think: MiniSat) and SMT (think: z3, cvc5). It's also coming to model counting (counting the number of solutions). In fact, we did one in Isabelle for an approximate model counter [1], that provides probabilistically approximate count (a so-called PAC guarantee), which is a lot harder to do, due to the probabilistic nature.

[1] https://github.com/meelgroup/approxmc


A tangential topic, but anyone have recommendations for proof-checking languages? Lean, Isabelle, Coq, Idris are the few I know about. I've only ever used Isabelle, but can't really tell the difference between them. Anyone here used more than one, and can tell why people pick one over the other?

Isabelle is my favorite, and it's the one I've used the most, if only because sledgehammer is super cool, but Coq is neat as well. It has coqhammer (a word I will not say at work), and that works ok but it didn't seem to work as well as sledgehammer did for me.

Coq is much more type-theory focused, which I do think lends it a bit better to a regular programming language. Isabelle feels a lot more predicate-logic and set-theory-ish to me, which is useful in its own right but I don't feel maps quite as cleanly to code (though the Isabelle code exporter has generally worked fine).


They are pretty similar as languages. In general I would recommend Lean because the open source community is very active and helpful and you can easily get help with whatever you're doing on Zulip. (Zulip is a Slack/Discord alternative.) The other ones are technically open source but the developers are far less engaged with the public.

Proof-checking languages are still just quite hard to use, compared to a regular programming language. So you're going to need assistance in order to get things done, and the Lean community is currently the best at that.


I have personally tried Coq and Lean and stuck to Lean as it has much better tooling, documentation and usability while being functionally similar to Coq.

Caveat that I have only used Coq and Lean side by side for about a month before deciding to stick with Lean.


One consideration in choosing a proof assistant is the type of proofs you are doing. For example, mathematicians often lean towards Lean (pun intended), while those in programming languages favour Coq or Isabelle, among others.

There's also https://en.wikipedia.org/wiki/Metamath

Its smallest proof-checker is only 500 lines.


It depends on the kind of proofs you deal with. If you need to write math in general terms, you can't avoid dealing with at some level the equivalent of category theory and you will want natural access to dependent types. So Coq and Lean. Otoh HOL is easy to comprehend and sufficient for almost everything else. And HOL based provers are designed to work with ATPs. HOL Light in particular is just another OCaml program and all terms and data structures can be easily manipulated in OCaml.

Another one is Oak: https://oakproof.org/

As Tao says in the interview: Lean is the dominant one.

Lean has the largest existing math library, but you're kind of stuck in a microsoft ecosystem.

If you mean dependence on VS Code, there are emacs and neovim packages for Lean; some of the core devs use those exclusively. Also note that de Moura is no longer at Microsoft and Lean is not an MS product.

I'm not sure what you mean. I've been playing around with Lean and it works fine on a Mac. I am using VSCode, though.

I just spent hours looking at formal methods and had decided to learn Isabelle/HOL, then this article !

I would recommend HOL Light and Lean. HOL Light is no longer being developed. It's very self contained and the foundation is very simple to comprehend. It's great for learning ATP, not the least due to the fact that Harrison wrote a nice companion book. And OCaml these days is very easy to access. I also developed a way to run it natively (https://github.com/htzh/holnat), which, while slightly less pretty, makes loading and reloading significantly faster.

Thanks.

There's also https://github.com/jrh13/hol-light which I think is still maintained (just learned about it in https://www.youtube.com/watch?v=uvMjgKcZDec from this thread: https://news.ycombinator.com/item?id=40619482)


may i ask why isabelle over agda or lean?

I know nothing, I am a complete beginner when it comes to formal methods. From reading about it, it seems that Isabelle/HOL is the best when it comes to automation which apparently is something you really want. It might be easier to learn (controversial, some say Lean is easier). It's been used to prove some software (including sel4 and a version of java), Apple and AWS are using it (but then I know AWS uses, or used, TLA+).

At the end of the day, I didn't want to spend more time reading about it then learning two of them (trying one and potentially switch later). The more you read about it, the more options open up (SPARK, TLA+, COQ, etc...).

I do find it ironic to read this article today given that I made that decision yesterday!


[shameless plug]I maintain a collection of proofs of leftpad in different prover languages, so people can compare them. It's here: https://github.com/hwayne/lets-prove-leftpad

[/invalid closing tag]


The main upside Isabelle has over other proof assistants is the existence of Sledgehammer: invoke it, and your current proof state gets handed off to a SMT solver like cvc4, veriT, z3, vampire, etc. If the SMT solver finds a solution, Isabelle then reconstructs the proof.

It's essentially a brute-force button that no other proof assistant (aside from Coq) has.


Will be interesting when they become able to autocomplete current research topics correctly.

How about: AI will become everyone's co-pilot

Now we don't need to write hundreds of additional articles.


~"You don't ever want to be a company programming a bot for a company which is designed to program bots, to program bots, programming bots programming bots."~ [1]

—Mark Zuckerberg, probably [actually not].

I am not a robot, but the above limerick's author.

RIP to my IRL humanbro, who recently programmed himself out of his kushy SalesForce job.

Welcome... to The Future™.

--

[1] Paraphrasing Zuckerberg saying he doesn't think a structure of "just managers managing managers" is ideal.


The signal of AGI is if it can help solve a hard problem or create a new one

Isn’t proof checking at least as hard as k-sat and therefore NP? Given that neural networks process in constant time, how would that be possible without approximation?

This is kind of amusing because NP problems, by definition, must have a polynomial-length certification to a given solution that can be run in polynomial time.

The word "certification" can be read as "proof" without loss of meaning.

What you're asking is slightly different: whether the proof-checking problem in logic is NP. I don't know enough math to know for sure, but assuming that any proof can be found in polynomial time by a non-deterministic Turing machine and that such a proof is polynomial in length, then yes, the proof-checking problem is in NP. Alas, I think that all of that is moot because Turing and Church's answer in the negative to the Entscheidungsproblem [1].

It would help your understanding and everyone else's if you separated the search problem for a proof from the verification problem of a proof. Something is NP when you can provide verification that runs in polynomial time. The lesser spoken about second condition is that the search problem must be solved in polynomial time by a non-deterministic Turing machine. Problems which don't meet the second condition are generally only of theoretical value.

1. https://en.m.wikipedia.org/wiki/Entscheidungsproblem


This article isn't about using neural networks for proof checking, but just using them to assist with writing Lean programs in the same way that developers use co-pilot to write java code or whatever. It might just spit out boilerplate code for you, of which there still is a great deal in formal proofs.

The proof checking is still actually done with normal proof checking programs.


Checking a proof is much easier than finding it.

Neural networks process in constant time per input/output (token or whatever). But some prompts could produce more tokens.

As you increase the size of the window, doesn't the size of the NN have to increase? So if you want to just use an NN for this, the bigger the problem it can handle, the bigger the window, so the longer per token produced; but also the more tokens produced, so it's longer on that front as well. I don't know if that adds up to polynomial time or not.

Note well: I'm not disagreeing with your overall assertion. I don't know if NNs can do this; I'm inclined to think they can't. All I'm saying is that even NNs are not constant time in problem size.


Most k-sat problems are easy, or can be reduced to a smaller and easier version of k-sat. Neural networks might be able to write Lean or Coq to feed into a theorem prover.

Humans run in constant time. So yeah it depends on the constant factor.

No.


A proof is basically a directed graph where nodes are propositions, and edges are deductive rules. One only needs to check that the starting nodes are axioms and that each edge is one of a finite set of deduction rules in the system. So the time to check a proof is linear in the number of edges.

Does Lean work with linear slgebra proofs ?


All I can read from this article is "the end of mathematics is on the horizon". Hopefully nothing he says actually materializes.

"their time frame is maybe a little bit aggressive" has to be the nicest way I've ever heard someone say "they're full of shit".

I think he's being quite literal in that he believes that they're too aggressive.

In the article he does afterall second the notion that AI will eventually replace a lot of human labour related to mathematics.


I'm not reading it that way, I don't know where you are getting "AI will eventually replace..." from, if anything he seems to have the opposite idea: "even if AI can do the type of mathematics we do now, it means that we will just move to a to a higher type of mathematics".

The line I quoted is in response to "in two to three years mathematics will be “solved” in the same sense that chess is solved".

Unless my priors are dramatically wrong, anyone who believes that is indeed full of shit.


"Elon Must"

Relying on GPT style systems to translate high level math to lean stuff strikes me as dangerous unless you're going to double-check by hand that the LLM didn't suddenly decide to hallucinate parts of the answer.

Granted, the hallucinated part might not pass the final Lean verification, but ... what if it does because the LLM did not properly translate the mathematician's intent to Lean language?

I'm thinking the human will still need to read the Lean program, understand it and double-check that his original human intent was properly codified.

Better than having to produce it by hand, but still sub-optimal.


The trick is that the human only needs to read and understand the Lean statement of a theorem and agree that it (with all involved definitions) indeed represents the original mathematical statement, but not the proof. Because that the proof is indeed proving the statement is what Lean checks. We do not need to trust the LLM in any way.

So would I accept a proof made by GPT or whatever? Yes. But not a (re)definition.

The analogy for programming is that if someone manages to write a function with a certain input and output types, and the compiler accepts it, then we do know that indeed someone managed to write a function of that type. Of course we have no idea about the behaviour, but statements/theorems are types, not values :-)


The thing is, when AI systems are able to translate intuitive natural language proofs into formal Lean code, they will also be used to translate intuitive concepts into formal Lean definitions. And then we can't be sure whether those definitions actually define the concepts they are named after.

There is no danger. If the output of the Neural net is wrong, then Lean will catch it.

>what if it does because the LLM did not properly translate the mathematician's intent to Lean language?

How so? The one thing which definitely needs to be checked by hand is the theorem to be proven, which obviously shouldn't be provided by a Neural network. The NN will try to find the proof, which is then verified by Lean, but fundamentally the NN can't prove the wrong theorem and have it verified by Lean.


I disagree. I think if the hallucinated part pass the final Lean verification, it would actually be trivial for a mathematician to verify the intent is reflected in the theorem.

And also, in this case, doesn't the mathematician first come up with a theorem and then engage AI to find a proof? I believe that's what Terence did recently by following his Mastodon threads a while ago. Therefore that intent verification can be skipped too. Lean can act as the final arbiter of whether the AI produces something correct or not.


GPT != AI. GPT is a subset of AI. GPT is probably absolutely useless for high-level math, if for no other reason that to be useful you'd need training data that basically by definition doesn't exist. Though I am also strongly inclined to believe that LLMs are constitutionally incapable of functioning at that level of math regardless, because that's simply not what they are designed to do.

But AI doesn't have to mean LLMs, recent narrowing of the term's meaning notwithstanding. There's all kinds of ways AIs in the more general sense could assist with proof writing in a proof language, and do it in a way that each step is still sound and reliable.

In fact once you accept you're working in a proof language like lean, the LLM part of the problem goes away. The AIs can work directly on such a language, the problems LLMs solve simply aren't present. There's no need to understand a high-context, fuzzy-grammar, background-laden set of symbols anymore. Also no problem with the AI "explaining" its thinking. We might still not be able to follow too large an explanation, but it would certainly be present even so, in a way that neural nets don't necessarily have one.


High level math, is made from a lot of low level math combined together to form the high level, right? You can definitely use tools like GPT to write the low level math.

It doesn't work the same for all use cases, training data do play a role certainly.


> GPT is probably absolutely useless for high-level math

Terence Tao disagrees with you.

https://mathstodon.xyz/@tao/110601051375142142


This tool is like all other tools in that expert practitioners extract value from it and novices consider it useless. Fascinating. The advent of search engines was similar with many remarking on how if you searched for something sometimes you'd find websites that were wrong. Things change and yet stay the same. At the time I thought age was the discriminator. Today I know it's not that. It's something else.

But this shows you that the tools (LLMs included) don't understand what they are producing. It takes an expert to extract the value. The tools do enrich the probability distribution greatly, raising the concentration of solutions much above the noise floor. But still they are only amplifiers. Even if they can get over the stage of producing apparent nonsense in the end they still can't get into the user's head to see what is exactly desired, and a precise specification of the intention takes experience and effort. A precise specification of a program can exceed in effort that of actually writing the program.

If it could be solved with a Math Overflow-post level of effort, even from Terence Tao, it isn't what I was talking about as "high level math".

I also am not surprised by "Consider a generation function" coming out of an LLM. I am talking about a system that could solve that problem, entirely, as doing high level math. A system that can emit "have you considered using wood?" is not a system that can build a house autonomously.

It especially won't seem all that useful next to the generation of AIs I anticipate to be coming which use LLMs as a component to understand the world but are not just big LLMs.


These tools, and subsequently their unethical parent companies, will be air gapped by length from my research.

https://lean-lang.org/about/ is an open source project

I have great respect for Terry Tao but let's not forget that while he won the Fields Medal at a relatively young age, he hasn't made any major contribution to math comparable to proving Fermat's Last Theorem (Andrew Wiles did), solving the Poincaré's Conjecture (Grigori Perelman did) or Yitang Zhang's proof progress towards solving the twin prime conjecture.

Perhaps for mathematicians like Terry Tao AI will be a tool they will rely on, but for mathematicians like the aforementioned, a pen, paper and the published research literature will continue to be their main tools!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: