The post says this in other words: in Lean, Rocq, or any other theorem prover, you get a formally-verified proof, but you do NOT get a formally verified theorem statement.
So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.
Even then this seems much more promising to me compared to other areas. Writing theorem statements is much much easier than coming up with proofs so it's not a big deal if a human has to do that. And once that's done getting a correct proof out of an LLM/AI model can be done fully automatically (assuming you do get a proof out of it though!)
In the case of Lean, propositions are encoded as dependent types and the user typically has to encode that themselves then make use of e.g. tactics to derive a term of said type.
Writing a statement you don't understand then later getting a proof from an LLM doesn't seem all that useless to me; in my mind, it could still be useful as exploration. Worst case scenario: you encoded a tautology and the LLM gave you a trivial proof. Best case scenario: the proposition ends up being a lemma for something you want to prove.
I do think there is a kernel of truth in what you've stated: if the user does not actually understand the statement of a proposition, then the proof is not very useful to them, since they don't know what the statement's truthiness implies. As someone who used to do mathematics, I still find immense value in vibe-coding away mundane computations, similar to what Lean's `simp` tactic already does, but much more broad.
> Third, you need to decide how far “down the stack” you want to go. That is to say, the software you want to verify operates over some kind of more complex system, for instance, maybe it’s C code which gets compiled down to X86 and runs on a particular chip, or maybe it’s a controller for a nuclear reactor and part of the system is the actual physical dynamics of the reactor. Do you really want your proof to involve specifying the semantics of the C compiler and the chip, or the way that the temperature and other variables fluctuate in the reactor?
I can appreciate what he's getting at, but my utopian vision for the future is that we won't need to reinvent the wheel like this each time we want verified software! E.g. for high-consequence systems, the hard part of compiler correctness is already handled by the efforts of CompCert, and SystemVerilog assertions for the design guarantees of processors is becoming more commonplace.
Yeah, but the problem is that programming languages and compilers change all the time, making it hard to maintain a formal model of them.
Exceptions exist (CompCert C and WebAssembly are two good examples) but for example, the semantics of raw pointers in Rust are intentionally under-defined because the compiler writers want to keep changing it.
For the case of Propositional Logic, ChatGPT reflects the current epistemological crisis. When asking for help on a question, it could not properly apply the Law of the Excluded Middle [1].
Interesting article, thanks. There is indeed a "semantic gap". However, there is also a practical solution: bidirectional LLM translation. You can verify the formal specification by back-translating it to natural language with another LLM session, allowing human review at the intent level rather than requiring expertise in e.g. Event-B syntax (see https://rochuskeller.substack.com/p/why-rust-solves-a-proble...). This addresses the concern about "mis-defining concepts" without requiring the human to be a formal methods expert. The human can review intent and invariants in natural language, not proof obligations. The AI handles the mathematical tedium while the human focuses on domain correctness, which is exactly where human expertise belongs.
why do we invent these formal languages except to be more semantically precise than natural language? What does one gain besides familiarity by translation back into a more ambiguous language?
Mis-defining concepts can be extremely subtle, if you look at the allsome quantifier
https://dwheeler.com/essays/allsome.html you'll see that these problems predate AI, and I struggle to see how natural language is going to help in cases like the "All martians" case where the confusion may be over whether martians exist or not. Something relatively implicit.
We build pretty complex systems only based on "natural language" specifications. I think you are conflating specification ambiguity with verification accessibility.
> What does one gain besides familiarity by translation back into a more ambiguous language?
You gain intent verification. Formal languages are precise about implementation, but they are often opaque about intent. A formal specification can be "precisely wrong". E.g. you can write a perfectly precise Event-B spec that says "When the pedestrian button is pressed, the traffic light turns Green for cars"; the formalism is unambiguous, the logic is sound, the proof holds, but the intent is fatally flawed. Translating this back to natural language ("The system ensures that pressing the button turns the car light green") allows a human to instantly spot the error.
> All Martians are green
Modern LLMs are actually excellent at explicating these edge cases during back-translation if prompted correctly. If the formal spec allows vacuous truth, the back-translation agent can be instructed to explicitly flag existential assumptions. E.g. "For every Martian (assuming at least one exists), the color is Green", or "If there are no Martians, this rule is automatically satisfied". You are not translating back to casual speech; you are translating back to structured, explicit natural language that highlights exactly these kinds of edge cases.
Maybe it can be done, but I struggle to believe adding in that branch for every forall quantifier (which may be plentiful in a proof) is going to help make a proof more understandable. Rather I feel like it'll just balloon the number of words necessary to explain the proof. Feels like it's going to fall on the bad side of verbosity as the sibling comment said.
I think there is a misunderstanding about what is being back-translated.
We don't back-translate the proof steps (the thousands of intermediate logical derivations). That would indeed be verbose and useless.
We back-translate the specification: the Invariants, Guards, and Events.
For a traffic light system, we don't need the LLM to explain the 50 steps of predicate logic that prove inv3 holds. We just need it to translate inv3 itself:
Formal: inv3: light_NS = Green ⇒ light_EW = Red
Back-translation: 'Invariant: If the North-South light is Green, the East-West light MUST be Red.'
This isn't verbose; it's the exact concise summary of the system's safety rules. The 'verbosity' of handling edge cases (like the 'Allsome' example) only applies when the specification itself relies on subtle edge cases, in which case, being verbose is exactly what you want to prevent a hidden bug.
I agree, if AI (or humans) have mistranslated a natural language statement to a formal statement, we should not rely on AI to correctly translate the formal statement back into natural language.
For many statements I expect it's not possible to retain the exact meaning of the formal-language sentence without the natural language becoming at least as complex, and if you don't retain meaning exactly then you're vulnerable to the kind of thing the article warns about.
> why do we invent these formal languages except to be more semantically precise than natural language
To be... more precise?
On a more serious note, cannot recommend enough "Exactly: How Precision Engineers Created the Modern World" by Winchester. While the book talks mostly about the precision in mechanical engineering, it made me appreciate _precision_ itself to a greater degree.
Rhetorical sentence? My point is that back-translation into natural langauge is translating into a less precise form. How is that going to help? No number of additional abstraction layers are going to solve human confusion.
Some valid points, but I hope the authors had developed them more.
On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.
For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.
It is more general than that: A programming language is a formal specification language that we know how to compile.
There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly?
So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.
Writing a statement you don't understand then later getting a proof from an LLM doesn't seem all that useless to me; in my mind, it could still be useful as exploration. Worst case scenario: you encoded a tautology and the LLM gave you a trivial proof. Best case scenario: the proposition ends up being a lemma for something you want to prove.
I do think there is a kernel of truth in what you've stated: if the user does not actually understand the statement of a proposition, then the proof is not very useful to them, since they don't know what the statement's truthiness implies. As someone who used to do mathematics, I still find immense value in vibe-coding away mundane computations, similar to what Lean's `simp` tactic already does, but much more broad.
I can appreciate what he's getting at, but my utopian vision for the future is that we won't need to reinvent the wheel like this each time we want verified software! E.g. for high-consequence systems, the hard part of compiler correctness is already handled by the efforts of CompCert, and SystemVerilog assertions for the design guarantees of processors is becoming more commonplace.
1. https://chatgpt.com/share/696b7f8a-9760-8006-a1b5-89ffd7c5d2...
It's a solution only if the translation is proven correct. If not, you're in the same place as you started.
Mis-defining concepts can be extremely subtle, if you look at the allsome quantifier https://dwheeler.com/essays/allsome.html you'll see that these problems predate AI, and I struggle to see how natural language is going to help in cases like the "All martians" case where the confusion may be over whether martians exist or not. Something relatively implicit.
> What does one gain besides familiarity by translation back into a more ambiguous language?
You gain intent verification. Formal languages are precise about implementation, but they are often opaque about intent. A formal specification can be "precisely wrong". E.g. you can write a perfectly precise Event-B spec that says "When the pedestrian button is pressed, the traffic light turns Green for cars"; the formalism is unambiguous, the logic is sound, the proof holds, but the intent is fatally flawed. Translating this back to natural language ("The system ensures that pressing the button turns the car light green") allows a human to instantly spot the error.
> All Martians are green
Modern LLMs are actually excellent at explicating these edge cases during back-translation if prompted correctly. If the formal spec allows vacuous truth, the back-translation agent can be instructed to explicitly flag existential assumptions. E.g. "For every Martian (assuming at least one exists), the color is Green", or "If there are no Martians, this rule is automatically satisfied". You are not translating back to casual speech; you are translating back to structured, explicit natural language that highlights exactly these kinds of edge cases.
We don't back-translate the proof steps (the thousands of intermediate logical derivations). That would indeed be verbose and useless.
We back-translate the specification: the Invariants, Guards, and Events.
For a traffic light system, we don't need the LLM to explain the 50 steps of predicate logic that prove inv3 holds. We just need it to translate inv3 itself:
This isn't verbose; it's the exact concise summary of the system's safety rules. The 'verbosity' of handling edge cases (like the 'Allsome' example) only applies when the specification itself relies on subtle edge cases, in which case, being verbose is exactly what you want to prevent a hidden bug.For many statements I expect it's not possible to retain the exact meaning of the formal-language sentence without the natural language becoming at least as complex, and if you don't retain meaning exactly then you're vulnerable to the kind of thing the article warns about.
To be... more precise?
On a more serious note, cannot recommend enough "Exactly: How Precision Engineers Created the Modern World" by Winchester. While the book talks mostly about the precision in mechanical engineering, it made me appreciate _precision_ itself to a greater degree.
On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.
For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.
If we can use AI to automatically implement a formal spec, then that formal specification language has just become a programming language.
There are plenty of formal specifications that cannot be compiled, even not by an AI. If you use AI, how do you make sure that the AI compiler compiles correctly?