AI coding and the peanut butter and jelly problem

iamcharliegraham.substack.com

121 points by tylerg 4 days ago

kenjackson 4 days ago

This is actually no different than for humans once you get past the familiar. It's like the famous project management tree story: https://pmac-agpc.ca/project-management-tree-swing-story

If anything, LLMs have surprised at much better they are than humans in understanding instructions for text based activities. But they are MUCH worse than humans when it comes to creating images/videos.

barotalomey 3 days ago

> If anything, LLMs have surprised at much better they are than humans in understanding instructions for text based activities.
That's demonstrateably false, as proven by both OpenAI's own research [1] and endless independent studies by now.
What is fascinating is how some people cling on false ideas about what LLM is and isnt.
Its a recurring fallacy that's bound to get it's own name any time soon.
1: https://news.ycombinator.com/item?id=43155825
- otabdeveloper4 3 days ago
  
  People think coding is the difficult part of programming.
  Which it isn't, just like pressing keys isn't the difficult part of being a pianist.
  If they invented a machine to press piano keys with superhuman speed and precision that wouldn't make you a musician.
- kenjackson 3 days ago
  
  You’re comparing an LLM to expert programmers. Compare an LLM on the same task versus the average college student. And try it for a math problem. A poetry problem. Ask it a more complex question about history or to do an analysis of an essay you wrote.
  Put it this way — I’m going to give you a text based question to solve and you have a choice to get another human to solve it (randomly selected from adults in the US) or ChatGPT, and both will be given 30 minutes to read and solve the problem — which would you choose?
  - aleph_minus_one 3 days ago
    
    > Put it this way — I’m going to give you a text based question to solve and you have a choice to get another human to solve it (randomly selected from adults in the US) or ChatGPT, and both will be given 30 minutes to read and solve the problem — which would you choose?
    You wouldn't randomly selected an arbitrary adult from the USA to do a brain surgery on you, so this argument is rabulistic.
    
    kenjackson 3 days ago
    
    Brain surgery requires a license.
    But I do expect an arbitrary adult to be able to follow instructions.
    Ok. How about you give me a text based task where you would pick the random adult over the LLM?
    
    nyclounge 2 days ago
    
    I think you and the parent may be talking about 2 different things.
    Do I want to use an LLM to do it from business owner perspective? Yeah probably it is cheaper and more convenient. Which one I want to use, depending the problem we are solving here right?
    I'm more concern about the integrity of the current digital infrastructure. In that sense I would NOT trust ANY thing really important to anything digital, much less to LLM. Can I use it for exploration then require an actually human expert approval/edit. Absolutely!
    As long as the digital doesn't result in significant physical or financial damage.
    Edit: and for HN ppl, of course the LLM will have have to be open weight and all and run locally in a air gaped GPU, preferably in a Faraday cage.
    
    aleph_minus_one 3 days ago
    
    > Brain surgery requires a license.
    This is rather a red-tape problem. :-)
    
    daveguy 3 days ago
    
    I would chose a random person from my company that was hired to work in that domain to solve problems in that domain. Yes, regardless of the position. Accountant in the domain, yes. Office organizer in the domain, yes. Essentially anyone in the domain, yes. No offense, but by restricting the selection to the general human population you're setting a low bar for LLMs here.
    
    kenjackson 3 days ago
    
    If the bar is for LLMs to replace domain experts about four years after introduction then yes, they are failing miserably.
    But if you were to go back to 2020 and ask if your take a random human over a the state of the art AI to answer a text question you’d take the random human every time except for arithmetic (and you’d have to write it in math notation and not plain English).
    And if you were to ask AI experts when would you chose an AI they’d say at least not for a decade or two, if ever.
    
    daveguy 2 days ago
    
    I wasn't talking about how impressive AI systems are, or how far they've come. I was talking about the fact that any random human with any experience in a specific field -- even though they are not a domain expert -- is going to do better than an LLM. Or, human common sense >>>> what LLMs are doing.
    
    kenjackson 2 days ago
    
    We will have to agree to disagree about your fundamental point.
    
    daveguy 2 days ago
    
    Fair enough. We will see.

zahlman 4 days ago

Okay, but like.

If you do have that skill to communicate clearly and describe the requirements of a novel problem, why is the AI still useful? Actually writing the code should be relatively trivial from there. If it isn't, that points to a problem with your tools/architecture/etc. Programmers IMX are, on average, far too tolerant of boilerplate.

MBCook 4 days ago

Exactly. This same point was mentioned on Accidental Tech Podcast last week during a section primarily about “vibe coding”. (May have been the paid-only segment)
If the LLM gets something wrong, you have to be more exact to get it to make the program do the thing you want. And when that isn’t perfect, you have to tell it exactly what you want to to do in THAT situation. And the next one. And the next one.
At that point you’re programming. It may not be the same as coding in a traditional language, but isn’t it effectively the same process? You’re having to lay out all the exact steps to take when different things happen.
So in the end have you replaced programmers or decreased the amount of programming needed? Or have you just changed the shape of the activity so it doesn’t look like what we’re used to calling programming today?
John Siracusa (one of the hosts) compared it to the idea of a fourth generation language.
From Wikipedia:
“The concept of 4GL was developed from the 1970s through the 1990s, overlapping most of the development of 3GL, with 4GLs identified as ‘non-procedural’ or ‘program-generating’ languages”.
Program generating language sounds an awful lot like what people are trying to use AI for. And these claims that we don’t need programmers anymore also sound a lot like the claims from when people were trying to make flowchart based languages. Or COBOL.
“You don’t need programmers! The managers can write their own reports”.
In fact “the term 4GL was first used formally by James Martin in his 1981 book Application Development Without Programmers” (Wikipedia again).
They keep trying. But it all ends up still being programming.
- daxfohl 4 days ago
  
  This is what I keep coming back to. I'm sure I'm not the only one here who frequently writes the code, or at least a PoC, then writes the design doc based on it. Because the code is the most concise and precise way to specify what you really want. And writing it gives you more clarity on things you might not have thought about when writing it in a document. Unrolling that into pseudocode/English almost always gets convoluted for anything but very linear pieces of logic, and you're generally not going to get it right if you haven't already done a little exploratory coding beforehand.
  So to me, even in an ideal world the dream of AI coding is backwards. It's more verbose, it's harder to conceptualize, it's less precise, and it's going to be more of a pain to get right even if it worked perfectly.
  That's not to say it'll never work. But the interface has to change a lot. Instead of a UX where you have to think about and specify all the details up front, a useful assistant would be more conversational, analyze the existing codebase, clarify the change you're asking about, propose some options, ask which layer of the system, which design patterns to use, whether the level of coupling makes sense, what extensions of the functionality you're thinking about in the future, pros and cons of each approach, and also help point out conflicts or vague requirements, etc. But it seems like we've got quite a way to go before we get there.
  - grahac 3 days ago
    
    Agreed although AIs today with simple project based rules can do things like check and account for error cases, and write the appropriate unit tests for those error cases.
    I personally have found I can often create equivalent code in less English than typing.
    Also it works very well where the scope is well defined like implementing interfaces or porting a library from one language to another.
    
    daxfohl 3 days ago
    
    Yeah, I guess it depends how much you care about the details. Sometimes you just want a thing to get done, and there are billions of acceptable ways to do it, so whatever GPT spits out is within the realm of good enough. Sometimes you want finer control, and in those cases trying to use AI exclusively is going to take longer than writing code.
    Not much different from image generation really. Sometimes AI is fine, but there's always going to be a need to drop down into photoshop when you really care about some detail. Even if you could do the same thing thing with very detailed AI prompts and some trial and error, doing the thing in photoshop will be easier.
  - namaria 4 days ago
    
    Another issue I see is the "Machine Stops" problem. When we come to depend on a systems that fails to foster the skills and knowledge needed to reproduce it (i.e. if programming comes to be so easy to so many people that they don't actually need to know how it works under the hood) you slowly loose the ability to maintain and extend the system as a society.
- LikesPwsh 3 days ago
  
  I realise this is meant to be a jab at high-level programming languages, but SQL really did succeed at that.
  Its abstraction may leak sometimes, but most people using it are incredibly productive without needing to learn what a spool operator or bitmap does.
  Even though the GUI and natural language aspects of 4GL failed, declarative programming was worth it.
  - MBCook 3 days ago
    
    I really like SQL personally. You’re right it does work well, but I suspect that’s because it has a limited domain instead of being a general purpose language.
- aleph_minus_one 3 days ago
  
  > At that point you’re programming. It may not be the same as coding in a traditional language, but isn’t it effectively the same process? You’re having to lay out all the exact steps to take when different things happen.
  No, it isn't.
  Programming is thinking deeply about
  - the invariants that your code obeys
  - which huge implications a small, innocent change in one part of the program will have for other, seemingly unrelated parts of the program
  - in which sense the current architecture is (still) the best possible for what the program does, and if not, what the best route is to get there
  - ...
- euroderf 2 days ago
  
  So, which is it ? Do you want to end up writing extremely detailed requirements, in English ? Or do you want to DIY by filling your head with software-related abstractions - in some internal mental "language" that might often be beyond words - and then translating those mental abstractions to source code ?
derefr 4 days ago

An LLM is a very effective human-solution-description / pseudocode to "the ten programming languages we use at work, where I'm only really fluent in three of them, and have to use language references for the others each time I code in them" transpiler.
It also remembers CLI tool args far better than I do. Before LLMs, I would often have to sit and just read a manpage in its entirety to see if a certain command-line tool could do a certain thing. (For example: do you know off-hand if you can get ls(1) to format file mtimes as ISO8601 or POSIX timestamps? Or — do you know how to make find(1) prune a specific subdirectory, so that it doesn't have to iterate-over-and-ignore the millions of tiny files inside it?) But now, I just ask the LLM for the flags that will make the tool do the thing; it spits them out (if they exist); and then I can go and look at the manpage and jump directly to that flag to learn about it — using the manpage as a reference, the way it was intended.
Actually, speaking of CLI tools, it also just knows about tools that I don't. You have to be very good with your google-fu to go from the mental question of "how do I get disk IO queue saturation metrics in Linux?" to learning about e.g. the sar(1) command. Or you can just ask an LLM that actual literal question.
- taurath 3 days ago
  
  I’ve found that the surfacing of tools and APIs really can help me dive into learning, but ironically usually by AI finding a tool and then me reading its documentation, as I want to understand if it has the capabilities or flexibility I have in mind. I can leave that to LLMs to tell me, but I find it’s too good an opportunity to build my own internal knowledge base to pass up. It’s the back and forth between having an LLM spit out familiar concepts and give new to me solutions. Overall it helps me get through learning quicker I think, because I can often work off of an example to start.
  - derefr 3 days ago
    
    Exactly — one thing LLMs are great at, is basically acting as a coworker who happens to have a very wide breadth of knowledge (i.e. to know at least a little about a lot) — who you can thus ask to "point you in a direction" any time you're stuck or don't know where to start.
- Arcuru 3 days ago
  
  Before LLMs there existed quite a few tools to try to help with understanding CLI options; off the top of my head there are https://github.com/tldr-pages/tldr and explainshell.com
  LLMs are both more general and more useful than those tools. They're more flexible and composable, and can replace those tools with a small wrapper script. Part of the reason why the LLMs can do that though is because it has those other tools as datasets to train off of.
simonw 4 days ago

Once you've got to a detailed specification, LLMs are a lot faster at correctly typing code than you are.
- layer8 4 days ago
  
  As a developer, typing speed is rarely the bottleneck.
  - Kiro 4 days ago
    
    Old trope that is no longer true.
    
    otabdeveloper4 3 days ago
    
    Is this a jab at enterprise Java programmers?
- zahlman 4 days ago
  
  In your analysis, do you account for the time taken to type a detailed specification with which to prompt the LLM?
  Or the time to review the code - whether by manual fixes, or iterating with the prompt, or both?
  - simonw 4 days ago
    
    No, just the time spent typing the code.
    
    zahlman 4 days ago
    
    I'm sure curiousity will get the better of me eventually, but as it stands I'm still unconvinced. Over the years I've ingrained a strong sense that just fixing things myself is easier than clearly explaining in text what needs to be done.
    
    recursivegirth 4 days ago
    
    Time to first iteration is a huge metric that no one is tracking.
    
    r0b05 4 days ago
    
    Could you explain this please?
- tharant 3 days ago
  
  This is one reason I see to be optimistic about some of the hype around LLMs—folks will have to learn how to write high quality specifications and documentation in order to get good results from a language model; society desperately needs better documentation!
geor9e 4 days ago

>Actually writing the code should be relatively trivial
For you, maybe. This statement assumes years of grueling training to become bilingual in a foreign programming language. And I can't type at 1000 tokens/s personally - sometimes I just want to press the voice dictate key and blab for five seconds and move on to something actually interesting.
- zahlman 4 days ago
  
  >This statement assumes years of grueling training to become bilingual in a foreign programming language
  ...So, less experienced programmers are supposed to be happy that they can save time with the same technology that will convince their employers that a human isn't necessary for the position?
  (And, frankly, I've overall quite enjoyed the many years I've put into the craft.)
  - geor9e 4 days ago
    
    You're seeing this entirely from the perspective of people who do programming as their job. I'm seeing it from the perspective of the other 99% of society. It feels really good that they're no longer gatekept by the rigid and cryptic interfaces that prevented them from really communicating with their computer, just because it couldn't speak their native tongue.
    
    wrs 4 days ago
    
    The point of the PB&J thing is exactly to demonstrate that your native tongue isn’t precise enough to program a computer with. There’s a reason those interfaces are rigid, and it’s not “gatekeeping”. (The cryptic part is just to increase information density — see COBOL for an alternative.)
    
    geor9e 4 days ago
    
    I think https://docs.cursor.com/chat/agent has shown plain English is precise enough to program a computer with, and some well respected programmers have become fans of it https://x.com/karpathy/status/1886192184808149383
    I only took exception to the original statement - that coding is trivial, and the questioning if AI is even useful. So many people are finally able to create things they were never able to. That's something to celebrate. Coding isn't trivial to most people, it's more of an insurmountable barrier to entry. English works - that's why a clear-minded project manager can delegate programming to someone fluent in it, without knowing how to code themselves. We don't end up with them dumping a jar of jam on the floor, because intelligent beings can communicate in the context of a lot of prior knowledge they were trained on. That's how AI is overcoming the peanut butter and jelly problem of English. It doesn't need solutions defined for it, a word to the wise is sufficient.
    
    namaria 4 days ago
    
    > intelligent beings can communicate in the context of a lot of prior knowledge
    This is key. It works because of previous work. People have shared context because they develop it over time, when we are raised - shared context is passed on the the new generation and it grows.
    LLMs consume the context recorded in the training data, but they don't give it back. They diminish it because people don't need to learn the shared context when using this tools. It appears to work in some use cases, but it will degrade our collective shared context over time as people engage with and use these tools that consume past shared context and at the same time atrophy our ability to maintain and increase the shared context. Because the shared context is reproduced and grows when it is learned by people. If a tool just takes it and precludes people learning it, there is a delayed effect where over time there will be less shared context and when the performance of the tool degrades the ability to maintain and extend the shared context will also have degraded. We might get to an irrecoverable state and spiral.
    
    otabdeveloper4 3 days ago
    
    > plain English is precise enough to program a computer with
    Only if your problem is already uploaded on Github.
    
    Arn_Thor 3 days ago
    
    Yep! I’m digitally literate but can’t do anything more advanced than “hello world”. Never had the time or really interest in learning programming.
    In the last year I’ve built scripts and even full fledged apps with GUIs to solve a number of problems and automate a bunch of routine tasks at work and in my hobbies. It feels like I woke up with a superpower.
    And I’ve learned a ton too, about how the plumbing works, but I still can’t write code on my own. That makes me useful but dependent on the AI.
larve 4 days ago

Useful boilerplate:
- documentation (reference, tutorials, overviews) - tools - logging and log analyzers - monitoring - configurability - unit tests - fuzzers - UIs - and not least: lots and lots of prototypes and iterating on ideas
All of these are "trivial" once you have the main code, but they are incredibly valuable, and LLMs do a fantastic job.
- zahlman 4 days ago
  
  I was referring specifically to boilerplate within the code itself. But sure, I can imagine some uses.

Syzygies 4 days ago

"I bought those expensive knives. Why doesn't my cooking taste better?"

"I picked up an extraordinary violin once. It sounded awful!"

There's an art here. Managerial genius is recognizing everyone's strengths and weaknesses, and maximizing impact. Coding with AI is no different.

Of course I have to understand the code well enough to have written it. Usually much of the time is spent proposing improvements.

I'm a few months in, learning to code with AI for my math research. After a career as a professor, I'm not sure I could explain to anyone what I'm starting to get right, but I'm working several times more efficiently than I ever could by hand.

Some people will get the hang of this faster than others, and it's bull to think this can be taught.

phalangion 4 days ago

This video shows the peanut butter and jelly problem in action: https://youtu.be/cDA3_5982h8?si=xIQpzNTvhRcGY4Nb

_wire_ 2 days ago

Step 1: Computer, make me a peanut butter and jelly sandwich.

If this can't work, the program abstraction is insufficient to the task. This insufficiency is not a surprise.

That an ordinary 5-year can make a sandwich after only ever seeing someone make one, and that the sandwich so-made is a component within a life sustaining matrix which inevitably leads to new 5 year-olds making their own sandwiches and serenading the world about the joys of peanut butter and jelly is the crucial distinction between AI and intelligence.

The rest of the stuff about a Harvard professor ripping a hole in a bag and pouring jelly on a clump of bread on the floor is a kooky semantic game that reveals something about the limits of human intelligence among the academic elite.

We might wonder why some people have to get to university before encountering such basic epistemological conundrum as what constitutes clarity in exposition... But maybe that's what teaching to the test in U.S. K-12 gets you.

Alan Kay is known a riff on a simple study where Harvard students were asked what causes the earth's seasons: almost all of them give the wrong explanation, but many of them are very confident about the correctness of their wrong explanations.

Given that the measure of every AI chat program's performance is how agreeable its response is to a human, is there a clear distinction between a the human and the AI?

If this HN discussion was among AI chat programs considering their own situations and formulating understanding of their own problems; maybe waxing about the ineffable, for them, joy of eating a peanut butter and jelly sandwich...

But it isn't.

pkdpic 4 days ago

lol, I didn't realize how famous the PB&J exercise was. That's fantastic. I thought it was just from this puppet video I've been showing my 4yo and his friends. Anyway they seem to love it.

https://m.youtube.com/watch?v=RmbFJq2jADY&t=3m25s

Also seems like great advice, feels like a good description of what Ive been gravitating towards / having more luck with lately proompting.

01HNNWZ0MV43FF 4 days ago

My class did it with paper airplanes. My partner used the phrase "hotdog style" which I'd never heard. Good times!

extr 4 days ago

Didn't know they did the PB&J thing at Harvard. I remember doing that in the 3rd grade or thereabouts.

iDon 3 days ago

For decades people have been dreaming of higher-level languages, where a user can simply specify what they want and not how to do it (the name of the programming language Forth derives from '4th Generation Language', reflecting this idea).

Here we are - we've arrived at the next level.

The emphasis in my prompts is specification : clear and concise, defining terms as they are introduced, and I've had good results with that. I expect that we'll see specification/prompt languages evolve, in the same way that MCP has become a defacto standard API for connecting LLMs to other applications and servers. We could use a lot of the ideas from existing specification languages, and there has been a lot of work done on this over 40+ years, but my impression is they are largely fairly strict, because their motivation was provably-correct code. The ideas can be used in a more relaxed way, because prompting fits well with rapid application development (RAD) and prototyping - I think there is a sweet spot of high productivity in a kind of REPL (read/evaluate/print loop) with symbolic references and structure embedded in free-form text.

Other comments have mentioned the importance of specification and requirements analysis, and dahlfox menions being able to patch new elements into the structure in subsequent prompts (via BASIC line number insertion).

grahac 4 days ago

Anyone here see the CS50 peanut butter and jelly problem in person?

Nuzzerino 4 days ago

We had this in the 8th grade science class, and IMO it was much better than this Harvard version. Still PB&J with instructions. Teacher had a skeleton named "it". Anytime instructions referenced the word "it", the teacher used the skeleton in place.
csours 4 days ago

Not in that course, but I've done it at a "STEM" day; it's just about the most fun I've ever had teaching.
- grahac 4 days ago
  
  That's awesome.
  I got the sense the professor liked it a ton too. In the videos of this on YouTube you can see the professor really enjoying watching it all go down.
  It still is so memorable.
rileymat2 4 days ago

I had a middle school art teacher do it in roughly 1995.
ryoshu 4 days ago

We did it in 3rd grade in public school.

mkw5053 4 days ago

Similar to having a remote engineering team that fulfills the (insufficient) requirements but in ways you did not predict (or want).

daxfohl 4 days ago

Whenever I'm prompting LLM for this kind of thing I find myself wishing there was a BASIC style protocol that we could use to instruct LLMs. Numbered statements, GOTOs to jump around, standardized like MCP or A2A such that all LLMs are trained to understand and verified to follow the logic.

Why BASIC? It's a lot harder to mix English and structured programming concepts. Plus it's nice if you forget a step between 20 and 30 you can just say `25 print 'halfway'` from the chat.

philipswood 4 days ago

I've had successful chats with ChatGPT that uses Python as pseudocode.
I specifically mean pseudocode intended to clarify and communicate - as opposed to be run by a computer.
E.g.: https://chatgpt.com/share/67f9fee9-07dc-8003-a272-ca05d91282...

cadamsdotcom 3 days ago

AI is exposing the difference in effectiveness between communicating clearly and precisely (potentially being more verbose than you think you need), vs. leaning heavily on context.

01HNNWZ0MV43FF 4 days ago

> Over the past year, I’ve been fully immersed in the AI-rena—building products at warp speed with tools like Claude Code and Cursor, and watching the space evolve daily.

Blast fax kudos all around

sevenseacat 3 days ago

Heh, I remember doing that same peanut butter exercise in my high school computing class. It also has stuck in my head for all these years!

conductr 4 days ago

Funny to see, I used this exact analogy a few weeks ago regarding AI

kazinator 2 days ago

> If your “sandwich” is a product that doesn’t have an obvious recipe—a novel app, an unfamiliar UX, or a unique set of features—LLMs struggle

Bzzt, nope!

If the sandwich does not have an obvious recipe --- an app similar to many that have been written before or familiar, conventional UX, or boring features found in countless existing apps --- LLMs struggle.

Fixed it for ya!

davidcalloway 4 days ago

My teacher did the peanut butter and jelly problem with us in the fourth grade, but we were given the time to write the instructions as homework and she picked a few to execute the following day.

The disappointment always stayed with me that my instructions were not chosen, as I really had been far more precise than the fun examples she did choose. I recall even explaining which side of the knife to use when taking peanut butter from the jar.

Of course, she would still have found plenty of bugs in my instructions, which I wish I still had.

Thanks for that, and also the pet rats, Ms. Clouser!

tedunangst 4 days ago

Why is the peanut butter so sloppy?

focusgroup0 4 days ago

Garbage In, Garbage Out

derefr 4 days ago

> Today’s AI Still Has a PB&J Problem

If this is how you're modelling the problem, then I don't think you learned the right lesson from the PB&J "parable."

Here's a timeless bit of wisdom, several decades old at this point:

Managers think that if you can just replace code with something else that isn't text with formal syntax, then all the sudden "regular people" (like them, maybe?) will be able to "program" a system. But it never works. And the reason it never works is fundamental to how humans relate to computers.

Hucksters continually reinvent the concept of "business rules engines" to sell to naive CTOs. As a manager, you might think it's a great idea to encode logic/constraints into some kind of database — maybe one you even "program" visually like UML or something! — and to then have some tool run through and interpret those. You can update business rules "live and on the fly", without calling a programmer!

They think it's a great idea... until the first time they try to actually use such a system in anger to encode a real business process. Then they hit the PB&J problem. And, in the end, they must get programmers to interface with the business rules engine for them.

What's going on there? What's missing in the interaction between a manager and a business rules engine, that gets fixed by inserting a programmer?

There are actually two things:

1. Mechanical sympathy. The programmer knows the solution domain — and so the programmer can act as an advocate for the solution domain (in the same way that a compiler does, but much more human-friendly and long-sighted/predictive/10k-ft-view-architectural). The programmer knows enough about the machine and about how programs should be built to know what just won't work — and so will push back on a half-assed design, rather than carrying the manager through on a shared delusion that what they're trying to do is going to work out.

2. Iterative formalization. The programmer knows what information is needed by a versatile union/superset of possible solution architectures in the solution space — not only to design a particular solution, but also to "work backward", comparing/contrasting which solution architectures might be a better fit given the design's parameters. And when the manager hasn't provided this information — the programmer knows to ask questions.

Asking the right questions to get the information needed to determine the right architecture and design a solution — that's called requirements analysis.

And no matter what fancy automatic "do what I mean" system you put in place between a manager and a machine — no matter how "smart" it might be — if it isn't playing the role of a programmer, both in guiding the manager through the requirements analysis process, and in pushing back through knowledge of mechanical sympathy... then you get PB&J.

That being said: LLMs aren't fundamentally incapable of "doing what programmers do", I don't think. The current generation of LLMs is just seemingly

1. highly sycophantic and constitutionally scared of speaking as an authority / pushing back / telling the user they're wrong; and

2. trained to always try to solve the problem as stated, rather than asking questions "until satisfied."

dsjoerg 3 days ago

You're right about everything except you underestimate the current generation of LLMs. With the right prompting and guidance, they _already_ can give pushback and ask questions until satisfied.
- derefr 3 days ago
  
  Well, yes and no.
  You can in-context-learn an LLM into being a domain expert in a specific domain — at which point it'll start challenging you within that domain.
  But — AFAIK — you can't get current LLMs to do the thing that experienced programmers do, where they can "know you're wrong, even though they don't know why yet" — where the response isn't "no, that's wrong, and here's what's right:" but rather "I don't know about that... one minute, let me check something" — followed by motivated googling, consulting docs, etc.
  And yes, the "motivated googling" part is something current models (DeepResearch) are capable of. But the layer above that is missing. You need a model with:
  1. trained-in reflective awareness — "knowing what you know [and what you don't]" — such that there's a constant signal within the model representing "how confident I am in the knowledge / sources that I'm basing what I'm saying upon", discriminated as a synthesis/reduction over the set of "memories" the model is relying upon;
  2. and a trained-in capability to evaluate the seeming authoritativeness and domain experience of the user, through their statements (or assertions-from-god in the system prompt about the user) — in order for the model to decide whether to trust a statement you think sounds "surprising", vs. when to say "uhhhhh lemme check that."
  - dsjoerg 7 hours ago
    
    Yeah I agree that the current generation of LLMs dont appear to have been trained on solid "epistemological behavior". I believe the underlying architecture is capable of it, but I see signs of the training data not containing that sort of thing. In fact in either the training or the prompting or both it seems like the LLMs I use have been tuned to do the opposite.

gblargg 4 days ago

At least with AI you can ask it what it understands about the topic so you know what you can assume.

GuB-42 4 days ago

It turns out it is not a reliable approach. How a LLM works and how a LLM says it works can be completely different.
Think about it, a LLM is an autocompleter. It will give you the most probable next word each time. It doesn't mean it doesn't understand high level concepts, but in the end, it just writes stuff that is similar to its training dataset.
For example, ask it to multiply two numbers. If the number are small enough, you will get the right answer. Now ask it to explain how it did it, it will probably tell you the process as commonly taught in school, but it not actually how it did it. What it did is much weirder for us humans, and the only way to see how it actually works is to look at the internals of the neural network. The LLM can't describe it, it doesn't see inside itself, however, it has many textbooks in its training dataset, so it will grab an answer from these textbooks because that's how people answer.
Seeing how it correctly describes the multiplication process and how it can multiply small number correctly, you would assume it can also multiply large numbers (as we do), but nope, it can't, unless it has access to a separate math module, traditionally made (i.e. not a neural net).
- EMIRELADERO 4 days ago
  
  > The LLM can't describe it, it doesn't see inside itself, however, it has many textbooks in its training dataset, so it will grab an answer from these textbooks because that's how people answer.
  EDIT: I see now that you were referring to the answers it uses to justify the result, not the underlying computations. Sorry! You can disregard the actual comment. Leaving for completeness.
  ORIGINAL COMMENT:
  That's not how it works. Addition in LLMs is believed to function through different mechanisms depending on model size and architecture, but the single consistent finding across different models is that they generalize beyond the training data for at least those simple arithmetic operations.
  For example: "Language models use trigonometry to do addition" https://arxiv.org/abs/2502.00873
  For a different "emergent" algorithm, see Anthropic's observations: https://transformer-circuits.pub/2025/attribution-graphs/met...
gitremote 4 days ago

You are assuming that generative AI has self-awareness.