You don't need to understand the math. You don't need to know what a transformer is or how backpropagation works. But if you're building products on top of LLMs and you're treating the model as a black box that takes text in and returns text out, you're going to make bad product decisions. The failure modes of these systems are specific and predictable, and understanding them - even at a conceptual level - is now part of the job.
I've spent the last year building LLM-powered features at q.watt and watching other teams build them at Uzum. The teams that struggle are almost always the ones who don't understand what's actually happening inside the model. Not the math - the behavior. Here's what I think every PM needs to know.
How attention actually works, and why it matters
LLMs process text by paying "attention" to different parts of the input when generating each word of the output. The attention mechanism is what lets the model understand that "it" in "the server crashed because it ran out of memory" refers to "the server" and not "memory." It's how the model tracks relationships across long stretches of text.
Here's the thing that matters for product decisions: attention is not uniform across the context window. Models pay more attention to the beginning and end of their input than to the middle. This is sometimes called the "lost in the middle" problem, and it's been documented empirically across multiple model families. If you feed a model a 10,000-word document and ask it a question, the answer will be more accurate if the relevant information is in the first or last few paragraphs than if it's buried in the middle.
This has direct implications for how you design prompts and how you structure inputs. If you're building a document summarizer, the most important context should be at the top or bottom of what you send the model. If you're building a customer support assistant that needs to follow specific policies, put the most critical policies at the beginning of the system prompt, not in the middle of a long list. If you're building a RAG system that retrieves chunks of text and concatenates them, the order of those chunks matters more than most people realize.
What temperature actually does
Temperature is a parameter that controls how "random" the model's outputs are. Most PM-facing documentation describes it as a creativity dial - low temperature for factual tasks, high temperature for creative ones. That's not wrong, but it's not precise enough to be useful.
What temperature actually does is adjust the probability distribution over the model's vocabulary at each generation step. At temperature 0, the model always picks the highest-probability next token - the most "expected" word given everything that came before. At higher temperatures, lower-probability tokens get a bigger share of the distribution, so the model is more likely to pick something surprising.
The practical implication: low temperature doesn't make the model more accurate. It makes the model more consistent. If the model's most likely answer is wrong, low temperature will give you that wrong answer confidently and repeatedly. High temperature will give you more varied answers, some of which might be right and some of which will be more wrong.
For product design, this means: if you're building something where consistency matters more than accuracy - a tone-of-voice generator, a template filler - use low temperature. If you're building something where you want to explore the space of possible answers - a brainstorming tool, a creative writing assistant - use higher temperature. And if you're building something where accuracy matters, temperature is not your primary lever. You need better prompts, better retrieval, or human review.
Why RAG exists and what it actually solves
Retrieval-Augmented Generation is the pattern where you retrieve relevant documents from a knowledge base and include them in the model's context before asking it to answer a question. It's become the standard approach for building LLM products that need to answer questions about specific, up-to-date, or proprietary information.
RAG exists because of two fundamental limitations of base LLMs. First, the model's knowledge is frozen at its training cutoff. It doesn't know about things that happened after it was trained. Second, the model can't reliably recall specific facts from its training data - it can hallucinate plausible-sounding but incorrect details, especially for niche topics.
RAG solves both problems by giving the model the relevant information at inference time, in the context window, rather than relying on what it memorized during training. The model doesn't need to remember the answer - it just needs to read the retrieved documents and synthesize a response.
What RAG doesn't solve: it doesn't fix hallucination entirely. The model can still hallucinate even when the correct answer is in the context - it might ignore the retrieved document and generate something from its training data instead. It also doesn't fix the "lost in the middle" problem I mentioned earlier. If you're retrieving 20 chunks of text and concatenating them, the chunks in the middle are less likely to influence the answer than the ones at the beginning and end.
The retrieval quality matters enormously. A RAG system is only as good as its ability to retrieve the right documents. If your retrieval is returning irrelevant chunks, the model will either ignore them or, worse, try to synthesize an answer from them and produce something confidently wrong. Most RAG failures I've seen are retrieval failures, not generation failures.
Fine-tuning is not the same as prompting
I see this confusion constantly. Teams try to get a model to behave a certain way through prompting, fail, and conclude they need to fine-tune. Sometimes that's right. Often it's not.
Prompting tells the model what to do in the current context. Fine-tuning changes the model's weights - it changes what the model "knows" and how it behaves by default. Fine-tuning is expensive, requires labeled training data, and produces a model that's harder to update when your requirements change.
The right use case for fine-tuning is when you need the model to consistently produce outputs in a very specific format or style that's hard to specify in a prompt, or when you need to teach the model domain-specific knowledge that isn't in its training data. The wrong use case is when you just need the model to follow instructions more reliably - that's usually a prompting problem, not a fine-tuning problem.
At q.watt, we spent two weeks exploring fine-tuning for a specific output format before realizing the problem was that our prompts were ambiguous. Better prompts solved it in a day. Fine-tuning would have solved it too, but it would have cost more, taken longer, and left us with a model we'd have to retrain every time our output format changed.
The latency problem is a product problem
LLMs are slow compared to everything else in your stack. A database query takes milliseconds. A cache hit takes microseconds. A model inference takes seconds - sometimes many seconds for long outputs. This is not going to change dramatically in the near term, even as models get faster.
For product design, this means you need to think carefully about where LLM calls sit in your user flows. A 3-second wait is fine if the user asked for something that obviously takes time - a summary of a long document, a detailed analysis. It's not fine if it's blocking a UI interaction the user expects to be instant.
Streaming responses - showing text as it's generated rather than waiting for the full response - helps a lot with perceived latency. The user sees something happening immediately, even if the full response takes 5 seconds. This is why almost every chat interface streams. It's not just aesthetics - it's a meaningful improvement in perceived performance.
The other thing to think about is where you can move LLM calls out of the critical path entirely. If you're generating a product description that will be shown to users, you don't need to generate it at the moment the user loads the page. You can generate it asynchronously when the product is created, store it, and serve it from a cache. That turns a 3-second LLM call into a sub-millisecond cache read. Most teams I've seen don't think about this early enough and end up with LLM calls blocking user-facing flows that should be fast.
The model is not magic. It's a sophisticated pattern-matching system with specific, predictable failure modes. Understanding those failure modes is what separates PMs who build good LLM products from PMs who build products that work in demos and fail in production.