Prompt Engineering Essentials — Beyond the Polite Question
The phrase 'prompt engineering' sounds like marketing. The work it describes is real. A poorly written prompt produces vague, hedged, or off-topic output fro...
Prompt Engineering Is Real, Just Not Magic
The phrase "prompt engineering" sounds like marketing. The work it describes is real. A poorly written prompt produces vague, hedged, or off-topic output from even the best models. The same prompt, restructured, can turn the same model into something genuinely useful. The difference is not in the model — it is in how the request is shaped.
This article is the working version of prompt engineering for engineers. Not the LinkedIn-influencer version with dramatic before-and-after screenshots. The named techniques, what each one is for, and the small set you actually need to know to ship LLM features that work.
If you have read How LLMs Actually Work, this article is the practical layer on top.
The Default Prompt Mistake
A new developer's first prompt usually looks like:
Summarize this article.
Followed by 3000 words of article. The model produces a summary. Sometimes it is fine. Often it is too long, or too short, or a bullet list when you wanted a paragraph, or focuses on the wrong aspects. You re-prompt with "make it shorter" and end up in a back-and-forth that consumes tokens and time.
The problem is not that "summarize this article" is unclear. It is that the model has no information about the audience, the format, the length, or the criterion for what to include. It is forced to guess. Different runs produce different guesses. That is the whole problem.
Good prompt engineering is mostly removing those guesses.
The Anatomy of a Production Prompt
A production-quality prompt has a predictable structure:
- Role/persona — who the model is in this conversation.
- Task — what specifically you want it to do.
- Context — the data it should work from.
- Format — exactly what the output should look like.
- Constraints — what to do, what not to do.
- Examples — concrete demonstrations of inputs and outputs.
You will not always need all six. Most production prompts use three or four. Here is the same article-summary task done well:
You are a technical content editor. Summarize the article below
into 3 bullet points for a busy software engineer who wants to
decide in 15 seconds whether to read the full article.
Each bullet:
- Is one sentence, max 30 words.
- Names a specific claim or finding from the article.
- Avoids meta-statements ("the article discusses").
ARTICLE:
[paste here]
The model now has the role (editor), the task (3 bullet summary), the audience (engineer), the latency target (15 seconds reading), the constraints (one sentence, 30 words, no meta-statements). The output gets dramatically more consistent because most of the guesses have been removed.
The Named Techniques Worth Knowing
Five techniques cover most production prompting needs. Each has a specific use case.
1. System prompts vs user prompts
Modern chat APIs accept multiple messages with different roles. The system message sets behavior for the whole conversation. The user message is what you would type. Models give the system message higher weight — it is a stable instruction, not part of the back-and-forth.
messages: [
{ role: 'system', content: 'You are a code reviewer for senior engineers. Be specific and direct. Skip platitudes.' },
{ role: 'user', content: 'Review this Python function: ...' }
]
Persona, output format, and refusal rules belong in the system message. The actual question belongs in the user message. Mixing them — putting persona instructions in every user message — wastes tokens and gets less reliable behavior.
2. Few-shot prompting
Show the model 2-5 examples of the input/output pattern you want. The model picks up the pattern from the examples better than it picks it up from any description.
Convert these to past tense:
INPUT: I am running fast.
OUTPUT: I was running fast.
INPUT: She writes a letter.
OUTPUT: She wrote a letter.
INPUT: They are planning a trip.
OUTPUT:
Three examples is the sweet spot for most tasks. One example is often enough for trivial transformations. More than five rarely helps and starts costing real tokens. Use few-shot when the format is unusual or the rules are easier to demonstrate than to describe.
3. Chain-of-thought (CoT)
Add the phrase "Think step by step before answering" or show the model an example where reasoning is shown before the answer. The model produces visible reasoning, which makes it more likely to get hard problems right (math, logic, multi-step planning).
You are a math tutor. For each problem, show the steps,
then state the final answer on its own line starting with "ANSWER:".
Problem: A train leaves at 10:30am traveling 60mph...
CoT works because each new token is conditioned on all previous tokens — letting the model write out the steps gives it more context for the final answer. The cost is more output tokens. For tasks where correctness matters more than length, the trade is usually worth it.
Modern models (GPT-4, Claude 3.5+, Gemini 2.5+) do CoT-style reasoning automatically when they see hard problems. The technique is most useful when you want structured reasoning the user can read and verify.
4. Output format pinning
If you want JSON, ask for JSON explicitly and show the schema:
Respond with valid JSON matching this schema:
{
"summary": "string, 1-2 sentences",
"topics": ["string", "string", ...],
"sentiment": "positive" | "negative" | "neutral"
}
Do not include any text outside the JSON.
Most modern APIs also support structured outputs as a first-class feature — you pass a JSON schema and the API guarantees the response matches it. OpenAI's response_format, Anthropic's tool use, and Google's responseSchema all do this. When the API supports it, use it instead of asking nicely in the prompt — guaranteed parseable JSON saves real grief.
5. Decomposition
Split a complex task into multiple model calls. One call generates an outline; a second call writes the body using the outline; a third call edits for length. Each call is simpler than the all-in-one alternative, easier to debug, and easier to swap individual stages.
This is exactly what production agent frameworks (LangChain, the abc-blogger Panchadeva pipeline, AutoGPT) do under the hood. They orchestrate many small focused prompts instead of one large prompt that tries to do everything.
The trade-off is more latency and more tokens. For tasks where quality matters more than speed (writing, code generation, research), decomposition routinely produces better output than monolithic prompting.
The Production Wrinkles
Five techniques will get you most of the way. The remaining gap is the operational layer that turns prompting from a craft into something you can ship.
Temperature for the task. Set temperature low (0.1-0.3) when you want consistency — extracting data, classifying, formatting. Set it higher (0.6-0.8) when you want creativity — writing, brainstorming, summarizing. The default of 0.7-1.0 is too high for most utility prompts.
Max tokens as a hard cap. Without a max_tokens limit, a model occasionally produces a 4000-token response when you wanted 200. Set a cap that is generous for the task and trust it.
Prompt versioning. Treat prompts like code. Store them in version control. When you change a prompt, A/B test the new version against the old before rolling it out — small wording changes have surprisingly large effects on output quality.
Caching. If the same prompt produces the same output with temperature=0, cache the result. Anthropic and Google both offer prompt caching that lets you reuse a long prefix across many requests at much lower cost. For high-volume applications this can cut bills significantly.
Eval datasets. Before changing a production prompt, run it against 20-50 representative inputs and compare outputs to a reference. Tools like Promptfoo, Braintrust, or LangSmith automate this. The amount of breakage you catch this way is large enough that the time investment pays back inside a month.
Guardrails and moderation. Production LLM features need to refuse off-topic, unsafe, or out-of-policy inputs. The cleanest pattern is a small fast model that classifies the input first, and a larger model that handles only the requests that pass classification. Run moderation on outputs too.
What Stops Mattering
Some techniques you will see in older articles are not worth optimizing for in 2026:
- "Polite" prompting. "Please" and "thank you" have negligible effect on modern models. The instruction matters; the politeness does not.
- Token-by-token magic words. Older models occasionally responded to "Take a deep breath and think carefully" — modern frontier models do this internally already. Save the tokens.
- Adversarial encoding. Tricks to bypass safety filters mostly do not work on current models and create reputational risk.
- Long persona stories. "You are an expert with 30 years of experience..." adds little to a model that has been trained on enough text to know what an expert sounds like. A two-sentence role definition is enough.
The signal to noise on prompt-engineering content is rough. The techniques in this article are the ones that survive across model generations because they correspond to how transformer models actually work.
A Practical Workflow
When I am building a new LLM feature, the workflow is:
- Write the simplest prompt that might work. Two sentences. No examples.
- Run it on 5 real inputs. Read the outputs critically. Note where they fail.
- Add the smallest fix for each failure mode. Add an example, tighten a constraint, set a format. One change at a time.
- Re-run on the same 5 inputs. Check that earlier outputs still look good.
- Expand to 20 inputs once the prompt is stable. Build an eval set.
- Lock the prompt and the model. Version both. Anything that touches them flows through the eval.
This loop turns prompt engineering from "type things and hope" into something with a feedback signal. It is not glamorous, but it is the difference between a working LLM feature and an unreliable one.
Where This Fits
Lesson 17 of the ABCsteps curriculum is hands-on with prompts for the project. This article gives you the named techniques the lesson uses — system prompts, few-shot, chain-of-thought, structured outputs, decomposition. With that vocabulary, every choice the lesson makes is recognizable as a deliberate technique, not magic. Once you have shipped one prompt-driven feature this way, the others get easier fast.
Apply this hands-on · Module D
Prompting for Useful Engineering Output
Lesson 17 is hands-on with prompts. This article gives you the named techniques — system prompts, few-shot, chain-of-thought — so the lesson's choices are recognizable, not magic.
Open lesson