Design Prompts with Explicit, Actionable Criteria
One of the most common mistakes in production prompt design is relying on vague qualitative instructions instead of specifying concrete, categorical criteria the model can evaluate deterministically. The difference between a useful code review agent and a noisy one often comes down to how the review standards are articulated in the prompt.
Concrete Categories Beat Vague Adjectives
Consider the difference between telling the model to "check that comments are accurate" versus instructing it to "flag any comment whose description contradicts the observable behavior of the surrounding code." The first phrasing gives the model latitude to interpret "accurate" however it wants, leading to inconsistent results. The second defines a precise condition the model can check mechanically.
Similarly, instructions like "be conservative" or "only report findings you're highly confident about" do not measurably improve precision. The model has no reliable internal confidence calibration, so these qualifiers add no filtering value. What actually reduces false positives is defining explicit categories of issues to look for — and equally important, specifying what to ignore.
False Positives Undermine Accurate Categories Too
Even when some categories in your review prompt are well-defined, a high false positive rate in other categories erodes user trust across the board. If developers learn to ignore the tool's output because half the warnings are noise, they'll also miss the legitimate findings. The prompt must be tuned holistically — every category needs to earn its place by maintaining a high signal-to-noise ratio.
Specific, categorical review criteria consistently outperform confidence-based filtering. Define exactly what constitutes a finding — don't ask the model to self-assess how sure it is.
Watch for answer choices that include phrases like "be thorough", "find all issues", or "only report high-confidence findings". These sound reasonable but are ineffective in practice. The correct answer will specify concrete, enumerated criteria.
Apply Few-Shot Prompting for Consistency
Few-shot examples are the single most effective technique for getting consistent, predictable output from Claude in production systems. When the task involves ambiguity — edge cases in classification, nuanced formatting requirements, or domain-specific reasoning — a small set of well-chosen examples communicates expectations far more reliably than lengthy written instructions.
How Many Examples and What Should They Show?
Two to four targeted examples generally hit the sweet spot. Each example should demonstrate not just the desired output format but also the reasoning process that leads to that output. For a code review agent, this might mean showing an example where a suspicious pattern is flagged along with an explanation of why it constitutes a real issue — and a counterexample where a similar-looking pattern is correctly classified as benign.
Reducing Hallucination and Enabling Generalization
Few-shot examples anchor the model's behavior in concrete precedent rather than abstract instruction. This reduces the tendency to hallucinate findings or invent categories not specified in the prompt. Crucially, well-chosen examples also help the model generalize to novel patterns — when it sees how you've reasoned about edge cases A, B, and C, it can apply analogous reasoning to edge case D even though D wasn't explicitly covered.
Few-shot examples are the most effective technique for achieving consistent output. They demonstrate format, reasoning, and handling of ambiguity simultaneously — something that instructions alone cannot accomplish as reliably.
Enforce Structured Output with Tool Use and JSON Schemas
When your system requires guaranteed schema-compliant output — not "usually valid JSON" but
always valid JSON matching a specific schema — the most reliable approach is tool_use
combined with a JSON schema definition. This leverages the API's built-in enforcement mechanism rather than
relying on prompt instructions alone.
Tool Choice Modes
The tool_choice parameter controls how the model selects tools:
"auto"— the model decides whether to call a tool or respond with plain text. Useful when tool use is optional."any"— the model must call some tool but can choose which one from the available set.- Forced tool selection — you specify a particular tool by name, guaranteeing the model calls exactly that tool. This is the strongest guarantee for structured extraction.
Structure vs. Semantics: A Critical Distinction
Strict JSON schemas eliminate syntax errors — you'll never get malformed JSON, missing required fields, or wrong data types. But schemas do not prevent semantic errors. For example, an invoice extraction tool might output a perfectly schema-valid response where the individual line items don't sum to the stated total. The structure is correct; the content is wrong.
Schema Design Best Practices
Thoughtful schema design reduces the surface area for errors:
- Required vs. optional fields: Mark fields as required only when the source document reliably contains that information. Making everything required forces the model to fabricate values when data is absent.
- Enums with an "other" escape hatch: When defining categorical fields, include an
"other"value paired with a freeformdetailstring. This prevents the model from force-fitting novel inputs into the wrong category. - Nullable fields: Explicitly allow
nullfor fields that may not exist in every document. This gives the model a safe way to say "not found" rather than inventing a plausible value.
tool_use guarantees structural compliance only, not semantic correctness.
Your schema will always be syntactically valid, but the values within may still be wrong. Semantic
validation requires separate business-logic checks.
Don't fall for answers that imply tool_use output is always correct because it matches the schema. Schema conformance eliminates syntax issues, not content errors. The exam will test whether you understand this distinction.
Your data pipeline needs to extract structured information from unstructured documents and guarantee the output matches a predefined schema. Which approach provides the strongest guarantee of schema-compliant output?
tool_use with a JSON schema and forced tool_choice is the only approach
that provides a built-in API-level guarantee of schema compliance. Prompt instructions
(A) and few-shot examples (D) improve consistency but can't guarantee compliance. Post-processing retries
(B) add latency and still depend on the model eventually producing valid output.
Implement Validation, Retry, and Feedback Loops
Even with structured output guarantees, the extracted data may contain errors that require correction. An effective retry strategy doesn't just say "try again" — it appends specific, actionable error details to the prompt so the model knows exactly what to fix.
Retry with Targeted Error Feedback
When validation fails, the retry prompt should include the exact nature of the error: which field failed, what rule it violated, and what the expected vs. actual values were. For example: "The line_items total ($2,340) does not match the stated invoice_total ($2,430). Please re-extract and verify the amounts." This gives the model a concrete signal it can act on.
In contrast, a generic retry like "There was a validation error. Please try again." provides zero diagnostic information. The model is essentially guessing what went wrong, which often produces the same mistake or introduces new ones.
When Retries Don't Help
Retries are ineffective when the required information simply isn't present in the source document. If a field is missing from the input, no amount of retrying will conjure a correct value — the model will either keep producing the same fabrication or switch to a different one. Your validation logic should distinguish between "the model made a correctable error" and "the source data doesn't contain this information."
Tracking Patterns and Self-Correction
For production pipelines processing many documents, consider adding detected_pattern fields
to your output schema so the model can flag recurring issues — for example, a specific document type that
consistently triggers a particular false positive. This metadata helps you refine the prompt iteratively.
A powerful self-correction technique is to have the model extract both a calculated_total
(summed from individual items) and a stated_total (read directly from the document). If these
diverge, the system automatically flags the discrepancy for review, catching errors the schema alone can't
detect.
Specific error details in retry prompts guide the model toward correction. A generic "try again" message provides no useful signal and typically does not improve results.
Watch for answer choices that describe retry strategies with generic error messages like "validation failed, please correct." The correct approach always includes specific field-level error details — which field, what was wrong, and what was expected.
Design Efficient Batch Processing Strategies
The Message Batches API offers 50% cost savings compared to synchronous requests, but with an important tradeoff: requests are processed within a 24-hour window with no guaranteed latency SLA. This makes batch processing ideal for specific workload types and completely wrong for others.
When Batch Processing Is Appropriate
- Overnight report generation: Summaries, analytics, and dashboards that need to be ready by morning but don't block any real-time process.
- Weekly or nightly audits: Code quality reviews, compliance scans, or documentation checks that run on a schedule.
- Nightly test generation: Creating test cases from production logs or specification documents where results are consumed the next day.
When Batch Processing Is NOT Appropriate
Any workflow that blocks a developer or process from proceeding should use the synchronous API. Pre-merge CI checks are the canonical example — a pull request cannot be merged until the review completes, so a potential 24-hour delay is unacceptable. The cost savings don't matter if they create a bottleneck in the development workflow.
Technical Constraints
Batch requests do not support multi-turn tool calling within a single request. Each batch item is a
standalone request-response pair. Use custom_id fields to correlate requests with responses
when processing results — this is how you match each output back to the input that generated it.
Use the Batch API for latency-tolerant workloads (overnight reports, nightly audits) and the synchronous API for blocking workflows (pre-merge checks, real-time user interactions). The decision hinges entirely on whether anything is waiting on the result.
Your engineering manager proposes using the Message Batches API for two workloads: (1) pre-merge code review checks that block PR merging, and (2) a nightly audit that scans the full codebase for style violations. Which workloads should use the Batch API?
Design Multi-Instance and Multi-Pass Review Architectures
When a single model instance reviews its own output, there's an inherent limitation: it retains the reasoning context from the generation phase. This means it's systematically less likely to question decisions it already justified to itself. This is the fundamental problem with same-session self-review.
Why Independent Review Instances Are More Effective
A separate model instance — running in a fresh session with no memory of the generation process — evaluates the output on its own merits. It doesn't know the reasoning that led to each decision, so it can assess the output more objectively. This is analogous to how code review works on engineering teams: the reviewer wasn't present during implementation and evaluates the code without the author's mental context.
Multi-Pass Architecture for Large Reviews
For complex tasks like reviewing a 14-file pull request, a single pass over all files produces uneven results — some files get detailed feedback while others receive shallow analysis. The solution is to decompose the review into focused passes:
- Per-file local analysis: Each file is reviewed individually in a dedicated pass, ensuring consistent depth and attention across all files.
- Cross-file integration pass: A separate pass examines how the files interact — checking data flow consistency, interface contracts, and architectural coherence that only become visible when considering multiple files together.
This decomposition ensures that neither local detail nor global coherence is sacrificed.
Use separate sessions for generation and review. A model reviewing its own output in the same session retains reasoning context that biases the review. Independent instances produce more objective evaluations.