How photo details get drafted
A photo passed the checks and is ready. Now the reader does the actual tagging: it calls Bedrock vision once, gives it the small photo and the rules from your style doc, and asks for five fields back — a title, alt text, tags, a category, and a short description. The model returns them as plain structured data, with a confidence score on each field, and it is told plainly not to make up anything it can’t see. This is the one place a model is used, and it’s used carefully.
Key takeaways
- One Bedrock Claude Haiku 4.5 vision call per photo — no second model, no loop.
- The prompt includes the style doc: title format, the real tag list, the category list, words to prefer or avoid.
- Five fields come back as structured data: title, alt text, tags, category, description.
- Each field carries a confidence score; the model says when it can’t tell rather than guessing.
- Low-confidence or wrong-looking results route to a human instead of becoming a clean-looking draft.
The drafting flow, per photo
The style doc: what makes the drafts sound like you
A model with no guidance writes generic listings. The style doc is what makes the drafts sound like your shop instead of a stock catalog. It has a few short sections: the title format (“Material + Product + Colour,” or whatever your store uses), the actual list of tags you allow (so the model picks from your tags, not invented ones), the list of categories (so it places the item in a real bucket), and a small word list — words your brand prefers (“midnight,” not “dark blue”) and words to avoid. The doc lives in Drive and is mirrored to S3, so a rep can change a tag or a phrase without touching code.
Because the tag and category lists come from the doc, the model can’t invent a tag you don’t use. If it sees something that doesn’t fit any of your tags, it says so rather than making one up — and that becomes one of the low-confidence signals that routes the photo to a human.
One call, five fields
The reader sends one request to Bedrock Claude Haiku 4.5 with the small photo and the style doc. The prompt is short and firm: “Look at this product photo. Draft these five fields. Return them as plain structured data. For each field, give a confidence score from 0 to 1. Only describe what you can actually see. If you can’t tell the colour, the material, or what the item is, say so — do not guess. If this isn’t a clean product photo, set the not-a-product flag.” The five fields:
- Title. A clear, consistent name in your title format. Short, no filler, no all-caps shouting.
- Alt text. A plain description of the item for a shopper using a screen reader — what it is, its colour, and any obvious feature. This is the field most catalogs skip, and the one that helps both accessibility and search the most.
- Tags. A handful of tags chosen from your allowed list. Not a hundred — the few that actually fit.
- Category. One category from your list. If nothing fits, the model says so rather than forcing it.
- Description. Two or three plain sentences in your house voice. Honest about what the photo shows; no invented features, no made-up materials.
Confidence is the safety valve
The confidence score on each field is what keeps the system honest. A photo of a plain mug on a white background is easy — the model is confident on all five fields, and the draft is ready for the owner. A photo where the colour is ambiguous under odd lighting, or the item is partly out of frame, produces lower confidence on the fields it’s unsure about. The reader compares each score against the threshold in the rules doc and routes accordingly: all fields confident means the draft is ready; some fields weak means the draft is stored but the weak fields are marked so the owner’s eye goes straight to them; the not-a-product flag, or a uniformly low result, means the photo is flagged for a human instead of dressed up as a clean draft.
This is the whole reason the model is asked for confidence rather than just answers. A confident wrong title — “Red Mug” on a photo of a blue one — is worse than no title, because it sails through review when nobody’s looking closely. An honest “I’m not sure of the colour” sends the photo to the right place: a human’s eyes.
Why one call, and why Haiku
The reader makes exactly one model call per photo. No back-and-forth, no second model double-checking the first. One good vision call gets all five fields at once, which is cheaper and simpler than chaining calls, and the confidence scores plus the human review in Part 5 are the safety net — not a second model. Claude Haiku 4.5 is the right model here because the task is bounded and concrete: look at a clear product photo and describe it in five short fields. That doesn’t need the heaviest, most expensive model; the cheap fast one does it well, and the cost page shows just how cheap that makes the whole system.
Next post: how a bad photo gets flagged — the deterministic quality gate from Part 2 and the model’s own not-a-product check, working together so a wrong image goes to a human instead of becoming a tidy draft.
All posts