Here is a misconception that wastes developer hours: Nano Banana 2 image editing and image generation are separate modes requiring different API endpoints, different authentication, or different request structures. They are not. They use the same generateContent endpoint, the same auth header, the same response format, and the same $0.058 pricing on Wisdom Gate. The only difference between a generation request and an editing request is what you put in the contents array. Understanding this distinction — and nothing more — is what unlocks the full nano banana 2 core features set for iterative creative workflows.
This article breaks down exactly how both modes work, when to use each one, how to pass image context from one turn to the next, and how to manage state in your application layer. By the end, you will have the complete mental model and working code to build AI image tools that handle both creation and refinement through a single, unified integration.
🚀 One endpoint. Two modes. Zero endpoint-switching overhead. Start building with Nano Banana 2 on Wisdom Gate today — open AI Studio for no-code testing, or grab your key at wisdom-gate.juheapi.com/hall/tokens and hit the API directly. $0.058/image, 20-second generation, 0.5K to 4K.
The Core Insight: Editing Is Just Iterative Generation
Before we get to code, the mental model matters. In most traditional image editing APIs, editing is a distinct operation — you call a separate endpoint, pass a mask, specify an inpainting region, and receive a patched result. This is not how Nano Banana 2 works.
In the gemini 3.1 flash unified transformer architecture, every request — whether it produces a brand-new image from a text prompt or refines an existing image based on an instruction — goes through the same generateContent endpoint. The model is not "editing pixels"; it is continuing a visual conversation. When you include a previously generated image in the request alongside a new instruction, you are giving the model the same kind of context a human designer would have when receiving feedback on a draft: here is what I made, here is what needs to change, generate the next version.
This "iterative generation" framing changes how you architect your integration. Instead of building a generation pipeline and a separate editing pipeline, you build one pipeline that maintains a growing conversation history. Editing is generation with context.
Mode 1 — Text-to-Image Generation
What It Is
Text-to-image generation is the entry point of every creative session. You send a text prompt describing the desired output, and the model generates an image from scratch. No image input. No conversation history. Just a prompt and a configuration.
When to Use It
Use pure text-to-image generation when:
- You are creating the first asset in a session
- The user is describing something entirely new
- You have no reference image to refine
- You want to generate multiple independent variants from the same prompt
The Payload Structure
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "Da Vinci style anatomical sketch of a dissected Monarch butterfly. Detailed drawings of the head, wings, and legs on textured parchment with notes in English."
}
]
}
],
"generationConfig": {
"responseModalities": ["TEXT", "IMAGE"],
"imageConfig": {
"aspectRatio": "1:1",
"imageSize": "2K"
}
}
}
The contents array has exactly one entry: a user turn with a single text part. No image history. No inline_data. The model starts fresh.
Full cURL Implementation
curl -s -X POST \
"https://wisdom-gate.juheapi.com/v1beta/models/gemini-3.1-flash-image-preview:generateContent" \
-H "x-goog-api-key: $WISDOM_GATE_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"role": "user",
"parts": [{
"text": "Da Vinci style anatomical sketch of a dissected Monarch butterfly. Detailed drawings of the head, wings, and legs on textured parchment with notes in English."
}]
}],
"tools": [{"google_search": {}}],
"generationConfig": {
"responseModalities": ["TEXT", "IMAGE"],
"imageConfig": {
"aspectRatio": "1:1",
"imageSize": "2K"
}
}
}' | jq -r '.candidates[0].content.parts[] | select(.inlineData) | .inlineData.data' \
| head -1 | base64 --decode > butterfly_turn1.png
This is your Turn 1. Store the output — you will need it for Turn 2.
Pro Tip: Always set
"responseModalities": ["IMAGE"](without TEXT) for pure generation pipelines where you only need the image and a text response would break your parsing logic. Use["TEXT", "IMAGE"]when you want the model to also return a description of what it generated — useful for generating alt text or changelogs alongside the image.
Mode 2 — Image-to-Image Editing
What It Is
Nano Banana 2 image editing mode is a continuation of a generation session. You pass the full conversation history — including the previous model response with its generated image as inline_data — alongside a new text instruction. The model processes both the visual context and the new instruction together, generating a refined version of the previous output.
When to Use It
Use image-to-image editing when:
- A user says "make the wings blue," "remove the background," or "add a border"
- You are building an iterative refinement tool (design review cycles, content approval workflows)
- You need to apply a sequence of targeted changes without regenerating from scratch
- You want to maintain visual consistency across a multi-turn creative session
The Payload Structure — The Critical Difference
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "Da Vinci style anatomical sketch of a dissected Monarch butterfly. Detailed drawings of the head, wings, and legs on textured parchment with notes in English."
}
]
},
{
"role": "model",
"parts": [
{
"inlineData": {
"mimeType": "image/png",
"data": "<BASE64_IMAGE_FROM_TURN_1>"
}
}
]
},
{
"role": "user",
"parts": [
{
"text": "Add a bold red rectangular border around the entire image. Keep the butterfly sketch identical."
}
]
}
],
"generationConfig": {
"responseModalities": ["TEXT", "IMAGE"],
"imageConfig": {
"aspectRatio": "1:1",
"imageSize": "2K"
}
}
}
The structural difference is the contents array. In generation mode, it has one entry. In editing mode, it has three: the original user prompt, the model's prior response with the image as inline_data, and the new editing instruction. The model reads all three and generates the next version.
The Multi-Turn Workflow: Step-by-Step Implementation
Turn 1 — Generate and Capture
Generate your initial image and capture the base64 output for use in the next turn:
# Generate Turn 1 and save the base64 data
TURN1_B64=$(curl -s -X POST \
"https://wisdom-gate.juheapi.com/v1beta/models/gemini-3.1-flash-image-preview:generateContent" \
-H "x-goog-api-key: $WISDOM_GATE_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"role": "user",
"parts": [{"text": "Da Vinci style anatomical sketch of a dissected Monarch butterfly. Detailed drawings of the head, wings, and legs on textured parchment with notes in English."}]
}],
"generationConfig": {
"responseModalities": ["IMAGE"],
"imageConfig": {"aspectRatio": "1:1", "imageSize": "2K"}
}
}' | jq -r '.candidates[0].content.parts[] | select(.inlineData) | .inlineData.data' | head -1)
# Save Turn 1 image to disk
echo "$TURN1_B64" | base64 --decode > butterfly_turn1.png
echo "Turn 1 complete. Base64 captured for Turn 2."
The key step here is storing $TURN1_B64 — not just the PNG file. The base64 string is what you will pass as inline_data in Turn 2.
Turn 2 — Pass Context and Edit
# Turn 2: Pass Turn 1 image as context + editing instruction
TURN2_B64=$(curl -s -X POST \
"https://wisdom-gate.juheapi.com/v1beta/models/gemini-3.1-flash-image-preview:generateContent" \
-H "x-goog-api-key: $WISDOM_GATE_KEY" \
-H "Content-Type: application/json" \
-d "{
\"contents\": [
{
\"role\": \"user\",
\"parts\": [{\"text\": \"Da Vinci style anatomical sketch of a dissected Monarch butterfly. Detailed drawings of the head, wings, and legs on textured parchment with notes in English.\"}]
},
{
\"role\": \"model\",
\"parts\": [{
\"inlineData\": {
\"mimeType\": \"image/png\",
\"data\": \"$TURN1_B64\"
}
}]
},
{
\"role\": \"user\",
\"parts\": [{\"text\": \"Add a bold red rectangular border around the entire image. Keep the butterfly sketch identical.\"}]
}
],
\"generationConfig\": {
\"responseModalities\": [\"IMAGE\"],
\"imageConfig\": {\"aspectRatio\": \"1:1\", \"imageSize\": \"2K\"}
}
}" | jq -r '.candidates[0].content.parts[] | select(.inlineData) | .inlineData.data' | head -1)
echo "$TURN2_B64" | base64 --decode > butterfly_turn2_border.png
echo "Turn 2 complete."
Turn 3 — Continue the Edit Chain
# Turn 3: Convert to 3D render style, building on Turn 2
curl -s -X POST \
"https://wisdom-gate.juheapi.com/v1beta/models/gemini-3.1-flash-image-preview:generateContent" \
-H "x-goog-api-key: $WISDOM_GATE_KEY" \
-H "Content-Type: application/json" \
-d "{
\"contents\": [
{
\"role\": \"user\",
\"parts\": [{\"text\": \"Da Vinci style anatomical sketch...\"}]
},
{
\"role\": \"model\",
\"parts\": [{\"inlineData\": {\"mimeType\": \"image/png\", \"data\": \"$TURN1_B64\"}}]
},
{
\"role\": \"user\",
\"parts\": [{\"text\": \"Add a bold red rectangular border around the entire image.\"}]
},
{
\"role\": \"model\",
\"parts\": [{\"inlineData\": {\"mimeType\": \"image/png\", \"data\": \"$TURN2_B64\"}}]
},
{
\"role\": \"user\",
\"parts\": [{\"text\": \"Reinterpret this as a photorealistic 3D render. The butterfly becomes a detailed sculptural model on a dark walnut display stand. Keep the red border as a physical frame.\"}]
}
],
\"generationConfig\": {
\"responseModalities\": [\"IMAGE\"],
\"imageConfig\": {\"aspectRatio\": \"1:1\", \"imageSize\": \"4K\"}
}
}" | jq -r '.candidates[0].content.parts[] | select(.inlineData) | .inlineData.data' \
| head -1 | base64 --decode > butterfly_turn3_3d.png
echo "Turn 3 complete. Full edit chain: Sketch → Red Border → 3D Render."
Note the resolution upgrade on Turn 3: "imageSize": "4K". The final client-facing deliverable warrants maximum quality. Wisdom Gate delivers 4K in the same consistent 20 seconds as 0.5K — so upgrading resolution on the final turn costs no additional latency.
State Management in Frontend Applications
The server-side pattern shown above works cleanly for scripts and backend services. Frontend applications need a state management strategy for the base64 strings that persist between turns.
React State Pattern
import { useState } from "react";
const ENDPOINT = "https://wisdom-gate.juheapi.com/v1beta/models/gemini-3.1-flash-image-preview:generateContent";
export function useImageEditSession() {
// Conversation history — grows with each turn
const [conversationHistory, setConversationHistory] = useState([]);
const [currentImageB64, setCurrentImageB64] = useState(null);
const [isGenerating, setIsGenerating] = useState(false);
async function sendTurn(userPrompt) {
setIsGenerating(true);
// Build contents array from history + new user message
const newUserTurn = {
role: "user",
parts: [{ text: userPrompt }]
};
const contents = [...conversationHistory, newUserTurn];
const response = await fetch(ENDPOINT, {
method: "POST",
headers: {
"x-goog-api-key": process.env.NEXT_PUBLIC_WISDOM_GATE_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify({
contents,
generationConfig: {
responseModalities: ["IMAGE"],
imageConfig: { aspectRatio: "1:1", imageSize: "2K" }
}
})
});
const data = await response.json();
const imagePart = data.candidates[0].content.parts.find(p => p.inlineData);
if (!imagePart) throw new Error("No image in response — check responseModalities");
const newImageB64 = imagePart.inlineData.data;
// Update conversation history: add user turn + model response
setConversationHistory(prev => [
...prev,
newUserTurn,
{
role: "model",
parts: [{ inlineData: { mimeType: "image/png", data: newImageB64 } }]
}
]);
setCurrentImageB64(newImageB64);
setIsGenerating(false);
return newImageB64;
}
function resetSession() {
setConversationHistory([]);
setCurrentImageB64(null);
}
return { sendTurn, currentImageB64, isGenerating, resetSession };
}
Key architectural decisions in this pattern:
conversationHistoryis the source of truth — it contains every user turn and every model response as the growingcontentsarraycurrentImageB64is derived state for UI rendering — the base64 string displayed as adata:image/png;base64,URI in the<img>tagresetSession()clears both — starting a new creative session without rebuilding the integration- The 256K context window on Nano Banana 2 means this history array can grow through many turns before approaching any limit. For most product use cases — 5 to 20 turns per session — it fits entirely within the context with room to spare
Pro Tip: For production applications, persist
conversationHistoryto your backend (session store or database) rather than React state alone. This allows users to resume editing sessions across page refreshes, share sessions with collaborators, or roll back to earlier turns in the edit chain.
Pricing and Performance: Why Unified Matters
The economic implication of the unified endpoint model is direct: every turn in a multi-turn editing session costs $0.058 on Wisdom Gate, regardless of whether it is a generation turn or an editing turn. There is no editing surcharge, no premium for passing image context, and no additional charge for enabling Image Search Grounding.
| Workflow | Turns | Cost on Wisdom Gate | Latency per Turn |
|---|---|---|---|
| Single generation | 1 | $0.058 | 20 seconds |
| Generation + 1 edit | 2 | $0.116 | 20 seconds each |
| Full 3-turn session | 3 | $0.174 | 20 seconds each |
| 10-turn design review | 10 | $0.580 | 20 seconds each |
The latency row matters for product design. Adding an image to the context via inline_data does not extend the generation time beyond the platform's consistent 20-second guarantee. Wisdom Gate's infrastructure is built to handle multi-modal inputs within the same response window — so your loading state, your timeout configuration, and your user experience design remain identical whether you are on Turn 1 or Turn 10.
This predictability is not incidental. For developers building creative tools where the user experience depends on reliable feedback timing — design review platforms, AI art tools, automated content workflows — a variable latency API is an architectural risk. A guaranteed 20 seconds is an engineering specification you can design around.
Mode Selection: The Decision Rule
Here is the practical decision rule that covers 95% of use cases:
| User Action | Mode | Contents Array |
|---|---|---|
| "Generate a product image of X" | Text-to-image | One user turn (text only) |
| "Make the background white" | Image editing | Previous turns + new instruction |
| "Try a different color palette" | Image editing | Previous turns + new instruction |
| "Start over with a new concept" | Text-to-image | Reset history; new user turn (text only) |
| "Generate 5 variants of this" | Image editing × 5 | Same history, 5 separate edit calls |
The only moment you reset the conversation history and start fresh is when the user explicitly wants a new starting point. Every refinement, every targeted change, and every variant based on the current image is an editing turn — pass the history, add the instruction, receive the next version.
Putting It Together: The Unified Workflow Summary
Nano Banana 2 image editing and image generation are the same operation at the API level. The unified generateContent endpoint on Wisdom Gate handles both through one authentication pattern, one response structure, and one pricing model. The difference is your contents array: text alone for generation, conversation history plus new instruction for editing.
The 256K context window means your edit chains can be long without truncation. The consistent 20-second response time on Wisdom Gate means your product's UX is predictable at every turn. The $0.058 flat rate means your cost model is linear and forecastable regardless of how many edit turns a session requires.
Build one pipeline. Handle both modes. Stop maintaining separate generation and editing integrations and reduce your AI feature surface area to a single, well-understood endpoint.
Simplify your code. Amplify your creative output. Nano Banana 2 on Wisdom Gate is the single endpoint that handles your entire image creation and refinement lifecycle — from first prompt to final 4K hero asset. Test multi-turn editing right now in AI Studio with no API key, or get your production key at wisdom-gate.juheapi.com/hall/tokens and start building the unified creative tool your users actually need. $0.058 per turn, 20 seconds, every time.