Zum Inhalt springen
KIStrategie

Does Prompt Language Matter? The 2026 Data

Jamin Mahmood-Wiebe

Jamin Mahmood-Wiebe

Mechanical keyboard with multilingual keycaps showing English, German umlauts, Chinese characters, and Japanese hiragana
Article

A Confession From a Non-Native Speaker

My mother tongue is German. But every time I open Claude, ChatGPT, or any other LLM, I speak to it in English. I stopped typing prompts months ago — but even when dictating, I default to English. Not because I think my German is bad. Because I have a gut feeling that the AI just... gets me better in English. Does prompt language matter that much, or was this just placebo?

For months, I assumed the latter. A bias from spending too much time in English-speaking developer communities. Then I started digging into the research. Turns out, my gut feeling has data behind it.

The Training Data Problem: English Owns the Internet

The root cause is simple: LLMs are trained predominantly on English text.

67%of LLaMA's training data is English
93%of GPT-3's training data was English
5-6%typical share of German in training corpora

According to Meta's LLaMA paper, the training mix was 67% English, with the remaining third split across dozens of languages and programming code. OpenAI's GPT-3 paper (Brown et al., 2020) documented an even more extreme 93% English. Even deliberately multilingual models like BigScience's BLOOM, designed to counter English dominance, still ended up with 30% English as the single largest language.

The consequence? English is the language where these models have seen the most examples of good writing, logical reasoning, factual knowledge, and nuanced expression. When you prompt in English, you are activating the densest, most well-trained neural pathways the model has.

German, at roughly 5-6% of typical training corpora, is actually one of the better-represented non-English languages. But "one of the best non-English languages" still means roughly 10x less training data than English.

What the Benchmarks Actually Show

This is not speculation. Multiple research teams have measured the gap.

The Historical Baseline: GPT-4 (2023)

According to the GPT-4 Technical Report, OpenAI tested GPT-4 on translated MMLU across 26 languages. The gap was clear:

Performance degraded roughly in proportion to training data volume. German and French sat 3 points behind English. Low-resource languages showed gaps of 10-14 points.

But that was 2023. A lot has changed.

The 2025-2026 Reality: The Gap Has Narrowed Dramatically

Three years of model improvements have compressed the multilingual gap for high-resource languages. Here is where the leading models stand on multilingual benchmarks today:

The standout number: Gemini 2.5 Pro achieves 94% parity with English across 12 languages. That is a fundamentally different landscape than GPT-4's 3-14% gaps.

Claude Opus 4.6 averages 96 on BenchLM's multilingual suite, a 15.5-point improvement over Claude 3.5 Sonnet. The generational leap in multilingual capability has been massive.

Perhaps most surprising: a 2026 study by Lilt testing GPT-5.2, Claude Opus 4.6, and Gemini 3.1 on Arabic, German, and Korean found that German actually surpassed English on version editing tasks (53.66% vs 46.34%). The blanket "English is always better" assumption no longer holds for every task.

The Reversed Asymmetry: Chinese Models

DeepSeek-V3 tells an interesting story. On English factual knowledge (SimpleQA benchmark), it trails GPT-4o and Claude 3.5 Sonnet. But on Chinese SimpleQA, it surpasses both. This is not parity. It is a reversed advantage for the language the model was optimized for.

Qwen 3 expanded from 29 to 119 languages and dialects, trained on 36 trillion tokens. Qwen3-235B with thinking achieves around 80% on MMLU-ProX across multiple languages, closing in on parity.

The lesson: if you work primarily in Chinese, a Chinese-optimized model may outperform Western models even when those Western models are prompted in English.

The Reasoning Gap is Larger Than the Knowledge Gap

The quality difference is not uniform across tasks. For simple factual questions, language barely matters. For complex reasoning chains, the gap widens.

Shi et al. (2022) demonstrated that English chain-of-thought exemplars improved reasoning in other languages. Huang et al. (2023) formalized this with Cross-Lingual Thought (XLT) prompting: instructing the model to reason in English internally before responding in the target language, producing 1-10% improvements on arithmetic reasoning.

But a May 2025 survey cataloging 39 multilingual prompting techniques found that newer strategies can outperform the simple "think in English" approach. Cross-Lingual Self-Consistent Prompting (CLSP), which reasons in multiple languages and picks the most consistent answer, outperforms English-only chain-of-thought on math and causal reasoning. And Regressive Native-CoT actually outperforms English CoT for certain languages on subjectivity tasks.

The takeaway has evolved: English still helps for reasoning, but it is no longer the only path to good results.

The Token Tax: Non-English Languages Cost More

Beyond quality, there is a hard financial reality. LLM tokenizers were trained primarily on English text, making English the most efficiently encoded language.

German's 30% token overhead comes mainly from compound nouns. "Geschwindigkeitsbegrenzung" (speed limit) might consume 4-5 tokens where "speed limit" costs 2. Multiply that across an entire conversation, and you are paying a real premium. These figures are based on OpenAI's cl100k_base tokenizer analysis.

For a business running thousands of API calls per day, this adds up. A German-language customer service bot costs roughly 30% more per interaction than an English one, before you even consider the quality difference.

Chinese Token Efficiency: From Myth to Reality

For years, the claim that "Chinese is more token-efficient" was dismissed as a myth. And with older tokenizers, it genuinely was. GPT-4's cl100k_base tokenizer turned each Chinese character into 2-3 BPE tokens, making Chinese text roughly 1.8x more expensive than English despite carrying more meaning per character.

That has fundamentally changed. With OpenAI's o200k_base tokenizer (used in GPT-4o and GPT-5), Chinese token efficiency has caught up dramatically. The vocabulary doubled from 100K to 200K tokens, with Unicode categories added specifically for CJK scripts.

The result is striking. Take "artificial intelligence" vs "人工智能": English requires 3 tokens, Chinese requires 2 tokens. Chinese actually wins. At the sentence level, according to tokenizer analysis by Tony Baloney and testing with o200k_base, Chinese and English are at near-parity, averaging roughly 1.0-1.1x. At the paragraph level, Chinese runs about 15% more tokens for equivalent content — a far cry from the old 80% penalty.

ℹ️

The real picture in 2026

Chinese information density IS translating into real token efficiency with modern tokenizers. The old 1.8x penalty has collapsed to roughly 1.1-1.2x. For short technical phrases, Chinese often beats English on token count. The "myth" is no longer a myth — it just took tokenizer improvements to catch up with the linguistic reality.

One caveat remains: tokenization can still fragment semantic units in Chinese, splitting compound concepts across token boundaries in ways that may affect comprehension during complex reasoning. Newer tokenizers have reduced but not eliminated this issue. And for low-resource languages using non-Latin scripts (Hindi, Tamil, Arabic), significant token premiums of 2-3x versus English persist even with o200k.

German-Specific Quirks I Have Noticed

After thousands of hours prompting LLMs in both English and German, here are the German-specific issues I see repeatedly:

Compound noun hallucinations. Models occasionally generate compound words that do not exist in German. They understand the principle of German compounds but sometimes invent new ones that sound plausible but are wrong.

Case system drift. In long outputs, dative and accusative cases start blending. The model defaults toward nominative, especially in complex nested sentences. Native speakers notice this immediately.

Sie/du inconsistency. The formal/informal distinction is a minefield. Models switch between Sie and du mid-conversation, especially after code blocks or technical explanations where the model briefly "forgets" its register.

Gender-inclusive language chaos. Ask for gender-neutral German (Gendern) and you will get an inconsistent mix of Genderstern (Mitarbeiter*innen), Doppelnennung (Mitarbeiterinnen und Mitarbeiter), and masculine generic within the same paragraph. Unless you are extremely specific in your instructions.

None of these issues occur in English output. The model has simply seen enough English to avoid these classes of errors.

Does Prompt Language Matter? The Task-Dependent Answer

After reviewing the research and testing extensively myself, here is what I recommend:

For Complex Reasoning and Analysis: Prompt in English

When you need the model to think hard, English wins. Math problems, logical deduction, strategic analysis, data interpretation. Write your prompt in English, even if you want the final answer in another language. Add "Respond in German" at the end if needed.

This maps directly to the research. English chain-of-thought is more reliable. English instruction following is more consistent. The model's alignment training (RLHF) was conducted predominantly in English.

For Creative Writing and Marketing Copy: Use Your Native Language

Here is where the conventional wisdom breaks. If you want natural-sounding German marketing copy, prompt in German. Prompting in English and requesting German output produces what I call "translationese" — grammatically correct German that reads like a translation. Native speakers feel it immediately, even if they cannot pinpoint why.

The cultural context matters too. German business communication has different conventions, different levels of formality, different rhetorical patterns. The model accesses these more naturally when the entire conversation is in German.

For Code: Always English

This one is non-negotiable. Variable names, comments, documentation, code explanations — everything should be in English. Programming is an English-native domain. The training data for code is overwhelmingly English. Prompting for code in German produces worse variable naming, less idiomatic patterns, and more errors. This is true whether you use Cursor, Claude Code, or any other AI coding tool.

For Structured Output: English Instructions, Target Language Content

When requesting JSON, tables, or any structured format, write the structural instructions in English. The model follows formatting rules more reliably in English. Content within the structure can be in your target language.

💡

The hybrid approach that works best

Write your system prompt in English. Write your user message in whatever language feels natural. Add a clear language instruction for the response. This leverages English's superior instruction-following while keeping your input natural and comfortable.

A Practical Trick: "Think in English, Respond in German"

The single most effective technique I have found: explicitly tell the model to reason in English internally.

Adding "Think through this problem in English, then provide your response in German" to complex prompts produces noticeably better results. This is the practitioner version of the Cross-Lingual Thought (XLT) prompting strategy from Huang et al. (2023).

It costs a few extra tokens for the internal reasoning. It is worth it every time for anything beyond simple questions.

The Gap Is Closing — Fast for Some, Slow for Others

The trajectory from GPT-3 to current models tells the story:

For German, we are approaching functional parity. A 2026 Lilt study found that instruction retention drops only 3-7% for non-English languages in current frontier models. German even outperformed English on some specific tasks. Gemini 2.5 Pro hits 94% parity with English across 12 languages.

But the uncomfortable truth remains: for low-resource languages, the improvement is much slower. The INCLUDE benchmark (ICLR 2025), built with native-speaker-created resources rather than translations, revealed persistent gaps. On Qwen 2.5's MMLU-ProX results, the gap between English (70.3%) and Swahili (40.1%) is still a staggering 30 points. A Stanford HAI report and a January 2025 multilingual LLM survey both conclude that most LLMs remain fundamentally English-centric.

The "Always Prompt in English" Advice is Outdated

Here is where I need to update my own assumptions. A 2025 study across 35 languages on extraction tasks found that matching prompt language to content language consistently outperforms English prompts, with up to 50% accuracy improvement. English prompts even took 25-35% longer to process non-English content.

This does not mean English prompts are useless. For reasoning-heavy tasks without native-language input, English still tends to win. But the blanket advice to "always use English" is a relic of the GPT-3/GPT-4 era. Current models are good enough in major languages that task-matching matters more than language-switching.

What This Means for Businesses

If you are building AI-powered products or agentic workflows, prompt language is a real architectural decision:

Cost optimization. An English-first prompt strategy saves 20-40% on token costs for German applications, more for CJK languages. For high-volume applications, this is a line item worth optimizing.

Quality assurance. If your AI outputs are customer-facing, the language you prompt in affects the quality your customers see. A proper AI readiness assessment should include prompt language strategy. Test both approaches with native speakers before committing.

Longer contexts amplify the gap. In short prompts, language barely matters. In long contexts (10K+ tokens), non-English prompts show more instruction drift, formatting inconsistencies, and occasional language switching. If your use case involves long documents, English prompting becomes more important.

Temperature sensitivity. Higher temperature settings amplify multilingual quality differences. For non-English generation, keep temperature lower (0.3-0.5) than you would for English (0.7).

My Personal Workflow

I prompt in English for everything except German marketing copy. Even when building features for the German market, my system prompts are English, my technical discussions are English, and my code is obviously English.

When I need German output, I write "Respond in German" and accept the small quality tradeoff for natural-sounding results. For critical German content, I review and edit manually — the model gets 90% right, but that last 10% is where native fluency lives.

Is this optimal? The research says yes. Does it feel slightly absurd to speak a foreign language to a machine? Also yes. But until training data distribution catches up, English remains the lingua franca of AI — not by design, but by data.

Quick Reference: When to Use Which Language

Task TypeRecommended LanguageWhy
Complex reasoning, math, logicEnglishStrongest chain-of-thought performance
Processing non-English documentsMatch the document languageUp to 50% accuracy gain (2025 study)
Creative writing, marketing copyYour native languageAvoids "translationese", captures cultural tone
Code generationEnglishProgramming is an English-native domain
System prompts, structured outputEnglishBetter instruction following
TranslationSource language input + English instructionsReliable formatting with natural source comprehension

FAQ

Does prompting in English really make a difference for simple questions?

Not really. For basic factual questions or short interactions, the difference is negligible. The gap becomes meaningful for complex reasoning, long contexts, and tasks where instruction following matters.

Should I switch to English even if my English is not perfect?

Yes, for reasoning tasks. Imperfect English prompts still outperform native-language prompts for complex analysis in most cases. The model is very good at understanding non-native English. Your slightly awkward phrasing activates better neural pathways than perfect German.

Will this change as models improve?

For high-resource languages like German, French, and Japanese — yes, the gap is closing quickly. For low-resource languages, the improvement is slower. English dominance in training data is a structural issue that will take years to fully resolve.

What about Chinese-specialized models like DeepSeek or Qwen?

They do not just match English — they beat Western models on Chinese tasks. DeepSeek-V3 surpasses GPT-4o and Claude 3.5 Sonnet on Chinese SimpleQA. Qwen 3 supports 119 languages trained on 36 trillion tokens. If you work primarily in Chinese, a Chinese-optimized model is not just comparable — it is likely superior to prompting a Western model in English.

Is the token cost difference significant for individuals?

For personal use with subscription-based tools (ChatGPT Plus, Claude Pro), token efficiency barely matters — you are paying a flat fee. It becomes significant for API-based applications processing thousands of requests per day.

End of article

AI Readiness Check

Find out in 3 min. how AI-ready your company is.

Start now3 min. · Free

AI Insights for Decision Makers

Monthly insights on AI automation, software architecture, and digital transformation. No spam, unsubscribe anytime.

Let's talk

Questions about this article?.

Keith Govender

Keith Govender

Managing Partner

Book appointment

Auch verfügbar auf Deutsch: Jamin Mahmood-Wiebe

Send a message

This site is protected by reCAPTCHA and the Google Privacy Policy Terms of Service apply.