AIEthos Research | LLM Semantic Case Study

By AIEthos LLC Research | May 11, 2026

How Three LLMs Interpret the Same 48 Enterprise Brands Differently

We pulled 48 subscription-audit reports and ran cross-model sentiment and semantic-gap analysis across OpenAI, Claude, and Gemini. What we found exposes a major blind spot in single-model GEO benchmarking.

Audits in Cohort

57 Requested

Reports Analyzed

48 of 51 Completed

Full 3-Model Agreement

20.8% (10 / 48)

Semantic Gap Rows

235 across 46 Reports

The Data

This analysis used a first-party subscription-audit cohort with 57 requested cases, 51 completed cases, and 48 analyzable reports after quality filtering. The final sample was transformed into a standardized analysis dataset that enabled cross-model comparison of sentiment outcomes and semantic-gap outcomes at the report level.

Method Snapshot

  • - Reports were pulled from securely stored internal reporting artifacts and mapped to a consistent analysis shape.
  • - Records that failed completeness checks were excluded prior to metric computation.
  • - The transformed dataset was used to compare sentiment classifications across OpenAI, Claude, and Gemini.
  • - The same dataset supported semantic-gap analysis using normalized topic and gap representations.
  • - Outputs were aggregated at cohort level for agreement rates, divergence patterns, and supported topic summaries.

Sentiment Distribution by Model

Each model was asked the same brand - yet the three distributions are strikingly different. OpenAI and Gemini skew strongly positive; Claude trends neutral.

OpenAI

Positive46 (96%)
Neutral2 (4%)
Negative0 (0%)

Claude

Positive10 (21%)
Neutral37 (77%)
Negative1 (2%)

Gemini

Positive44 (92%)
Neutral4 (8%)
Negative0 (0%)

Top Tone Terms by Model

The vocabulary each model uses to describe brand tone reveals fundamentally different interpretive frames. Claude focuses on technical infrastructure signals; OpenAI and Gemini describe brand persona.

OpenAI

corporate ×32mission-driven ×17professional ×11authoritative ×7innovation-focused ×6sustainability-oriented ×5purpose-driven ×4trust-oriented ×4

Claude

structured content ×91llms.txt ×67JSON-LD ×59robots.txt ×56schema ×53signals ×43sitemap.xml ×38crawlers ×36

Gemini

authoritative ×43mission-driven ×18corporate ×16structured ×16sustainability ×5governance ×5transparency ×5institutional ×5

Sentiment Divergence Examples

38 of 48 reports (79%) had at least one model disagreement. The most common pattern: OpenAI and Gemini classify positive while Claude returns neutral, including in stronger-readiness samples.

IndustrySampleGEO ScoreOpenAIClaudeGemini
AutomotiveSample A48positiveneutralpositive
AutomotiveSample B42positiveneutralpositive
AutomotiveSample C69positiveneutralpositive
Technology ServicesSample A66positiveneutralpositive
Consumer ElectronicsSample A71positiveneutralneutral
Enterprise TechnologySample A75positiveneutralpositive
Enterprise TechnologySample B70positiveneutralneutral
HealthcareSample A58positiveneutralpositive
HealthcareSample B60positiveneutralpositive
HealthcareSample C55positiveneutralpositive

Semantic Gap Categories

Re-analysis of 235 semantic-gap rows across 46 reports shows a strong concentration in Structural gaps, followed by Entity/Definition and Intent/Context categories.

CategoryCore QuestionPrimary ChallengeRowsShareReportsAvg Gap
Intent/ContextWhy are they asking?Missing the underlying goal or hidden sub-questions.198.1%1930.42
Entity/DefinitionWhat are we talking about?Conflicting definitions of the same term across teams.3414.5%2636.18
Sensory/RepresentationHow does it look/feel?Translating raw signals (pixels/sound) into meaning.93.8%933.33
StructuralWhere is the data?Technical layers that fail to connect related concepts.17172.8%4646.98
LinguisticHow do we say it?Moving from fuzzy human talk to rigid code.20.9%227.5

Lessons Learned

Four actionable takeaways from analyzing 48 enterprise brands across three LLMs.

01

Single-model sentiment scores are not a reliable GEO signal

OpenAI and Gemini classified 96% and 92% of brands as positive respectively, while Claude gave 77% a neutral rating. A team relying on one model alone could draw completely opposite strategic conclusions about the same brand.

02

Claude's tone vocabulary is technical, not emotional

Claude's top tone terms - structured content, llms.txt, JSON-LD, robots.txt - reveal that it interprets tone through a technical infrastructure lens rather than a brand persona lens. This makes Claude valuable for surfacing machine-readability risk, not sentiment alone.

03

Machine-readable authority gaps persist regardless of sentiment

Structural gaps dominate the semantic-gap landscape, representing 72.8% of all recurring issues. With an average gap of 47, these machine-readable authority and technical-readiness problems persist even in brands that all three models rate as positive - showing that positive sentiment does not imply citation-readiness.

04

Consensus is rare and should be the benchmark, not the exception

Only 10 of 48 brands achieved full three-model agreement. Enterprise GEO strategy should target consensus across model families as a performance bar, not optimize for a high score from a single preferred model.

Know How Three Models Read Your Brand

An AIEthos LLC audit generates a full three-model semantic profile - so you can see where OpenAI, Claude, and Gemini agree, where they diverge, and which structural gaps are hiding behind a positive headline score.

Data Note - This study is based on first-party AIEthos LLC subscription audit report data from securely stored internal systems. The requested cohort was 57 audits; 51 were completed at extraction time and 48 reports were fully analyzable. All three model sentiment outputs (OpenAI, Claude, Gemini) were present in the analyzed set. The sample is not statistically random. Findings reflect brand state at audit time and may not represent current deployments.