Aether One™ · The model layer

Open weights.
Sovereign inference.

Healthcare runs on regulated infrastructure. Aether One™ runs on models we can host inside that infrastructure — open-weight foundation models from Meta, Mistral, Microsoft, and Google, plus healthcare-specific open releases like MedGemma and BioMistral. The model layer is engineered, not assumed.

HIP One Clinical Medical Review Dashboard powered by Aether One™ — document upload and review interface running on sovereign open-weight inference
Aether One™ · Powering HIP One Clinical Review
Engineering thesis
8+
Open-weight families

In production across Aether One™ agent surfaces

0
PHI egress

Sovereign deployments call no external inference APIs

3
Inference frameworks

Ollama, vLLM, llama.cpp — matched to deployment shape

Why open-weight

Four properties closed APIs can't deliver.

For payer, provider, and government healthcare, the right model is the one you can host where the data lives, audit when the regulator asks, and cost-predict at production volume. Closed model APIs deliver none of those.

01 · Residency

Data stays put

Inference happens on infrastructure the customer owns. PHI never crosses a network boundary to a third-party model API. Required for sovereign deployments and CMS-aligned environments.

02 · Cost

Predictable economics

No per-token pricing surprise at production volume. Cost is GPU-hours, not API charges. A 70B model running on an H100 costs the same on day one as on day one thousand.

03 · Portability

Deployable anywhere

Same model artifact runs on Azure Government, AWS GovCloud, on-prem GPU rack, air-gapped DoD-adjacent system. The model doesn't know or care — the inference framework abstracts it.

04 · Audit

Reconstructable evidence

Specific model version, weights hash, and inference parameters are recorded in every audit packet. Reviewers can replay any historical decision exactly — impossible with a closed API that may have silently versioned underneath you.

Models in production

What we run, and where.

Aether One™ agents are not single-model. Each agent surface uses the model best matched to its task — latency budget, reasoning depth, structured-output reliability, healthcare-domain pre-training. Below is the production matrix as of 2026.

Model familyProvenanceWhere it runsWhy it's there
Llama 3.1 70B · instructMeta · open weightsClinical reasoning agents, criteria decompositionBest open-weight model for long-context clinical reasoning. Strong at multi-step synthesis required by NCD/LCD criteria atomization.
Llama 3.1 8B · instructMeta · open weightsRouting, intake parsing, fast classificationSub-second latency on commodity GPU. Used wherever the agent's job is to route, classify, or extract structured data from a clean input.
Mixtral 8x7BMistral AI · Apache 2.0Multi-language correspondence, dense classificationMixture-of-experts architecture gives strong multilingual performance for member correspondence and benefit translations.
Gemma 4 · 31B DenseGoogle DeepMind · Apache 2.0Structured-output agents, JSON schema enforcementNative function-calling and structured JSON output are baked into the architecture — not bolted on via prompt engineering. 256K context. Used where agent output must conform to a strict JSON schema before being passed to a downstream system.
Gemma 4 · 26B MoEGoogle DeepMind · Apache 2.0Rule-pack generation, code-as-policyMixture-of-experts architecture activates only 3.8B parameters per token — predictable inference cost for high-throughput code-reasoning workloads. Used internally for translating regulatory text into executable rule logic.
Microsoft Phi-3.5 miniMicrosoft · MIT licenseEdge inference, tablet-resident deployments3.8B parameters quantized to 2GB. Runs on a clinical workstation or tablet without a GPU. Used for ambient-agent scenarios where latency budget is <500ms.
MedGemmaGoogle · open weights, healthcare-tunedClinical Q&A, evidence retrievalHealthcare-domain pre-training improves clinical-terminology recall and reduces hallucination on medical-specific queries vs. general models of the same size.
BioMistral 7BOpen-source biomedical fine-tuneBiomedical literature retrieval, drug interaction reasoningContinued pre-training on biomedical corpora. Used in PA-RX for pharmacy benefit reasoning and formulary-aware criteria evaluation.

Model selection is documented per-agent in the deployed rule-pack manifest. The selection is not static — we re-evaluate on each model release cycle (typically quarterly) and can swap a model behind an agent without API changes for the customer.

Inference patterns

Three frameworks, one model artifact.

A model is a file on disk. The inference framework is what serves it as an API to the agent. Aether One™ uses three, matched to deployment shape and latency budget. The model artifact itself is identical — only the runtime changes.

01 · Sovereign / on-prem

Ollama

Self-contained Go binary. Trivial to install, easy to update via signed model artifacts. The default for sovereign deployments where the operator wants minimal external dependency surface area. Supports GGUF format with hot model swap.

Used in: WISeR (CMS, NJ) · State Medicaid pilots

02 · Production GPU cluster

vLLM

Throughput-optimized. PagedAttention scheduler, dynamic batching, continuous batching. Drives high requests-per-second on multi-tenant GPU clusters where latency budgets are bounded but volume is unbounded. The default for hyperscaler-deployed HIP One.

Used in: Azure-deployed HIP One · Marketplace agents

03 · Edge / ambient

llama.cpp

Pure-C++ inference, runs on CPU or commodity hardware. Used for tablet-resident agents, clinical workstation copilots, and any scenario where the constraint is "no GPU available." Sub-second latency on Phi-3.5 mini quantized to Q4_K_M.

Used in: Ambient clinical agents · Tablet copilots

Quantization

From 140GB to 18GB without losing the doctor's intent.

A 70B-parameter model in full FP16 precision is ~140GB. Most healthcare deployments don't have the GPU memory or the budget for that. Quantization compresses the model weights to lower precision — the right quantization keeps clinical-task accuracy nearly identical to the full-precision baseline.

FormatSize (Llama 70B)When we use itTradeoff
FP16 (full precision)~140 GBReference baseline for evaluationHighest accuracy. Impractical for most deployments — needs multi-H100.
Q8_0 (8-bit)~70 GBProduction GPU clusters with H100/A100 capacity~99% of FP16 quality. Standard for hyperscaler-deployed HIP One.
Q4_K_M (4-bit, K-quants medium)~40 GBSovereign on-prem on commodity GPUs (A6000, L40S)~98% quality on clinical tasks. The sweet spot for sovereign deployments where GPU budget is fixed.
AWQ (Activation-aware 4-bit)~38 GBvLLM throughput-bound deploymentsSlightly higher quality than GGUF Q4 at the same size; requires vLLM (no GGUF).

Quantization choice is recorded per-agent in the audit packet. If a regulator asks "what model produced this decision," the answer includes the exact quantization — not a vague "Llama 70B."

Deployment shapes

Same agents. Different infrastructure.

Aether One™ Sovereign is the most constrained deployment shape we support. Every other shape relaxes constraints in some direction. The model layer adapts — the agents do not.

ShapeInference frameworkDefault model classNotes
Sovereign on-premOllama (or llama.cpp)Llama 3.1 70B Q4_K_MThe reference shape. Air-gappable. No external network calls. Used by CMS WISeR.
Government cloudvLLMLlama 3.1 70B AWQAzure Government, AWS GovCloud. Customer-owned subscription, FedRAMP-aligned tenant.
Commercial cloudvLLMLlama 3.1 70B AWQ · or hosted Azure OpenAIFor customers who prefer cloud-vendor-managed model hosting. Same agent contracts.
HybridvLLM (cloud) + Ollama (edge)Mixed by agentRouting agent on edge, reasoning agents in cloud. Used in ambient clinical scenarios.
Edge / tabletllama.cppPhi-3.5 mini Q4_K_MNo GPU. Constrained by 16GB device RAM. Used for clinical-workstation copilots.
Evaluation

Four metrics every agent is measured on.

Healthcare AI without measurement is just plausible-sounding text generation. Aether One™ agents are evaluated on the same four metrics, on holdout sets refreshed each rule-pack release. Numbers are visible to the customer in the audit packet, not buried in a marketing claim.

Metric 01

Clinical accuracy on holdout

Agent decision compared against gold-standard reviewer decision on a stratified holdout set of historical cases. Reviewer disagreement is itself a measured baseline — we don't claim accuracy higher than the inter-reviewer agreement on the same task.

Metric 02

Hallucination rate

For every retrieval-augmented response, we measure the rate at which the agent's cited evidence does not appear in the retrieved corpus. Target is <0.5%. The Clinical Query and Inference Layer (CQIL) is built around making this measurable.

Metric 03

False-affirm rate

The most consequential error class for prior authorization — an automated approval that should have been a non-affirm. The architectural prohibition on auto-deny means this is the only auto-decision class that exists, and we measure it on every rule-pack release.

Metric 04

P95 latency

Wall-clock time from input received to decision returned, at the 95th percentile across production traffic. Distinct budgets per agent class — eligibility check <2s, full clinical reasoning <30s, ambient copilot <500ms.

Specific numbers per agent surface are shared under NDA during evaluation engagements. We don't publish them on the marketing site — the numbers depend on the customer's data distribution, not just our model choice.

Engineered, not assumed.

If you're evaluating Aether One™ on engineering rigor, this is the page that says we have it. The deeper conversation — specific eval numbers, your data distribution, your deployment shape — happens under NDA with our engineering team.

Talk to engineering Sovereign deployment shape