Aether Models — Open-Weight Foundation Models

Aether One™ · The model layer

Open weights.
Sovereign inference.

Healthcare runs on regulated infrastructure. Aether One™ runs on models we can host inside that infrastructure — open-weight foundation models from Meta, Mistral, Microsoft, and Google, plus healthcare-specific open releases like MedGemma and BioMistral. The model layer is engineered, not assumed.

Sovereign deployment shape Talk to engineering

HIP One Clinical Medical Review Dashboard powered by Aether One™ — document upload and review interface running on sovereign open-weight inference

Aether One™ · Powering HIP One Clinical Review

Engineering thesis

Open-weight families

In production across Aether One™ agent surfaces

PHI egress

Sovereign deployments call no external inference APIs

Inference frameworks

Ollama, vLLM, llama.cpp — matched to deployment shape

Why open-weight

Four properties closed APIs can't deliver.

For payer, provider, and government healthcare, the right model is the one you can host where the data lives, audit when the regulator asks, and cost-predict at production volume. Closed model APIs deliver none of those.

01 · Residency

Data stays put

Inference happens on infrastructure the customer owns. PHI never crosses a network boundary to a third-party model API. Required for sovereign deployments and CMS-aligned environments.

02 · Cost

Predictable economics

No per-token pricing surprise at production volume. Cost is GPU-hours, not API charges. A 70B model running on an H100 costs the same on day one as on day one thousand.

03 · Portability

Deployable anywhere

Same model artifact runs on Azure Government, AWS GovCloud, on-prem GPU rack, air-gapped DoD-adjacent system. The model doesn't know or care — the inference framework abstracts it.

04 · Audit

Reconstructable evidence

Specific model version, weights hash, and inference parameters are recorded in every audit packet. Reviewers can replay any historical decision exactly — impossible with a closed API that may have silently versioned underneath you.

Models in production

What we run, and where.

Aether One™ agents are not single-model. Each agent surface uses the model best matched to its task — latency budget, reasoning depth, structured-output reliability, healthcare-domain pre-training. Below is the production matrix as of 2026.

Model family	Provenance	Where it runs	Why it's there
Llama 3.1 70B · instruct	Meta · open weights	Clinical reasoning agents, criteria decomposition	Best open-weight model for long-context clinical reasoning. Strong at multi-step synthesis required by NCD/LCD criteria atomization.
Llama 3.1 8B · instruct	Meta · open weights	Routing, intake parsing, fast classification	Sub-second latency on commodity GPU. Used wherever the agent's job is to route, classify, or extract structured data from a clean input.
Mixtral 8x7B	Mistral AI · Apache 2.0	Multi-language correspondence, dense classification	Mixture-of-experts architecture gives strong multilingual performance for member correspondence and benefit translations.
Gemma 4 · 31B Dense	Google DeepMind · Apache 2.0	Structured-output agents, JSON schema enforcement	Native function-calling and structured JSON output are baked into the architecture — not bolted on via prompt engineering. 256K context. Used where agent output must conform to a strict JSON schema before being passed to a downstream system.
Gemma 4 · 26B MoE	Google DeepMind · Apache 2.0	Rule-pack generation, code-as-policy	Mixture-of-experts architecture activates only 3.8B parameters per token — predictable inference cost for high-throughput code-reasoning workloads. Used internally for translating regulatory text into executable rule logic.
Microsoft Phi-3.5 mini	Microsoft · MIT license	Edge inference, tablet-resident deployments	3.8B parameters quantized to 2GB. Runs on a clinical workstation or tablet without a GPU. Used for ambient-agent scenarios where latency budget is <500ms.
MedGemma	Google · open weights, healthcare-tuned	Clinical Q&A, evidence retrieval	Healthcare-domain pre-training improves clinical-terminology recall and reduces hallucination on medical-specific queries vs. general models of the same size.
BioMistral 7B	Open-source biomedical fine-tune	Biomedical literature retrieval, drug interaction reasoning	Continued pre-training on biomedical corpora. Used in PA-RX for pharmacy benefit reasoning and formulary-aware criteria evaluation.

Model selection is documented per-agent in the deployed rule-pack manifest. The selection is not static — we re-evaluate on each model release cycle (typically quarterly) and can swap a model behind an agent without API changes for the customer.

Inference patterns

Three frameworks, one model artifact.

A model is a file on disk. The inference framework is what serves it as an API to the agent. Aether One™ uses three, matched to deployment shape and latency budget. The model artifact itself is identical — only the runtime changes.

01 · Sovereign / on-prem

Ollama

Self-contained Go binary. Trivial to install, easy to update via signed model artifacts. The default for sovereign deployments where the operator wants minimal external dependency surface area. Supports GGUF format with hot model swap.

Used in: WISeR (CMS, NJ) · State Medicaid pilots

02 · Production GPU cluster

vLLM

Throughput-optimized. PagedAttention scheduler, dynamic batching, continuous batching. Drives high requests-per-second on multi-tenant GPU clusters where latency budgets are bounded but volume is unbounded. The default for hyperscaler-deployed HIP One.

Used in: Azure-deployed HIP One · Marketplace agents

03 · Edge / ambient

llama.cpp

Pure-C++ inference, runs on CPU or commodity hardware. Used for tablet-resident agents, clinical workstation copilots, and any scenario where the constraint is "no GPU available." Sub-second latency on Phi-3.5 mini quantized to Q4_K_M.

Used in: Ambient clinical agents · Tablet copilots

Quantization

From 140GB to 18GB without losing the doctor's intent.

A 70B-parameter model in full FP16 precision is ~140GB. Most healthcare deployments don't have the GPU memory or the budget for that. Quantization compresses the model weights to lower precision — the right quantization keeps clinical-task accuracy nearly identical to the full-precision baseline.

Format	Size (Llama 70B)	When we use it	Tradeoff
FP16 (full precision)	~140 GB	Reference baseline for evaluation	Highest accuracy. Impractical for most deployments — needs multi-H100.
Q8_0 (8-bit)	~70 GB	Production GPU clusters with H100/A100 capacity	~99% of FP16 quality. Standard for hyperscaler-deployed HIP One.
Q4_K_M (4-bit, K-quants medium)	~40 GB	Sovereign on-prem on commodity GPUs (A6000, L40S)	~98% quality on clinical tasks. The sweet spot for sovereign deployments where GPU budget is fixed.
AWQ (Activation-aware 4-bit)	~38 GB	vLLM throughput-bound deployments	Slightly higher quality than GGUF Q4 at the same size; requires vLLM (no GGUF).

Quantization choice is recorded per-agent in the audit packet. If a regulator asks "what model produced this decision," the answer includes the exact quantization — not a vague "Llama 70B."

Deployment shapes

Same agents. Different infrastructure.

Aether One™ Sovereign is the most constrained deployment shape we support. Every other shape relaxes constraints in some direction. The model layer adapts — the agents do not.

Shape	Inference framework	Default model class	Notes
Sovereign on-prem	Ollama (or llama.cpp)	Llama 3.1 70B Q4_K_M	The reference shape. Air-gappable. No external network calls. Used by CMS WISeR.
Government cloud	vLLM	Llama 3.1 70B AWQ	Azure Government, AWS GovCloud. Customer-owned subscription, FedRAMP-aligned tenant.
Commercial cloud	vLLM	Llama 3.1 70B AWQ · or hosted Azure OpenAI	For customers who prefer cloud-vendor-managed model hosting. Same agent contracts.
Hybrid	vLLM (cloud) + Ollama (edge)	Mixed by agent	Routing agent on edge, reasoning agents in cloud. Used in ambient clinical scenarios.
Edge / tablet	llama.cpp	Phi-3.5 mini Q4_K_M	No GPU. Constrained by 16GB device RAM. Used for clinical-workstation copilots.

Evaluation

Four metrics every agent is measured on.

Healthcare AI without measurement is just plausible-sounding text generation. Aether One™ agents are evaluated on the same four metrics, on holdout sets refreshed each rule-pack release. Numbers are visible to the customer in the audit packet, not buried in a marketing claim.

Metric 01

Clinical accuracy on holdout

Agent decision compared against gold-standard reviewer decision on a stratified holdout set of historical cases. Reviewer disagreement is itself a measured baseline — we don't claim accuracy higher than the inter-reviewer agreement on the same task.

Metric 02

Hallucination rate

For every retrieval-augmented response, we measure the rate at which the agent's cited evidence does not appear in the retrieved corpus. Target is <0.5%. The Clinical Query and Inference Layer (CQIL) is built around making this measurable.

Metric 03

False-affirm rate

The most consequential error class for prior authorization — an automated approval that should have been a non-affirm. The architectural prohibition on auto-deny means this is the only auto-decision class that exists, and we measure it on every rule-pack release.

Metric 04

P95 latency

Wall-clock time from input received to decision returned, at the 95th percentile across production traffic. Distinct budgets per agent class — eligibility check <2s, full clinical reasoning <30s, ambient copilot <500ms.

Specific numbers per agent surface are shared under NDA during evaluation engagements. We don't publish them on the marketing site — the numbers depend on the customer's data distribution, not just our model choice.

Engineered, not assumed.

If you're evaluating Aether One™ on engineering rigor, this is the page that says we have it. The deeper conversation — specific eval numbers, your data distribution, your deployment shape — happens under NDA with our engineering team.

Talk to engineering Sovereign deployment shape

Open weights.Sovereign inference.