Open weights.
Sovereign inference.
Healthcare runs on regulated infrastructure. Aether One™ runs on models we can host inside that infrastructure — open-weight foundation models from Meta, Mistral, Microsoft, and Google, plus healthcare-specific open releases like MedGemma and BioMistral. The model layer is engineered, not assumed.
In production across Aether One™ agent surfaces
Sovereign deployments call no external inference APIs
Ollama, vLLM, llama.cpp — matched to deployment shape
Four properties closed APIs can't deliver.
For payer, provider, and government healthcare, the right model is the one you can host where the data lives, audit when the regulator asks, and cost-predict at production volume. Closed model APIs deliver none of those.
Data stays put
Inference happens on infrastructure the customer owns. PHI never crosses a network boundary to a third-party model API. Required for sovereign deployments and CMS-aligned environments.
Predictable economics
No per-token pricing surprise at production volume. Cost is GPU-hours, not API charges. A 70B model running on an H100 costs the same on day one as on day one thousand.
Deployable anywhere
Same model artifact runs on Azure Government, AWS GovCloud, on-prem GPU rack, air-gapped DoD-adjacent system. The model doesn't know or care — the inference framework abstracts it.
Reconstructable evidence
Specific model version, weights hash, and inference parameters are recorded in every audit packet. Reviewers can replay any historical decision exactly — impossible with a closed API that may have silently versioned underneath you.
What we run, and where.
Aether One™ agents are not single-model. Each agent surface uses the model best matched to its task — latency budget, reasoning depth, structured-output reliability, healthcare-domain pre-training. Below is the production matrix as of 2026.
| Model family | Provenance | Where it runs | Why it's there |
|---|---|---|---|
| Llama 3.1 70B · instruct | Meta · open weights | Clinical reasoning agents, criteria decomposition | Best open-weight model for long-context clinical reasoning. Strong at multi-step synthesis required by NCD/LCD criteria atomization. |
| Llama 3.1 8B · instruct | Meta · open weights | Routing, intake parsing, fast classification | Sub-second latency on commodity GPU. Used wherever the agent's job is to route, classify, or extract structured data from a clean input. |
| Mixtral 8x7B | Mistral AI · Apache 2.0 | Multi-language correspondence, dense classification | Mixture-of-experts architecture gives strong multilingual performance for member correspondence and benefit translations. |
| Gemma 4 · 31B Dense | Google DeepMind · Apache 2.0 | Structured-output agents, JSON schema enforcement | Native function-calling and structured JSON output are baked into the architecture — not bolted on via prompt engineering. 256K context. Used where agent output must conform to a strict JSON schema before being passed to a downstream system. |
| Gemma 4 · 26B MoE | Google DeepMind · Apache 2.0 | Rule-pack generation, code-as-policy | Mixture-of-experts architecture activates only 3.8B parameters per token — predictable inference cost for high-throughput code-reasoning workloads. Used internally for translating regulatory text into executable rule logic. |
| Microsoft Phi-3.5 mini | Microsoft · MIT license | Edge inference, tablet-resident deployments | 3.8B parameters quantized to 2GB. Runs on a clinical workstation or tablet without a GPU. Used for ambient-agent scenarios where latency budget is <500ms. |
| MedGemma | Google · open weights, healthcare-tuned | Clinical Q&A, evidence retrieval | Healthcare-domain pre-training improves clinical-terminology recall and reduces hallucination on medical-specific queries vs. general models of the same size. |
| BioMistral 7B | Open-source biomedical fine-tune | Biomedical literature retrieval, drug interaction reasoning | Continued pre-training on biomedical corpora. Used in PA-RX for pharmacy benefit reasoning and formulary-aware criteria evaluation. |
Model selection is documented per-agent in the deployed rule-pack manifest. The selection is not static — we re-evaluate on each model release cycle (typically quarterly) and can swap a model behind an agent without API changes for the customer.
Three frameworks, one model artifact.
A model is a file on disk. The inference framework is what serves it as an API to the agent. Aether One™ uses three, matched to deployment shape and latency budget. The model artifact itself is identical — only the runtime changes.
Ollama
Self-contained Go binary. Trivial to install, easy to update via signed model artifacts. The default for sovereign deployments where the operator wants minimal external dependency surface area. Supports GGUF format with hot model swap.
Used in: WISeR (CMS, NJ) · State Medicaid pilots
vLLM
Throughput-optimized. PagedAttention scheduler, dynamic batching, continuous batching. Drives high requests-per-second on multi-tenant GPU clusters where latency budgets are bounded but volume is unbounded. The default for hyperscaler-deployed HIP One.
Used in: Azure-deployed HIP One · Marketplace agents
llama.cpp
Pure-C++ inference, runs on CPU or commodity hardware. Used for tablet-resident agents, clinical workstation copilots, and any scenario where the constraint is "no GPU available." Sub-second latency on Phi-3.5 mini quantized to Q4_K_M.
Used in: Ambient clinical agents · Tablet copilots
From 140GB to 18GB without losing the doctor's intent.
A 70B-parameter model in full FP16 precision is ~140GB. Most healthcare deployments don't have the GPU memory or the budget for that. Quantization compresses the model weights to lower precision — the right quantization keeps clinical-task accuracy nearly identical to the full-precision baseline.
| Format | Size (Llama 70B) | When we use it | Tradeoff |
|---|---|---|---|
| FP16 (full precision) | ~140 GB | Reference baseline for evaluation | Highest accuracy. Impractical for most deployments — needs multi-H100. |
| Q8_0 (8-bit) | ~70 GB | Production GPU clusters with H100/A100 capacity | ~99% of FP16 quality. Standard for hyperscaler-deployed HIP One. |
| Q4_K_M (4-bit, K-quants medium) | ~40 GB | Sovereign on-prem on commodity GPUs (A6000, L40S) | ~98% quality on clinical tasks. The sweet spot for sovereign deployments where GPU budget is fixed. |
| AWQ (Activation-aware 4-bit) | ~38 GB | vLLM throughput-bound deployments | Slightly higher quality than GGUF Q4 at the same size; requires vLLM (no GGUF). |
Quantization choice is recorded per-agent in the audit packet. If a regulator asks "what model produced this decision," the answer includes the exact quantization — not a vague "Llama 70B."
Same agents. Different infrastructure.
Aether One™ Sovereign is the most constrained deployment shape we support. Every other shape relaxes constraints in some direction. The model layer adapts — the agents do not.
| Shape | Inference framework | Default model class | Notes |
|---|---|---|---|
| Sovereign on-prem | Ollama (or llama.cpp) | Llama 3.1 70B Q4_K_M | The reference shape. Air-gappable. No external network calls. Used by CMS WISeR. |
| Government cloud | vLLM | Llama 3.1 70B AWQ | Azure Government, AWS GovCloud. Customer-owned subscription, FedRAMP-aligned tenant. |
| Commercial cloud | vLLM | Llama 3.1 70B AWQ · or hosted Azure OpenAI | For customers who prefer cloud-vendor-managed model hosting. Same agent contracts. |
| Hybrid | vLLM (cloud) + Ollama (edge) | Mixed by agent | Routing agent on edge, reasoning agents in cloud. Used in ambient clinical scenarios. |
| Edge / tablet | llama.cpp | Phi-3.5 mini Q4_K_M | No GPU. Constrained by 16GB device RAM. Used for clinical-workstation copilots. |
Four metrics every agent is measured on.
Healthcare AI without measurement is just plausible-sounding text generation. Aether One™ agents are evaluated on the same four metrics, on holdout sets refreshed each rule-pack release. Numbers are visible to the customer in the audit packet, not buried in a marketing claim.
Clinical accuracy on holdout
Agent decision compared against gold-standard reviewer decision on a stratified holdout set of historical cases. Reviewer disagreement is itself a measured baseline — we don't claim accuracy higher than the inter-reviewer agreement on the same task.
Hallucination rate
For every retrieval-augmented response, we measure the rate at which the agent's cited evidence does not appear in the retrieved corpus. Target is <0.5%. The Clinical Query and Inference Layer (CQIL) is built around making this measurable.
False-affirm rate
The most consequential error class for prior authorization — an automated approval that should have been a non-affirm. The architectural prohibition on auto-deny means this is the only auto-decision class that exists, and we measure it on every rule-pack release.
P95 latency
Wall-clock time from input received to decision returned, at the 95th percentile across production traffic. Distinct budgets per agent class — eligibility check <2s, full clinical reasoning <30s, ambient copilot <500ms.
Specific numbers per agent surface are shared under NDA during evaluation engagements. We don't publish them on the marketing site — the numbers depend on the customer's data distribution, not just our model choice.
Engineered, not assumed.
If you're evaluating Aether One™ on engineering rigor, this is the page that says we have it. The deeper conversation — specific eval numbers, your data distribution, your deployment shape — happens under NDA with our engineering team.