Open weights.
Sovereign inference.
Healthcare runs on regulated infrastructure. Aether One™ runs on models we can host inside that infrastructure — open-weight foundation models from Meta, Mistral, Microsoft, and Google, plus healthcare-specific open releases like MedGemma and BioMistral. The model layer is engineered, not assumed.
In production across Aether One™ agent surfaces
Sovereign deployments call no external inference APIs
Ollama, vLLM, llama.cpp — matched to deployment shape
Four properties closed APIs can't deliver.
For payer, provider, and government healthcare, the right model is the one you can host where the data lives, audit when the regulator asks, and cost-predict at production volume. Closed model APIs deliver none of those.
Data stays put
Inference happens on infrastructure the customer owns. PHI never crosses a network boundary to a third-party model API. Required for sovereign deployments and CMS-aligned environments.
Predictable economics
No per-token pricing surprise at production volume. Cost is GPU-hours, not API charges. A 70B model running on an H100 costs the same on day one as on day one thousand.
Deployable anywhere
Same model artifact runs on Azure Government, AWS GovCloud, on-prem GPU rack, air-gapped DoD-adjacent system. The model doesn't know or care — the inference framework abstracts it.
Reconstructable evidence
Specific model version, weights hash, and inference parameters are recorded in every audit packet. Reviewers can replay any historical decision exactly — impossible with a closed API that may have silently versioned underneath you.
What we run, and where.
Healthcare agents are not single-model. Each agent surface uses the model best matched to its task — latency budget, reasoning depth, structured-output reliability, healthcare-domain pre-training. Below is the production matrix as of 2026.
| Model family | Provenance | Where it runs | Why it's there |
|---|---|---|---|
| Llama 3.1 70B · instruct | Meta · open weights | Clinical reasoning agents, criteria decomposition | Best open-weight model for long-context clinical reasoning. Strong at multi-step synthesis required by NCD/LCD criteria atomization. |
| Llama 3.1 8B · instruct | Meta · open weights | Routing, intake parsing, fast classification | Sub-second latency on commodity GPU. Used wherever the agent's job is to route, classify, or extract structured data from a clean input. |
| Mixtral 8x7B | Mistral AI · Apache 2.0 | Multi-language correspondence, dense classification | Mixture-of-experts architecture gives strong multilingual performance for member correspondence and benefit translations. |
| Gemma 4 · 31B Dense | Google DeepMind · Apache 2.0 | Structured-output agents, JSON schema enforcement | Native function-calling and structured JSON output are baked into the architecture — not bolted on via prompt engineering. 256K context. Used where agent output must conform to a strict JSON schema before being passed to a downstream system. |
| Gemma 4 · 26B MoE | Google DeepMind · Apache 2.0 | Rule-pack generation, code-as-policy | Mixture-of-experts architecture activates only 3.8B parameters per token — predictable inference cost for high-throughput code-reasoning workloads. Used internally for translating regulatory text into executable rule logic. |
| Microsoft Phi-3.5 mini | Microsoft · MIT license | Edge inference, tablet-resident deployments | 3.8B parameters quantized to 2GB. Runs on a clinical workstation or tablet without a GPU. Used for ambient-agent scenarios where latency budget is <500ms. |
| MedGemma | Google · open weights, healthcare-tuned | Clinical Q&A, evidence retrieval | Healthcare-domain pre-training improves clinical-terminology recall and reduces hallucination on medical-specific queries vs. general models of the same size. |
| BioMistral 7B | Open-source biomedical fine-tune | Biomedical literature retrieval, drug interaction reasoning | Continued pre-training on biomedical corpora. Used in PA-RX for pharmacy benefit reasoning and formulary-aware criteria evaluation. |
Model selection is documented per-agent in the deployed rule-pack manifest. The selection is not static — we re-evaluate on each model release cycle (typically quarterly) and can swap a model behind an agent without API changes for the customer.
The CliniGuard family — open-weight clinical NER on Hugging Face.
We don’t just run open-weight models inside the perimeter — we publish them outward. The CliniGuard family is the first two: a vitals-extraction model and a PHI/PII de-identification model. Both Apache-2.0. Both Bio_ClinicalBERT fine-tunes. Both auditable by any team that wants to inspect what’s deciding what stays and what leaves.
De-identification across 20 HIPAA Safe Harbor categories.
Detection and de-identification of Protected Health Information and Personally Identifiable Information in clinical text. The primitive that sits between local processing and any egress path — auditable, open, and reproducible.
| Base model | Bio_ClinicalBERT |
| Architecture | BertForTokenClassification |
| Parameters | ~110M |
| Tagging scheme | BIO · 41 labels (20 PHI types) |
| Max sequence length | 512 tokens |
| License | Apache-2.0 |
| Headline F1 | Micro 0.9695 · Macro 0.9656 |
Training data: Genzeon-curated proprietary clinical NER dataset spanning diverse healthcare note formats. Recommended for human-in-the-loop pairing in high-stakes de-identification workflows. English-only at present; multilingual support on the Genzeon Platforms roadmap.
Vital-sign and measurement extraction across 15 categories.
Automated extraction of vital signs, body measurements, and physiological parameters from clinical text — nursing notes, ED triage, and progress notes. Turns prose into structured clinical signal the rest of the pipeline can act on.
| Base model | Bio_ClinicalBERT |
| Architecture | BertForTokenClassification |
| Parameters | ~110M |
| Tagging scheme | BIO · 31 labels (15 vital categories) |
| Max sequence length | 512 tokens |
| License | Apache-2.0 |
| Headline F1 | Micro 1.0000 · Macro 1.0000 * |
Training data: Genzeon-curated proprietary clinical vital-signs NER dataset across nursing notes, ED triage, and progress notes. Trained 15 epochs with early stopping; best model selected by entity-level F1.
* Evaluated on a held-out split of the proprietary training distribution. External-distribution performance has not yet been benchmarked — community evaluation on public clinical NER benchmarks is welcomed and feedback channels are open on the model card.
Auditable primitives, not black-box claims.
Sovereign AI in healthcare requires primitives that customers can inspect before they touch a single record. De-identification and clinical extraction are load-bearing components — the kind that, if they fail silently, take compliance budgets with them. Publishing them under Apache-2.0 on Hugging Face, with named base models and reproducible architectures, is the posture this work requires. Our focus on auditable de-identification reflects what privacy and compliance teams in healthcare actually need from AI components.
Intended use. Both models are released for research, development, and integration into clinical workflows under appropriate human review. Neither is a medical device. Neither should be used for autonomous clinical decisioning. Expect more CliniGuard releases as additional Aether One™ primitives open up.
Three frameworks, one model artifact.
A model is a file on disk. The inference framework is what serves it as an API to the agent. Aether One™ uses three, matched to deployment shape and latency budget. The model artifact itself is identical — only the runtime changes.
Ollama
Self-contained Go binary. Trivial to install, easy to update via signed model artifacts. The default for sovereign deployments where the operator wants minimal external dependency surface area. Supports GGUF format with hot model swap.
Used in: WISeR (CMS, NJ) · State Medicaid pilots
vLLM
Throughput-optimized. PagedAttention scheduler, dynamic batching, continuous batching. Drives high requests-per-second on multi-tenant GPU clusters where latency budgets are bounded but volume is unbounded. The default for hyperscaler-deployed HIP One.
Used in: Azure-deployed HIP One · Marketplace agents
llama.cpp
Pure-C++ inference, runs on CPU or commodity hardware. Used for tablet-resident agents, clinical workstation copilots, and any scenario where the constraint is "no GPU available." Sub-second latency on Phi-3.5 mini quantized to Q4_K_M.
Used in: Ambient clinical agents · Tablet copilots
From 140GB to 18GB without losing the doctor's intent.
A 70B-parameter model in full FP16 precision is ~140GB. Most healthcare deployments don't have the GPU memory or the budget for that. Quantization compresses the model weights to lower precision — the right quantization keeps clinical-task accuracy nearly identical to the full-precision baseline.
| Format | Size (Llama 70B) | When we use it | Tradeoff |
|---|---|---|---|
| FP16 (full precision) | ~140 GB | Reference baseline for evaluation | Highest accuracy. Impractical for most deployments — needs multi-H100. |
| Q8_0 (8-bit) | ~70 GB | Production GPU clusters with H100/A100 capacity | ~99% of FP16 quality. Standard for hyperscaler-deployed HIP One. |
| Q4_K_M (4-bit, K-quants medium) | ~40 GB | Sovereign on-prem on commodity GPUs (A6000, L40S) | ~98% quality on clinical tasks. The sweet spot for sovereign deployments where GPU budget is fixed. |
| AWQ (Activation-aware 4-bit) | ~38 GB | vLLM throughput-bound deployments | Slightly higher quality than GGUF Q4 at the same size; requires vLLM (no GGUF). |
Quantization choice is recorded per-agent in the audit packet. If a regulator asks "what model produced this decision," the answer includes the exact quantization — not a vague "Llama 70B."
Same agents. Different infrastructure.
Aether One™ Sovereign is the most constrained deployment shape we support. Every other shape relaxes constraints in some direction. The model layer adapts — the agents do not.
| Shape | Inference framework | Default model class | Notes |
|---|---|---|---|
| Sovereign on-prem | Ollama (or llama.cpp) | Llama 3.1 70B Q4_K_M | The reference shape. Air-gappable. No external network calls. Used by CMS WISeR. |
| Government cloud | vLLM | Llama 3.1 70B AWQ | Azure Government, AWS GovCloud. Customer-owned subscription, FedRAMP-aligned tenant. |
| Commercial cloud | vLLM | Llama 3.1 70B AWQ · or hosted Azure OpenAI | For customers who prefer cloud-vendor-managed model hosting. Same agent contracts. |
| Hybrid | vLLM (cloud) + Ollama (edge) | Mixed by agent | Routing agent on edge, reasoning agents in cloud. Used in ambient clinical scenarios. |
| Edge / tablet | llama.cpp | Phi-3.5 mini Q4_K_M | No GPU. Constrained by 16GB device RAM. Used for clinical-workstation copilots. |
Four metrics every agent is measured on.
Healthcare AI without measurement is just plausible-sounding text generation. Healthcare agents are evaluated on the same four metrics, on holdout sets refreshed each rule-pack release. Numbers are visible to the customer in the audit packet, not buried in a marketing claim.
Clinical accuracy on holdout
Agent decision compared against gold-standard reviewer decision on a stratified holdout set of historical cases. Reviewer disagreement is itself a measured baseline — we don't claim accuracy higher than the inter-reviewer agreement on the same task.
Hallucination rate
For every retrieval-augmented response, we measure the rate at which the agent's cited evidence does not appear in the retrieved corpus. Target is <0.5%. The Clinical Query and Inference Layer (CQIL) is built around making this measurable.
False-affirm rate
The most consequential error class for prior authorization — an automated approval that should have been a non-affirm. The architectural prohibition on auto-deny means this is the only auto-decision class that exists, and we measure it on every rule-pack release.
P95 latency
Wall-clock time from input received to decision returned, at the 95th percentile across production traffic. Distinct budgets per agent class — eligibility check <2s, full clinical reasoning <30s, ambient copilot <500ms.
Specific numbers per agent surface are shared under NDA during evaluation engagements. We don't publish them on the marketing site — the numbers depend on the customer's data distribution, not just our model choice.
Engineered, not assumed.
If you're evaluating Aether One™ on engineering rigor, this is the page that says we have it. The deeper conversation — specific eval numbers, your data distribution, your deployment shape — happens under NDA with our engineering team.