About the position
Weβre seeking a Senior QA / Eval Engineer to own and evolve the quality and evaluation infrastructure behind our AI-powered patient engagement platform. Our eval system is multi-factorial (deterministic, LLM-based, human expert), running against every voice interaction we handle. Youβll be responsible for the verification and evaluation infrastructure that determine whether each interaction meets multi-faceted quality criteria.
This is a high-impact individual contributor role. You wonβt just write test cases β youβll shape how we define βqualityβ across our automated and manual workflows and build the tooling that makes that definition measurable and actionable.
Responsibilities
- Own and extend our multi-layered eval pipeline and verification portfolio: deterministic quality checks on tool calls, risk-factor heuristics, and LLM-graded transcript evaluation.
- Advance our capabilities to evaluate end-to-end system performance (across orchestrated agents, RAG-supported responses, multi-party voice conversations) with modular and auditable verification that is independent of any single model provider.
- Drive improvements to our observability stack to surface eval metrics, detect regressions, and enable data-driven quality decisions across the team.
- Build real-time monitoring and verification loops that catch issues in production interactions as they happen, intervening with context and feeding back for system refinement.
- Partner with ML engineers, product managers, and operations leads to translate real-world failure modes into automated checks, closing the loop between production incidents and eval coverage.
- Build and maintain adversarial and edge-case test suites β including prompt injection resistance, guardrail robustness, and graceful degradation under ambiguous patient inputs.
- Champion βshift-leftβ quality practices: embed eval criteria into prompt engineering workflows, define acceptance criteria for new agent behaviors, and make quality a first-class concern in the development cycle.
- Contribute to the design of our QA pipeline orchestration (background processing, Slack notifications, risk assessment persistence) to improve throughput, reliability, and developer experience.
Required skills and qualifications
- 5+ years of software engineering or test engineering experience, with 3+ years focused on quality infrastructure for AI/ML or data-intensive systems.
- Strong proficiency in Python, particularly for building test frameworks, eval pipelines, and API-level integration tests (e.g., pytest, FastAPI TestClient, Pydantic).
- Demonstrated experience designing evaluation or verification systems for LLM-based applications β with a clear understanding that the model is a generation layer, not the quality layer. Comfort with both deterministic and model-graded assessment approaches, and a point of view on when each is appropriate.
- Familiarity with the architectural tradeoffs of relying on LLM outputs in production β including variance across model versions, prompt sensitivity, and the need for external verification infrastructure that remains stable as underlying models change.
- Experience building extensible, rule-based validation systems (check registries, plugin architectures, or similar patterns) that scale across a growing surface area of features.
- Solid understanding of voice AI or conversational AI systems, including tool-calling patterns, transcript analysis, and interaction-level quality metrics.
- Hands-on experience with observability and metrics instrumentation in production environments.
- Excellent communication skills, with the ability to collaborate effectively across engineering, product, and non-technical stakeholders.
- Strong interest in healthcare innovation and building AI systems that meaningfully improve health outcomes.
Desired skills and qualifications
- Experience building QA or eval systems in healthcare or regulated environments, with familiarity with standards such as HIPAA, GDPR, or FDA guidance.
- Proven experience leading complex technical initiatives and mentoring junior engineers.
- Experience building or operating systems where quality guarantees live in the verification infrastructure rather than in any single model.
- Familiarity with risk-scoring systems, anomaly detection, or production safety nets for autonomous AI agents.
- Experience with AI safety testing, including adversarial evaluation, jailbreak testing, and bias detection in LLM outputs.
- Hands-on experience with CI/CD pipelines for eval automation (CircleCI, GitHub Actions, or equivalent) and infrastructure-as-code deployment patterns.
- Experience with voice UI testing tools and platforms, with a focus on evaluating speech generation and response quality.
- Knowledge of accessibility testing and inclusive design principles.