Third Way Health logo

Senior QA Engineer

Third Way Health
2 days ago
Full-time
On-site
Cambridge, Massachusetts, United States
QA Engineer

About the position

We’re seeking a Senior QA / Eval Engineer to own and evolve the quality and evaluation infrastructure behind our AI-powered patient engagement platform. Our eval system is multi-factorial (deterministic, LLM-based, human expert), running against every voice interaction we handle. You’ll be responsible for the verification and evaluation infrastructure that determine whether each interaction meets multi-faceted quality criteria.

This is a high-impact individual contributor role. You won’t just write test cases β€” you’ll shape how we define β€œquality” across our automated and manual workflows and build the tooling that makes that definition measurable and actionable.

Responsibilities

  • Own and extend our multi-layered eval pipeline and verification portfolio: deterministic quality checks on tool calls, risk-factor heuristics, and LLM-graded transcript evaluation.
  • Advance our capabilities to evaluate end-to-end system performance (across orchestrated agents, RAG-supported responses, multi-party voice conversations) with modular and auditable verification that is independent of any single model provider.
  • Drive improvements to our observability stack to surface eval metrics, detect regressions, and enable data-driven quality decisions across the team.
  • Build real-time monitoring and verification loops that catch issues in production interactions as they happen, intervening with context and feeding back for system refinement.
  • Partner with ML engineers, product managers, and operations leads to translate real-world failure modes into automated checks, closing the loop between production incidents and eval coverage.
  • Build and maintain adversarial and edge-case test suites β€” including prompt injection resistance, guardrail robustness, and graceful degradation under ambiguous patient inputs.
  • Champion β€œshift-left” quality practices: embed eval criteria into prompt engineering workflows, define acceptance criteria for new agent behaviors, and make quality a first-class concern in the development cycle.
  • Contribute to the design of our QA pipeline orchestration (background processing, Slack notifications, risk assessment persistence) to improve throughput, reliability, and developer experience.


Required skills and qualifications

  • 5+ years of software engineering or test engineering experience, with 3+ years focused on quality infrastructure for AI/ML or data-intensive systems.
  • Strong proficiency in Python, particularly for building test frameworks, eval pipelines, and API-level integration tests (e.g., pytest, FastAPI TestClient, Pydantic).
  • Demonstrated experience designing evaluation or verification systems for LLM-based applications β€” with a clear understanding that the model is a generation layer, not the quality layer. Comfort with both deterministic and model-graded assessment approaches, and a point of view on when each is appropriate.
  • Familiarity with the architectural tradeoffs of relying on LLM outputs in production β€” including variance across model versions, prompt sensitivity, and the need for external verification infrastructure that remains stable as underlying models change.
  • Experience building extensible, rule-based validation systems (check registries, plugin architectures, or similar patterns) that scale across a growing surface area of features.
  • Solid understanding of voice AI or conversational AI systems, including tool-calling patterns, transcript analysis, and interaction-level quality metrics.
  • Hands-on experience with observability and metrics instrumentation in production environments.
  • Excellent communication skills, with the ability to collaborate effectively across engineering, product, and non-technical stakeholders.
  • Strong interest in healthcare innovation and building AI systems that meaningfully improve health outcomes.


Desired skills and qualifications

  • Experience building QA or eval systems in healthcare or regulated environments, with familiarity with standards such as HIPAA, GDPR, or FDA guidance.
  • Proven experience leading complex technical initiatives and mentoring junior engineers.
  • Experience building or operating systems where quality guarantees live in the verification infrastructure rather than in any single model.
  • Familiarity with risk-scoring systems, anomaly detection, or production safety nets for autonomous AI agents.
  • Experience with AI safety testing, including adversarial evaluation, jailbreak testing, and bias detection in LLM outputs.
  • Hands-on experience with CI/CD pipelines for eval automation (CircleCI, GitHub Actions, or equivalent) and infrastructure-as-code deployment patterns.
  • Experience with voice UI testing tools and platforms, with a focus on evaluating speech generation and response quality.
  • Knowledge of accessibility testing and inclusive design principles.