Configure and maintain Datadog dashboards, alerts, monitors, SLOs & SLIs.
Integrate Datadog with cloud environments (AWS / Azure / GCP), Kubernetes, and on-prem applications.
Implement APM traces, RUM, Infrastructure Monitoring, and Log Management.
Develop and standardize observability best practices across teams.
Troubleshoot performance issues using Datadog metrics, logs & traces.
Automate monitoring setup using Terraform / Ansible / CI/CD tools.
Work closely with DevOps, SRE, and development teams to ensure platform reliability.
Optimize alerting to reduce noise and enhance incident response processes.
Required Skills
Hands-on experience with Datadog (Dashboards, Log Pipelines, Metrics, Alerts, APM).
Strong knowledge of Linux-based systems and system performance metrics.
Experience working with Containers & Kubernetes (EKS / AKS / GKE).
Proficiency with at least one scripting language: Python / Bash / Shell.
Experience with Cloud platforms: AWS / Azure / GCP.
Understanding of CI/CD pipelines and Infrastructure as Code (Terraform preferred).
Good to Have
Experience with Incident Management / SRE practices
Familiarity with Prometheus, Grafana, Splunk, New Relic, or similar tools
Knowledge of Service Mesh / Microservices architecture
Networking basics (DNS, Load balancing, SSL/TLS)