[Remote] Principal Observability Platform Engineer
Note: The job is a remote job and is open to candidates in USA. Nscale is the GPU cloud engineered for AI, providing high-performance infrastructure for AI start-ups and large enterprises. As a Principal Observability Platform Engineer, you will own the technical direction of Nscale's observability platform, ensuring it scales with the business and simplifies operations.
Responsibilities
- Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale
- Drive platform decisions that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management
- Identify systemic gaps before they become incidents; design platforms that make failure visible and fast to diagnose
- Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how Nscale builds and operates
- Define standards and patterns that other engineers adopt, not by mandate, but because they're clearly better
- Mentor and technically grow the observability team; raise the ceiling on what the team can build and own
- Lead incident postmortems and use them to drive durable platform improvements
- Evaluate and introduce tooling that meaningfully improves signal quality, operational efficiency, or scalability, and retire what doesn't
Skills
- 8+ years in SRE, infrastructure engineering, platform engineering, or observability-focused roles
- You've operated observability infrastructure at serious scale. You know what breaks at 10x and you design for it
- You have a strong bias toward simplicity. You've seen over-engineered observability stacks collapse under their own weight and you build accordingly
- Deep hands-on experience with a significant subset of: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic
- Strong engineering fundamentals, proficient in Python, Go, or similar; comfortable owning complex systems end to end
- Experience with Kubernetes at scale; familiarity with GPU infrastructure or HPC environments (Slurm) is a strong plus
- You can architect systems, write the code, review others' work, and explain the tradeoffs clearly, all in the same week
- Infrastructure-as-Code is default, not optional (Terraform, Ansible, or equivalent)
- You influence without authority. Teams want your opinion because it makes their work better
- Experience with high-volume streaming pipelines for observability data (Kafka, Vector, Fluent Bit, etc.)
- Background in AI/ML infrastructure observability: GPU utilisation, training job visibility, inference latency
- Prior experience defining observability strategy at an organisation level
Benefits
- Bonus
- Equity
- Commission programs
- Medical
- Dental
- Vision
- Flexible paid time off
- Parental leave
- Retirement plan participation
Company Overview