[Remote] Principal Observability Platform Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Nscale is the GPU cloud engineered for AI, providing high-performance infrastructure for AI start-ups and large enterprises. As a Principal Observability Platform Engineer, you will own the technical direction of Nscale's observability platform, ensuring it scales with the business and simplifies operations.

Responsibilities

Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale
Drive platform decisions that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management
Identify systemic gaps before they become incidents; design platforms that make failure visible and fast to diagnose
Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how Nscale builds and operates
Define standards and patterns that other engineers adopt, not by mandate, but because they're clearly better
Mentor and technically grow the observability team; raise the ceiling on what the team can build and own
Lead incident postmortems and use them to drive durable platform improvements
Evaluate and introduce tooling that meaningfully improves signal quality, operational efficiency, or scalability, and retire what doesn't

Skills

8+ years in SRE, infrastructure engineering, platform engineering, or observability-focused roles
You've operated observability infrastructure at serious scale. You know what breaks at 10x and you design for it
You have a strong bias toward simplicity. You've seen over-engineered observability stacks collapse under their own weight and you build accordingly
Deep hands-on experience with a significant subset of: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic
Strong engineering fundamentals, proficient in Python, Go, or similar; comfortable owning complex systems end to end
Experience with Kubernetes at scale; familiarity with GPU infrastructure or HPC environments (Slurm) is a strong plus
You can architect systems, write the code, review others' work, and explain the tradeoffs clearly, all in the same week
Infrastructure-as-Code is default, not optional (Terraform, Ansible, or equivalent)
You influence without authority. Teams want your opinion because it makes their work better
Experience with high-volume streaming pipelines for observability data (Kafka, Vector, Fluent Bit, etc.)
Background in AI/ML infrastructure observability: GPU utilisation, training job visibility, inference latency
Prior experience defining observability strategy at an organisation level

Benefits

Bonus
Equity
Commission programs
Medical
Dental
Vision
Flexible paid time off
Parental leave
Retirement plan participation

Company Overview

Nscale builds AI data centers and provides GPU cloud infrastructure that companies use to train, run, and scale large AI models. It was founded in 2024, and is headquartered in London, England, GBR, with a workforce of 201-500 employees. Its website is https://www.nscale.com.

Apply Now

[Remote] Principal Observability Platform Engineer

More open positions

[Remote] Strategic Customer Success Manager

[Remote] Enterprise Account Executive, Fintech

[Remote] Senior Business Development Manager US

[Remote] Staff AI Engineer - Contact Center AI

[Remote] Senior Recruiter

Nexyte - Pre-Sales Engineer

Experienced Entry-Level Customer Service Specialist – Curating Unforgettable Travel Experiences for careerzynith

CRM Dynamics 365 Developer

Sr. EHS Specialist- EHS Operational Excellence

Remote Online Notary – Work From Home (405) 568-7539 Omaha. NE

Hiring Now: 1099 Contractor Work from Home Customer Service Rep

Remote Mental Health Counselor

Lead UI Designer

In-Home Physician Per Diem - Idaho Falls

Fusion Engineer (d/m/w) Akustik & Bildverarbeitung Deutschland remote

Compliance EDD/High Risk Client Analyst - Spanish Speaking

Customer Service Associate – careerzynith Amazon Store Operations & In‑Store Experience

Senior Manager, Regulatory Affairs

Cybersecurity Analyst II - Disaster Recovery (Remote)

Variant scientist iii - remote

Experienced Customer Service Associate – Freshers Program at careerzynith

[Remote] Principal Observability Platform Engineer

More open positions

[Remote] Strategic Customer Success Manager

[Remote] Enterprise Account Executive, Fintech

[Remote] Senior Business Development Manager US

[Remote] Staff AI Engineer - Contact Center AI

[Remote] Senior Recruiter

Nexyte - Pre-Sales Engineer

Experienced Entry-Level Customer Service Specialist – Curating Unforgettable Travel Experiences for careerzynith

CRM Dynamics 365 Developer

Sr. EHS Specialist- EHS Operational Excellence

​Remote Online Notary – Work From Home (405) 568-7539 Omaha. NE

Hiring Now: 1099 Contractor Work from Home Customer Service Rep

Remote Mental Health Counselor

Lead UI Designer

In-Home Physician Per Diem - Idaho Falls

Fusion Engineer (d/m/w) Akustik & Bildverarbeitung Deutschland remote

Compliance EDD/High Risk Client Analyst - Spanish Speaking

Customer Service Associate – careerzynith Amazon Store Operations & In‑Store Experience

Senior Manager, Regulatory Affairs

Cybersecurity Analyst II - Disaster Recovery (Remote)

Variant scientist iii - remote

Experienced Customer Service Associate – Freshers Program at careerzynith

Remote Online Notary – Work From Home (405) 568-7539 Omaha. NE