← all jobs

[Remote] Senior Cluster Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. The Voleon Group is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. As a Senior Cluster Site Reliability Engineer, you will be responsible for scaling the research compute cluster, ensuring high uptime and reliability, and collaborating with engineering teams to improve monitoring and operational frameworks.

Responsibilities

  • Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
  • Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
  • Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
  • Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
  • Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
  • Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Skills

  • 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
  • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
  • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
  • Experience with cloud infrastructure (AWS or GCP)
  • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
  • Experience with distributed storage technologies (Lustre, Ceph, S3)
  • Embodies a 'system engineer' rather than 'system administrator' mindset, thinking systematically and leveraging automation
  • Bachelor degree in computer science
  • Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
  • Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
  • Familiarity with hybrid/on-prem environments
  • Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
  • Experience with HPC networking (InfiniBand, RDMA)
  • Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)

Benefits

  • If you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral.

Company Overview

  • The Voleon Group is a family of companies committed to the development & deployment of cutting-edge technologies in investment management. It was founded in 2008, and is headquartered in Berkeley, California, USA, with a workforce of 201-500 employees. Its website is http://voleon.com/.
  • Company H1B Sponsorship

  • The Voleon Group has a track record of offering H1B sponsorships, with 2 in 2025, 2 in 2024, 3 in 2023, 4 in 2022, 1 in 2021, 1 in 2020. Please note that this does not guarantee sponsorship for this specific role.
  • More open positions

    [Remote] Senior Manager - Product Development - Drinkware

    Work from home Full-time role

    [Remote] Commissioning Engineer

    Work from home Full-time role

    [Remote] AI Code Vulnerability Mining Engineer

    Work from home Full-time role

    [Remote] Procurement Admin & Marketing Assistant

    Work from home Full-time role

    [Remote] Digital Business Manager (Digital Marketing) \- Indonesia

    Work from home Full-time role

    Remote Data Entry Specialist – Entry‑Level Home‑Based Position with careerzynith – Flexible Schedule, Growth Path & Competitive Pay

    Work from home Full-time role

    B2B Sales Representative

    Work from home Full-time role

    Sr. Marketing Coordinator

    Work from home Full-time role

    Remote Scheduling & Planning Specialist

    Work from home Full-time role

    [Remote] Engineering Manager

    Work from home Full-time role

    Data Entry & Project Tracking Specialist – Construction Operations & Project Management at careerzynith

    Work from home Full-time role

    Senior Data Analyst (Web3)

    Work from home Full-time role

    Support Coordinator

    Work from home Full-time role

    Desktop Support Specialist (PST work schedule)

    Work from home Full-time role

    [Remote] Enterprise Account Executive

    Work from home Full-time role

    Software Development Engineer in Test II

    Work from home Full-time role

    Oncology Nurse Navigator (11:30AM - 8:00PM EST)

    Work from home Full-time role

    Remote Data Entry Specialist – Entry‑Level Work‑From‑Home Position with Full Training & Growth Opportunities at careerzynith

    Work from home Full-time role

    Spanish Bilingual Vehicle Return Clerk (Remote - Michigan)

    Work from home Full-time role

    Experienced Remote Data Entry Specialist – Flexible Work-from-Home Opportunity

    Work from home Full-time role

    TechConnect Remote Chat Support | Provide Tech Assistance from Home | Earn $25-$35/hr | Join the TechConnect Team

    Work from home Full-time role