[Remote] Senior Network Engineer
Note: The job is a remote job and is open to candidates in USA. Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for AI systems. They are seeking a Senior Network Engineer to design and scale the network backbone for their AI infrastructure platform, ensuring high reliability and performance for demanding workloads.
Responsibilities
- Design and deploy scalable spine/leaf network architectures for AI data centers
- Engineer high-performance Ethernet fabrics supporting GPU clusters and AI workloads
- Build and maintain EVPN/VXLAN, BGP, and high-speed routing environments
- Optimize east-west traffic flows for AI training and inference operations
- Support RoCE/RDMA networking and low-latency transport technologies
- Support backbone, DCI, WAN, and edge connectivity solutions
- Collaborate with compute, storage, AI platform, and operations teams to deliver integrated infrastructure solutions
- Develop automation and Infrastructure-as-Code (IaC) solutions for network provisioning and operations
- Troubleshoot complex network, performance, and congestion issues across distributed environments
- Improve network observability, telemetry, and operational visibility
- You have hands-on experience working with SONiC and Junos
- Experience with cloud networking technologies including VPC’s, NFV, Direct Connect, Cloud Connect
- You enjoy working with a small group of friendly, highly motivated, high-execution colleagues
- You’re comfortable with a high degree of autonomy, can independently prioritize your work and understand how it maps to the overall needs and goals of the company
- You’re knowledgeable in your domain but also enjoy wearing multiple hats and venturing outside of your comfort zone when the need arises
- You value the ability to write well and understand the importance of good documentation
Skills
- Experience with Cumulus NOS
- 5+ years of experience in large-scale data center networking
- Experience in spine-leaf architectures and L3 fabrics
- Experience with BGP, EVPN, VXLAN
- Experience operating high-performance computing (HPC) or GPU-dense environments
- Experience designing networks for hyperscalers, neoclouds, or high-scale SaaS infrastructure
- Experience in automation with (Python, Ansible, Terraform, or similar)
- Experience with network observability tooling and telemetry pipelines
- Proven ability to design systems that scale to thousands of nodes
- Strong documentation and communication skills
- Familiarity with NVIDIA networking (Spectrum, Quantum, BlueField, etc.)
- Familiarity with RDMA, RoCE, or InfiniBand fabrics
- Experience with multi-region backbone design
- Exposure to bare-metal provisioning systems
- Experience working in high-growth infrastructure startups
Benefits
- Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
- Generous paid time off, plus holidays
- Paid parental leave
- Professional development support
- Wellness and work-from-home stipends
- Flexible work environment
- Discretionary bonus
- Meaningful equity component
- Comprehensive benefits
Company Overview