[Remote] Founding Sr. Operations Support Specialist
Note: The job is a remote job and is open to candidates in USA. InSiteVerse is a startup based in the US developing a FinTech mobile app designed to provide hedge-fund-grade trading intelligence to everyday investors. The Founding Sr. Operations Support Specialist will be responsible for ensuring the reliability and operational excellence of the platform, leading incident management, and establishing a strong observability culture.
Responsibilities
- Define and manage SLIs, SLOs, and error budgets for critical user journeys
- Ensure high system availability, low latency, and minimal error rates
- Proactively identify risks and implement strategies to prevent SLO breaches
- Partner with engineering to balance reliability vs feature velocity
- Act as Incident Commander for high-severity (P0/P1) incidents
- Lead real-time war rooms, ensuring rapid issue resolution
- Own the full incident lifecycle: detection → response → recovery → RCA → prevention
- Establish and enforce incident response frameworks, SLAs, and escalation policies
- Drive blameless postmortems and continuous improvement
- Monitor and analyze observability dashboards across cloud, analytics, and application layers to identify infrastructure issues, detect application downtime, and uncover system anomalies impacting reliability
- Build dashboards and alerts for real-time system visibility
- Correlate signals across infrastructure, application, and AI systems
- Analyze trends from tickets, logs, and telemetry to detect systemic issues
- Monitor AI-specific signals (model drift, inference latency, failures, anomalies)
- Oversee intake of customer tickets, alerts, and operational signals
- Define and manage priority classification (P0–P3) and response expectations
- Resolve customer-impacting issues or coordinate with internal teams
- Drive collaboration across AI, Backend, Frontend, Mobile, DevOps, QA, and Product
- Define and optimize ticket workflows and escalation paths
- Lead communication during incidents with both technical and non-technical stakeholders
- Own the release calendar and operational readiness checks
- Ensure monitoring, rollback plans, and risk assessments are in place
- Validate system performance post-deployment
- Build automated runbooks and self-healing systems
- Reduce manual intervention through scripting and tooling
- Improve system resilience using failover, scaling, and redundancy mechanisms
Skills
- 10+ years in Production Support / SRE / Technical Operations
- Strong understanding of SLO, SLI, SLA, and error budgets
- Proven experience in incident management and troubleshooting distributed systems
- Hands-on experience with cloud platforms (AWS & GCP)
- Strong debugging and root cause analysis skills
- Experience supporting mobile applications (iOS/Android)
- Understanding of DevOps and SRE practices
- Exposure to AI/ML systems and model behavior monitoring
- Experience with log management and tracing systems
- Monitoring & Observability: Azure Monitor, Prometheus, Grafana
- Incident Management: PagerDuty, Opsgenie (or similar)
- Scripting/Automation: Python, PowerShell, Bash
- Logging: ELK Stack, Azure Monitor Logs, Splunk, or Datadog
- Tracing: OpenTelemetry, Jaeger, Zipkin, or Azure Application Insights
- Familiarity with low-latency or financial systems
Company Overview