Share this job
VP Engineering – AI Cloud Infrastructure (San Francisco)
San Francisco, California, United States
Apply for this job

VP Engineering


Location: San Francisco, CA

Compensation: Competitive Base + Equity + Benefits


TLDR

  • Our client is a rapidly growing Series A AI infrastructure company building an open AI cloud that provides developers with affordable, on-demand GPU infrastructure for training, inference, and production AI workloads.
  • We are seeking a executive level VP Engineering to build and scale the core infrastructure powering our client's AI cloud platform. You'll lead infrastructure strategy while remaining 40% hands-on, driving architecture, contributing to technical design, and partnering with engineering on implementation and execution for GPU-native infrastructure.
  • If you're excited by large-scale infrastructure, AI platforms, cloud systems, and solving problems at the intersection of hardware and software, this is an opportunity to influence the architecture of a next-generation AI compute platform. This is a foundational leadership role focused on cloud infrastructure, distributed systems, AI platforms, and scaling both the technology and engineering organization.


Requirements

  • 12+ years building and scaling large-scale cloud infrastructure, with hands-on leadership of infrastructure organizations.
  • Proven experience building cloud platforms, GPU infrastructure, or AI/ML compute platforms in high-growth environments.
  • Expert in Kubernetes, Linux, networking, distributed systems, storage architecture, and Infrastructure-as-Code.
  • Strong background in automation, observability, monitoring, reliability engineering, and highly available production systems.
  • Preferred: Experience with GPU scheduling, Slurm, Kubernetes GPU Operators, Ray, distributed training, and managing large-scale GPU clusters for AI training and inference.


Responsibilities

  • Own AI cloud infrastructure architecture, including GPU orchestration, compute scheduling, networking, storage, distributed systems, bare-metal deployments, and platform scalability.
  • Build and scale large GPU clusters for AI training and inference, with ownership of GPU provisioning, scheduling, utilization optimization, capacity management, reliability, and performance.
  • Provide hands-on technical leadership across infrastructure design, architecture reviews, system design, implementation, and escalation for complex production issues.
  • Establish Platform Engineering and SRE best practices across Kubernetes, observability, CI/CD, security, automation, SLOs, SLIs, incident response, and capacity planning.
  • Recruit and lead high-performing Infrastructure, Platform, and SRE teams, while partnering with executive leadership on strategy, vendor relationships, budgets, and infrastructure investments.
Apply for this job