VP Engineering – AI Cloud Infrastructure (San Francisco)

Share this job

San Francisco, California, United States

VP Engineering

Location: San Francisco, CA

Compensation: Competitive Base + Equity + Benefits

TLDR

Our client is a rapidly growing Series A AI infrastructure company building an open AI cloud that provides developers with affordable, on-demand GPU infrastructure for training, inference, and production AI workloads.
We are seeking a executive level VP Engineering to build and scale the core infrastructure powering our client's AI cloud platform. You'll lead infrastructure strategy while remaining 40% hands-on, driving architecture, contributing to technical design, and partnering with engineering on implementation and execution for GPU-native infrastructure.
If you're excited by large-scale infrastructure, AI platforms, cloud systems, and solving problems at the intersection of hardware and software, this is an opportunity to influence the architecture of a next-generation AI compute platform. This is a foundational leadership role focused on cloud infrastructure, distributed systems, AI platforms, and scaling both the technology and engineering organization.

Requirements

12+ years building and scaling large-scale cloud infrastructure, with hands-on leadership of infrastructure organizations.
Proven experience building cloud platforms, GPU infrastructure, or AI/ML compute platforms in high-growth environments.
Expert in Kubernetes, Linux, networking, distributed systems, storage architecture, and Infrastructure-as-Code.
Strong background in automation, observability, monitoring, reliability engineering, and highly available production systems.
Preferred: Experience with GPU scheduling, Slurm, Kubernetes GPU Operators, Ray, distributed training, and managing large-scale GPU clusters for AI training and inference.

Responsibilities

Own AI cloud infrastructure architecture, including GPU orchestration, compute scheduling, networking, storage, distributed systems, bare-metal deployments, and platform scalability.
Build and scale large GPU clusters for AI training and inference, with ownership of GPU provisioning, scheduling, utilization optimization, capacity management, reliability, and performance.
Provide hands-on technical leadership across infrastructure design, architecture reviews, system design, implementation, and escalation for complex production issues.
Establish Platform Engineering and SRE best practices across Kubernetes, observability, CI/CD, security, automation, SLOs, SLIs, incident response, and capacity planning.
Recruit and lead high-performing Infrastructure, Platform, and SRE teams, while partnering with executive leadership on strategy, vendor relationships, budgets, and infrastructure investments.