VP Engineering
Location: San Francisco, CA
Compensation: Competitive Base + Equity + Benefits
TLDR
- Our client is a rapidly growing Series A AI infrastructure company building an open AI cloud that provides developers with affordable, on-demand GPU infrastructure for training, inference, and production AI workloads.
- We are seeking a executive level VP Engineering to build and scale the core infrastructure powering our client's AI cloud platform. You'll lead infrastructure strategy while remaining 40% hands-on, driving architecture, contributing to technical design, and partnering with engineering on implementation and execution for GPU-native infrastructure.
- If you're excited by large-scale infrastructure, AI platforms, cloud systems, and solving problems at the intersection of hardware and software, this is an opportunity to influence the architecture of a next-generation AI compute platform. This is a foundational leadership role focused on cloud infrastructure, distributed systems, AI platforms, and scaling both the technology and engineering organization.
Requirements
- 12+ years building and scaling large-scale cloud infrastructure, with hands-on leadership of infrastructure organizations.
- Proven experience building cloud platforms, GPU infrastructure, or AI/ML compute platforms in high-growth environments.
- Expert in Kubernetes, Linux, networking, distributed systems, storage architecture, and Infrastructure-as-Code.
- Strong background in automation, observability, monitoring, reliability engineering, and highly available production systems.
- Preferred: Experience with GPU scheduling, Slurm, Kubernetes GPU Operators, Ray, distributed training, and managing large-scale GPU clusters for AI training and inference.
Responsibilities
- Own AI cloud infrastructure architecture, including GPU orchestration, compute scheduling, networking, storage, distributed systems, bare-metal deployments, and platform scalability.
- Build and scale large GPU clusters for AI training and inference, with ownership of GPU provisioning, scheduling, utilization optimization, capacity management, reliability, and performance.
- Provide hands-on technical leadership across infrastructure design, architecture reviews, system design, implementation, and escalation for complex production issues.
- Establish Platform Engineering and SRE best practices across Kubernetes, observability, CI/CD, security, automation, SLOs, SLIs, incident response, and capacity planning.
- Recruit and lead high-performing Infrastructure, Platform, and SRE teams, while partnering with executive leadership on strategy, vendor relationships, budgets, and infrastructure investments.