At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-scale production... to guarantee flawless service operation with consistent reliability and uptime. As an SRE here, you will be part of a welcoming...
. For an engineer driven to tackle the unique telemetry, orchestration, and reliability challenges found only at GW scale... of next-generation, open-source-driven datacenters at Gigawatt scale. This role moves beyond infrastructure maintenance to define the...
hardware and software infrastructure required to build, validate, and release a wide variety of hardware and software products..., reliability, and/or velocity within the pipeline through implementation of robust infrastructure telemetry, KPIs, and indicators...
infrastructure. This is an outstanding opportunity to work where brand-new hardware, software, and infrastructure intersect.../AI Benchmarking and Telemetry Engineer to join our team and drive performance insights across our most advanced computing...
and application servers. Prior experience in Site Reliability Engineering/DevOps and managing large-scale server infrastructure... of the components of a cloud infrastructure including hardware platforms, OS, applications, databases, networks, web...
and operating massive-scale bare-metal Kubernetes environments. As a senior leader within Infrastructure Shared Services (ISS...), you'll bridge the gap between hardware and high-performance software, ensuring our global R&D teams have the reliable, secure...