using container-native Hadoop services to work in a Kubernetes cluster. As a Sr. Staff Software Engineer...
, and cluster deployments and lead discussions about network topologies, compute, management, telemetry, and storage fabrics...
. Experiences to run workloads on large scale heterogeneous cluster is a plus Experiences to optimize GPU kernels...
. SpringBoot, Redis, MongoDB, Kafka, and MicroServices architecture. 3. AWS deployments, scaling and EKS cluster management 4...
, and negotiate staffing plans. Manage the end-to-end cluster development cycle, including forecasting, sourcing, procurement...
personnel to create costed bills of material (BOMs) for rack and cluster level solutions Partner with business development...
of running AI/HPC workloads in single node and cluster level and develop test suites and performance automation. Lead the debug...
. Lead and manage interconnection applications and queue positions during the cluster study phases in NYISO/PJM/SERC...
stacks, and cluster environments. This role requires good understanding and experience in ROCm, CUDA, GPU architecture, ML...
, and automated provisioning. Strong experience in Kafka cluster management, topic configuration, performance tuning, and ensuring...
consistency Proficiency in monitoring cluster health and resource utilization Ability to troubleshoot complex database...
. Expertise in Databricks components such as Delta Lake, Notebooks, Pipelines, cluster management, and cloud integration (Azure...
networking. Experience with PCIe, CXL, NVMe interconnects and cluster schedulers (Kubernetes, Slurm). Proven ability...
. Expertise in Databricks components such as Delta Lake, Notebooks, Pipelines, cluster management, and cloud integration (Azure...
. Experiences to run workloads, especially AI models, on large scale heterogeneous cluster Familiarity with clusters...
, EC2, RDS, S3, CloudWatch, IAM) and Kubernetes including multi-cluster management * Strong programming skills (Python...
, CloudWatch, IAM) and Kubernetes including cluster management Proficient programming skills (Python, Go, or Java...
, and reliability. Standardize Databricks workspaces-cluster policies, repos/CI/CD, secrets, cost guardrails, and operational SLAs...
, storage systems, networking components, and cluster automation. Integrate modern technologies to ensure platform components...
, storage systems, networking components, and cluster automation. Integrate modern technologies to ensure platform components...