, Kubernetes) and service mesh architectures. Experience with monitoring, observability, and alerting tools (e.g., Prometheus...
, lineage tracking, and compliance across the entire model inventory. Drive Advanced Observability & Monitoring: Develop... (Terraform, Helm), and robust monitoring/observability solutions (Prometheus, Grafana, ELK/EFK stack). Comprehensive knowledge...
, performance, and capacity planning. Promote reliability through automation, simplification, observability, root-cause analysis...
of web protocols Preferred Spring Boot development experience Experience with IT operations and observability tools...
and platforms. Monitor system health using APM/observability tools (Dynatrace, AppDynamics, New Relic, Grafana, Prometheus...
. Ensure operational excellence through platform reliability, performance, observability, cost efficiency, and simplification...
Language Models (LLMs) - Implement AI agent observability, monitoring, and tracing solutions - Design and build Retrieval... AI Observability: Experience with tracing, monitoring, and debugging AI/LLM applications CI/CD & Monitoring: GitHub Actions, Jenkins...
and Langfuse for orchestration, chaining, and observability. Core Responsibilities · Implement and maintain MCP server...), manage orchestration (Kubernetes/GKE), and optimize nodes, autoscaling and resource requests. · Ensure observability...
observability tools during incident and performance investigations. Provides visibility to all stakeholders throughout the entire... and continuously improves our time to resolution metrics. Maintains and configures core observability tools to ensure optimum...
workflows Ensure solutions meet requirements for security, data governance, observability, and reliability Review solution...
towards exhaustive health monitoring of AI training supercomputers. Build AI Supercomputer observability solutions at scale, with deep... trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability...
scalability, reliability, and observability needs as usage grows and as Klaviyo's AI strategy evolves. Establish, refine... harnesses, observability standards). Mentor and uplevel other engineers on the team through design reviews, pairing, feedback...
application meets world-class standards in Availability Scalability Performance Observability (metrics| logging| tracing) Security... standards in: o Availability o Scalability o Performance o Observability (metrics, logging, tracing) o Security...
troubleshooting guides (TSGs), wikis, tests, and telemetry, adding comprehensive observability and monitoring capabilities..., reliability, efficiency, observability, and performance of supercomputers while also driving consistency in monitoring...
automation Experience with system monitoring, observability, and performance tuning Experience supporting high-scale or data...