Job Description
About the Role
Raydian Cloud is seeking a forward-thinking DevOps Engineer to help build and scale infrastructure that powers cutting-edge AI workloads. You’ll work at the intersection of cloud-native technologies and Artificial Intelligence operations (AIOps), enabling high-performance, secure, and automated environments for AI development and deployment. Your expertise in Infrastructure as Code and Kubernetes will be critical in supporting scalable AI pipelines and platform services.
Key Responsibilities
• Design and manage cloud infrastructure optimized for AI/ML workloads using Infrastructure as Code (Terraform, Pulumi, etc.)
• Deploy and maintain Kubernetes clusters tailored for GPU scheduling, distributed training, and inference workloads
• Build CI/CD pipelines for AI model training, validation, and deployment across environments
• Collaborate with data scientists and ML engineers to streamline model lifecycle management
• Implement observability and monitoring for AI services (e.g., Prometheus, Grafana, OpenTelemetry)
• Ensure infrastructure security, compliance, and cost-efficiency in multi-tenant AI environments
• Automate provisioning of AI-specific resources (e.g., GPU nodes, storage volumes, feature stores)
• Document infrastructure patterns, DevOps workflows, and platform architecture
Why Join Raydian Cloud?
• Shape the future of AI infrastructure and platform services
• Work with a visionary team blending deep tech and strategic execution
• Influence architecture decisions in a fast-moving AI startup environment
• Competitive compensation, flexible work culture, and growth opportunities
Job Requirements
Required Skills & Qualifications
• Strong experience with Kubernetes, including GPU scheduling and Helm
• Proficiency in Infrastructure as Code tools (Terraform, Pulumi, etc.)
• Familiarity with cloud platforms (AWS, Azure, GCP) and AI services (e.g., SageMaker, Vertex AI)
• Experience with CI/CD tools (GitHub Actions, GitLab CI, Argo Workflows)
• Scripting skills in Python, Bash, or Go
• Understanding of ML model lifecycle and data pipeline orchestration
• Excellent communication and collaboration skills across technical and business teams
Nice to Have
• Experience with Kubeflow, MLflow, or similar MLOps frameworks
• Knowledge of containerized AI workloads (e.g., TensorFlow Serving, Triton Inference Server)
• Familiarity with service mesh technologies (Istio, Linkerd) in AI microservices
• Certifications in Kubernetes or cloud platforms (CKA, AWS DevOps Engineer)
Skills Requirements
About Company
Raydian Cloud is a leader in AI-driven digital transformation, delivering secure, scalable, and sovereign cloud solutions for enterprises and governments. By leveraging strategic partnerships with industry leaders like NVIDIA, Rafay Systems, and Monetize360, we provide a complete ecosystem for AI innovation—from infrastructure to talent development. We empower organizations in highly regulated sectors such as healthcare, finance, and telecommunications to harness the power of AI while ensuring data sovereignty and strict regulatory compliance.