DevOps Engineer (AI and Services)
Location: Midrand, Gauteng, South Africa
Employment Type: Full-time and Office-based
Reporting Line: General Manager – AI and Services
Contact: Chanel Lubbe – Associate Talent Specialist ([Email Address Removed])

Job Purpose
The DevOps Engineer will be responsible for deploying, managing, and optimizing the AI software stack to support our AI-driven applications. This role involves close collaboration with data scientists and machine learning engineers to ensure the seamless integration of AI models and services within our enterprise environment.
Key Responsibilities

  • Design, implement, and manage on-premises and hybrid infrastructure for AI solutions.
  • Leverage tools like NVIDIA AI Enterprise to streamline containerized application deployments and manage GPU resources effectively.
  • Automate deployment and configuration of AI software solutions, ensuring that machine learning models and AI frameworks (e.g., TensorFlow, PyTorch) are optimized for performance.
  • Develop scripts and tools in Python to facilitate rapid deployment of AI applications across various environments.
  • Implement CI/CD pipelines specifically tailored for AI workloads using tools like cuDNN, Jenkins, GitLab CI, or CircleCI.
  • Collaborate with data science teams to ensure efficient model versioning and deployment strategies.
  • Establish monitoring solutions to track the performance and utilization of AI resources and systems for efficient and reliable operations.
  • Analyze system performance, identify bottlenecks, and implement tuning strategies for optimal GPU and application performance.
  • Work closely with cross-functional teams, including data scientists, ML engineers, and IT, to support the integration of AI solutions into business applications.
  • Provide guidance and support for best practices in AI model training and deployment, ensuring effective use of AI tools and solutions.
  • Implement security measures and best practices to safeguard data and AI models.
  • Ensure compliance with relevant data protection regulations and industry standards.
  • Create and maintain comprehensive documentation related to infrastructure setups, deployment processes, and operational guidelines.
  • Conduct training sessions for team members on AI tools, platforms, DevOps practices, and workflows.

Requirements
Experience and Knowledge

  • 3+ years of experience in a DevOps role, with a focus on automation, AI, machine learning, or data engineering.
  • Hands-on experience with NVIDIA AI Enterprise software is an advantage.
  • Experience with technologies like Service Fabric, Redis, Rancher, ASP.NET, .Net Core, RabbitMQ, Elastic Stack, Git, API, and Terraform is beneficial.
  • Knowledge of AI platforms (e.g., Nvidia, Intel, OpenShift AI, Kubernetes), AI models (e.g., HuggingFace, Nvidia, Lama, GPT), and AI infrastructure (e.g., DELL AI Factory, SuperMicro, Nutanix AI).

Skills and Education

  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • Proficiency in Python and other scripting languages (Bash) for automation and tool development.
  • Familiarity with containerization technologies (Docker, Kubernetes, Rancher) as they relate to AI workloads.
  • Understanding of machine learning frameworks (TensorFlow, PyTorch) and their deployment.
  • Knowledge of coding/scripting languages such as Python, JavaScript, Yaml, Json, Terraform, and Ansible.
  • Understanding of messaging protocols, APIs, SDKs, and open-source databases.
  • Fundamental understanding of networking concepts like TCP/IP, DNS, TLS, and load balancing.
  • Strong analytical and problem-solving skills with a keen attention to detail.
  • Excellent communication and teamwork abilities, with a collaborative mindset.
  • Ability to adapt to a fast-paced environment and manage multiple priorities effectively.

Desired Skills:

  • Python
  • DevOps
  • Machine Learning
  • Nvidia
  • Kubernetes
  • SuperMirco

Learn more/Apply for this position