DevOps Engineer (AI and Services)
Location: Midrand, Gauteng, South Africa
Employment Type: Full-time and Office-based
Reporting Line: General Manager – AI and Services
Contact: Chanel Lubbe – Associate Talent Specialist ([Email Address Removed])
Job Purpose
The DevOps Engineer will be responsible for deploying, managing, and optimizing the AI software stack to support our AI-driven applications. This role involves close collaboration with data scientists and machine learning engineers to ensure the seamless integration of AI models and services within our enterprise environment.
Key Responsibilities
- Design, implement, and manage on-premises and hybrid infrastructure for AI solutions.
- Leverage tools like NVIDIA AI Enterprise to streamline containerized application deployments and manage GPU resources effectively.
- Automate deployment and configuration of AI software solutions, ensuring that machine learning models and AI frameworks (e.g., TensorFlow, PyTorch) are optimized for performance.
- Develop scripts and tools in Python to facilitate rapid deployment of AI applications across various environments.
- Implement CI/CD pipelines specifically tailored for AI workloads using tools like cuDNN, Jenkins, GitLab CI, or CircleCI.
- Collaborate with data science teams to ensure efficient model versioning and deployment strategies.
- Establish monitoring solutions to track the performance and utilization of AI resources and systems for efficient and reliable operations.
- Analyze system performance, identify bottlenecks, and implement tuning strategies for optimal GPU and application performance.
- Work closely with cross-functional teams, including data scientists, ML engineers, and IT, to support the integration of AI solutions into business applications.
- Provide guidance and support for best practices in AI model training and deployment, ensuring effective use of AI tools and solutions.
- Implement security measures and best practices to safeguard data and AI models.
- Ensure compliance with relevant data protection regulations and industry standards.
- Create and maintain comprehensive documentation related to infrastructure setups, deployment processes, and operational guidelines.
- Conduct training sessions for team members on AI tools, platforms, DevOps practices, and workflows.
Requirements
Experience and Knowledge
- 3+ years of experience in a DevOps role, with a focus on automation, AI, machine learning, or data engineering.
- Hands-on experience with NVIDIA AI Enterprise software is an advantage.
- Experience with technologies like Service Fabric, Redis, Rancher, ASP.NET, .Net Core, RabbitMQ, Elastic Stack, Git, API, and Terraform is beneficial.
- Knowledge of AI platforms (e.g., Nvidia, Intel, OpenShift AI, Kubernetes), AI models (e.g., HuggingFace, Nvidia, Lama, GPT), and AI infrastructure (e.g., DELL AI Factory, SuperMicro, Nutanix AI).
Skills and Education
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
- Proficiency in Python and other scripting languages (Bash) for automation and tool development.
- Familiarity with containerization technologies (Docker, Kubernetes, Rancher) as they relate to AI workloads.
- Understanding of machine learning frameworks (TensorFlow, PyTorch) and their deployment.
- Knowledge of coding/scripting languages such as Python, JavaScript, Yaml, Json, Terraform, and Ansible.
- Understanding of messaging protocols, APIs, SDKs, and open-source databases.
- Fundamental understanding of networking concepts like TCP/IP, DNS, TLS, and load balancing.
- Strong analytical and problem-solving skills with a keen attention to detail.
- Excellent communication and teamwork abilities, with a collaborative mindset.
- Ability to adapt to a fast-paced environment and manage multiple priorities effectively.
Desired Skills:
- Python
- DevOps
- Machine Learning
- Nvidia
- Kubernetes
- SuperMirco