Key Responsibilities:
- Contribute to the global design and implementation of scalable and fault tolerant infrastructure systems that support engineering and operational needs.
- Contribute to the deployment, configuration, and maintenance of distributed storage and database systemsAnalyse system failures, performance issues, and misconfigurations across hardware, software, and network [URL Removed] and mentor the computer systems engineers and contribute to strategic technical planning
Qualifications and Experience Requirements:
- BTech / BEng (Hons) in Computer Science, Software Engineering, Information Systems, Electronic Engineering, or equivalent, coupled with 13 years of relevant experience
- MTech / MEng in Computer Science, Software Engineering, Information Systems, Electronic Engineering, or equivalent, coupled with 9 years of relevant experience
- MEng in Computer Science, Software Engineering, Information Systems, Electronic Engineering, or equivalent, coupled with 7 years of relevant experience
- PhD in Computer Science, Software Engineering, Information Systems, Electronic Engineering, or equivalent, coupled with 5 years of relevant experience
- A minimum of 3+ years’ experience in a technical leadership or software/system architecture role, with direct responsibility for large-scale, platform-based distributed systems
- Demonstrated hands-on experience in infrastructure design and automation, distributed systems, observability, CI/CD, container orchestration (e.g. Kubernetes), DevOps/SRE practices and cloud-native technologies.
- Experience leading teams or initiatives that intersect with data platforms, storage, networking, and systems engineering domains
Desired Skills:
- Ability to lead architectural discussions
- influence design decisions
Desired Work Experience:
- 5 to 10 years
Desired Qualification Level:
- Degree
About The Employer:
– Knowledge:
– In-depth understanding of systems engineering principles, including performance optimisation, fault tolerance, and resource scheduling in Linux-based environments.
– Strong knowledge of containerised environments (Docker, Podman), orchestration platforms (Kubernetes, Helm), and runtime architectures (containerd, CRI).Expertise in infrastructure-as-code, continuous integration/deployment (CI/CD), and configuration management tools (e.g., GitLab CI, Ansible, Terraform, ArgoCD).
– Advanced understanding of distributed computing and storage architectures, including Ceph, S3, NFS, and local/clustered file systems.
– Operational and architectural fluency in relational and NoSQL database systems (e.g., PostgreSQL, MySQL, MongoDB), including replication, backups, and performance tuning.
– Working knowledge of networking fundamentals, security protocols, and systems-level observability (e.g., Prometheus, Grafana, ELK/EFK stack).
– Familiarity with the HPC ecosystem (e.g., SLURM, job schedulers) is beneficial for environments supporting scientific or research computing.
Competencies (Essential):
– Demonstrated technical leadership (3+ years): Proven ability to lead cross-functional initiatives across systems, storage, and database infrastructure, driving technical decisions from architecture through to implementation.
– Systems engineering expertise: Strong background in Linux administration, infrastructure automation, service orchestration, and performance optimisation across diverse environments.
– Distributed systems architecture: Extensive experience in designing and deploying scalable, resilient services using microservices, event-driven, and cloud-native design patterns.
– Containerisation and orchestration: Proficient in production-grade environments using Kubernetes, Docker, and Helm for both system and application deployments.
– Infrastructure automation and CI/CD: Hands-on experience with tools such as GitLab CI, ArgoCD, FluxCD, Jenkins, or GitHub Actions to enhance and secure platform operations.
– DevOps and SRE practices: Solid understanding of infrastructure-as-code, configuration management, and release automation (DevOps), alongside incident response, monitoring, SLIs/SLOs, and system reliability engineering (SRE).
– Advanced Linux expertise: Skilled in troubleshooting, kernel tuning, systemd orchestration, and large-scale system optimisation.
– Technical delivery and planning: Experience in backlog management, cross-team collaboration, and Agile sprint execution.
– Database administration: Practical experience managing both relational and NoSQL databases (e.g., PostgreSQL, MySQL, MongoDB), including high availability, backups, replication, and performance tuning.
– Strong diagnostic and problem-solving skills: Ability to adopt a root-cause-first approach, with a strong sense of ownership, accountability, and focus on long-term operational stability.
Skills:
– Technical leadership: Ability to lead architectural discussions, influence design decisions, and mentor junior engineers across infrastructure streams.
– Resource management and leadership: Demonstrates leadership that fosters innovation and supports the development of emerging skills. Builds trust through consistency, integrity, understanding, and patience, while effectively planning, allocating, and monitoring resources to achieve desired outcomes.
– Problem-solving and analytical skills: Strong capability in root cause analysis, systems troubleshooting, and resolving performance bottlenecks.
– Communication and collaboration: Ability to clearly articulate technical recommendations, engage with cross-functional stakeholders, and effectively incorporate feedback.
– Planning and delivery: Proficient in backlog grooming, sprint planning, and delivering technical solutions within Agile and DevOps environments.
– Continuous learning: Committed to staying up to date with evolving technologies, particularly in containerisation, cloud-native systems, observability, and systems automation.
– Documentation and knowledge sharing: Skilled in producing high-quality technical documentation and effectively sharing knowledge across engineering teams.