Design, build and infrastructure solutions –
Executing the technical contract:
- Design and execute scalability strategies that ensure the scalability and the elasticity of infrastructure.
- Architect system and process solutions to satisfy business and IT needs with specific emphasis on application and data integration and interoperability.
- Engage in and improve the whole life cycle of application and cloud services – from inception and design, through deployment, operation and refinement.
- Align integration development, platform strengths and constraints and IT team efforts with business expectations.
- Understand, learn and analyse any Layer 2-7 Network Protocol, with the use of network protocol analysis tools (i.e., sniffer), in order to flesh out the architecture and design of new systems and to do root-cause problem analysis on existing systems.
- Analyse large amounts of operational system and instrumentation data to identify the root causes of problems and to find the best architecture and design adjustments to solve such problems.
- Create various application and platform integration options for a given application and to motivate which design would fit best in any given context.
- Design and implement application instrumentation to enhance platform integration, operational control and operational reporting and alerting.
- Design, develop, ship, and motivate the creation of software and systems to increase product reliability and organizational efficiency.
- Participate in and align with software release cycles. Work closely with Developers to ensure software releases are well designed, planned, implemented, released, and monitored.
- Responsible for the automation of time-consuming and manual processes.
Systems Reliability Engineering (SRE):
- Lead development and tracking of SRE Error Budgets.
- Lead development of SRE dashboards.
- Assess current SRE solution and define the SRE approach for products.
- Work with Applications Development teams on designing, implementing, and improving SRE practices.
Enable a sustainable platform service –
- Guide reliability practices through the entire software development life cycle through activities like architecture reviews, code reviews, creating platforms and frameworks, capacity planning.
- Pro-actively identify integration issues, as well as design and implement improvements to the system and component development landscape with regards to application and data integration.
- Responsible for proactively and continuously analysing platforms and improving systems reliability and resilience.
- Conduct production readiness reviews (platform related).
- Work with Senior Engineering and Testing team members to build tools and testing strategies for problem prevention, detection, and chaos testing.
- Design and create test cases with targeted outcomes that are driven by solid architecture and design principles. This includes running test cases, analysing its outcomes and adjusting the application architecture and design accordingly to ensure the desired outcomes are met.
Incident and Risk Management:
- Analyse security vulnerabilities within a given context and design solutions for them.
- Responsible for troubleshooting production incidents in real time, and to lead root cause investigations.
- Improve service reliability through blameless post-incident reviews and using code to prevent or respond to problem recurrence.
- Proactively identify and escalate system anomalies and risks.
- Responsible for code level debugging on issues escalated to the team.
Best practice and knowledge sharing –
- Build and understanding of and adhere to the Enterprise Architecture principles.
- Coach architectural and development resources with regards to application and data integration and interoperability.
- Learn new languages or frameworks within a short amount of time as required to create frameworks/components or assist with troubleshooting.
- Attend and facilitate SRE training sessions.
- Bachelor’s Degree in IT or similar tertiary qualification.
- Honours Degree in Computer Science, Information Systems, IT Engineering or similar preferred.
- Minimum 8-10 years’ experience as a Platform Architect with advance knowledge in the following key areas: containers, deployment architecture, benchmarking, design, and network engineering.
- Minimum 4-6 years of combined experience serving in either a DevOps, SRE, Systems, and/or Software Development role.
- Defining the SRE Roadmap for organisations.
- Extensive experience with public cloud technologies and solution. (Azure Preferred).
- IAC tools (Terraform, Gitlab).
- Configuration management tools like Ansible, Chef, and Packer.
- Container technology and orchestration (Kubernetes, Docker).
- Linux operating system, testing tools and database management with MySQL.
- Monitoring tools like New Relic, OpsRamp.
- Log Management and ELK Stack. (Elasticsearch, Logstash, Kibana).
- Jenkins or other build tools.
- Hands-on experience in administering high availability and high-performance environments, as well as managing large-scale deployments of traffic-heavy applications.
- Handling multiple complex systems and not shy away from the challenge of improving them.
- Deep understanding of the microservice architectures, application servers, network and databases.
- Excellent understanding of Scalability processes and techniques.
- Basic knowledge of: Network Design.
- Working knowledge of:
- Data Schema and Code Design with emphasis on integration.
- Key network performance design considerations.
- Windows System Internals.
- Linux System Internals.
- Cloud computing.
- Expert at creating:
- Application Architecture.
- Application Design.
- Application Integration.
- Data Integration.
- Developing with C/C++.
- Web Service standards.
- Object Oriented design and development.
- Solutions profiling and tracing, including the use of Telemetry Data.
- Enterprise Application Integration Patterns usage and implementation.
- Cloud integration patterns.
- Data integration patterns and techniques.
- Application instrumentation techniques in order to enhance operational control and reporting.
Ideal to have –
- 10+ Years of experience as platform architect with advance knowledge in the following key areas: containers, deployment architecture, benchmarking, design, and network engineering
- 6+ Years of combined experience serving in either a DevOps, SRE, Systems, and/or Software Development role.
- 5 Years C++/C programming and Network Protocol level analysis.
- 2 Years Network Design.
- Cloud computing with the emphasis on integration.
- Dump analysis.
- The 7 layer OSI model.
- Network trace analysis for all 7 of the OSI model.
- Methods of securing APIs including but not limited to: Authentication and Authorization mechanisms, Transport and Message security.
- GOF design patterns usage and implementation.
- A working knowledge of: Budgeting and procurement.
- Ability to multitask, and not miss critical details.
- Strong learning potential (in order to learn new coding languages in a short period of time).
- Integration skills – master new technologies and harmonising them with existing systems to achieve better results.
- Excellent communication and collaboration skills.
While we would really like to respond to every application, should you not be contacted for this position within 10 working days please consider your application unsuccessful.
When applying for jobs, ensure that you have the minimum job requirements. OnlySA Citizens will be considered for this role. If you are not in the mentioned location of any of the jobs, please note your relocation plans in all applications for jobs and correspondence. Please e-mail a word copy of your CV to [Email Address Removed] and mention the reference numbers of the jobs. We have a list of jobs on [URL Removed] Datafin IT Recruitment – Cape Town Jobs.