Job Description:
• We are looking for an experienced Senior Terraform Engineer to join our team. The successful candidate will be responsible for ensuring the reliability, availability, and performance of our production systems. The candidate will work closely with development teams to ensure that new systems are designed with reliability and scalability in mind.
Responsibilities:
• Design and implement systems to ensure the reliability, availability, and performance of our production systems.
• One of the primary focus will be on cloud environment support, build automation and developer productivity.
• Work with development teams to ensure that new systems are designed with reliability and scalability in mind.
• Develop and maintain monitoring and alerting systems to proactively detect and resolve issues.
• Continuously improve system reliability and performance through the development of automated tools and processes.
• Implement DevOps pipelines and infrastructure automation.
• Investigate and troubleshoot complex system issues and provide root cause analysis.
• Develop and maintain disaster recovery and business continuity plans.
• Collaborate with cross-functional teams to improve system scalability, security, and performance.
• Stay up-to-date with industry trends and emerging technologies
Key Skills and Competencies:
• Bachelor's or Master's degree in Computer Science, Engineering, or a related field
• 5+ years of experience in site reliability engineering or a related field
• Strong understanding of the following monitoring concepts: Infrastructure, systems, and Application health, system availability, latency, performance, and end-to-end monitoring.
• Strong monitoring and debugging skills.
• Strong experience with cloud infrastructure and services (AWS, GCP, or Azure)
• Strong expertise and hands-on project experience in enterprise level development and maintenance of infrastructure as code using Terraform.
• Good practical Linux / Windows-based systems administration skills in a Cloud or Virtualized environment.
• Strong hands-on experience with network, storage and compute configuration and setup.
• Experience with container orchestration platforms such as Kubernetes
• Experience with automation and configuration management tools (e.g., Ansible, Puppet, Chef, Terraform)
• DevOps - Create, maintain, and manage CI/CD pipelines for infrastructure.
• Experience with monitoring and logging tools such as Prometheus, Grafana, and ELK stack
• Strong understanding of network protocols and infrastructure security best practices.
• Experience with scripting languages such as Terraform, Python, Ruby, or Bash
• Strong analytical and troubleshooting skills
• Experience (1 year) with ITIL processes including Incident, Problem, Change, Knowledge and Event Management.
• Excellent communication and collaboration skills