Job title: Site Reliability Engineer
Job type: Contract
Emp type: Full-time
Pay interval: Hourly
Pay rate: negotiable
Location: Abu Dhabi
Job published: 2024-01-22
Job ID: 32211
Contact name: Tharushi Pramodi
Contact email: Tharushi@Cloudsource.ae

Job Description

Job Title: Site Reliability Engineer (SRE)

Location: Abu Dhabi, UAE

Responsibilities:

  • Utilize your expertise and take a leadership role in designing and implementing scalable, reliable, and efficient solutions.
  • Manage and scale virtual machines (VMs) and Kubernetes clusters to ensure optimal performance and reliability.
  • Provide deep technical expertise in the areas of databases, with a focus on Postgres, to optimize system performance and reliability.
  • Work on the design and scalability of systems in both private and public cloud environments.
  • Collaborate with cross-functional teams to enhance system observability and monitoring using tools like Prometheus, Grafana, OpenTelemetry, and other relevant technologies.
  • Analyze system and platform metrics to proactively identify and address potential issues before they impact the user experience.
  • Contribute to the continuous improvement of our systems, processes, and infrastructure.

Qualifications:

  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • Minimum of 5 years of experience as a Site Reliability Engineer (SRE).
  • Solid experience with database management, particularly with Postgres.
  • Proficiency in container orchestration platforms, especially Kubernetes.
  • Strong background in scaling virtual machines and managing cloud environments (private and public).
  • Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, and OpenTelemetry.
  • Familiarity with system-level metrics and performance analysis.
  • Excellent problem-solving skills and the ability to troubleshoot complex issues in a production environment.

Preferred Skills:

  • Previous experience with private and public cloud platforms (e.g., AWS, Azure, GCP).
  • Knowledge of system-level metrics and instrumentation.
  • Familiarity with containerization technologies (Docker, containers).
  • Experience with log management and analysis tools.
  • Strong scripting and automation skills (e.g., Python, Bash).
  • Excellent communication and collaboration skills.