Job Title: Site Reliability Engineer (SRE)

Location: Abu Dhabi, UAE

Responsibilities:

Utilize your expertise and take a leadership role in designing and implementing scalable, reliable, and efficient solutions.
Manage and scale virtual machines (VMs) and Kubernetes clusters to ensure optimal performance and reliability.
Provide deep technical expertise in the areas of databases, with a focus on Postgres, to optimize system performance and reliability.
Work on the design and scalability of systems in both private and public cloud environments.
Collaborate with cross-functional teams to enhance system observability and monitoring using tools like Prometheus, Grafana, OpenTelemetry, and other relevant technologies.
Analyze system and platform metrics to proactively identify and address potential issues before they impact the user experience.
Contribute to the continuous improvement of our systems, processes, and infrastructure.

Qualifications:

Bachelor's degree in Computer Science, Engineering, or a related field.
Minimum of 5 years of experience as a Site Reliability Engineer (SRE).
Solid experience with database management, particularly with Postgres.
Proficiency in container orchestration platforms, especially Kubernetes.
Strong background in scaling virtual machines and managing cloud environments (private and public).
Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, and OpenTelemetry.
Familiarity with system-level metrics and performance analysis.
Excellent problem-solving skills and the ability to troubleshoot complex issues in a production environment.

Preferred Skills:

Previous experience with private and public cloud platforms (e.g., AWS, Azure, GCP).
Knowledge of system-level metrics and instrumentation.
Familiarity with containerization technologies (Docker, containers).
Experience with log management and analysis tools.
Strong scripting and automation skills (e.g., Python, Bash).
Excellent communication and collaboration skills.