Site Reliability Engineer

Location:

Pune, MH, IN

Department: Data Center

Position Description

The AdServer and RTB Production Infrastructure is pivotal to ensuring our software applications' reliability, availability, and overall excellence.
As an SRE Engineer, you will be responsible for the AdServer and RTB Production Infrastructure. Your essential duties encompass ensuring the seamless operation and optimal performance of large-scale distributed software applications. Your role revolves around maintaining a robust and high-performing environment, contributing to the reliability of our services, and innovating solutions to guarantee 24/7 availability. By leveraging your technical expertise and dedication, you contribute to maintaining a seamless experience for our users while upholding the highest standards of operational excellence. Your specific responsibilities include:

Responsibilities:

Operational Support
- Be a primary point of contact for operational support of multiple large-scale distributed software applications in the Ad Server environment.
- Monitor availability of applications, promptly detect anomalies, analyze the impact, debug the problems in production, and follow up for the resolution by working closely with the engineering team.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Diligently work with the engineering team to expedite the resolution of incidents and ensure a swift return to normal operations.
- Be innovative in building dashboards, adding metrics, writing automation scripts to reduce operation toil, and streamlining processes to enhance system reliability and stability.
- Design and construct software and systems to effectively manage the Ad Serving platform, its underlying infrastructure, and applications.
On Call Availability and Support
- Work in shifts to provide continuous on-call support for the production systems and resolve issues on your own by using predefined handbooks
- Show a sense of urgency for high-priority issues and arrange war rooms to resolve the problems.
- Provide timely updates for high-priority issues and do handovers when a problem needs to be worked out 24*7
- Conduct post-incident reviews to identify root causes, recommend preventive measures, and contribute to a culture of learning and improvement.

Requirements:

Bachelor's degree in computer science or related disciplines
Total 3+ years' experience in software development
Ability to program using programming languages like C or C++, Scripting languages like Shell or Python
Good to have prior experience in technical engineering
A proactive approach to identify the problems, performance bottlenecks, and areas of improvement
Must know, Networking, Database (MySQL) and Linux System concepts, Debugging and analyzing the core dumps
Hands-on experience with monitoring and observability tools like Grafana, Nagios, Influx, ELK, etc.
Familiarity with orchestration tools like Docker and Grafana and incident management systems like Zenduty
Excellent communication and collaboration skills, with the ability to work effectively across teams.
Self-motivated and positive mindset to examine any incidents