Site Reliability Engineer

A site reliability engineer (SRE) creates a bridge between development and IT operations by taking on the tasks typically done by operations. Instead, such tasks are given to these types of engineers who use automation tools to solve problems by creating scalable and reliable software systems.

Standardization and automation are at the heart of what an SRE does, especially as systems migrate to the cloud. Thus, they often have a background in software or system engineering or system administration with IT operations experience.

What is site reliability engineering?

We will start with a definition of what this type of engineering is before we move onto the role and responsibilities of a site reliability engineer.

Site reliability engineering is a term that was first coined by Google, where it is described as “when you treat operations as if it’s a software problem.”

The main purpose of SRE is developing software systems and automated solutions for operational aspects. Thus, SRE does the work traditionally done by operations but instead using engineers with software expertise to solve complex problems.

Therefore, site reliability engineering can be considered a set of practices that incorporates aspects of software engineering into operations thereby increasing the efficiency and reliability of software systems and improving workflow.

SRE and DevOps

Site reliability engineering is closely related to DevOps, another concept that links software development and operations, and can be seen as a generalization of core SRE principles. Consequently, SRE plays a large part in successfully implementing DevOps practices.

Additionally, both DevOps and SRE seek to bridge the gap between operations and development teams to deliver software faster.

However, an article by Google makes a distinction between the two terms stating that SRE “happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas.”

Read more about DevOps and what a DevOps engineer does.

What does a site reliability engineer do?

A site reliability engineer (SRE) works between development and operations. The SRE, then, is a software developer with experience in and knowledge of IT operations.

A lot of this role revolves around writing and developing code to automate processes, such as analyzing logs, testing production environments and responding to any issues, so this engineer will be an expert in writing code.

Such automation allows developers, in turn, to focus exclusively on feature development enabling them to bring new features to production as quickly as possible.

The operations team, for their part, will find their workload decreasing as a SRE will automate solutions for any recurring problem.

Thus, he/she will be shifting between development and operations work and maintaining a balance between them.

Because an SRE engineer’s main focus is on automation, this means that he/she enhances performance, efficiency and monitoring of software development processes.

Required skill set

SREs dedicate their time to creating software that will improve the reliability of systems, fixing issues and responding to incidents and issues. As such, they will need various technical skills.

They will need to have knowledge of various automation tools as they are usually responsible for building and integrating software tools to enhance an organizational system’s reliability and scalability.

As mentioned above, the SRE will require knowledge of coding and most of the common programming languages including Ruby, Javascript and PHP.

He/she will also need to have expertise in the major cloud providers such as AWS and Google Cloud.

Daily roles and responsibilities of an SRE

Automation

As mentioned previously, SRE engineers build tools for automation to manage IT operations. Thus, instead of manually performing these functions, their aim is to automate them. Such functions include:

Continuous integration and continuous delivery
Monitoring
Incident response
Alerts

Monitoring

SRE engineers are responsible for ensuring that the underlying infrastructure is running smoothly and that systems and tools are working as expected.

They also monitor critical applications and services to minimize downtime and ensure their availability.

Issue resolution

These engineers work closely with developers, especially when issues arise so they will collaborate with developers to help with troubleshooting and provide consultation when alerts are issued.

This engineer will investigate and then resolve the issue in the event that a developer runs into a problem.

Following the incident resolution, the engineer will revisit the issue and determine the cause to ensure it doesn’t happen again.

Cross-team collaboration

Based on the above, SREs work across different teams, mainly operations and development. Building reliable systems and providing support to these teams, will give these teams more time to divert their attention to building new features and hence get these out faster to customers.

Common tools used by SREs

Monitoring: such tools include AWS CloudWatch and NewRelic
Incident management/on-call: such as PagerDuty and VictorOps
Project management and issue tracking: such as Jira and Trello
Infrastructure orchestration: including Terraform and SaltStack

To find out more tools from project management tools to infrastructure and container orchestration used by site reliability engineers, check out this curated list of SRE tools.

How much does an SRE make?

According to payscale, this type of engineer makes a salary anywhere between $76,000 to $158,000 a year in the United States with the average being $117,768 per year.

Conclusion

A site reliability engineer is becoming an increasingly important role within organizations. It is a challenging role that requires a passion for coding and automation.

Having such engineers in your organization will help reduce your operational costs while improving the reliability of your systems.