Site Reliability Engineer III

Location US-TX-Addison
ID
338388
Pos. Category
Corporate - Information Systems
Pos. Type
Full Time

Overview

Concentra is recognized as the nation’s leading occupational health care company.

With more than 40 years of experience, Concentra is dedicated to our mission to improve the health of America’s workforce, one patient at a time. With a wide range of services and proactive approaches to care, Concentra colleagues provide exceptional service to employers and exceptional care to their employees.

 

The Site Reliability Engineer (SRE) III is responsible for ensuring that the underlying infrastructure and critical systems are working as expected and running smoothly. They also monitor critical applications, services, and infrastructure to minimize downtime and ensure their availability. The Senior SRE I plays a large part in improving core system stability and successfully implementing DevOps practices. The Senior Site Reliability Engineer I applies engineering principles to operations, focusing on system reliability, scalability, and performance. This role balances the need for rapid feature development with the imperative of maintaining system stability and availability. The role emphasizes observability, proactive reliability engineering, continuous improvement, and collaboration across teams.

Responsibilities

  • Lead incident response efforts and conduct blameless postmortems to identify root causes and drive systemic improvements.
  • Define, monitor, and report on service-level indicators (SLIs), objectives (SLOs), and agreements (SLAs).
  • Evolve the architecture to support future requirements based on SLIs, SLOs, and SLAs.
  • Identify and eliminate toil by automating repetitive operational tasks, thus increasing velocity and reliability.
  • Ensure management awareness of problems that are severe in nature or that are exceeding documented targets.
  • Ensure that all problems are resolved in a timely and efficient manner.
  • Own development of software to automate processes like analyzing logs, testing production environments, and responding to any issues.
  • Develop software tasks in accordance with standards and methodologies.
  • Possess deep knowledge of the entire technology stack.
  • Participate in capacity planning, performance analysis, and system tuning to ensure scalability and resilience.
  • Collaborate with development teams to ensure reliability is considered during design and implementation phases.
  • Mentor others to accelerate their career growth and encourage participation.
  • Provide technical mentoring to junior SREs.
  • Help build team spirit by assisting other staff members and promoting a positive workplace.
  • Challenge team processes, looking for ways to improve them.
  • Recognize potential areas where policies and procedures require change, or where new ones need to be developed, especially regarding future business expansion. Submit recommendations as appropriate.
  • Ensure all changes comply with change management policies and procedures.
  • Embody the philosophy of DevOps & Sire Reliability Engineering by providing a prescriptive way of measuring and achieving reliability through engineering and operations work.
  • Monitor and report on any security violations related to the unwarranted access to corporate data.
  • Review outstanding issues daily to assure that troubleshooting and resolutions are current.
  • Cross-functional collaboration with application engineering, QA, and infrastructure teams to ensure observability and reliability.
  • Perform tool evaluation and selection in support of observability and automation

Qualifications

  • Education Level: Bachelor’s Degree
  • Preferred experience includes AWS or Azure certifications.
  • Experience in lieu of required education is acceptable
  • 7+ years of total work experience in IT, software engineering, or infrastructure roles.
  • Minimum of 5 years of hands-on experience in Sire Reliability Engineering, DevOps, or closely related roles.
  • At least 3 years of direct experience with AWS and/or Azure, including infrastructure provisioning, automation, and monitoring.
  • Experience with implementing, managing, and using observability tools, data visualization, and application monitoring platforms such as Dynatrace, AWS CloudWatch, Azure Monitor, Grafana, Prometheus, or Datadog.
  • Familiarity with error budgets and their role in balancing reliability and innovation.
  • Direct experience building, launching, configuring, and maintaining AWS and/or Microsoft Azure cloud resources.
    • Expertise preferred in implementing methodologies for Automation, Continuous Integration, Continuous Delivery, High Availability, High Scalability, Monitoring, Logging, Security and Governance
  • Experience with Terraform and a strong understanding of Infrastructure as Code (IaC) principles.
  • Strong scripting knowledge using languages such as PowerShell, Bash, Python, Groovy, etc.
  • Proficiency in at least one programming language preferred, e.g., Python, Java, or .NET.
  • Proficient in Git for version control and collaborative development.
  • Experience with GitLab or similar platforms for source code management and CI/CD.
  • Familiarity with Atlassian tools (Jira, Confluence) is a plus.
  • Proficient in administering Linux and/or Windows-based platforms.
  • Experience supporting production enterprise applications.
  • Strong understanding of complex multi-tiered environments and their integration with DevOps toolsets.
  • Experience in problem management, preventive maintenance, and analytical and conceptual problem solving.
  • Experience in business process improvement is also desired.

Job-Related Skills/Competencies

  • Ability to effectively multi-task and adapt to changing business priorities
  • Excellent attention to detail
  • Willingness to learn new technologies
  • Excellent analytical and problem-solving skills
  • Excellent time management and organizational skills
  • Proven drive towards continual improvement
  • Strong interpersonal and communication skills
  • Strong dedication to quality customer service
  • Must possess a personal sense of urgency
  • Strong analytical mindset for risk assessment and mitigation.
  • Ability to quantify reliability and communicate trade-offs.
  • Ability to assess and mitigate risks to system reliability through proactive engineering.
  • Skilled in quantifying reliability metrics and communicating their impact to stakeholders

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed