ESSENTIAL DUTIES AND RESPONSIBILITIES
- Take a purist SRE approach to shared multi-tenant infrastructure for a resilient SaaS microservice-based containerized systems in addition to customer-centric application environments
- Oversee and automate the team’s growing presence in AWS
- Contribute to core infrastructure systems development with features, bug fixes, reliability improvements, etc
- Platform reliability engineering of a complex single sign-on SAML / OAuth-based central authentication platform
- Creatively build and develop tooling to aid in driving 24x7x365 follow-the-sun operations of critical production systems
- Automate deployment tasks for core product and infrastructure tools and maintain automation infrastructure
- Create system documentation and training materials to empower and educate our fellow team members
- Build and maintain observability tooling, metrics, and dashboarding for a global platform product infrastructure
- Improve our incident management lifecycle to identify, mitigate, and learn from reliability risks and issues
- Enhance platform observability with helping create a self-healing approach to platform reliability
- Collaborate with engineering teams, providing product feedback and where necessary contribute code to the product
REQUIRED SKILLS AND EXPERIENCE
Education and Work Experience :Bachelor’s Degree in Computer Science or related field.Software engineering and task automation skills with Bash, Python, and / or Go are a must.Familiarity with the Agile software development lifecycle.Deep background with Linux systems and engineering.Highly experienced with engineering and automating on Amazon Web Services (AWS).Experience supporting web applications running on Java / Apache / Tomcat in a live production environment.Prior experience with IaC tools like Terraform / Terragrunt / Terraspace.Prior experience with devops / gitops tools (Git, Bitbucket, Flux CD, Teamcity) for gate promotions.Production-At-Scale support background in a heavily microservice-based world.Hands-on engineering and ops expertise in containerization (Docker, Helm, Kubernetes / EKS, CNI and Ingress networking).Strong understanding of Single-Sign On, SAML, OAuth (Bonus if hands-on experience with Okta).Seasoned expertise around certificate technology and basic concepts of encryption.Experience working with Relational Databases such as Aurora Postgres and / or Oracle RDS.Advanced exposure to application development, web UI (design and development), JSON, application architecture.Experience strongly utilizing observability tools (logging / APM) like Datadog, CloudWatch, and PagerDuty.Familiarity with event store / stream-processing technologies like Kafka or AWS SQS.Understanding of Open Application Model systems such as KubeVela or Crossplane.Personal Qualities and Soft Skills :You greatly prefer writing code than clicking a GUI.You enjoy teaching, being a mentor to others, and working across boundaries.Outstanding troubleshooting skills; ability to think critically and display an aptitude for problem solving.Strong analytical mind with a penchant for process development and enhancement.A highly positive can-do attitude with desire for being a team player.Great communication skills and ability to explain complex technical concepts to a varied audience.Demonstrate strong follow-through, a strong work ethic and consistently keep and meet commitments.Other Requirements :Ability to read, write, and speak English.We provide 24x7 support to our customers, so we expect you to take turns with your teammates being on-call for weekend production emergencies or to provide rotating weekend operational support.Travel – Expect occasional travel (less than 5%) to other Guidewire offices for training and team meetings.