What This Position Is All About:
As a Cloud Operations Engineer you will work closely within all facets of the Saks Off5th technology stack to ensure stability and uptime in our Production environments. You will be managing delivery, tuning, optimization, cost, and performance of workloads and services that run in our cloud environment, including AWS or physical Data Center. You will be part of a team responsible for incident triage, day to day operations and maintenance, change implementation, front end performance, Production readiness assurance, Production monitoring, and project implementation planning. You will work with product owners to assist in prioritizing defects and work closely with Development and DevOps to troubleshoot, plan, and deploy.
Who You Are:
- Ability to read,understand,amend technical documentation, system diagrams and integration flows
- Solid communication skills and ability to direct meetings and conversations towards specific goals
- Customer and team-centric attitude
- Support relationship development with internal and external teams and resources
- Knowledge of alerting and monitoring mechanisms
- Support on-call activities on a rotating schedule
- Support and escalate Major incident on production outage
- Prioritize stability and performance
- Maintain focus on customer experience
- Embody a sense of focused urgency in problem solving and communication
- Define, follow-through, and action plans to case resolution
- Document systems, processes, and knowledge base articles
- Curious - You keep up with the industry trends and best practices and understand the when, why, and how to use them.
- Not afraid - to roll up your sleeves and deal with any challenges presented to you including complex legacy systems.
- Innovative - You can think outside the box. You learn from others but are not afraid to try something different.
- Strong Believer - in automating the mundane and repetitive tasks and building reliable systems that only need minimal supervision.
As the Cloud Operations Engineer, You Will:
- Responsible for the day-to-day availability, monitoring, maintenance, incident response and resolution within the cloud (primarily AWS).
- Work closely with multiple teams and organizations to ensure all services' health meets defined SLAs.
- Troubleshoot and perform the initial triage of failed processes and instances in large scale systems
- Communicate effectively with stakeholders including team leads, managers, and directors and above
- Provide thorough documentation for all work using strong written skills.
- Follow and implement risk and compliance policies and procedures.
- Actively contribute in a dynamic team to ensure continuous improvement of processes, policies, automation, and self.
- Continue to develop personally and professionally with internal and extra-curricular training.
- This position assists with the deployment and day to day administration and maintenance of environments hosted in the AWS cloud.
- Demonstrated proficiency with AWS Cloud development principles, managed services and Infrastructure as Code (Terraform).
- Design and implement automated processes using Terraform for cloud network environments eliminating manual and repetitive tasks
- Monitor work and ticket queues to assure issues are addressed and escalated as needed.
- Maintain an accurate picture of existing server, storage, networking software, and hardware and virtual environments to support scaling against various project requirements.
- Manage cloud infrastructure to maintain operational stability and security, including planned hardware and VM maintenance, patching, software upgrades, etc.
You Also Have:
- Bachelor’s degree in Computer Science, Computer Information Systems
- 5+ years of overall experience into Information Technology
- 3+ years of experience interacting with Salesforce Commerce Cloud (Demandware) preferred
- 3+ years of experience with AppDynamic, New Relic, Nagios, Datadog, Cloudwatch, PagerDuty or similar monitoring and alerting platforms.
- 3+ years of experience in incident management or production support
- 3+ years of technical experience in the retail or e-commerce space
- 2+ years of experience with CDN providers, i.e. Akamai, Cloudflare
- 3+ years of experience into any of AWS cloud technologies
- 2+ years in CI/CD tools and understanding of release and deployment cycle concepts.
- Strong experience in Windows and *nix environments
- Excellent understanding of TCP/IP and network communications
- Functional knowledge and experience with text and data representation and manipulation (XML, HTML, Regular Expressions, Scripting, SQL)
- Strong problem solving and analytical skills
- Strong written and verbal communication skills
- Ability to quickly understand security systems in order to identify and validate security requirements
- Ability to manage multiple projects, priorities and deadlines
- Demonstrated initiative, customer orientation and teamwork competencies
- Adaptability, flexibility and ability to work as part of a team or in an individual capacity
- Must demonstrate effective decision making, results delivery, and the ability to stay current with relevant technologies and security practices.
- Willingness to work outside of regular business hours as required which can include evenings, weekends and holidays
- Ability to handle and maintain the integrity and confidentiality of highly sensitive material and information
- Ability to work in an office/remote environment and concentrate on complex tasks for extended periods of time
How Often you May Travel: