Who We Are:
Hudson’s Bay Company is a diversified global retailer, focused on driving the performance of high quality stores and their all-channel offerings, growing through acquisitions, and unlocking the value of real estate holdings.Founded in 1670, HBC is the oldest company in North America. Our portfolio today includes formats ranging from luxury to premium department stores to off price fashion shopping destinations, with more than 300 stores and over 30,000 employees around the world. Our leading banners across North America and Europe include Hudson’s Bay, Saks Fifth Avenue and Saks OFF 5TH. We have significant investments in real estate joint ventures. HBC has partnered with Simon. Property Group Inc. in the HBC Global Properties Joint Venture, which owns properties in the United States and Germany. In Canada, HBC has partnered with RioCan Real Estate Investment Trust in the RioCan-HBC Joint Venture. A truly global corporate citizen, HBC is committed to responsible business practices to bring about positive change, and we work hard to shape a sustainable future for people and the planet. Our philanthropic initiatives help create healthy families, strong communities, and sport excellence in the cities and countries in which we operate around the world, while striving to create innovative programs and resources that provide flexibility for work-life balance in order to maintain a positive working environment.
What This Position is All About
The Site Reliability Engineering Manager role assists in the planning, monitoring, and controlling the day-to-day operations and delivery aspects of the Site Reliability Engineering teams. The role assists in managing team productivity and works to ensure the optimal health of the Hudson’s Bay eCommerce & CRM platforms by overseeing platform performance, resilience, and stability. This role is also an active participant in all aspects of Site Reliability Engineering, including technical vision, telemetry and observation decisions, automation strategy, solution delivery, and platform incident and problem management. This is a leadership role with both technical and people leadership responsibilities. As such, this role participates in short and long-term systems planning, teams and organizational planning. This position reports directly to the Director, Site Reliability Engineering.
Who You Are:
- Provide technical and people leadership to the Site Reliability Engineering teams by facilitating one-one-one, team, and performance review meetings.
- Assist in budgeting, planning, hiring, and 3rd Party contract negotiations.
- Oversee and report on project status, assemble project teams, and help to define assignments against defined schedules and milestones.
- Continuously work to improve the reliability, stability, and performance of the digital platforms by overseeing the implementation of fully automated telemetry, observation, & applied intelligence systems.
- Continuously work to improve problem identification and service restoration of digital platforms by leading and overseeing efforts to define, enhance, and deliver automated alerting and response systems with intelligent, self-healing capabilities.
- Provide periodic on-call escalations support based on established 24/7/365 support schedules.
- Fulfill the role of Escalation Manager/Critical Incident Manager on major incidents by facilitating incident resolutions by leading teams through effective service restoration.
- Communicate and provide timely status and incident reports to Sr. Leadership.
- Collaborate with admins and platform engineers through implementation decisions to achieve highly reliable infrastructure, systems, and integrations.
- Lead conversations and provide business and engineering support for both in-house and external customers.
- Provide advanced Incident Management and Problem Management support to teams, to effectively identify, remediate, and resolve issues related to platform reliability, stability, and performance through careful analysis of telemetry data and system logs.
- Document all changes following controls, procedures and documentation standards and raises issues and concerns with recommendations for follow-up action.
You also have:
- Bachelor’s Degree in Computer Science or equivalent
- Azure/AWS, Microsoft, RedHat, certifications and knowledge of ITIL/MOF practices
- Highly experienced with monitoring, logging & telemetry tools like New Relic, Splunk, ELK, Nagios, SolarWinds, Prometheus, AWS Cloudwatch, Datadog, etc.
- Experienced in the administration and support of Digital Retail Platforms, e.g. Salesforce CC, Shopify, Magento, IBM WebSphere Commerce, etc.
- Advanced understanding of Networking, Content Delivery Networks (CDN, e.g. Akamai, Cloudflare), and Cloud Platforms.
- Understanding hand-on experience in the monitoring of streaming platform technologies, like Apache Kafka.
- Highly experience with automation and tools such as (but not limited to) Jenkins, Chef, Terraform, Ansible, etc.
- Expert in architecting, creating and supporing Automation (PowerShell, Python, Ruby, AWK, SED, etc.) to run health-checks and self-healing capabilities for the platforms.
- Advanced experience in the use of the following platforms and tools:
- Cloud: MS Azure/AWS Cloud
- Networking fundamentals: TCP/IP, DNS, WINS, DHCP, etc.
- Collaboration & Change Management tools: Jira, ServiceNow, Cherwell, etc.
- Databases: (Oracle, MS SQL, Teradata, DB2, etc.)
- 10+ years of experience working in global organizations with the ability to effectively communicate with executives, leaders and individual contributors across the organization.
- 5+ years of SRE experience working with telemetry, observation, self-healing solutions, and platform automation.
Thank you for your interest with HBC. We look forward to reviewing your application.
HBC provides equal employment opportunities (EEO) to all employees and applicants for employment.