Project Bear - Rebuilding Bash's Core Supply Chain Infrastructure

The Bear Season 2 Poster, by FX Networks

This was originally published on Linkedin here.

In July, the Bash team faced significant decisions: after a comprehensive review of our supply chain services, we had to choose between iterating on our existing legacy systems, which were largely 10+ year-old projects with little internal IP, or building new services to meet our needs for the 2023 season and beyond.

Context

We landed in this situation due to a longstanding misalignment between business and engineering culture, commonly seen in large corporate environments. The existing services had been meeting business requirements for many years but had not kept pace with modern engineering standards. This resulted in opaque systems with limited monitoring and understanding of the decisions they made.

Fast forward to 2021: business expectations shifted to an engineering-led mindset. Stakeholders now sought insight into how their systems worked and urged teams to invest in core engineering best practices.

Over the past two years, teams worked to upgrade their services and create some transparency. However, at their core, these services still suffered from misalignment, as evidenced by:

A majority of logic residing in SQL Stored Procedures, making changes risky.
Database queuing, which, at our scale, caused numerous problems.
Writing logs to databases.
Heavy reliance on deprecated/unsupported technology.
On-premise deployments with limited/no control of the environment.
All mid-2010 Microsoft-centric, with minimal Open Source usage and no clear path to improvement (e.g. .NET Core, containerisation).

Our Tech Standards

Given these limitations, we set an ambitious goal: We wanted our 2023 peak shopping season to run on new services built around our developing tech standards within Bash:

Golang as our preferred language.
PostgreSQL as our primary database engine of choice, moving away from MS SQL.
Embracing event-driven architectures, utilising Kafka for internal and domain events.
Deployment in a cloud environment (AWS) on Kubernetes.
Seamless observability and monitoring with Prometheus integration and detailed Grafana dashboards.
Empowering our Data Team by centralising all pertinent data in Snowflake.

Why Project Bear?

The project's name, inspired by the FX show "The Bear" shows the ambitious goal of a young chef in the fine dining world who returns home to Chicago to transform his small family restaurant into a world-class dining establishment. Watching The Bear, we found many parallels between operating in a high-performance kitchen and an engineering team, where communication, situational awareness, and trust are crucial.

Here are some short clips which touch on some of the aspects in the show:

Sydney & Richie's Five Minutes
Richie Steps Up at the Michelin Star Restaurant
Kitchen Nightmare (and what goes wrong when you don't communicate)

New Services

By the first week of November, all Project Bear services were operational, taking 4 months to complete. The following services where deployed:

A new foundation for our Order Management System, enhancing order orchestration to be simple, extensible, and observable.
A new service for representing all stock in TFG stores and warehouses, and an allocation engine (affectionately known as Al’s Alligator 🐊) for routing order item pick instructions to the appropriate facility.
A new service for managing all supply chain facilities, allowing extensive configuration via Retool.
A new logistics engine, increasing transparency for courier integration and parcels in our operations teams.
(Bonus) A new communication service, revamping how we send SMSs and Emails for critical journeys like one-time PINs.

Results

These new services significantly improved customer experience, evidenced by:

A 90% reduction in average time to send critical communications, down to less than 5 seconds during peak periods.
Our Order Management System’s allocation time for an order to a facility decreased from minutes (sometimes hours during peak periods) to less than a second.
Empowered operations teams were able to monitor and take action on supply chain facilities through Retool apps built atop the new services.
A 66% improvement in our core order item cancellation metric during peak periods.

While this phase of the project is complete, we are only about 40% done with our overall goal of deprecating all our legacy systems (not just core supply chain). We will continue to actively expand new services over the coming months until we can fully move away our legacy on-premise services.

Reflection

This achievement would not have been possible without the hard work and dedication of the project engineering team, their PMs, and supporting engineers in the business. Thank you for all your hard work - It was a once in a lifetime experience work with such a talented team and have an entire engineering organisation rally around a goal like this.

We will share more about some of our new services in the new year, but in the meantime the team will be taking a well-deserved holiday break.