Like many people in today’s day and age, on top of being a Data Engineer, I have a side hustle. More specifically, about 6 months ago, I purchased an ice cream shop. Small business ownership has its ups and downs, but few are more interesting than what happened this week.
On any normal day, I don’t have to deal with too much for this business, I have a great management team that is able to handle most things that are thrown at them. Unfortunately, this was not one of those days. I received a call early in the morning from my general manager informing me of a massive freezer malfunction. This had caused all of the ice cream in our walk in freezer (approximately 320 gallons worth) to melt.
After a few moments of panic, our team quickly got to work. We first started by breaking down the tasks that needed to be done, these tasks can be shown as three categories, assessment, repair, and remediation.
Assessment
What is the damage? How far reaching is it? Can anything be saved? This mainly consisted of taking inventory of everything in the freezer that was lost along with looking at how much product we had outside of the freezer to see how long we could stay open on what we had. At this point in the process, we figured we could probably remain open for about 3 days without using the product from the walk in. This also consisted of informing our staff.
Repair
How do we fix the root cause? What happened to cause this? For this step, we had to call in an expert, as neither myself nor my staff has much experience fixing industrial cooling equipment. We were able to get in touch with our usual expert, and he told us he would be out later in the day.
Remediation
How do we get rid of all this? How do we recoup our losses? This was a major concern, even though we have access to a dumpster, putting close to 3000 pounds of melted ice cream into it on a hot August day didn’t seem like the best plan. I ended up calling a waste removal company, who admitted they also were not sure exactly what to do, so together with this company, we came up with a plan to load the ice cream into large construction trash bags and they would come remove them. The next call was to our business insurance provider to put in a claim for spoilage, and finally, a call to our supplier to schedule an emergency order.
Whilst doing all of this with my team, I couldn’t help but be reminded of how I have worked with teams in the past on major data issues, whether they have impacted applications or reporting, the workflow is largely the same.
How to Triage Major Data Issues
Assessment
This typically consists of both identifying and reproducing the issue at hand – figuring out what failed so that a path forward can be identified. For example, identifying which portion of a report or what functionality within an application is being affected. This often also involves distinguishing impacted users so that they can be well informed that an issue is taking place and that the team is aware and working on solutions.
Repair
This is often my favorite part, even though I can’t troubleshoot an industrial freezer, I can troubleshoot analytics code. Of course, sometimes, experts are needed. Whether it be a failed pipeline with some code I am unfamiliar with or a performance issue that a database admin may be able to help with. Getting things back in working order is a top priority and it often takes the knowledge of more than one person to repair the issue.
Remediation
Once things have been fixed, it is important to figure out if any remnants of the issue are left over, much like melted ice cream, commented out code blocks, long running sessions on the database, or bad data that was generated may still be out there, and making sure that this is all taken care of is an essential step to make sure things are cleaned up. It is also essential to inform users once repair and remediation are finished so that they can go back to using and trusting the app or reporting.
Though it could certainly be argued that these three steps could be followed in many scenarios, I have biases towards both data and ice cream, so I figured they would make good examples. Throughout each of these steps, clear communication and teamwork are both key to not only fix issues, but to keep things running.
In summary, when issues happen, breaking them down into meaningful, manageable steps can help provide a clear path forward for you and your team.