How to get your engineering teams ready for launch
How we built a framework for measuring performance, tracking errors, and responding to incidents
On the runway towards the launch, working at startups includes a lot of putting out one fire after the other.
You squash a bug when a user reports it. Speed up the product when someone complains. Roll out a feature when there’s a definite demand.
It’s so important to keep the momentum and build what the users want.
But once you spot the sweet product-market fit, you can no longer not worry about crossing the metaphoric bridge.
Beta testing is over. Users like and want your product. Now it’s time to level up. That means it’s time for proactive action, not just reaction.
Of course, it’s impossible to get everything right in the first rollout. But being on the backfoot is worse than floating a slightly faulty first impression.
To ready ourselves for the launch, we at OSlash launched a mission—internally named Project Apollo.
🚀 Project Apollo
“One small step for OSlash, a giant leap for productivity”
We wanted to make sure that when we open the gates to the public, we avoid nasty surprises, and that nobody dies. Of course, that’s asking a lot but a team can try.
Our objectives were clear –
- Build a baseline for our current systems’ performance
- Prepare for widening service limits
- Build an observability platform
- Design a process for incident management & identifying SLOs
To meet the objectives, we divided the engineering team into three groups picked out at random.
Note: The teams worked towards the success of the mission along with performing their everyday tasks. It was not an easy ask, yet one they couldn’t have done better.
Squad 1 — Performance Squad
The objective of this group was to answer the question “How many users can our platform serve today?”. With this answer, we gained two new abilities — identifying performance degradations and capacity planning for expected traffic surges (like, well, a product launch!).
For most engineering systems, these are the usual metrics for quantifying performance.
- 🛎 Availability
Availability is a measure of how much time we’re available to respond to a request. This is usually expressed in the percentage of uptime in a given year.
- ⏱ Latency
Latency is a measure of how long it takes for a request to get a response. This is often measured between different layers (for example, API gateway ↔ Lambda), usually expressed in 95th or 99th percentiles.
- 🚛 Throughput
Throughput is a measure of how many requests can be handled. This is usually expressed in rps (requests per second).
- 🪖 Durability
Durability is a measure of the data we’re looking for always being there. This is usually a function of the database we use. For example, see DynamoDB’s durability promise.
- ✅ Correctness
Correctness is a measure of the system always returning the right result we’re looking for.
To get these numbers straight to our dashboard, the performance squad decided to measure different aspects of performance by breaking it down into parts.
Part I: Measure how fast the current system is
To do that, the squad employed Firebase Performance Monitor that tells us exactly how our users are experiencing our product. It returns –
- Load Time
- Wait Time
- Regional Delays
- Performance in Wifi, 4G networks, etc.
The squad complemented Firebase with Sentry Performance Tracing to delve further into the exact user experience. Sentry returns the amount of time taken in different parts of our product.
Let’s say a user requests a shortcut -> o/roadmap. Sentry helps return –
- How much time it took for the server to find it
- How much time it took for the network to deliver it
- How much time it took for the browser to redirect it on the client-side
Part II: Identify the bottlenecks
We split every touchpoint in the product into smaller chunks known as transactions to easily calculate how much time a transaction takes to complete. Any transaction that takes more than 1.2 seconds to complete leads to user misery. Identifying such transactions gave us a fine-grain analysis of all the potential bottlenecks.
The bottlenecks were identified using Sentry User Misery Score that returns the following:
Transactions per minute: How many times a given operation occurs in a minute
Latency: Measured in P50, P95, P99
P50 - 50% of the operations finish in this time (average speed)
P95 - 5% of the operations take this much time on the slower side (worst speed)
Part III: Keep the eyes on the speedometer
Another tool that we set up to monitor all our lambda functions, including the time taken to complete a transaction, is Lumigo.
By tagging every prod release appropriately, we are able to ensure that the performance doesn’t degrade over time.
Part IV: Ensure preparedness for the incoming traffic
Once your product is launched, in a great-case scenario, you should expect a sudden spike in traffic. To be sure we don’t falter at this crucial juncture, the squad helped us answer two crucial questions:
- How many simultaneous users can our product handle? More users will lead to more requests to our servers leading to HTTP 429 error
- How fast will our servers be when the number of users is high? A transaction that was taking 40ms in regular condition might take 400ms in high-traffic scenarios. This meant that we needed to figure out the threshold for the essential transactions in this situation
Load Testing with Vegeta
Vegeta is a tool that allows teams to simulate heavy traffic. If you are gearing towards a big launch or PR, it is highly recommended to simulate every condition to see that the product does not break anywhere.
With Vegeta, we were able to figure out how much time the most frequent transactions took in peak traffic conditions.
Part V: Fix the low-hanging fruits
With the help from Sentry and the data obtained from Vegeta, the squad made it possible for the engineering team to immediately fix the issues that instantly made the product faster for all users.
Also, by linking all issues in Linear to Sentry, we were able to ensure that the fixed issues don’t show up in production.
Squad 2 — Observability Squad
Observability is being able to quickly find anything you want about a system. In other words, observability is a bunch of systems like error trackers, uptime monitoring systems, log aggregators and tracers, all giving a bird’s eye view of all system components at any given point in time. Observability is also different from monitoring — monitoring will tell you about errors (that you already know of but haven’t fixed yet) that happen again; observability, on the other hand, can give you more real-time information and help you predict faults.
We wanted to ensure that the observability system is our single source of truth for all platform events (errors, alerts, system & app metrics) and that it helps us quickly jump to a particular flow or transaction to help debug and fix issues.
Basic building blocks
- 🐞 Error alerting
We’ve been using Sentry for tracking all errors. How can we better leverage Sentry’s capabilities?
- Enable and setup sourcemaps so errors are readable.
- Use appropriate levels to mark severity of the issue.
- Identify users so we can identify deeper patterns.
- Set up Slack alerts for high-severity alerts.
- 🔄 Tracing
Tracing is generating a unique ID that can be used to identify a transaction across systems. For example, when a user initiates an action, the initial request that is sent to the backend should have a unique ID (called a trace ID) that is propagated across all discrete systems. This way, if we search for, say request ID req-abc-123, we should be able to trace it from the extension to the backend and back to the extension and finally to the user. Tools & platforms (search term: “distributed tracing”) include Jaeger, Honeycomb, AWS X-Ray and OpenTracing. Sentry errors can also be enriched with this tracing metadata.
- 📡 Status page
Set up a status page (can be private too) so we know all systems are operational.
- 📊 Dashboard
A company-wide real-time dashboard with the most important metrics (like total installs, total active users, daily users signups, total number of Shortcuts created, overall system throughput, error rate, number of extension installs, etc.) are incredibly useful for a number of reasons —
- Throw it on a giant display in the center of the office and everyone gets around to watch and celebrate big milestones (like say, the 1000th user).
- Everyone knows what the most important goals for the company are.
- Easier to notice anomalies and take corrective action.
Product and engineering teams came together to identify important metrics for both teams and build an interactive dashboard together.
To make sure the dashboard is built super fast, the Observability Squad used Retool.
With Retool, we’ve unearthed some great insights into the product. Each time a crucial number sees a sudden spike or an expected uptick, our hearts collectively skip a beat.
Squad 3 — Incident management & preparedness Squad
Incident management is a set of definitions and rules that answer the following questions —
- What exactly happens when an incident (could be an error, a security breach, or downtime) happens?
- Who or which team is the first responder?
- Who takes responsibility in coordinating various teams in responding to the incident?
- How long can a bug take to fix before it is escalated?
- Who will respond to incidents on non-working days?
- How much is reasonable pay for on-call duty?
- What is the template for a responsible disclosure or an RCA?
To make sure there is a process in place that can answer every question, the Incident Squad created playbooks that followed a carefully laid-out set of steps.
Step I: Categorization & Severity of issues
Step II: Service-Level Agreements according to SOC-2
- Severity critical: 3 business days
- Severity high: 30 days
- Severity moderate: 60 days
- Severity low: 180 days
Step III: Define SLAs according to OSlash Security
- Critical (S0) : Within 1 week of being reported
- High (S1) : Within 1-2 weeks of being reported
- Medium (S2) : Within 2-3 weeks of being reported
- Low (S3) : Within 3-4 weeks of being reported
To make the whole process seamless, the incident squad ended up trying out a bunch of tools for issue monitoring such as PageDuty, Opsgenie, VictorOps, and Incident.io
In our personal experience, incident.io ticked all the boxes we were looking for.
After classifying all issues depending on the level of severity, the incident squad went on to describe how all issues will be communicated to the users, who will stay on call and how that person be monetarily compensated for the extra hours put in.
If you are looking at building your own version of Project Apollo, here are a few key tips that might prove helpful:
- Document everything. Don’t worry about where to put them or how they should be laid out. Just document everything.
- Checkpoint daily. Catch-up daily over squad standups, and feel free to go as deep and for as long as you want. Keep your eyes on the prize and make sure you are achieving your squad goals on a day-to-day basis.
- Prefer no-code and managed solutions. Prefer tools that are already battle-tested over building something on your own.
- Accommodate existing priorities. Keep your everyday tasks at a higher priority and work on the project for a couple hours daily.
- Find consensus with help from senior folks. If the squad members cannot come to an agreement, consult with the senior folks; they’ll help by asking more questions, inviting more discussion, and helping find common ground between what’s best for the team and the company.
- Think critically. Ask questions like “should we even do this?”. A squad coming to a “we don’t need this at all” conclusion is still okay if they can back it up with data.
The entire mission took us two weeks to complete. In hindsight, to lessen the burden on the already stressed engineering team, we could have earmarked a couple more weeks for the activity.
We hope you found value in our experience. We wish to find you next to us as we travel...