Mission Critical
Build vs. Buy
March 14, 2024

Only the Paranoid Survive

Sift’s customers operate their hardware in the most challenging and remote environments on (and off) Earth. When designing machines for the future, the potential for failure is high — and these failures can be costly.

Before deploying their vehicles, engineers must catch and remediate every possible bit of risk. The processes they use for developing and building their hardware must be fool-proof, eliminating the possibility for human error and flagging anomalies with speed and accuracy – otherwise the potential for disaster is great.

As you can imagine, engineers who are trying to build machines without these kinds of systems in place regularly lose sleep. They’re worrying about potential risks, stressed that a tiny mistake they can’t find will destroy the entire mission.

If this feels all too familiar, you might want to upgrade your telemetry stack.

Taking on unnecessary risk

Are there anomalies present in your system right now, but you can’t see them because they’ve gone unnoticed in your test data?

Far too often, postmortem investigations discover hidden risks only after an anomaly manifests in production. It’s the beauty of hindsight, but that doesn’t help you avoid exposure in the first place. Right now, you might be taking on risk you don’t need, simply because you’re using outdated tools that can’t manage the mountain of continuously streaming data from your system.

This inability to find anomalies in your test data leads to an overall lack of visibility and low signal-to-noise ratio. It makes you far more likely to miss something that could lead to failure later on.

Space Shuttle Challenger explosion, 1986

Tiny mistakes can have big consequences

When designing complex machines, the smallest of errors can become quite costly. For example, a tiny, cold O-ring infamously caused the explosion of the Space Shuttle Challenger. Tiny software bugs can have the same effect, and these types of scenarios are continuing to occur nearly 40 years later.

Without high-fidelity, end-to-end tests to verify the communication between booster and capsule during the mission, engineers couldn’t find this vulnerability until it was too late.

Boeing’s failed flight test in 2019 was caused by a single incorrect parameter. Without high-fidelity, end-to-end tests to verify the communication between booster and capsule during the mission, engineers couldn’t find this vulnerability until it was too late.

Similarly, the HAKUTO-R M1 lander crashed onto the lunar surface just last year. The root cause? A software bug that didn’t trust the (accurate) data coming from a laser rangefinder when it flew over a crater.

The risks start at a foundational level

How do we end up with these errors? When you lack adequate visibility of your machine, several factors compound your risk and make you more likely to miss a frozen O-ring, incorrect parameter, or fault tolerance bug.

In order to avoid these mistakes, you have an important decision to make. Without proper tools, you can either choose to allow more risk into your system, or waste your engineers' time reviewing data with a low signal-to-noise ratio.

Modern visualization and analysis tools shouldn’t require you to choose.

Sift rejects this premise. Modern visualization and analysis tools shouldn’t require you to choose. You can buy down risk and reduce the time your team spends on review. You can have the best of everything and sacrifice nothing.

Key engineer risk

Some companies – especially at an early stage – try to solve this problem by relying on engineers with specific subject matter expertise to reduce risk. But if you’re using legacy tools, it’s likely you only have a few key engineers who possess the specialized knowledge to even understand your telemetry dashboard.

By relying on specific people to operate your hardware, you end up taking on new risks. The most obvious one — staff attrition — will create vacancies in all positions of expertise and their associated areas.

Sift empowers you to avoid these types of risks altogether by allowing engineers to add their specific knowledge to the system.

The less obvious risk — reduced internal mobility — will prevent you from shifting some of your best engineers from mature products to developmental efforts. Your top talent wants to keep building new hardware, not maintain operational systems. This kind of monotony will cause your team to go looking for the next exciting opportunity at a different company.

Sift empowers you to avoid these types of risks altogether by allowing engineers to add their specific knowledge to the system. Not only is that knowledge captured, it’s easy to access and searchable for everyone to reference. When new engineers encounter anomalies, they can leverage the past experience of the teams that came before them.

SpaceX mission control

Risks involved with skipping data reviews

Creating new software releases can also introduce risk to your production hardware. Companies with an observability problem acutely feel the tradeoff between creating frequent releases and burning out their engineers with tedious data review.

If you create frequent releases, your team will spend a high proportion of its time manually reviewing data with a low signal-to-noise ratio. As fatigue sets in, bugs will be missed. However, if you decide to create fewer releases, each one will encompass more and more software changes. With a long list of these changes, any bugs you actually find will be difficult to root cause due to a ballooning amount of updates.

Engineers need automation to ensure that they are only reviewing data that requires their expertise.

Either way, the same level of risk remains. Tedious manual review is not sustainable. Engineers need automation to ensure that they are only reviewing data that requires their expertise.

Risk of blowing up

Not every mistake manifests as a disaster on the same scale as the Space Shuttle Challenger or the M1 lander, but they can still result in gut-wrenching failures. The Starliner mission failure impeded the Boeing teams’ progress, wasting time and money. Don’t wait until you’ve invested months and years into a project, only to lose your machine due to missed errors.

Sift automates data review and removes humans from the loop — making the review process easier, more objective, and more thorough. No more hours upon hours staring at tedious lines of code, and no more sleepless nights.

With greater visibility and a streamlined workflow, you’ll unearth issues earlier, and your operational hardware will experience fewer in-flight anomalies. Sift mitigates your exposure to risk without slowing down your processes — its automated review tools will actually allow your team to get to market faster.

You’ll also reduce the likelihood of incurring a Rapid Unscheduled Disassembly (RUD) due to stale information. Sift conducts ongoing tests that continually validate your source of truth, mitigating the chances of something going wrong.

Let Sift help with your paranoia and allow your team to focus on core engineering challenges – which is what they’d rather be doing.

Learn more about mitigating risk with Sift >>>