Maintenance KPIs & Reliability

The Reliability Engineer's Workflow: From Failure Data to PM Intervals

For the reliability engineer, the loop is data → insight → interval change. Here's a practical workflow that turns failure history into reliability gains.

Rovaryn DigitalJune 25, 202612 min read

The Reliability Engineer's Workflow: From Failure Data to PM Intervals

The Data Is Sitting There — But Nobody Has Built the Loop Yet

Here is a scenario that plays out at small and mid-size manufacturing plants more often than it should. A motor fails for the third time in fourteen months. The maintenance manager calls it bad luck. The production supervisor calls it a parts problem. But you, the reliability engineer, pull the work orders and see it immediately: the lubrication interval on that asset is the same generic quarterly schedule it has been on since the plant opened, and every failure has occurred between weeks ten and twelve of the cycle.

The interval is wrong. The data said so twice before this failure happened. Nobody built the loop that would have caught it.

This is the reliability engineer's core challenge at an SMB plant: the failure data exists — somewhere in a spreadsheet, a stack of paper work orders, or a CMMS that logs completions but never surfaces trends — but there is no systematic workflow to convert that data into interval decisions. So the same wrong intervals run indefinitely, MTBF (mean time between failures) stays flat, and every third or fourth PM cycle ends in a reactive repair that costs several times more in parts, labor, and lost production than the planned task would have.

This article walks through a practical, repeatable reliability engineer workflow for SMB plants: how to collect and clean failure data, calculate MTBF and MTTR, rank assets by criticality, translate those inputs into defensible PM interval changes, and close the loop with compliance tracking. By the end, you will have a framework you can start applying to your highest-priority assets this week.

Step 1 — Collect and Clean Your Failure History

The reliability engineer workflow begins with data collection, and at most SMB plants, the honest first task is acknowledging that the data is messy.

Work orders may be incomplete. Failure modes may be recorded as "broke" or "wouldn't run" instead of the specific component and failure mechanism. Repair dates may be logged but start-of-failure dates may not. Some failures may never have generated a formal work order at all.

Start by pulling every work order, repair log, or maintenance note for your target asset class over the past 12–24 months. Your goal is to identify, for each failure event:

Date of failure (or the date the failure was first observed, if that is the best proxy)
Date returned to service (to calculate MTTR — mean time to repair)
Failure mode — what component failed and how (wear, fatigue, contamination, seal leak, electrical fault, etc.)
Whether the failure was detected by PM inspection or by production reporting breakdown

That last column is diagnostic gold. Failures detected by PM inspection are your PM program working. Failures reported by production are your PM program missing a window. Track the ratio.

Clean the data ruthlessly. Drop entries with no failure date. Flag entries where the failure mode is too vague to be useful — they are not junk, but you cannot use them for interval analysis yet. A maintenance history that is thorough and searchable is one of the highest-value assets a reliability engineer can build; see the case for asset maintenance history for a fuller treatment of why this record matters beyond your own analysis.

Step 2 — Calculate MTBF and MTTR Per Asset

With clean failure records in hand, calculate MTBF and MTTR for each asset in scope. These two metrics anchor every interval decision you will make.

MTBF (mean time between failures) = total operating time ÷ number of failures in that period.

MTTR (mean time to repair) = total repair time ÷ number of repair events in that period.

For a practical walkthrough of both calculations with worked examples, see the MTBF and MTTR calculation guide.

A few notes on doing this honestly at an SMB plant:

Operating time, not calendar time. If an asset runs two shifts, 5 days a week, its operating time is roughly 4,160 hours per year — not 8,760. Use actual run hours if you have them from hour meters or production logs; use a shift-based estimate if you do not, and document your assumption.
Confidence requires enough events. An MTBF calculated from two failures in 18 months is directionally useful but statistically thin. Be explicit about your sample size and treat the result as a best current estimate, not a certified figure. As you accumulate more failure events, the estimate tightens.
Separate failure modes. A pump that failed once from a seal leak and once from bearing wear has two MTBF values — one per failure mode — and those may require different interval responses. Lumping them produces an average that is the right answer to no specific question.

Research consistently documents that PM programs — when applied with realistic intervals — can improve MTBF by 50–75% and reduce MTTR by 30–50% compared to purely reactive maintenance (Re-Leased, industry research summary, 2025). Those ranges represent the ceiling, not the guaranteed floor; your results depend on how closely your current intervals match the actual failure distribution of your assets. The gap between your current MTBF and those benchmarks is your reliability improvement opportunity.

Step 3 — Rank Assets by Criticality Before You Optimize

Not every asset deserves the same depth of interval analysis. Before you invest engineering time in Weibull fitting or detailed failure-mode mapping, rank your asset population by criticality so you allocate effort where the consequence of failure is highest.

A practical criticality ranking combines three factors:

Consequence of failure — does this asset stop a production line, create a safety risk, or trigger a regulatory hold? Or is it a secondary system with a standby backup?
Frequency of failure — how often has it failed in the last 12–24 months?
Cost of failure — parts, labor, and production loss per event.

Assets that score high on all three are your A-class critical assets. These receive the tightest PM intervals, the most rigorous compliance standards (world-class PM compliance for critical A-class assets is ≥95%, per SMRP Best Practices cited via eWorkOrders, 2026), and the first investment of your reliability engineering time. B- and C-class assets get progressively lighter treatment.

For a structured approach to building this ranking, the asset criticality ranking guide covers the scoring methodology in detail. Getting criticality right before you optimize intervals prevents a common mistake: spending three days analyzing an asset that has a standby backup and a $200 repair cost, while the single-point-of-failure press with a $15,000 downtime cost per event runs on a factory-default interval nobody has reviewed.

Step 4 — Translate Failure Data Into Interval Decisions

This is the core of the reliability engineer workflow: taking MTBF, failure-mode analysis, and criticality ranking and converting them into a specific interval recommendation — tighter, longer, or the same.

The basic logic of interval tightening

If your current PM interval is 90 days but your MTBF for that failure mode is 75 days, you are scheduling inspections after the expected failure has already occurred. The interval needs to move inside the failure distribution — typically to 50–70% of MTBF for that mode, giving you a detection window before the failure peak.

If your MTBF is 240 days and your interval is 90 days, you may be over-maintaining. More frequent PM is not always better — it consumes technician time, increases wear from unnecessary disassembly, and can introduce new failure modes through reassembly errors. Lengthening the interval recovers wrench time for higher-value tasks.

Failure mode mapping

For each significant failure mode on a critical asset, ask:

Is this failure age-related (wear-out, fatigue, corrosion)? Age-related failures respond well to time- or cycle-based PM intervals. MTBF tells you where the distribution peaks; set your interval to intercept the asset before it gets there.
Is this failure random (electrical transient, contamination event, operator error)? Random failures do not respond to time-based PM. The right response is condition monitoring (vibration, thermography, oil analysis) or operator-error root-cause work, not a shorter PM interval.
Is this failure infant-mortality (early-life failures after installation or overhaul)? These point to installation or commissioning issues, not interval problems.

This logic is the foundation of reliability-centered maintenance (RCM) — matching the maintenance task type to the actual failure pattern. Full RCM analysis (as defined in standards like SAE JA1011 and applied in frameworks like MIL-STD-1629A) can be extensive, but the core question — "what is the failure pattern for this mode, and what task type addresses it?" — is accessible without formal RCM certification and applies directly at the SMB plant level.

The interval is not an arbitrary schedule. It is a hypothesis about failure timing — and your MTBF data is how you test whether that hypothesis is correct.

Documenting your rationale

Write down the reasoning behind every interval change, even briefly: the failure mode, the MTBF it is based on, the sample size, and the resulting interval. This documentation serves two purposes. First, it makes the interval defensible to a maintenance manager or plant engineer who asks why the schedule changed. Second, it gives you a starting point for the next review cycle — you are not starting from scratch, you are updating a hypothesis with new data.

Step 5 — Build the Interval Into the PM Schedule (Planning First)

An interval decision that lives in a notebook or a spreadsheet cell is not yet doing any work. It needs to become a scheduled PM task with a specific due date, an assigned technician, a checklist of inspection steps, and a mechanism that generates the next task automatically when the current one closes.

This is where the planning-first approach matters. The work-order-first model — where individual technicians or supervisors create work orders reactively as problems surface — cannot operationalize an interval change. The reliability engineer changes the interval; nobody updates the schedule; the old cadence continues. Planning-first means the PM schedule itself is the primary artifact: intervals are set at the planning level, the work-order queue flows from those intervals, and changing the interval immediately changes every future occurrence.

For a full treatment of building and managing a PM schedule from the planning level down, see the preventive maintenance planning guide and the features overview for how Maintenance Planning Manager structures this workflow.

Key checklist elements to attach to each PM task when you update the interval:

The specific inspection steps and measurements for this failure mode (not just "check motor")
Any condition triggers that should escalate to a work order during the PM (abnormal vibration, temperature, wear measurement outside tolerance)
Parts and tools required so the technician is not hunting during the task window
Estimated task duration (feeds workload planning)

Step 6 — Track PM Compliance and Close the Loop

Updating an interval achieves nothing if the PM is not completed on schedule. PM compliance % — completed PMs ÷ scheduled PMs, expressed as a percentage — is the metric that tells you whether the plan is executing.

World-class PM compliance is ≥90% overall, with ≥95% for critical A-class assets. Below 80% is generally considered not functioning effectively (SMRP Best Practices, cited via eWorkOrders, 2026). The 10% timeliness rule — PMs completed within 10% of their interval, so a monthly PM completed within roughly 3 days of its due date — gives you a tighter standard for assets where timing precision matters to the interval's effectiveness (eMaint / Fluke Reliability, 2026).

Track compliance at the asset level, not just the aggregate. An overall compliance rate of 88% can mask a critical press running at 60% because its tasks fall at shift-end and get deferred. Asset-level compliance reveals the operational gaps the aggregate hides.

A quarterly reliability review cycle works well at most SMB plants: pull MTBF trends, check compliance rates, and identify assets where failure frequency has changed. If a previously stable asset is trending toward more frequent failures, that is an early signal to tighten its interval before the next incident, not after.

For strategies specifically focused on improving MTBF over time, see improve MTBF strategies. For the full KPI framework that connects PM compliance, MTBF, and MTTR into a coherent measurement system, see the MTBF and MTTR calculation guide.

The loop is: failure data → MTBF → criticality ranking → interval decision → PM schedule → compliance tracking → new failure data. Each cycle through the loop should tighten your intervals toward the actual failure distribution of your assets, and your MTBF trend is the measure of whether it is working.

Putting the Workflow Into Practice

The reliability engineer workflow described here — collect and clean failure data, calculate MTBF and MTTR, rank by criticality, translate data into interval decisions, build the interval into a planning-first schedule, and track compliance — is not a one-time project. It is a repeating cycle, and the cadence matters as much as the rigor of any single step.

At most SMB plants, the limiting factor is not analytical capability. It is the infrastructure for doing the loop consistently: a maintenance history that is complete enough to support analysis, a PM schedule that is structured enough to operationalize interval changes, and a compliance dashboard that surfaces problems at the asset level rather than burying them in an aggregate number.

If your current system — whether a spreadsheet, a disconnected CMMS, or a blank-canvas tool with no starting intervals — is the bottleneck, Maintenance Planning Manager is built to remove it. The platform is structured planning-first: you set the intervals, the work-order queue generates from those intervals, and the KPI dashboard tracks compliance and overdue counts in real time. Flat-fee pricing means adding a technician to the system does not add to your monthly invoice.

Try it free for 14 days — no credit card required. Start with your five highest-criticality assets, run the MTBF calculation, and see what a planning-first schedule looks like when the interval is yours to set. Start your free trial and bring the reliability engineer workflow off the whiteboard and into a system that runs it for you.

#reliability engineer #workflow #failure data #role

Ready to go beyond the guide?

Join the Waitlist Run the ROI Calculator Browse our templates

Get more guides like this in your inbox

Related guides

Maintenance KPIs & Reliability

Maintenance KPI Glossary and Resource Hub

Every maintenance KPI, defined in plain English and linked to a deeper guide — your reference hub for PM compliance, MTBF, MTTR, and OEE.

June 27, 202611 min read

Maintenance KPIs & Reliability

Live KPI Dashboard vs. Spreadsheet: Why Maintenance Metrics Should Calculate Themselves

Hand-calculating KPIs quarterly in Excel means you find problems too late. Here's the case for a live dashboard that updates from real data.

May 17, 202610 min read

Maintenance KPIs & Reliability

How to Improve MTBF: Practical Strategies for Reliability Engineers

Rising MTBF means fewer failures. Here are the practical levers reliability engineers use to push MTBF up — starting with the data you already have.

May 16, 202612 min read

Maintenance KPIs & Reliability

The Reliability Engineer's Workflow: From Failure Data to PM Intervals

For the reliability engineer, the loop is data → insight → interval change. Here's a practical workflow that turns failure history into reliability gains.

Rovaryn DigitalJune 25, 202612 min read

The Data Is Sitting There — But Nobody Has Built the Loop Yet

The interval is wrong. The data said so twice before this failure happened. Nobody built the loop that would have caught it.

Step 1 — Collect and Clean Your Failure History

The reliability engineer workflow begins with data collection, and at most SMB plants, the honest first task is acknowledging that the data is messy.

Start by pulling every work order, repair log, or maintenance note for your target asset class over the past 12–24 months. Your goal is to identify, for each failure event:

Date of failure (or the date the failure was first observed, if that is the best proxy)
Date returned to service (to calculate MTTR — mean time to repair)
Failure mode — what component failed and how (wear, fatigue, contamination, seal leak, electrical fault, etc.)
Whether the failure was detected by PM inspection or by production reporting breakdown

That last column is diagnostic gold. Failures detected by PM inspection are your PM program working. Failures reported by production are your PM program missing a window. Track the ratio.

Step 2 — Calculate MTBF and MTTR Per Asset

With clean failure records in hand, calculate MTBF and MTTR for each asset in scope. These two metrics anchor every interval decision you will make.

MTBF (mean time between failures) = total operating time ÷ number of failures in that period.

MTTR (mean time to repair) = total repair time ÷ number of repair events in that period.

For a practical walkthrough of both calculations with worked examples, see the MTBF and MTTR calculation guide.

A few notes on doing this honestly at an SMB plant:

Operating time, not calendar time. If an asset runs two shifts, 5 days a week, its operating time is roughly 4,160 hours per year — not 8,760. Use actual run hours if you have them from hour meters or production logs; use a shift-based estimate if you do not, and document your assumption.
Confidence requires enough events. An MTBF calculated from two failures in 18 months is directionally useful but statistically thin. Be explicit about your sample size and treat the result as a best current estimate, not a certified figure. As you accumulate more failure events, the estimate tightens.
Separate failure modes. A pump that failed once from a seal leak and once from bearing wear has two MTBF values — one per failure mode — and those may require different interval responses. Lumping them produces an average that is the right answer to no specific question.

Step 3 — Rank Assets by Criticality Before You Optimize

A practical criticality ranking combines three factors:

Consequence of failure — does this asset stop a production line, create a safety risk, or trigger a regulatory hold? Or is it a secondary system with a standby backup?
Frequency of failure — how often has it failed in the last 12–24 months?
Cost of failure — parts, labor, and production loss per event.

Step 4 — Translate Failure Data Into Interval Decisions

The basic logic of interval tightening

Failure mode mapping

For each significant failure mode on a critical asset, ask:

Is this failure age-related (wear-out, fatigue, corrosion)? Age-related failures respond well to time- or cycle-based PM intervals. MTBF tells you where the distribution peaks; set your interval to intercept the asset before it gets there.
Is this failure random (electrical transient, contamination event, operator error)? Random failures do not respond to time-based PM. The right response is condition monitoring (vibration, thermography, oil analysis) or operator-error root-cause work, not a shorter PM interval.
Is this failure infant-mortality (early-life failures after installation or overhaul)? These point to installation or commissioning issues, not interval problems.

The interval is not an arbitrary schedule. It is a hypothesis about failure timing — and your MTBF data is how you test whether that hypothesis is correct.

Documenting your rationale

Step 5 — Build the Interval Into the PM Schedule (Planning First)

Key checklist elements to attach to each PM task when you update the interval:

The specific inspection steps and measurements for this failure mode (not just "check motor")
Any condition triggers that should escalate to a work order during the PM (abnormal vibration, temperature, wear measurement outside tolerance)
Parts and tools required so the technician is not hunting during the task window
Estimated task duration (feeds workload planning)

Step 6 — Track PM Compliance and Close the Loop

A quarterly reliability review cycle works well at most SMB plants: pull MTBF trends, check compliance rates, and identify assets where failure frequency has changed. If a previously stable asset is trending toward more frequent failures, that is an early signal to tighten its interval before the next incident, not after.