In Part 1 of this series, we delved into the importance of understanding your end customer and the risks associated with diverse use. We touched upon avenues to shift left and design with reliability in mind. In this post, we will cover where companies traditionally allocate resources, which are the test-and-iterate phases of development. By the end of the post, we will have steps to launching a reliable product.

Engineering Phase

In the engineering phase, reliability test plans should be finalized and correlated to an expected mission profile. The mission profile will usually be documented in a Product Requirements Document and will specify the duration of use expected, the customer population, and the categories of use expected (e.g. is the product foreseeably expected to be used outdoors). Reliability tests need to be correlated to these usage behaviors and should be leveraging Accelerated Life Testing (ALT) to replicate years of usage in just weeks of test time. Developing these durable tests takes time, knowledge of acceleration models, knowledge around material limitations, and familiarity with failure mechanisms. Temperature is a great accelerant and is used across the industry to accelerate reaction rates, thermal mechanical stresses, and much more. A common acceleration model used in ALT is the Arrhenius equation; where the Acceleration Factor AF = exp[(Ea/kb)*(1/Tu-1/Tt)]. Certain standards, like JEDEC, are great starting points in developing a new reliability test suite but should not be the final product. The final reliability of each product should be tied directly to the expected mission profile, as this ensures products are designed for their customers and don’t carry unknown risks, and are not over-designed for a use case they aren’t intended for.

Reliability Execution

Reliability assessments in the test-and-iterate phases need to have sufficient sample size to perform statistically significant analysis. Reliability assessments need to be representative of the production population and have coverage over manufacturing variability, design variability, and material variability. As variability in both product and customer need to be evaluated, reliability is increasingly successful with proper sample sizes. Binomial models can be used for quick calculations on sample sizes but should not be the only method to determine sample sizes or demonstrated reliability. Once testing has concluded, the results need to be analyzed based on the failure mechanism as opposed to the failure mode. A single failure mode (e.g. product does not power on) can have many underlying failure mechanisms (e.g. board-to-board connector disengaged or fatigued FPC) so it is important to leverage Failure Analysis to understand the underlying physics for each failure. The final assessment of the engineering phase should be a signal on if the design is reliable against the expected use. Manufacturing refinement can still occur with process improvements, quality controls, and rolling changes but the inherent design should be reliable.

Reliability Demonstration Example

The following example demonstrates the flow from understanding the customer, to establishing an accelerated test, and to analyzing the risks identified in the test. In this example, the umbrella accelerated life test is correlated to a prescribed activation energy, Ea. As results materialize, all risks should be understood from their own respective physics. By working through conclusive root-cause analysis, the reliability engineer will be able to analyze the risk with each mechanism and provide insight into the mitigation space.

Scale Phase

In the scale phase, reliability test plans have transitioned into a qualification plan. These qualification plans are used to AB test a product’s scaling operations against the design’s demonstrated reliability in the engineering phase. Most, if not all, of the design-dependent risks should have been identified and either mitigated or accepted during the engineering phase prior to locking the design. In the scale phase, new risks will materialize from large volume variance, process regressions, and incomplete quality control measures. As new suppliers, lines, tools, or other factors get brought up, these scaling contributions will need to be assessed appropriately. Some new risks will be lot dependent or supply dependent, where others will be systemic due to current control limits. Navigating appropriate disposition of scale risks requires a comprehensive quality control plan and strong knowledge of the underlying physics.

Risk Assessments

Towards the end of the engineering phase and through the scale phase, the reliability and quality functions should be providing risk assessments for single failure mechanisms but also for the entire system assembly. These risk assessments inform Warranty Reserves in a company’s financial planning to ensure the right budget is set aside for dependencies of a failed product in the field. Depending on the industry, this could include preventative maintenance, reactive maintenance, returns, replacements, repair, reverse logistics, customer support cost, and even recalls. An integrity program needs to be comfortable with risk assessments for the entire bathtub curve, including risks associated with:

Infant Mortality : Quality escapes and early failures
Random Failure : Constant risks associated with intrinsic and random failures
Wear Out : End of life failures due to predominantly fatigue-related risks

Understanding the physics-of-failure ensures the proper risk assessment modeling is leveraged. Methodologies for risk assessments include Weibull modeling, Logarithmic Modeling, Stress-Strength modeling, Monte Carlo modeling, Markov Chains, or many other statistical approaches to predicting risk. Those experienced in this probably noticed that these are all statistical models. In my experience, I have found that combining the right statistical approach with the right physics-of-failure knowledge is the optimum approach to performing risk assessments. Some tools, such as Reliasoft, will provide a ‘best fit’ model for a data set determined by the coefficient of determination. Just leveraging a best fit model does not necessarily mean the statistical model is correct for the mechanism. This is where a purely statistical and data science approach breaks down in hardware quality or reliability. Risk assessments need to be inclusive of the appropriate combination of statistical and physics-based models to best represent how risks materialize in the field.

Production Phase

Woohoo! You shipped the product and customers are loving it! This is such an exciting time that everyone should always take a moment and enjoy the launch. This also begins an important next phase of the journey. To this point, field risks were all predicted based on development knowledge of failure mechanisms and statistical modeling. In production, the true magnitude of risks will become evident.

Understand Field Risks

A strong integrity program will partner with a robust Customer Service operation to understand the Voice of the Customer. In-field telemetry, quantitative analysis, qualitative analysis, and failure information is fundamental to analyzing friction your customers are facing. An excellent Voice of the Customer program will help feed information back into the living FMEA to refine severity, occurrence, and detection for a variety of failure modes. This phase of the product lifecycle should include a strong emphasis on learning. Lessons learned should be documented and fed back into the next product’s development cycle. Within Reliability Engineering, the goal of Lessons Learned is to improve the inherent risk knowledge in the architecture phase so that risks can proactively be designed out of the product, rather than continuing with a reliance on the most costly test-and-iterate cycles.

Culture for Success

Wow, that was a lot of content! If you’re still with me, I’d like to close out about enabling risk mitigation functions. Quality, reliability, and risk-based functions are well-intentioned professions accountable for minimizing business, customer, and financial risks by proactively identifying, dispositioning, and mitigating product risks. First and foremost, identifying all risks is the goal and should be recognized as a win! It is much more valuable to understand all risks for your product and company than to be naive and hopeful risks don’t exist. To establish a culture of integrity, psychological safety has to be employed, data driven analysis needs to be at the foundation of a durable decision framework, and the customer has to be the priority. There are many great talks on psychological safety and I advise you to listen to Amy Edmondson, who coined the term. Without the right culture, employees won’t feel comfortable identifying or disclosing risk. In the long term, this will damage a company’s brand as risks and issues will transfer over to the customer.

I’m thankful to Spanner for letting me share insights into Reliability Engineering. For any help related to hardware integrity, please inquire within at www.relfa.org.

Kevin Keeler is an accomplished product integrity leader and has focussed his career on ensuring the future of innovative hardware is also reliable. Kevin believes products should be designed to be used and strives to enable his partners to build long lasting and world-class hardware. Kevin is energized by innovation and believes the impossible is possible.

Kevin is based in the bay area of California and enjoys staying active through sports, family adventures, or playing with his dog, Stella.

More Insights

Interested in learning more about what it’s like to collaborate with Spanner?

Spanner PDJuly 9, 2024