Possible application state audit. Audit is not limited to what is achievable via UI or API, but also includes hardware issues, OS bugs and malicious insider cases. The states are categorized by business impact, severity and probability.
Analysis of a Random Graph representing service’s dependency tree. Failure probabilities calculated for individual nodes and the service in general. Weaknesses such as single point of failure are highlighted.
Verification of the performance levels and the recommendation on addressing the concerns. The audit includes a Multivariable Scalability Curve.
Random Graph Theory expert
comfortable in a large codebase
over 15 years of experience building distributed systems
know what can be made optional
You have a regular bugfix release scheduled after each feature release
You lost over $100k due to instability in two last quarters
You are about to sign a contract with strict SLAs and big penalties
You lost clients to competitors. The clients cite broad performance and stability issues as the reason for leaving
You do not fully understand either the cause or the remedy described by the dev team in the incident Post-Mortem