What does Walk-Forward Efficiency (WFE) = N/A mean?

We need at least 3 time windows with in-sample return > 0 to compute WFE. If you have fewer - for example because the backtest history is too short or the windows are too wide - WFE is not applicable. Add more data or shorten windows and re-run.

Why is my robustness score 0 if the strategy is profitable in the backtest?

The score is built from four modules (Validation, Risk, Stability, Execution) and is blocked when any module scores 0 or when the Data Quality Guard fails. A profitable backtest does not guarantee that all four modules pass. Check the report to see which gate failed - for example DQG reject, WFE below threshold, or execution costs exceeding the edge - and fix that first.

How do I interpret CAUTION vs DO NOT DEPLOY?

CAUTION means one or more checks failed or are borderline. The strategy has potential but requires specific fixes before deployment - the report tells you what to address. DO NOT DEPLOY means critical failures across data quality, validation, or risk. No deployment should occur until the listed issues are resolved.

What is the gap between WFE 0.5 and 0.7 in the thresholds?

The range 0.5 - 0.7 is labelled ACCEPTABLE: the edge carries over adequately but is not yet in the robust zone. Monitor for decay over time and consider whether the strategy needs more validation windows before full deployment.

Why is my Profit Factor capped at 20.0?

To prevent mathematical instability when losses are near zero.

Do the terms on this page match what I see in my report?

Yes. We use the same labels throughout: Walk-Forward Efficiency (WFE), Data Quality Guard (DQG), Kiploks Robustness Score, and the verdicts ROBUST / CAUTION / DO NOT DEPLOY. If you see a term in the report that is not defined here, check the glossary at the top of this page.

Are results reproducible between the open engine and the cloud product?

Core analytics are versioned. Reports expose engine version, formula version, risk analysis version, contract version, input hash, config hash, and seed so a given input can be traced to a deterministic contract. The cloud product is designed to use the same engine packages as the open core; presentation and collaboration features are separate. See the Open Core, reproducibility, and license section on this page.

Kiploks analysis methodology

Backtests show performance.
Kiploks shows whether your trading bot survives.

Kiploks evaluates whether your trading bot or strategy can survive real capital: we score robustness, validate out-of-sample, check data quality, and combine everything into a clear verdict (ROBUST / CAUTION / DO NOT DEPLOY) so you know what to do next.

Last updated: February 2026

How the report is built

The result you see follows a single pipeline. Each step feeds the next:

Data Quality GuardWalk-ForwardBenchmark MetricsBenchmark ComparisonParameter SensitivityCost DragAction PlanRisk MetricsRobustness ScoreFinal Verdict

Glossary

IS (In-Sample): The period used to tune or optimize the strategy. Results on IS data can be overfit.
OOS (Out-of-Sample): The period not used for tuning; used to validate whether the edge holds. OOS results are a better signal of real-world performance.
CAGR (Compound Annual Growth Rate): The smoothed annual return assuming gains are reinvested. Comparable across strategies and time periods regardless of how long the backtest runs.
WFE (Walk-Forward Efficiency): Median of (OOS return / IS return) per time window, computed only over windows where IS return > 0. Measures how much of the tuned performance carries over to the next period. High WFE = robust; low or N/A = weak or insufficient data.
OOS Retention: The ratio of total out-of-sample returns to total in-sample returns across all walk-forward windows (sum of all OOS divided by sum of all IS). Measures overall survival of the edge across the full history, as opposed to WFE which uses the median per window. Retention > 1 with many failed windows indicates one dominant window - check window breakdown.
Sharpe Ratio: Return above the risk-free rate divided by total return volatility. A common measure of risk-adjusted performance. Higher is better; below 1.0 is generally weak.
Sortino Ratio: Like Sharpe, but only penalises downside volatility - moves against you. A better measure for strategies where upside variance is not a risk.
Profit Factor: Gross profit divided by gross loss across all trades. Above 1.0 means the strategy earns more than it loses in total. Below 1.0 means it does not.
Alpha: The return your strategy earns above the benchmark after accounting for market exposure. We use geometric excess return to ensure accuracy over long compounding periods. Positive alpha means the strategy adds value beyond simply holding the market.
Information Ratio (IR): Alpha divided by the volatility of that alpha (tracking error). Measures consistency of outperformance - a high IR means the strategy beats the benchmark reliably, not just occasionally.
Expected Shortfall (ES / CVaR): The average loss in the worst 5% of periods - which may be WFA windows or individual days/trades depending on how the engine is configured (or whichever threshold is shown in the report). More conservative than VaR - it tells you not just when things go wrong, but how bad it gets on average when they do.
VaR (Value at Risk): The maximum expected loss at a given confidence level (e.g. 95%) over a given period. VaR 95% means losses exceeding this threshold occur in only 5% of periods. Expected Shortfall (ES) tells you the average severity of those tail losses.
DQG (Data Quality Guard): Checks that the backtest data and run are trustworthy: enough trades for reliable statistics, no price gaps, price integrity. If DQG fails, the robustness score is blocked.
PSI (Parameter Stability Index): Measures how much a parameter's optimal value drifts across walk-forward windows. Low PSI = stable across time; high PSI = the optimal value changes significantly, which is a sign of overfitting or regime sensitivity.
Kill-Switch: Not just a metric - an exit rule. A consecutive losing OOS windows threshold: if the strategy hits it, the edge may have broken down; the report recommends pausing and reviewing. One of the most practical tools for live deployment.
Robustness Margin: How far a parameter can move from its optimal value before performance drops below 80% of its peak. Wide margin = the strategy does not depend on precise tuning. Narrow margin = fragile, likely overfit.

Data Quality Guard (DQG)

We check that the data and run are trustworthy before scoring robustness. If the Data Quality Guard fails or is rejected, the robustness score is blocked so you do not rely on weak data.

What to do: If DQG fails, improve data quality, add more trades, or fix price gaps and re-run the analysis.

[How the Data Quality Guard Works]

Checks

• Sampling: enough trades for reliable statistics
• Gap density / price integrity (when OHLCV available)
• Verdict: PASS / FAIL / REJECT
• If DQG fails or is rejected, robustness score is blocked (overall = 0)

Walk-Forward Validation

We tune on one time window (In-Sample, IS), then measure how much of that edge carries over to the next (Out-of-Sample, OOS). This process is repeated across the full history to test the stability of your optimization logic.

Efficiency vs. Retention:

WFE (Efficiency) measures the consistency of the process: how reliably the strategy preserves its performance when facing unknown data. It uses the median to filter out "lucky" outlier windows.
OOS Retention measures the survival of the edge: the ratio of total money made Out-of-Sample versus In-Sample.

What to do:

If WFE is FAIL or WARN, your optimization process is likely capturing noise (overfitting); the strategy "breaks" as soon as the market shifts. If WFE is N/A, you have fewer than 3 windows with a positive In-Sample start - add more data or shorten windows to gather enough evidence.

[How to Read Walk-Forward Validation]

Key formulas

WFE:

WFE = median( OOS_i / IS_i ),  i ∈ { windows : IS_i > 0 }

Requires: count(IS_i > 0) ≥ 3; otherwise WFE = N/A

OOS Retention:

OOS Retention = Σᵢ OOS_i / Σᵢ IS_i,  Σᵢ IS_i > 0

Requires: Σ IS > 0; otherwise Retention = N/A

Retention > 1 with many failed windows indicates one dominant window - check window breakdown.

Window classification

normal - IS > 0, included in WFE
recovery - IS < 0, OOS > 0; excluded; reported separately
double_neg - IS < 0, OOS ≤ 0; excluded; reported separately
undefined - |IS| < ε; excluded

Performance Degradation

Performance Degradation = (mean(OOS) − mean(IS)) / |mean(IS)|

where mean(IS) ≠ 0; otherwise N/A. Note: when result < −1 (opposite signs), not comparable to full backtest CAGR.

WFE thresholds

≥ 0.7 PASS – Robust carry-over; the strategy "behaves" as expected.
0.5 – 0.7 ACCEPTABLE – Adequate stability; monitor for alpha decay.
0.2 – 0.5 WARN – Fragile; significant performance drop Out-of-Sample.
< 0.2 FAIL – High overfitting risk; the "tuned" edge does not exist.
N/A - – Insufficient statistical evidence (need 3 windows with in-sample return > 0).

Benchmark Metrics

Summary of how the strategy holds up on data it was not tuned on: Walk-Forward Efficiency, out-of-sample returns, how often it stays profitable, Kill-Switch (an exit rule, not just a metric), stability, and behavior across different market regimes.

[How to Read Benchmark Metrics]

Reported

• Walk-Forward Efficiency (WFE) min / typical / max
• OOS risk-adjusted returns, pass rate
• Kill-Switch: consecutive losing OOS windows threshold
• Stability, regimes (trend, range, high volatility)
• Verdict: READY, INCUBATE, CAUTION, REJECT

Benchmark Comparison

We compute the strategy’s net edge after all costs (fees, slippage, market impact) so the comparison with the benchmark (e.g. BTC buy and hold) is fair. Your strategy is compared over the same period with the same risk lens; we show annualised growth rate, alpha, information ratio, and correlation.

[How to Read Benchmark Comparison]

What you see

• CAGR and alpha vs benchmark
• Information ratio vs benchmark
• Correlation to benchmark
• Net edge after costs (used in capacity logic)

Parameter Sensitivity & Stability

We analyze each parameter to ensure your strategy's edge is built on a solid "plateau" rather than a lucky "spike." A robust strategy should perform well even if the market shifts slightly and your parameters are no longer perfectly optimal.

Three Dimensions of Control:

Robustness Margin: Measures the "width" of the profitable zone. It tells you how far a parameter can move before performance drops by 20%.
Sensitivity Index (SI): Measures the risk of small changes. A high SI means a tiny shift in settings can turn a winning strategy into a losing one (a sign of a "fragile" edge).
Parameter Stability (PSI): Tracks how much the optimal value "drifts" over time across different market regimes (WFA windows).

What to do:

If a parameter is labeled Fragile or Unstable, the strategy is likely overfit to a specific price pattern. Consider widening your entry logic, reducing the number of parameters, or extending the test period to find a more universal "plateau."

[How to Read Parameter Sensitivity & Stability]

Core Metrics per Parameter

• SI & PSI Labels: Stable, Reliable, Needs Tuning, Fragile.
• Overfitting Trigger: SI ≥ 0.35 indicates high sensitivity.
• Status: APPROVED, CONDITIONAL, REJECTED, HOLD.

Technical details →

Technical details

Robustness Margin (%):

Calculates the safe range around the peak:

threshold = score_peak × 0.8  (or × 1.2 if score_peak < 0)
robustPct = ((upper − lower) / 2) / |score_peak| × 100

Sensitivity Index (SI):

Determined via 5-bin variance analysis:

SI = Var(mean score per bin) / Var(all scores),  SI ∈ [0, 1]

Higher SI = higher risk of performance collapse from small drift.

PSI Thresholds (Drift Analysis):

• < 0.15 Stable: Optimal value is consistent over time.
• 0.15 – 0.40 Moderate: Optimal value shifts with regimes.
• > 0.40 Unstable: Optimal value is random (likely noise).

Trading Intensity & Cost Drag

How often the strategy trades, what costs eat into returns, and how much capital it can handle. Break-even slippage tells you how much worse execution you can tolerate before the edge disappears.

What to do: If cost-to-edge is FAIL or WARNING, reduce turnover or improve execution quality. If break-even slippage is low, do not scale position size before execution is sufficient.

[How to Read Trading Intensity & Cost Drag]

We report trading frequency, costs (fees, slippage, market impact), break-even slippage, and capacity (AUM levels where returns drop −10%, −25%, or collapse entirely). Cost-to-edge (costs as % of gross edge): PASS < 20%, WARNING 20–40%, FAIL ≥ 40%.

Technical details →

Technical details

Break-even slippage (bps):
  BreakevenSlippage = AvgNetProfitPerTrade_bps / 2

Market impact per trade:
  Impact = k × σ_daily × √(Q / ADV)
  where k = market impact coefficient (calibrated per asset class),
  σ_daily = daily return volatility, Q = trade size, ADV = avg daily volume

Cost drag:
  CostDrag% = −AnnualCost%

Strategy Action Plan

Concrete next steps derived from the full analysis. Use the action plan to prioritise what to fix first and to determine whether the strategy is ready to deploy.

Action Plan is not financial advice; it reflects statistical findings from the analysis.

[How the Strategy Action Plan Works]

• Action items by priority: parameters, execution, risk
• Slippage tolerance: safe range vs dangerous threshold
• Deployment readiness: go / wait / fix

Risk Metrics (Out-of-Sample)

Risk is measured exclusively on data the strategy was not tuned on. We report return distribution statistics and optional narratives. The risk verdict (STABLE, CAUTION, or UNSTABLE) is accompanied by a recommended next action and maximum leverage guidance.

What to do: If risk is CAUTION or UNSTABLE, reduce position size or leverage and address the root causes (e.g. tail risk, deep drawdowns) before deploying.

[How to Read Risk Metrics (Out-of-Sample)]

Reported

• Max drawdown, average drawdown, recovery time
• Sharpe ratio, Sortino ratio, profit factor, win rate
• VaR 95%, Expected Shortfall (CVaR)
• Return skewness, kurtosis (fat tails), tail ratio
• Verdict: STABLE / CAUTION / UNSTABLE
• Recommendation: status, next steps, max leverage

Kiploks Robustness Score

A single number from 0 to 100 built from four components: Validation (does the edge hold out-of-sample?), Risk (reward vs drawdown and recovery), Stability (small parameter changes do not break the strategy), and Execution (edge survives fees and slippage).

What to do: If the score is 0 despite a profitable backtest, one of the four modules or the Data Quality Guard is blocking it (e.g. DQG reject, WFE fail, or execution cost exceeding the edge). Check the report for which gate failed and fix that first.

[How the Kiploks Robustness Score Works]

Formula (same as in report)

Score = (V^0.4 × R^0.3 × S^0.2 × E^0.1) × DQG × 100

where V = Validation, R = Risk, S = Stability, E = Execution, DQG = Data Quality Guard (0–1)

If any of V, R, S, E = 0, or DQG = 0  →  Score = 0

Final Verdict Summary

One screen answers: launch now, wait, or drop the strategy. The verdict is computed from the robustness score, Data Quality Guard, Walk-Forward Efficiency, risk and cost checks.

The checklist shows which validation and risk gates passed or failed; the summary text explains what works, what fails, and what we recommend next.

What to do: Follow the recommended action in the summary. If CAUTION or DO NOT DEPLOY, fix the flagged issues before reassessing deployment.

[How to Read the Final Verdict]

What you see

Verdict:ROBUSTCAUTIONDO NOT DEPLOY
• Checklist: which gates passed or failed
• Summary + recommended action

Verdict meaning

ROBUSTStrategy passes validation, risk, and cost checks. Review the full report and consider deploying.
CAUTIONSome gates failed or are borderline. Fix the flagged issues before deploying.
DO NOT DEPLOYCritical failures detected (e.g. data quality, WFE, or risk). Address all summary items before any deployment.

Open Core, reproducibility, and license

Kiploks uses an Open Core model: the analytics engine is open source (Apache License 2.0) and published as @kiploks/engine-* on npm. kiploks.com is the hosted product-accounts, API, storage, integrations, and the full report UI-running on those same packages so cloud and local runs stay comparable when versions align.

The same input, engine version, contract version, and seed should yield the same deterministic outputs for the analytical core. Always record the versions and hashes above when you need to reproduce or audit a result.

Documentation for developers: see Open engine (canonical docs on this site). Source code: github.com/kiploks/engine.

Reproducibility metadata (report / API)

• Engine version
• Formula version
• Risk analysis version
• Contract version
• Input hash (canonical payload)
• Config hash
• Random seed

Hashes follow a versioned canonicalization policy so the same inputs and engine build produce the same fingerprints for audit and parity checks.

Frequently asked questions

What does Walk-Forward Efficiency (WFE) = N/A mean?: We need at least 3 time windows with in-sample return > 0 to compute WFE. If you have fewer - for example because the backtest history is too short or the windows are too wide - WFE is not applicable. Add more data or shorten windows and re-run.
Why is my robustness score 0 if the strategy is profitable in the backtest?: The score is built from four modules (Validation, Risk, Stability, Execution) and is blocked when any module scores 0 or when the Data Quality Guard fails. A profitable backtest does not guarantee that all four modules pass. Check the report to see which gate failed - for example DQG reject, WFE below threshold, or execution costs exceeding the edge - and fix that first.
How do I interpret CAUTION vs DO NOT DEPLOY?: CAUTION means one or more checks failed or are borderline. The strategy has potential but requires specific fixes before deployment - the report tells you what to address. DO NOT DEPLOY means critical failures across data quality, validation, or risk. No deployment should occur until the listed issues are resolved.
What is the gap between WFE 0.5 and 0.7 in the thresholds?: The range 0.5 - 0.7 is labelled ACCEPTABLE: the edge carries over adequately but is not yet in the robust zone. Monitor for decay over time and consider whether the strategy needs more validation windows before full deployment.
Why is my Profit Factor capped at 20.0?: To prevent mathematical instability when losses are near zero.
Do the terms on this page match what I see in my report?: Yes. We use the same labels throughout: Walk-Forward Efficiency (WFE), Data Quality Guard (DQG), Kiploks Robustness Score, and the verdicts ROBUST / CAUTION / DO NOT DEPLOY. If you see a term in the report that is not defined here, check the glossary at the top of this page.
Are results reproducible between the open engine and the cloud product?: Core analytics are versioned. Reports expose engine version, formula version, risk analysis version, contract version, input hash, config hash, and seed so a given input can be traced to a deterministic contract. The cloud product runs on the same open engine packages you can install from npm; presentation and collaboration are separate. See the Open Core section above and the Open engine documentation.

See it in action

The main page shows an example analysis with the same blocks you see in a real report. Run a robustness research to get your own verdict, robustness score, and full breakdown.

View example on the main page Analytics guides About

Backtests show performance.Kiploks shows whether your trading bot survives.