Kiploks analysis methodology

Backtests show performance.
Kiploks shows survivability.

Kiploks evaluates whether your strategy can survive real capital: we score robustness, validate out-of-sample, check data quality, and combine everything into a clear verdict (ROBUST / CAUTION / DO NOT DEPLOY) so you know what to do next.

Last updated: February 2026

How the report is built

The result you see follows a single pipeline. Each step feeds the next:

Glossary
IS (In-Sample)
The period used to tune or optimize the strategy. Results on IS data can be overfit.
OOS (Out-of-Sample)
The period not used for tuning; used to validate whether the edge holds. OOS results are a better signal of real-world performance.
CAGR (Compound Annual Growth Rate)
The smoothed annual return assuming gains are reinvested. Comparable across strategies and time periods regardless of how long the backtest runs.
WFE (Walk-Forward Efficiency)
Median of (OOS return / IS return) per time window, computed only over windows where IS return > 0. Measures how much of the tuned performance carries over to the next period. High WFE = robust; low or N/A = weak or insufficient data.
OOS Retention
The ratio of total out-of-sample returns to total in-sample returns across all walk-forward windows (sum of all OOS divided by sum of all IS). Measures overall survival of the edge across the full history, as opposed to WFE which uses the median per window. Retention > 1 with many failed windows indicates one dominant window — check window breakdown.

Data Quality Guard (DQG)

We check that the data and run are trustworthy before scoring robustness. If the Data Quality Guard fails or is rejected, the robustness score is blocked so you do not rely on weak data.

What to do: If DQG fails, improve data quality, add more trades, or fix price gaps and re-run the analysis.

[How the Data Quality Guard Works]

Checks

  • • Sampling: enough trades for reliable statistics
  • • Gap density / price integrity (when OHLCV available)
  • • Verdict: PASS / FAIL / REJECT
  • • If DQG fails or is rejected, robustness score is blocked (overall = 0)

Walk-Forward Validation

We tune on one time window (In-Sample, IS), then measure how much of that edge carries over to the next (Out-of-Sample, OOS). This process is repeated across the full history to test the stability of your optimization logic.

Efficiency vs. Retention:

  • WFE (Efficiency) measures the consistency of the process: how reliably the strategy preserves its performance when facing unknown data. It uses the median to filter out "lucky" outlier windows.
  • OOS Retention measures the survival of the edge: the ratio of total money made Out-of-Sample versus In-Sample.

What to do:

If WFE is FAIL or WARN, your optimization process is likely capturing noise (overfitting); the strategy "breaks" as soon as the market shifts. If WFE is N/A, you have fewer than 3 windows with a positive In-Sample start — add more data or shorten windows to gather enough evidence.

[How to Read Walk-Forward Validation]

Key formulas

WFE:

WFE = median( OOS_i / IS_i ),  i ∈ { windows : IS_i > 0 }

Requires: count(IS_i > 0) ≥ 3; otherwise WFE = N/A

OOS Retention:

OOS Retention = Σᵢ OOS_i / Σᵢ IS_i,  Σᵢ IS_i > 0

Requires: Σ IS > 0; otherwise Retention = N/A

Retention > 1 with many failed windows indicates one dominant window — check window breakdown.

Window classification

  • normal — IS > 0, included in WFE
  • recovery — IS < 0, OOS > 0; excluded; reported separately
  • double_neg — IS < 0, OOS ≤ 0; excluded; reported separately
  • undefined — |IS| < ε; excluded

Performance Degradation

Performance Degradation = (mean(OOS) − mean(IS)) / |mean(IS)|

where mean(IS) ≠ 0; otherwise N/A. Note: when result < −1 (opposite signs), not comparable to full backtest CAGR.

WFE thresholds

  • ≥ 0.7 PASS Robust carry-over; the strategy "behaves" as expected.
  • 0.5 – 0.7 ACCEPTABLE Adequate stability; monitor for alpha decay.
  • 0.2 – 0.5 WARN Fragile; significant performance drop Out-of-Sample.
  • < 0.2 FAIL High overfitting risk; the "tuned" edge does not exist.
  • N/A Insufficient statistical evidence (need 3 windows with in-sample return > 0).

Benchmark Metrics

Summary of how the strategy holds up on data it was not tuned on: Walk-Forward Efficiency, out-of-sample returns, how often it stays profitable, Kill-Switch (an exit rule, not just a metric), stability, and behavior across different market regimes.

[How to Read Benchmark Metrics]

Reported

  • • Walk-Forward Efficiency (WFE) min / typical / max
  • • OOS risk-adjusted returns, pass rate
  • Kill-Switch: consecutive losing OOS windows threshold
  • • Stability, regimes (trend, range, high volatility)
  • • Verdict: READY, INCUBATE, CAUTION, REJECT

Benchmark Comparison

We compute the strategy’s net edge after all costs (fees, slippage, market impact) so the comparison with the benchmark (e.g. BTC buy and hold) is fair. Your strategy is compared over the same period with the same risk lens; we show annualised growth rate, alpha, information ratio, and correlation.

[How to Read Benchmark Comparison]

What you see

  • • CAGR and alpha vs benchmark
  • • Information ratio vs benchmark
  • • Correlation to benchmark
  • • Net edge after costs (used in capacity logic)

Parameter Sensitivity & Stability

We analyze each parameter to ensure your strategy's edge is built on a solid "plateau" rather than a lucky "spike." A robust strategy should perform well even if the market shifts slightly and your parameters are no longer perfectly optimal.

Three Dimensions of Control:

  • Robustness Margin: Measures the "width" of the profitable zone. It tells you how far a parameter can move before performance drops by 20%.
  • Sensitivity Index (SI): Measures the risk of small changes. A high SI means a tiny shift in settings can turn a winning strategy into a losing one (a sign of a "fragile" edge).
  • Parameter Stability (PSI): Tracks how much the optimal value "drifts" over time across different market regimes (WFA windows).

What to do:

If a parameter is labeled Fragile or Unstable, the strategy is likely overfit to a specific price pattern. Consider widening your entry logic, reducing the number of parameters, or extending the test period to find a more universal "plateau."

[How to Read Parameter Sensitivity & Stability]

Core Metrics per Parameter

  • SI & PSI Labels: Stable, Reliable, Needs Tuning, Fragile.
  • Overfitting Trigger: SI ≥ 0.35 indicates high sensitivity.
  • Status: APPROVED, CONDITIONAL, REJECTED, HOLD.
Technical details →

Technical details

Robustness Margin (%):

Calculates the safe range around the peak:

threshold = score_peak × 0.8  (or × 1.2 if score_peak < 0)
robustPct = ((upper − lower) / 2) / |score_peak| × 100

Sensitivity Index (SI):

Determined via 5-bin variance analysis:

SI = Var(mean score per bin) / Var(all scores),  SI ∈ [0, 1]

Higher SI = higher risk of performance collapse from small drift.

PSI Thresholds (Drift Analysis):

  • < 0.15 Stable: Optimal value is consistent over time.
  • 0.15 – 0.40 Moderate: Optimal value shifts with regimes.
  • > 0.40 Unstable: Optimal value is random (likely noise).

Trading Intensity & Cost Drag

How often the strategy trades, what costs eat into returns, and how much capital it can handle. Break-even slippage tells you how much worse execution you can tolerate before the edge disappears.

What to do: If cost-to-edge is FAIL or WARNING, reduce turnover or improve execution quality. If break-even slippage is low, do not scale position size before execution is sufficient.

[How to Read Trading Intensity & Cost Drag]

We report trading frequency, costs (fees, slippage, market impact), break-even slippage, and capacity (AUM levels where returns drop −10%, −25%, or collapse entirely). Cost-to-edge (costs as % of gross edge): PASS < 20%, WARNING 20–40%, FAIL ≥ 40%.

Technical details →

Technical details

Break-even slippage (bps):
  BreakevenSlippage = AvgNetProfitPerTrade_bps / 2

Market impact per trade:
  Impact = k × σ_daily ×(Q / ADV)
  where k = market impact coefficient (calibrated per asset class),
  σ_daily = daily return volatility, Q = trade size, ADV = avg daily volume

Cost drag:
  CostDrag% = −AnnualCost%

Strategy Action Plan

Concrete next steps derived from the full analysis. Use the action plan to prioritise what to fix first and to determine whether the strategy is ready to deploy.

Action Plan is not financial advice; it reflects statistical findings from the analysis.

[How the Strategy Action Plan Works]

  • • Action items by priority: parameters, execution, risk
  • • Slippage tolerance: safe range vs dangerous threshold
  • • Deployment readiness: go / wait / fix

Risk Metrics (Out-of-Sample)

Risk is measured exclusively on data the strategy was not tuned on. We report return distribution statistics and optional narratives. The risk verdict (STABLE, CAUTION, or UNSTABLE) is accompanied by a recommended next action and maximum leverage guidance.

What to do: If risk is CAUTION or UNSTABLE, reduce position size or leverage and address the root causes (e.g. tail risk, deep drawdowns) before deploying.

[How to Read Risk Metrics (Out-of-Sample)]

Reported

  • • Max drawdown, average drawdown, recovery time
  • • Sharpe ratio, Sortino ratio, profit factor, win rate
  • • VaR 95%, Expected Shortfall (CVaR)
  • • Return skewness, kurtosis (fat tails), tail ratio
  • • Verdict: STABLE / CAUTION / UNSTABLE
  • • Recommendation: status, next steps, max leverage

Kiploks Robustness Score

A single number from 0 to 100 built from four components: Validation (does the edge hold out-of-sample?), Risk (reward vs drawdown and recovery), Stability (small parameter changes do not break the strategy), and Execution (edge survives fees and slippage).

What to do: If the score is 0 despite a profitable backtest, one of the four modules or the Data Quality Guard is blocking it (e.g. DQG reject, WFE fail, or execution cost exceeding the edge). Check the report for which gate failed and fix that first.

[How the Kiploks Robustness Score Works]

Formula (same as in report)

Score = (V^0.4 × R^0.3 × S^0.2 × E^0.1) × DQG × 100

where V = Validation, R = Risk, S = Stability, E = Execution, DQG = Data Quality Guard (01)

If any of V, R, S, E = 0, or DQG = 0    Score = 0

Final Verdict Summary

One screen answers: launch now, wait, or drop the strategy. The verdict is computed from the robustness score, Data Quality Guard, Walk-Forward Efficiency, risk and cost checks.

The checklist shows which validation and risk gates passed or failed; the summary text explains what works, what fails, and what we recommend next.

What to do: Follow the recommended action in the summary. If CAUTION or DO NOT DEPLOY, fix the flagged issues before reassessing deployment.

[How to Read the Final Verdict]

What you see

  • Verdict:ROBUSTCAUTIONDO NOT DEPLOY
  • • Checklist: which gates passed or failed
  • • Summary + recommended action

Verdict meaning

  • ROBUSTStrategy passes validation, risk, and cost checks. Review the full report and consider deploying.
  • CAUTIONSome gates failed or are borderline. Fix the flagged issues before deploying.
  • DO NOT DEPLOYCritical failures detected (e.g. data quality, WFE, or risk). Address all summary items before any deployment.

Frequently asked questions

What does Walk-Forward Efficiency (WFE) = N/A mean?
We need at least 3 time windows with in-sample return > 0 to compute WFE. If you have fewer - for example because the backtest history is too short or the windows are too wide - WFE is not applicable. Add more data or shorten windows and re-run.
Why is my robustness score 0 if the strategy is profitable in the backtest?
The score is built from four modules (Validation, Risk, Stability, Execution) and is blocked when any module scores 0 or when the Data Quality Guard fails. A profitable backtest does not guarantee that all four modules pass. Check the report to see which gate failed - for example DQG reject, WFE below threshold, or execution costs exceeding the edge - and fix that first.
How do I interpret CAUTION vs DO NOT DEPLOY?
CAUTION means one or more checks failed or are borderline. The strategy has potential but requires specific fixes before deployment - the report tells you what to address. DO NOT DEPLOY means critical failures across data quality, validation, or risk. No deployment should occur until the listed issues are resolved.
What is the gap between WFE 0.5 and 0.7 in the thresholds?
The range 0.5 - 0.7 is labelled ACCEPTABLE: the edge carries over adequately but is not yet in the robust zone. Monitor for decay over time and consider whether the strategy needs more validation windows before full deployment.
Why is my Profit Factor capped at 20.0?
To prevent mathematical instability when losses are near zero.
Do the terms on this page match what I see in my report?
Yes. We use the same labels throughout: Walk-Forward Efficiency (WFE), Data Quality Guard (DQG), Kiploks Robustness Score, and the verdicts ROBUST / CAUTION / DO NOT DEPLOY. If you see a term in the report that is not defined here, check the glossary at the top of this page.

See it in action

The main page shows an example analysis with the same blocks you see in a real report. Run a robustness research to get your own verdict, robustness score, and full breakdown.