Add allowFailure flag on workflow steps
Opened by swampadmin · 1/25/2025
Problem
When building test or diagnostic workflows, some steps may fail due to external constraints (e.g., billing plan limitations, optional features not configured) rather than actual errors. Currently, any failed step marks the entire job and workflow as "failed", even when the failure is expected and acceptable.
There is no way to mark a step as "optional" or "allowed to fail." The existing dependency condition system (completed, always, etc.) controls whether downstream steps run, but doesn't prevent the failed step from counting toward the overall workflow status.
Proposed Solution
Add an allowFailure (or continueOnError) boolean flag to the step schema:
steps:
- name: check-log-streaming
description: Check log streaming config (may 403 on free plans)
allowFailure: true
task:
type: model_method
modelIdOrName: my-log-config
methodName: get
inputs:
logType: configurationWhen allowFailure: true:
- If the step succeeds, it reports as
succeededas normal - If the step fails, it reports as something like
failed_allowedorwarninginstead offailed - The step's failure does NOT propagate to the job or workflow status
- Downstream steps with
dependsOn: succeededstill skip (the step did fail), butdependsOn: completedwould fire
Use Case
We have a monolith test workflow (test-all-models) that exercises all 10 Tailscale model types in parallel. Two of the steps call logConfig.get, which returns HTTP 403 ("feature not available on current billing plan") on free-tier tailnets. The API call and error handling work correctly — it's just that the feature isn't available.
Without allowFailure, the workflow reports as "failed" even though 14 of 16 steps pass and the 2 failures are expected. This makes the workflow unusable as an automated health check because the exit code is always non-zero.
Alternatives Considered
- Remove the steps: Works but loses visibility into which features are available
- Restructure into separate workflows: Adds complexity without solving the underlying problem
- Use dependency conditions:
completed/alwayslet downstream steps run, but the overall workflow status is still "failed"
Closed
No activity in this phase yet.
Sign in to post a ripple.