Support for Workflow Automation¶
This notebook walks through existing functionality for (meta)data contributors to
register workflows, to
install sensor logic for automated workflow execution, to
programmatically register workflow-run state transitions, i.e. run events, and to
programmatically register generated assets, i.e. data and metadata outputs, with any workflow run event.
Register workflows¶
In the nmdc_runtime.api.boot.workflows
module, add an entry for your workflow to the _raw
list. Examples:
# nmdc_runtime/api/boot/workflows.py
{
"id": "test",
"created_at": datetime(2021, 9, 9, tzinfo=timezone.utc),
"name": "A test workflow",
"description": "For use in unit and integration tests",
},
{
"id": "metadata-in-1.0.0",
"created_at": datetime(2021, 10, 12, tzinfo=timezone.utc),
"name": "general metadata ETL",
"description": "Validate and ingest metadata from JSON files",
},
{
"id": "apply-changesheet-1.0.0",
"created_at": datetime(2021, 9, 30, tzinfo=timezone.utc),
"name": "apply metadata changesheet",
"description": "Validate and apply metadata changes from TSV/CSV files",
},
{
"id": "export-study-biosamples-as-csv-1.0.0",
"created_at": datetime(2022, 6, 8, tzinfo=timezone.utc),
"name": "export study biosamples metadata as CSV",
"description": "Export study biosamples metadata as CSV",
},
That's it. The id
field is a primary key under administration by workflow authors. That is, it is up to those who register a workflow by id
here to ensure that it corresponds to a semantically invariant (minor and patch updates may vary if no -x.y.z
suffix is part of the registered id
) version of an unambiguously known workflow. Concretely, there is no requirement for e.g. a commmit-hash-including GitHub link to the workflow's entrypoint.
Install sensor logic¶
Sensors are used to:
- orchestrate runs of runtime-site-executable workflows, e.g. validation and ingest of JSON objects and changesheets against the NMDC schema
- create new Job resources for external Sites to claim
In the nmdc_runtime.site.repository
module, you may add a function decorated with dagster.sensor
(i.e. @sensor
preceding the function's def
), following the examples already installed.
Alternatively, if your workflow needs to run if and only if a new data object of a certain type is detected by the runtime, then you may declaratively hook into the existing generic nmdc_runtime.site.repository.process_workflow_job_triggers
sensor by registering appropriate entries in the _raw
lists of nmdc_runtime.api.boot.triggers
and nmdc_runtime.api.boot.object_types
. See the next subsection for details.
Register object-type and trigger metadata¶
If your workflow needs to run if and only if a new data object of a certain type is detected by the runtime, you can add entries to two modules as per the following examples:
# nmdc_runtime/api/boot/object_types.py
{
"id": "test",
"created_at": datetime(2021, 9, 7, tzinfo=timezone.utc),
"name": "A test object type",
"description": "For use in unit and integration tests",
},
{
"id": "metadata-in",
"created_at": datetime(2021, 6, 1, tzinfo=timezone.utc),
"name": "metadata submission",
"description": "Input to the portal ETL process",
},
{
"id": "metadata-changesheet",
"created_at": datetime(2021, 9, 30, tzinfo=timezone.utc),
"name": "metadata changesheet",
"description": "Specification for changes to existing metadata",
},
# nmdc_runtime/api/boot/triggers.py
{
"created_at": datetime(2021, 9, 9, tzinfo=timezone.utc),
"object_type_id": "test",
"workflow_id": "test",
},
{
"created_at": datetime(2021, 6, 1, tzinfo=timezone.utc),
"object_type_id": "metadata-in",
"workflow_id": "metadata-in-1.0.0",
},
{
"created_at": datetime(2021, 9, 30, tzinfo=timezone.utc),
"object_type_id": "metadata-changesheet",
"workflow_id": "apply-changesheet-1.0.0",
},
The corresponding sensor,
# nmdc_runtime/site/repository.py
@sensor(job=ensure_jobs.to_job(name="ensure_job_triggered", **preset_normal))
def process_workflow_job_triggers(_context):
is activated approximately 30 seconds after the last time it ran, in perpetuity.
Register workflow-run state transitions¶
There are currently two ways to register workflow-run state transitions:
- through claiming advertised Jobs and updating corresponding job Operation resources
- direct event registration with
/runs
API entrypoints
Claiming a Job and updating the spawned Operation resource¶
If you have set up sensor logic to trigger the creation of a workflow Job resource when an appropriate input Object resource is available (see previous section), you may
GET /jobs
to list and filter for relevant jobsPOST /jobs/{job_id}:claim
to claim a job and receive the ID for a new Operation resource with which to register events regarding your workflow job execution.PATCH /operations/{op_id}
to report on job operation status, including whether it isdone
or not.
Direct workflow-execution event registration via /runs
entrypoints¶
You may POST /runs/{run_id}/events
to post events relevant to your workflow execution. It is your responsibility to supply (1) a run id and (2) a job/workflow id with each posted representation so that events may be collated to recover run provenance. The OpenLineage schema is used for representations.
If a workflow is registered with an executable by the runtime Site, you may POST /runs
to request a run given workflow inputs/configuration. In this case, the runtime will return a run ID and will post run events that you may retrieve via GET /runs/{run_id}/events
to list a run's events or GET /runs/{run_id}
to get a summary of the run and its current status.
Register workflow-generated assets¶
Each mechanism for registering workflow-run state transitions (see previous section) includes facility for annotating transition representations with metadata about generated assets. Operation resources have result
and metadata
fields, and RunEvent resources (the representation schema for the /runs
entrypoint suite) have outputs
fields. The recommendation here is to include qualified references to nmdc:DataObject IDs.
Note that such registration of assets within the representations of Operations and RunEvents is supplementary to but does not replace the primary requirement of provenance metadata embedded in submitted NMDC Schema nmdc:Activity representations, which also make reference to used and generated DataObjects.