gfw.common.beam.pipeline.Pipeline#

class Pipeline(name='', version='0.1.0', dag=None, pre_hooks=(), post_hooks=(), unparsed_args=(), **options)[source]#

Wrapper around beam.Pipeline with extended functionality.

Features:
  • Merges unparsed, parsed, and default options.

  • Supports custom DAG definitions.

  • Enables Google Cloud Profiler integration.

  • Automatically adds ./setup.py when sdk_container_image is not specified.

You can implement your own Dag object to be injected in the constructor, reuse the provided LinearDag, or just override the apply_dag() method of this class.

Parameters:
  • name (str) – The name of the pipeline. Defaults to an empty string.

  • version (str) – The version of the pipeline. Defaults to 0.1.0.

  • dag (Dag | None) – The DAG to be applied to the pipeline. Defaults to an empty LinearDag.

  • pre_hooks (Sequence[Callable[[...], None]]) – Sequence of callables executed before pipeline run. Each callable receives the pipeline instance as its only argument.

  • post_hooks (Sequence[Callable[[...], None]]) – Sequence of callables executed after pipeline run completes successfully. Each callable receives the pipeline instance as its only argument.

  • unparsed_args (Tuple[str, ...]) – A list of unparsed arguments to pass to Beam options. Defaults to an empty tuple.

  • **options (Any) – Additional options to pass to the Beam pipeline.

parsed_args#

The parsed arguments from the unparsed_args list.

pipeline_options#

The merged pipeline options, including parsed args, user options, and defaults.

pipeline#

The initialized beam Pipeline object.

Methods

apply_dag

Applies the provided DAG implementation to the self.pipeline.

default_options

Returns the default options for the pipeline.

run

Executes the Apache Beam pipeline.

Attributes

cloud_options

Returns the GoogleCloudOptions view of the PipelineOptions.

is_streaming

Returns whether the pipeline is running in streaming mode.

parsed_args

Parses the unparsed arguments using Beam's PipelineOptions.

pipeline

Returns the initialized beam.Pipeline object.

pipeline_options

Resolves pipeline options.

property parsed_args: dict[str, Any]#

Parses the unparsed arguments using Beam’s PipelineOptions.

Returns:

A dictionary of parsed arguments.

property pipeline_options: PipelineOptions#

Resolves pipeline options.

Combines parsed arguments by beam CLI, constructor parameters and defaults into a single PipelineOptions object.

Returns:

The merged pipeline options.

property cloud_options: GoogleCloudOptions#

Returns the GoogleCloudOptions view of the PipelineOptions.

property pipeline: Pipeline#

Returns the initialized beam.Pipeline object.

property is_streaming: bool#

Returns whether the pipeline is running in streaming mode.

apply_dag()[source]#

Applies the provided DAG implementation to the self.pipeline.

Return type:

PCollection

run(wait_until_finish=None)[source]#

Executes the Apache Beam pipeline.

Runs the configured DAG and returns the pipeline result along with its main output(s).

The execution can be either blocking or non-blocking depending on the pipeline type and the wait_until_finish parameter:

  • If wait_until_finish is None (default): - batch pipelines block until completion - streaming pipelines return immediately after submission

  • If wait_until_finish is explicitly set: - True blocks until the pipeline finishes - False returns immediately

Note that waiting on a streaming pipeline will block indefinitely unless the job is externally cancelled.

Post-hooks are only executed when the pipeline finishes successfully (i.e., in blocking mode and when the final state is DONE).

Parameters:

wait_until_finish (bool | None) – Whether to wait for the pipeline execution to complete. If None, the behavior is inferred from whether the pipeline is running in streaming mode.

Returns:

  • The PipelineResult of the executed pipeline.

  • The main output(s) produced by the DAG, which may be a PCollection, a tuple, a dict, or None.

Return type:

A tuple containing

static default_options()[source]#

Returns the default options for the pipeline.

Return type:

dict[str, Any]