gcloud dataproc clusters gke create - create a GKE-based virtual cluster
gcloud dataproc clusters gke create (CLUSTER : --region=REGION) --spark-engine-version=SPARK_ENGINE_VERSION (--gke-cluster=GKE_CLUSTER : --gke-cluster-location=GKE_CLUSTER_LOCATION) [--async] [--namespace=NAMESPACE] [--pools=[KEY=VALUE[;VALUE],...]] [--properties=[PREFIX:PROPERTY=VALUE,...]] [--setup-workload-identity] [--staging-bucket=STAGING_BUCKET] [--history-server-cluster=HISTORY_SERVER_CLUSTER : --history-server-cluster-region=HISTORY_SERVER_CLUSTER_REGION] [--metastore-service=METASTORE_SERVICE : --metastore-service-location=METASTORE_SERVICE_LOCATION] [GCLOUD_WIDE_FLAG ...]
Create a GKE-based virtual cluster.
Create a Dataproc on GKE cluster in us-central1 on a GKE cluster in the same project and region with default values:
$ gcloud dataproc clusters gke create my-cluster \ --region=us-central1 --gke-cluster=my-gke-cluster \ --spark-engine-version=latest --pools='name=dp,roles=default'
Create a Dataproc on GKE cluster in us-central1 on a GKE cluster in the same project and zone us-central1-f with default values:
$ gcloud dataproc clusters gke create my-cluster \ --region=us-central1 --gke-cluster=my-gke-cluster \ --gke-cluster-location=us-central1-f \ --spark-engine-version=3.1 --pools='name=dp,roles=default'
Create a Dataproc on GKE cluster in us-central1 with machine type 'e2-standard-4', autoscaling 5-15 nodes per zone.
$ gcloud dataproc clusters gke create my-cluster \ --region='us-central1' \ --gke-cluster='projects/my-project/locations/us-central1/cluster\ s/my-gke-cluster' --spark-engine-version=dataproc-1.5 \ --pools='name=dp-default,roles=default,machineType=e2-standard-4\ ,min=5,max=15'
Create a Dataproc on GKE cluster in us-central1 with two distinct node pools.
$ gcloud dataproc clusters gke create my-cluster \ --region='us-central1' --gke-cluster='my-gke-cluster' \ --spark-engine-version='dataproc-2.0' \ --pools='name=dp-default,roles=default,machineType=e2-standard-4\ --pools='name=workers,roles=spark-drivers;spark-executors,machin\ eType=n2-standard-8
- Cluster resource - The name of the cluster to create. The arguments in this
group can be used to specify the attributes of this resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways. To set the project attribute:
- —
provide the argument cluster on the command line with a fully specified name;
- —
provide the argument --project on the command line;
- —
set the property core/project.
This must be specified.
- CLUSTER
ID of the cluster or fully qualified identifier for the cluster. To set the cluster attribute:
provide the argument cluster on the command line.
This positional argument must be specified if any of the other arguments in this group are specified.
- --region=REGION
Dataproc region for the cluster. Each Dataproc region constitutes an independent resource namespace constrained to deploying instances into Compute Engine zones inside the region. Overrides the default dataproc/region property value for this command invocation. To set the region attribute:
provide the argument cluster on the command line with a fully specified name;
provide the argument --region on the command line;
set the property dataproc/region.
- --spark-engine-version=SPARK_ENGINE_VERSION
The version of the Spark engine to run on this cluster.
- Gke cluster resource - The GKE cluster to install the Dataproc cluster on. The
arguments in this group can be used to specify the attributes of this resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways. To set the project attribute:
- —
provide the argument --gke-cluster on the command line with a fully specified name;
- —
provide the argument --project on the command line;
- —
set the property core/project.
This must be specified.
- --gke-cluster=GKE_CLUSTER
ID of the gke-cluster or fully qualified identifier for the gke-cluster. To set the gke-cluster attribute:
provide the argument --gke-cluster on the command line.
This flag argument must be specified if any of the other arguments in this group are specified.
- --gke-cluster-location=GKE_CLUSTER_LOCATION
GKE region for the gke-cluster. To set the gke-cluster-location attribute:
provide the argument --gke-cluster on the command line with a fully specified name;
provide the argument --gke-cluster-location on the command line;
provide the argument --region on the command line;
set the property dataproc/region.
- --async
Return immediately, without waiting for the operation in progress to complete.
- --namespace=NAMESPACE
The name of the Kubernetes namespace to deploy Dataproc system components in. This namespace does not need to exist.
- --pools=[KEY=VALUE[;VALUE],...]
Each --pools flag represents a GKE node pool associated with the virtual cluster. It is comprised of a CSV in the form KEY=VALUE[;VALUE], where certain keys may have multiple values.
The following KEYs must be specified:
----------------------------------------------------------------------------------------------------------- KEY Type Example Description ------ ---------------- ------------------------ ---------------------------------------------------------- name string `my-node-pool` Name of the node pool.
roles repeated string `default;spark-driver` Roles that this node pool should perform. Valid values are `default`, `controller`, `spark-driver`, `spark-executor`. -----------------------------------------------------------------------------------------------------------
The following KEYs may be specified:
---------------------------------------------------------------------------------------------------------------------------------------------------------------- KEY Type Example Description --------------- ---------------- --------------------------------------------- --------------------------------------------------------------------------------- machineType string `n1-standard-8` Compute Engine machine type to use.
preemptible boolean `false` If true, then this node pool uses preemptible VMs. This cannot be true on the node pool with the `controllers` role (or `default` role if `controllers` role is not specified).
localSsdCount int `2` The number of local SSDs to attach to each node.
accelerator repeated string `nvidia-tesla-a100=1` Accelerators to attach to each node. In the format NAME=COUNT.
minCpuPlatform string `Intel Skylake` Minimum CPU platform for each node.
bootDiskKmsKey string `projects/project-id/locations/us-central1 The Customer Managed Encryption Key (CMEK) used to encrypt /keyRings/keyRing-name/cryptoKeys/key-name` the boot disk attached to each node in the node pool.
locations repeated string `us-west1-a;us-west1-c` Zones within the location of the GKE cluster. All `--pools` flags for a Dataproc cluster must have identical locations.
min int `0` Minimum number of nodes per zone that this node pool can scale down to.
max int `10` Maximum number of nodes per zone that this node pool can scale up to. ----------------------------------------------------------------------------------------------------------------------------------------------------------------
- --properties=[PREFIX:PROPERTY=VALUE,...]
Specifies configuration properties for installed packages, such as Spark. Properties are mapped to configuration files by specifying a prefix, such as "core:io.serializations".
- --setup-workload-identity
Sets up the GKE Workload Identity for your Dataproc on GKE cluster. Note that running this requires elevated permissions as it will manipulate IAM policies on the Google Service Accounts that will be used by your Dataproc on GKE cluster.
- --staging-bucket=STAGING_BUCKET
The Cloud Storage bucket to use to stage job dependencies, miscellaneous config files, and job driver console output when using this cluster.
- History server cluster resource - A Dataproc Cluster created as a History
Server, see https://cloud.google.com/dataproc/docs/concepts/jobs/history-server The arguments in this group can be used to specify the attributes of this resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways. To set the project attribute:
- —
provide the argument --history-server-cluster on the command line with a fully specified name;
- —
provide the argument --project on the command line;
- —
set the property core/project.
- --history-server-cluster=HISTORY_SERVER_CLUSTER
ID of the history-server-cluster or fully qualified identifier for the history-server-cluster. To set the history-server-cluster attribute:
provide the argument --history-server-cluster on the command line.
This flag argument must be specified if any of the other arguments in this group are specified.
- --history-server-cluster-region=HISTORY_SERVER_CLUSTER_REGION
Compute Engine region for the history-server-cluster. It must be the same region as the Dataproc cluster that is being created. To set the history-server-cluster-region attribute:
provide the argument --history-server-cluster on the command line with a fully specified name;
provide the argument --history-server-cluster-region on the command line;
provide the argument --region on the command line;
set the property dataproc/region.
- Metastore service resource - Dataproc Metastore Service to be used as an
external metastore. The arguments in this group can be used to specify the attributes of this resource. (NOTE) Some attributes are not given arguments in this group but can be set in other ways. To set the project attribute:
- —
provide the argument --metastore-service on the command line with a fully specified name;
- —
provide the argument --project on the command line;
- —
set the property core/project.
- --metastore-service=METASTORE_SERVICE
ID of the metastore-service or fully qualified identifier for the metastore-service. To set the metastore-service attribute:
provide the argument --metastore-service on the command line.
This flag argument must be specified if any of the other arguments in this group are specified.
- --metastore-service-location=METASTORE_SERVICE_LOCATION
Dataproc Metastore location for the metastore-service. To set the metastore-service-location attribute:
provide the argument --metastore-service on the command line with a fully specified name;
provide the argument --metastore-service-location on the command line;
provide the argument --region on the command line;
set the property dataproc/region.
These flags are available to all commands: --access-token-file, --account, --billing-project, --configuration, --flags-file, --flatten, --format, --help, --impersonate-service-account, --log-http, --project, --quiet, --trace-token, --user-output-enabled, --verbosity.
Run $ gcloud help for details.
These variants are also available:
$ gcloud alpha dataproc clusters gke create
$ gcloud beta dataproc clusters gke create