Marat's Notes

Google Cloud Batch is a fully managed service that enables you to run batch workloads on Google Cloud. This guide will walk you through setting up and running batch jobs using GCP Batch, with a focus on practical implementation and best practices.

Prerequisites

Before getting started, ensure you have:

GCP Account Setup:
- Active Google Cloud account
- Appropriate IAM permissions
- Google Cloud SDK installed
- Python 3.8+ installed
Required GCP Services:
- Cloud Batch
- Cloud Storage
- Compute Engine
- Cloud Logging
- Cloud Monitoring

Initial Setup

1. Google Cloud SDK Configuration

First, configure your Google Cloud credentials:

# Configure Google Cloud SDK
gcloud init

# Verify Batch access
gcloud batch jobs list

2. Python Environment Setup

Set up your Python environment:

# Create virtual environment
python -m venv gcp-batch-env
source gcp-batch-env/bin/activate

# Install required packages
pip install google-cloud-batch google-cloud-storage

Infrastructure Setup

1. VPC Configuration

Create a VPC network for your batch jobs:

from google.cloud import compute_v1
from google.api_core.exceptions import GoogleAPICallError

class BatchInfrastructure:
    def __init__(self):
        self.compute_client = compute_v1.NetworksClient()
    
    def create_vpc(self, project_id, network_name):
        try:
            network = compute_v1.Network()
            network.name = network_name
            network.auto_create_subnetworks = False
            
            operation = self.compute_client.insert(
                project=project_id,
                network_resource=network
            )
            return operation.result()
        except GoogleAPICallError as e:
            print(f"Error creating VPC: {e}")
            return None

2. Compute Environment

Set up the compute environment:

from google.cloud import batch_v1

def create_compute_environment(self, project_id, location):
    try:
        client = batch_v1.BatchServiceClient()
        
        parent = f"projects/{project_id}/locations/{location}"
        compute_resource = batch_v1.ComputeResource(
            cpu_milli=2000,  # 2 vCPU
            memory_mib=4096,  # 4GB
            boot_disk_mib=100000  # 100GB
        )
        
        environment = batch_v1.Environment(
            compute_resource=compute_resource
        )
        
        job = batch_v1.Job(
            name=f"{parent}/jobs/batch-job",
            environment=environment
        )
        
        response = client.create_job(
            parent=parent,
            job=job
        )
        return response
    except GoogleAPICallError as e:
        print(f"Error creating compute environment: {e}")
        return None

Job Definition

1. Container Setup

Create a Docker container for your batch job:

# Dockerfile
FROM python:3.8-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY batch_job.py .

CMD ["python", "batch_job.py"]

2. Job Definition

Define your batch job:

def create_job_definition(self, project_id, location):
    try:
        client = batch_v1.BatchServiceClient()
        
        parent = f"projects/{project_id}/locations/{location}"
        task = batch_v1.TaskSpec(
            runnables=[
                batch_v1.Runnable(
                    container=batch_v1.Container(
                        image_uri="gcr.io/your-project/batch-job:latest",
                        commands=["python", "batch_job.py"],
                        environment={
                            "ENVIRONMENT": "production"
                        }
                    )
                )
            ],
            compute_resource=batch_v1.ComputeResource(
                cpu_milli=1000,
                memory_mib=2048
            )
        )
        
        job = batch_v1.Job(
            name=f"{parent}/jobs/batch-job",
            task_groups=[
                batch_v1.TaskGroup(
                    task_count=1,
                    task_spec=task
                )
            ]
        )
        
        response = client.create_job(
            parent=parent,
            job=job
        )
        return response
    except GoogleAPICallError as e:
        print(f"Error creating job definition: {e}")
        return None

Running Jobs

1. Job Submission

Submit a batch job:

def submit_job(self, project_id, location, job_name, parameters):
    try:
        client = batch_v1.BatchServiceClient()
        
        parent = f"projects/{project_id}/locations/{location}"
        job = batch_v1.Job(
            name=f"{parent}/jobs/{job_name}",
            parameters=parameters
        )
        
        response = client.create_job(
            parent=parent,
            job=job
        )
        return response.name
    except GoogleAPICallError as e:
        print(f"Error submitting job: {e}")
        return None

2. Job Monitoring

Monitor job execution:

def monitor_job(self, project_id, location, job_name):
    try:
        client = batch_v1.BatchServiceClient()
        
        job_name = f"projects/{project_id}/locations/{location}/jobs/{job_name}"
        response = client.get_job(name=job_name)
        
        return {
            'status': response.status.state,
            'start_time': response.status.start_time,
            'end_time': response.status.end_time
        }
    except GoogleAPICallError as e:
        print(f"Error monitoring job: {e}")
        return None

Error Handling

1. Retry Logic

Implement retry logic for failed jobs:

from tenacity import retry, stop_after_attempt, wait_exponential

class BatchJobHandler:
    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
    def execute_with_retry(self, func, *args, **kwargs):
        try:
            return func(*args, **kwargs)
        except GoogleAPICallError as e:
            if e.code == 429:  # Rate limit exceeded
                raise
            return None

2. Error Logging

Set up comprehensive error logging:

from google.cloud import logging

def log_job_error(self, project_id, job_id, error):
    try:
        client = logging.Client(project=project_id)
        logger = client.logger('batch-jobs')
        
        logger.log_struct({
            'job_id': job_id,
            'error': str(error),
            'severity': 'ERROR'
        })
    except GoogleAPICallError as e:
        print(f"Error logging job error: {e}")

Best Practices

1. Resource Optimization

Optimize resource usage:

def optimize_resources(self, job_definition):
    try:
        client = batch_v1.BatchServiceClient()
        
        # Update job definition with optimized resources
        job_definition.task_groups[0].task_spec.compute_resource.cpu_milli = 2000
        job_definition.task_groups[0].task_spec.compute_resource.memory_mib = 4096
        
        response = client.update_job(job=job_definition)
        return response
    except GoogleAPICallError as e:
        print(f"Error optimizing resources: {e}")
        return None

2. Cost Management

Implement cost-saving measures:

def estimate_cost(self, job_definition, duration):
    try:
        # Calculate estimated cost based on resources and duration
        cpu_cost = (job_definition.task_groups[0].task_spec.compute_resource.cpu_milli / 1000) * duration * 0.00001
        memory_cost = (job_definition.task_groups[0].task_spec.compute_resource.memory_mib / 1024) * duration * 0.000005
        
        return cpu_cost + memory_cost
    except Exception as e:
        print(f"Error estimating cost: {e}")
        return None

Security Considerations

1. IAM Roles

Set up appropriate IAM roles:

{
  "bindings": [
    {
      "role": "roles/batch.jobsViewer",
      "members": [
        "user:user@example.com"
      ]
    },
    {
      "role": "roles/batch.jobsEditor",
      "members": [
        "serviceAccount:batch-service@project.iam.gserviceaccount.com"
      ]
    }
  ]
}

2. Network Security

Configure network security:

def configure_security(self, project_id, network_name):
    try:
        client = compute_v1.FirewallsClient()
        
        firewall = compute_v1.Firewall()
        firewall.name = f"{network_name}-allow-http"
        firewall.network = f"projects/{project_id}/global/networks/{network_name}"
        firewall.allowed = [
            compute_v1.Allowed(
                IP_protocol="tcp",
                ports=["80"]
            )
        ]
        firewall.source_ranges = ["0.0.0.0/0"]
        
        operation = client.insert(
            project=project_id,
            firewall_resource=firewall
        )
        return operation.result()
    except GoogleAPICallError as e:
        print(f"Error configuring security: {e}")
        return None

Monitoring and Logging

1. Cloud Monitoring Integration

Set up Cloud Monitoring:

from google.cloud import monitoring_v3

def setup_monitoring(self, project_id, job_name):
    try:
        client = monitoring_v3.MetricServiceClient()
        
        project_name = f"projects/{project_id}"
        descriptor = monitoring_v3.MetricDescriptor(
            type="custom.googleapis.com/batch/job_errors",
            display_name="Batch Job Errors",
            metric_kind=monitoring_v3.MetricDescriptor.MetricKind.GAUGE,
            value_type=monitoring_v3.MetricDescriptor.ValueType.INT64
        )
        
        response = client.create_metric_descriptor(
            name=project_name,
            metric_descriptor=descriptor
        )
        return response
    except GoogleAPICallError as e:
        print(f"Error setting up monitoring: {e}")

2. Log Management

Configure log management:

def configure_logging(self, project_id, job_name):
    try:
        client = logging.Client(project=project_id)
        
        # Create log sink
        sink = client.sink(
            f"batch-jobs-{job_name}",
            filter_=f"resource.type=cloud_batch_job AND resource.labels.job_name={job_name}"
        )
        
        # Set up log export to BigQuery
        sink.destination = f"bigquery.googleapis.com/projects/{project_id}/datasets/batch_logs"
        sink.create()
        
        return sink
    except GoogleAPICallError as e:
        print(f"Error configuring logging: {e}")

Conclusion

GCP Batch provides a powerful platform for running batch computing jobs. By following this guide, you can:

Set up and configure GCP Batch infrastructure
Create and manage job definitions
Submit and monitor batch jobs
Implement security best practices
Optimize costs and resources

Remember to:

Regularly monitor job performance
Implement proper error handling
Follow security best practices
Optimize resource usage
Keep track of costs

With proper implementation, GCP Batch can significantly streamline your batch computing workflows while maintaining security and cost-effectiveness.

Setting up and Running GCP Batch