Files
devops/infra/gcp/REVIEW.md
Javier Hinojosa 35773c6efe added infra
2025-11-09 11:17:13 -05:00

12 KiB

Infrastructure Review: GCP Terraform for Sing For Hope

Review Date: November 9, 2025 Reviewer: Claude Code Project: GCP Infrastructure for Gitea and n8n


Executive Summary

This Terraform project manages GCP infrastructure for hosting Gitea (self-hosted Git) and n8n (workflow automation) on separate VM instances. While the code is clean and functional, it has several critical issues that limit production-readiness, particularly around security, data persistence, and infrastructure-as-code completeness.

Overall Ratings:

  • Production-Readiness: 4/10
  • Code Quality: 6/10
  • Documentation: 7/10
  • Maintainability: 5/10
  • Security: 3/10

Strengths

1. Clean Project Structure

The project follows Terraform best practices with proper file organization:

  • Separate files for variables, outputs, and backend configuration
  • Clear naming conventions
  • Logical resource grouping

2. Good Documentation

The README.md provides:

  • Clear service descriptions
  • Prerequisite checklist
  • Step-by-step usage instructions
  • Honest acknowledgment of manual setup steps

3. Remote State Management

  • Uses GCS backend for state storage
  • Prevents state conflicts in team environments
  • Enables state locking

4. DNS Security

  • DNSSEC enabled with appropriate configuration
  • Uses NSEC3 for non-existence proof
  • Proper key specifications for signing and zone signing

5. Sensible Defaults

  • Variables have reasonable default values
  • Machine type (e2-small) appropriate for small workloads
  • Standard region/zone selection

Critical Issues

1. Hardcoded Service Account Email

Location: main.tf:53, main.tf:76

service_account {
  email  = "456409048169-compute@developer.gserviceaccount.com"
  scopes = [...]
}

Impact:

  • Code is not portable across projects
  • Violates infrastructure-as-code principles
  • Will fail if used in different GCP projects

Fix: Use Terraform data sources to dynamically fetch the default compute service account:

data "google_compute_default_service_account" "default" {}

service_account {
  email  = data.google_compute_default_service_account.default.email
  scopes = [...]
}

2. Infrastructure Drift (Manual Configuration)

Location: README.md notes manual setup of Docker, Nginx, Certbot

Impact:

  • Infrastructure cannot be reproduced from code alone
  • Disaster recovery requires manual intervention and documentation
  • Team members cannot spin up identical environments
  • Configuration changes aren't tracked in version control

Fix:

  • Add startup scripts to metadata_startup_script in VM resources
  • OR use Packer to create pre-configured VM images
  • OR use configuration management tools (Ansible, Cloud Init)

3. Security Vulnerabilities

a. Overly Permissive Firewall Rules

Location: All firewall rules use source_ranges = ["0.0.0.0/0"]

Issues:

  • Gitea port 3000 exposed to the internet (main.tf:82-93)
  • n8n port 5678 exposed to the internet (main.tf:121-132)
  • These should only be accessible via reverse proxy (Nginx)

b. Using Default VPC

Location: All resources use network = "default"

Issues:

  • Default network has permissive routing
  • No network segmentation
  • Shared with other project resources
  • Difficult to implement security best practices

c. No SSH Access Control

  • No explicit SSH firewall rules defined
  • Default GCP rules may be too permissive
  • No bastion host or IAP tunneling

4. No Data Persistence Strategy

Locations: main.tf:40-44, main.tf:64-68

Issues:

  • Boot disks use default sizing
  • No separate data volumes for application data
  • Gitea repositories and n8n workflows stored on ephemeral boot disk
  • No backup configuration
  • Risk of data loss if VMs are recreated

Impact: Critical user data (Git repositories, workflow configurations) could be lost during infrastructure updates.


Notable Gaps

5. Missing Startup Automation

While intentionally omitted per README, this creates operational challenges:

  • New team members can't provision working infrastructure
  • Updates require manual SSH intervention
  • No automated application updates or patching

6. Outdated Operating System

Location: main.tf:42, main.tf:66

Currently using debian-cloud/debian-11 while Debian 12 is available and supported.

7. No Monitoring or Alerting

Missing operational visibility:

  • No Cloud Monitoring dashboards
  • No alerting for VM health, disk usage, or service availability
  • No log aggregation configuration
  • No uptime checks for the applications

8. No High Availability or Auto-Healing

Current setup has single points of failure:

  • Single VM per service
  • No managed instance groups
  • No auto-restart on failure
  • No health checks

9. DNS Configuration Gaps

  • Zone signing key uses 1024-bit RSA (should be 2048 bits)
  • No apex domain record defined (only subdomains)
  • TTL of 300 seconds is reasonable but not documented

10. Missing Cost Optimization

  • No committed use discounts
  • Could use preemptible VMs for non-production
  • No resource tagging for cost allocation
  • No budgets or billing alerts

Detailed Recommendations

High Priority (Do First)

  1. Fix Hardcoded Service Account

    • Use data sources or variables
    • Ensures portability across projects
  2. Implement Application Provisioning

    • Add startup scripts with idempotent configuration
    • Or create golden images with Packer
    • Document all manual steps taken
  3. Secure Firewall Rules

    • Remove public access to ports 3000 and 5678
    • Restrict HTTP/HTTPS to CloudFlare IPs if using CDN
    • Add explicit SSH rules with IP allowlisting
  4. Create Custom VPC

    • Separate network for these resources
    • Proper subnet configuration
    • Network tags for better organization
  5. Add Persistent Data Disks

    resource "google_compute_disk" "gitea_data" {
      name = "gitea-data"
      size = 50
      type = "pd-standard"
      zone = var.zone
    }
    
  6. Implement Backup Strategy

    • Scheduled snapshots for data disks
    • Retention policies
    • Test restore procedures

Medium Priority (Important but Not Urgent)

  1. Add Monitoring and Alerting

    • Cloud Monitoring dashboards for VM metrics
    • Uptime checks for services
    • Alert policies for disk usage, CPU, memory
    • Email/Slack notifications
  2. Upgrade Operating System

    • Change to debian-cloud/debian-12
    • Test application compatibility first
  3. Improve DNS Configuration

    • Increase zone signing key to 2048 bits
    • Add apex domain record if needed
    • Consider lower TTL during migrations
  4. Add Lifecycle Management

    lifecycle {
      prevent_destroy = true  # For production
      ignore_changes  = [metadata_startup_script]
    }
    
  5. Implement Better Secrets Management

    • Use Secret Manager for application secrets
    • Grant VMs access via service account
    • Avoid hardcoding credentials
  6. Add Resource Labels

    labels = {
      environment = "production"
      service     = "gitea"
      managed_by  = "terraform"
      cost_center = "engineering"
    }
    

Low Priority (Nice to Have)

  1. Modularize Terraform Code

    • Create reusable modules for VM + DNS pattern
    • Separate module for firewall rules
    • Easier to maintain and extend
  2. Add terraform.tfvars.example

    • Document required variables
    • Provide example values
    • Help new team members get started
  3. Consider Terragrunt

    • If planning multi-environment setup (dev/staging/prod)
    • DRY configuration management
    • Environment-specific overrides
  4. Implement CI/CD for Terraform

    • Automated terraform plan on PRs
    • Automated terraform apply after merge
    • State locking verification
  5. Add Pre-commit Hooks

    • Run terraform fmt automatically
    • Run terraform validate
    • Run security scanning (tfsec, checkov)
  6. Consider Managed Services

    • Cloud Run for containerized apps (simpler than VMs)
    • Cloud SQL if databases are needed
    • Cloud Storage for artifacts/backups

Security Recommendations Summary

Immediate Actions Required:

  1. Close ports 3000 and 5678 to public internet
  2. Implement IP allowlisting for SSH access
  3. Create custom VPC with proper firewall rules
  4. Enable VPC Flow Logs for security monitoring
  5. Implement Cloud Armor for DDoS protection

Additional Security Measures:

  • Enable OS Login for SSH key management
  • Use Identity-Aware Proxy (IAP) for VM access
  • Implement least-privilege service account permissions
  • Enable audit logging for all resources
  • Regular security scanning with Cloud Security Scanner
  • Implement Web Application Firewall (WAF) rules

Data Protection Recommendations

Backup Strategy:

# Example snapshot schedule
resource "google_compute_resource_policy" "daily_backup" {
  name   = "daily-backup-policy"
  region = var.region

  snapshot_schedule_policy {
    schedule {
      daily_schedule {
        days_in_cycle = 1
        start_time    = "04:00"
      }
    }
    retention_policy {
      max_retention_days    = 14
      on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
    }
  }
}

Disaster Recovery Plan:

  1. Document manual setup steps in code-readable format (cloud-init)
  2. Test VM restoration from snapshots quarterly
  3. Maintain off-site backups of critical data
  4. Document RTO (Recovery Time Objective) and RPO (Recovery Point Objective)

Estimated Effort for Improvements

Priority Task Category Estimated Time
High Security fixes 4-6 hours
High Data persistence 2-3 hours
High Service account fix 30 minutes
High Startup scripts 4-8 hours
Medium Monitoring setup 3-4 hours
Medium OS upgrade 1-2 hours
Low Modularization 6-8 hours
Low CI/CD pipeline 4-6 hours

Total effort for high-priority items: ~12-18 hours Total effort for all recommendations: ~30-40 hours


Conclusion

This infrastructure demonstrates a good starting point for managing cloud resources with Terraform. The code is readable, well-documented, and follows basic IaC principles. However, significant work is needed to make this production-ready.

The most critical issue is the hybrid approach where infrastructure is managed by Terraform but application configuration is manual. This creates a maintenance burden and makes disaster recovery difficult.

  1. Start with security fixes (firewall rules, custom VPC)
  2. Add persistent data disks and backups
  3. Automate application provisioning
  4. Implement monitoring and alerting
  5. Create runbooks for common operations

With these improvements, this infrastructure could achieve a production-readiness score of 8/10 and provide a solid foundation for the Sing For Hope organization's DevOps needs.


Questions for Further Discussion

  1. What is the expected traffic volume for these services?
  2. What are the RTO/RPO requirements for disaster recovery?
  3. Is high availability required, or is some downtime acceptable?
  4. What is the budget for infrastructure costs?
  5. Are there compliance requirements (HIPAA, SOC2, etc.)?
  6. Who will be responsible for ongoing maintenance?
  7. Are there plans to add more services to this infrastructure?

Note: This review is based on the code as of November 9, 2025. As infrastructure evolves, periodic reviews should be conducted to ensure continued alignment with best practices and organizational needs.