devops/infra/gcp/REVIEW.md

# Infrastructure Review: GCP Terraform for Sing For Hope

**Review Date:** November 9, 2025
**Reviewer:** Claude Code
**Project:** GCP Infrastructure for Gitea and n8n

---

## Executive Summary

This Terraform project manages GCP infrastructure for hosting Gitea (self-hosted Git) and n8n (workflow automation) on separate VM instances. While the code is clean and functional, it has several critical issues that limit production-readiness, particularly around security, data persistence, and infrastructure-as-code completeness.

**Overall Ratings:**
- Production-Readiness: **4/10**
- Code Quality: **6/10**
- Documentation: **7/10**
- Maintainability: **5/10**
- Security: **3/10**

---

## Strengths

### 1. Clean Project Structure
The project follows Terraform best practices with proper file organization:
- Separate files for variables, outputs, and backend configuration
- Clear naming conventions
- Logical resource grouping

### 2. Good Documentation
The README.md provides:
- Clear service descriptions
- Prerequisite checklist
- Step-by-step usage instructions
- Honest acknowledgment of manual setup steps

### 3. Remote State Management
- Uses GCS backend for state storage
- Prevents state conflicts in team environments
- Enables state locking

### 4. DNS Security
- DNSSEC enabled with appropriate configuration
- Uses NSEC3 for non-existence proof
- Proper key specifications for signing and zone signing

### 5. Sensible Defaults
- Variables have reasonable default values
- Machine type (e2-small) appropriate for small workloads
- Standard region/zone selection

---

## Critical Issues

### 1. Hardcoded Service Account Email
**Location:** `main.tf:53`, `main.tf:76`

```terraform
service_account {
  email  = "456409048169-compute@developer.gserviceaccount.com"
  scopes = [...]
}
```

**Impact:**
- Code is not portable across projects
- Violates infrastructure-as-code principles
- Will fail if used in different GCP projects

**Fix:**
Use Terraform data sources to dynamically fetch the default compute service account:
```terraform
data "google_compute_default_service_account" "default" {}

service_account {
  email  = data.google_compute_default_service_account.default.email
  scopes = [...]
}
```

### 2. Infrastructure Drift (Manual Configuration)
**Location:** README.md notes manual setup of Docker, Nginx, Certbot

**Impact:**
- Infrastructure cannot be reproduced from code alone
- Disaster recovery requires manual intervention and documentation
- Team members cannot spin up identical environments
- Configuration changes aren't tracked in version control

**Fix:**
- Add startup scripts to `metadata_startup_script` in VM resources
- OR use Packer to create pre-configured VM images
- OR use configuration management tools (Ansible, Cloud Init)

### 3. Security Vulnerabilities

#### a. Overly Permissive Firewall Rules
**Location:** All firewall rules use `source_ranges = ["0.0.0.0/0"]`

**Issues:**
- Gitea port 3000 exposed to the internet (`main.tf:82-93`)
- n8n port 5678 exposed to the internet (`main.tf:121-132`)
- These should only be accessible via reverse proxy (Nginx)

#### b. Using Default VPC
**Location:** All resources use `network = "default"`

**Issues:**
- Default network has permissive routing
- No network segmentation
- Shared with other project resources
- Difficult to implement security best practices

#### c. No SSH Access Control
- No explicit SSH firewall rules defined
- Default GCP rules may be too permissive
- No bastion host or IAP tunneling

### 4. No Data Persistence Strategy
**Locations:** `main.tf:40-44`, `main.tf:64-68`

**Issues:**
- Boot disks use default sizing
- No separate data volumes for application data
- Gitea repositories and n8n workflows stored on ephemeral boot disk
- No backup configuration
- Risk of data loss if VMs are recreated

**Impact:**
Critical user data (Git repositories, workflow configurations) could be lost during infrastructure updates.

---

## Notable Gaps

### 5. Missing Startup Automation
While intentionally omitted per README, this creates operational challenges:
- New team members can't provision working infrastructure
- Updates require manual SSH intervention
- No automated application updates or patching

### 6. Outdated Operating System
**Location:** `main.tf:42`, `main.tf:66`

Currently using `debian-cloud/debian-11` while Debian 12 is available and supported.

### 7. No Monitoring or Alerting
Missing operational visibility:
- No Cloud Monitoring dashboards
- No alerting for VM health, disk usage, or service availability
- No log aggregation configuration
- No uptime checks for the applications

### 8. No High Availability or Auto-Healing
Current setup has single points of failure:
- Single VM per service
- No managed instance groups
- No auto-restart on failure
- No health checks

### 9. DNS Configuration Gaps
- Zone signing key uses 1024-bit RSA (should be 2048 bits)
- No apex domain record defined (only subdomains)
- TTL of 300 seconds is reasonable but not documented

### 10. Missing Cost Optimization
- No committed use discounts
- Could use preemptible VMs for non-production
- No resource tagging for cost allocation
- No budgets or billing alerts

---

## Detailed Recommendations

### High Priority (Do First)

1. **Fix Hardcoded Service Account**
   - Use data sources or variables
   - Ensures portability across projects

2. **Implement Application Provisioning**
   - Add startup scripts with idempotent configuration
   - Or create golden images with Packer
   - Document all manual steps taken

3. **Secure Firewall Rules**
   - Remove public access to ports 3000 and 5678
   - Restrict HTTP/HTTPS to CloudFlare IPs if using CDN
   - Add explicit SSH rules with IP allowlisting

4. **Create Custom VPC**
   - Separate network for these resources
   - Proper subnet configuration
   - Network tags for better organization

5. **Add Persistent Data Disks**
   ```terraform
   resource "google_compute_disk" "gitea_data" {
     name = "gitea-data"
     size = 50
     type = "pd-standard"
     zone = var.zone
   }
   ```

6. **Implement Backup Strategy**
   - Scheduled snapshots for data disks
   - Retention policies
   - Test restore procedures

### Medium Priority (Important but Not Urgent)

7. **Add Monitoring and Alerting**
   - Cloud Monitoring dashboards for VM metrics
   - Uptime checks for services
   - Alert policies for disk usage, CPU, memory
   - Email/Slack notifications

8. **Upgrade Operating System**
   - Change to `debian-cloud/debian-12`
   - Test application compatibility first

9. **Improve DNS Configuration**
   - Increase zone signing key to 2048 bits
   - Add apex domain record if needed
   - Consider lower TTL during migrations

10. **Add Lifecycle Management**
    ```terraform
    lifecycle {
      prevent_destroy = true  # For production
      ignore_changes  = [metadata_startup_script]
    }
    ```

11. **Implement Better Secrets Management**
    - Use Secret Manager for application secrets
    - Grant VMs access via service account
    - Avoid hardcoding credentials

12. **Add Resource Labels**
    ```terraform
    labels = {
      environment = "production"
      service     = "gitea"
      managed_by  = "terraform"
      cost_center = "engineering"
    }
    ```

### Low Priority (Nice to Have)

13. **Modularize Terraform Code**
    - Create reusable modules for VM + DNS pattern
    - Separate module for firewall rules
    - Easier to maintain and extend

14. **Add terraform.tfvars.example**
    - Document required variables
    - Provide example values
    - Help new team members get started

15. **Consider Terragrunt**
    - If planning multi-environment setup (dev/staging/prod)
    - DRY configuration management
    - Environment-specific overrides

16. **Implement CI/CD for Terraform**
    - Automated `terraform plan` on PRs
    - Automated `terraform apply` after merge
    - State locking verification

17. **Add Pre-commit Hooks**
    - Run `terraform fmt` automatically
    - Run `terraform validate`
    - Run security scanning (tfsec, checkov)

18. **Consider Managed Services**
    - Cloud Run for containerized apps (simpler than VMs)
    - Cloud SQL if databases are needed
    - Cloud Storage for artifacts/backups

---

## Security Recommendations Summary

### Immediate Actions Required:
1. Close ports 3000 and 5678 to public internet
2. Implement IP allowlisting for SSH access
3. Create custom VPC with proper firewall rules
4. Enable VPC Flow Logs for security monitoring
5. Implement Cloud Armor for DDoS protection

### Additional Security Measures:
- Enable OS Login for SSH key management
- Use Identity-Aware Proxy (IAP) for VM access
- Implement least-privilege service account permissions
- Enable audit logging for all resources
- Regular security scanning with Cloud Security Scanner
- Implement Web Application Firewall (WAF) rules

---

## Data Protection Recommendations

### Backup Strategy:
```terraform
# Example snapshot schedule
resource "google_compute_resource_policy" "daily_backup" {
  name   = "daily-backup-policy"
  region = var.region

  snapshot_schedule_policy {
    schedule {
      daily_schedule {
        days_in_cycle = 1
        start_time    = "04:00"
      }
    }
    retention_policy {
      max_retention_days    = 14
      on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
    }
  }
}
```

### Disaster Recovery Plan:
1. Document manual setup steps in code-readable format (cloud-init)
2. Test VM restoration from snapshots quarterly
3. Maintain off-site backups of critical data
4. Document RTO (Recovery Time Objective) and RPO (Recovery Point Objective)

---

## Estimated Effort for Improvements

| Priority | Task Category | Estimated Time |
|----------|---------------|----------------|
| High | Security fixes | 4-6 hours |
| High | Data persistence | 2-3 hours |
| High | Service account fix | 30 minutes |
| High | Startup scripts | 4-8 hours |
| Medium | Monitoring setup | 3-4 hours |
| Medium | OS upgrade | 1-2 hours |
| Low | Modularization | 6-8 hours |
| Low | CI/CD pipeline | 4-6 hours |

**Total effort for high-priority items:** ~12-18 hours
**Total effort for all recommendations:** ~30-40 hours

---

## Conclusion

This infrastructure demonstrates a good starting point for managing cloud resources with Terraform. The code is readable, well-documented, and follows basic IaC principles. However, significant work is needed to make this production-ready.

The most critical issue is the **hybrid approach** where infrastructure is managed by Terraform but application configuration is manual. This creates a maintenance burden and makes disaster recovery difficult.

### Recommended Next Steps:
1. Start with security fixes (firewall rules, custom VPC)
2. Add persistent data disks and backups
3. Automate application provisioning
4. Implement monitoring and alerting
5. Create runbooks for common operations

With these improvements, this infrastructure could achieve a production-readiness score of 8/10 and provide a solid foundation for the Sing For Hope organization's DevOps needs.

---

## Questions for Further Discussion

1. What is the expected traffic volume for these services?
2. What are the RTO/RPO requirements for disaster recovery?
3. Is high availability required, or is some downtime acceptable?
4. What is the budget for infrastructure costs?
5. Are there compliance requirements (HIPAA, SOC2, etc.)?
6. Who will be responsible for ongoing maintenance?
7. Are there plans to add more services to this infrastructure?

---

**Note:** This review is based on the code as of November 9, 2025. As infrastructure evolves, periodic reviews should be conducted to ensure continued alignment with best practices and organizational needs.