12 KiB
Infrastructure Review: GCP Terraform for Sing For Hope
Review Date: November 9, 2025 Reviewer: Claude Code Project: GCP Infrastructure for Gitea and n8n
Executive Summary
This Terraform project manages GCP infrastructure for hosting Gitea (self-hosted Git) and n8n (workflow automation) on separate VM instances. While the code is clean and functional, it has several critical issues that limit production-readiness, particularly around security, data persistence, and infrastructure-as-code completeness.
Overall Ratings:
- Production-Readiness: 4/10
- Code Quality: 6/10
- Documentation: 7/10
- Maintainability: 5/10
- Security: 3/10
Strengths
1. Clean Project Structure
The project follows Terraform best practices with proper file organization:
- Separate files for variables, outputs, and backend configuration
- Clear naming conventions
- Logical resource grouping
2. Good Documentation
The README.md provides:
- Clear service descriptions
- Prerequisite checklist
- Step-by-step usage instructions
- Honest acknowledgment of manual setup steps
3. Remote State Management
- Uses GCS backend for state storage
- Prevents state conflicts in team environments
- Enables state locking
4. DNS Security
- DNSSEC enabled with appropriate configuration
- Uses NSEC3 for non-existence proof
- Proper key specifications for signing and zone signing
5. Sensible Defaults
- Variables have reasonable default values
- Machine type (e2-small) appropriate for small workloads
- Standard region/zone selection
Critical Issues
1. Hardcoded Service Account Email
Location: main.tf:53, main.tf:76
service_account {
email = "456409048169-compute@developer.gserviceaccount.com"
scopes = [...]
}
Impact:
- Code is not portable across projects
- Violates infrastructure-as-code principles
- Will fail if used in different GCP projects
Fix: Use Terraform data sources to dynamically fetch the default compute service account:
data "google_compute_default_service_account" "default" {}
service_account {
email = data.google_compute_default_service_account.default.email
scopes = [...]
}
2. Infrastructure Drift (Manual Configuration)
Location: README.md notes manual setup of Docker, Nginx, Certbot
Impact:
- Infrastructure cannot be reproduced from code alone
- Disaster recovery requires manual intervention and documentation
- Team members cannot spin up identical environments
- Configuration changes aren't tracked in version control
Fix:
- Add startup scripts to
metadata_startup_scriptin VM resources - OR use Packer to create pre-configured VM images
- OR use configuration management tools (Ansible, Cloud Init)
3. Security Vulnerabilities
a. Overly Permissive Firewall Rules
Location: All firewall rules use source_ranges = ["0.0.0.0/0"]
Issues:
- Gitea port 3000 exposed to the internet (
main.tf:82-93) - n8n port 5678 exposed to the internet (
main.tf:121-132) - These should only be accessible via reverse proxy (Nginx)
b. Using Default VPC
Location: All resources use network = "default"
Issues:
- Default network has permissive routing
- No network segmentation
- Shared with other project resources
- Difficult to implement security best practices
c. No SSH Access Control
- No explicit SSH firewall rules defined
- Default GCP rules may be too permissive
- No bastion host or IAP tunneling
4. No Data Persistence Strategy
Locations: main.tf:40-44, main.tf:64-68
Issues:
- Boot disks use default sizing
- No separate data volumes for application data
- Gitea repositories and n8n workflows stored on ephemeral boot disk
- No backup configuration
- Risk of data loss if VMs are recreated
Impact: Critical user data (Git repositories, workflow configurations) could be lost during infrastructure updates.
Notable Gaps
5. Missing Startup Automation
While intentionally omitted per README, this creates operational challenges:
- New team members can't provision working infrastructure
- Updates require manual SSH intervention
- No automated application updates or patching
6. Outdated Operating System
Location: main.tf:42, main.tf:66
Currently using debian-cloud/debian-11 while Debian 12 is available and supported.
7. No Monitoring or Alerting
Missing operational visibility:
- No Cloud Monitoring dashboards
- No alerting for VM health, disk usage, or service availability
- No log aggregation configuration
- No uptime checks for the applications
8. No High Availability or Auto-Healing
Current setup has single points of failure:
- Single VM per service
- No managed instance groups
- No auto-restart on failure
- No health checks
9. DNS Configuration Gaps
- Zone signing key uses 1024-bit RSA (should be 2048 bits)
- No apex domain record defined (only subdomains)
- TTL of 300 seconds is reasonable but not documented
10. Missing Cost Optimization
- No committed use discounts
- Could use preemptible VMs for non-production
- No resource tagging for cost allocation
- No budgets or billing alerts
Detailed Recommendations
High Priority (Do First)
-
Fix Hardcoded Service Account
- Use data sources or variables
- Ensures portability across projects
-
Implement Application Provisioning
- Add startup scripts with idempotent configuration
- Or create golden images with Packer
- Document all manual steps taken
-
Secure Firewall Rules
- Remove public access to ports 3000 and 5678
- Restrict HTTP/HTTPS to CloudFlare IPs if using CDN
- Add explicit SSH rules with IP allowlisting
-
Create Custom VPC
- Separate network for these resources
- Proper subnet configuration
- Network tags for better organization
-
Add Persistent Data Disks
resource "google_compute_disk" "gitea_data" { name = "gitea-data" size = 50 type = "pd-standard" zone = var.zone } -
Implement Backup Strategy
- Scheduled snapshots for data disks
- Retention policies
- Test restore procedures
Medium Priority (Important but Not Urgent)
-
Add Monitoring and Alerting
- Cloud Monitoring dashboards for VM metrics
- Uptime checks for services
- Alert policies for disk usage, CPU, memory
- Email/Slack notifications
-
Upgrade Operating System
- Change to
debian-cloud/debian-12 - Test application compatibility first
- Change to
-
Improve DNS Configuration
- Increase zone signing key to 2048 bits
- Add apex domain record if needed
- Consider lower TTL during migrations
-
Add Lifecycle Management
lifecycle { prevent_destroy = true # For production ignore_changes = [metadata_startup_script] } -
Implement Better Secrets Management
- Use Secret Manager for application secrets
- Grant VMs access via service account
- Avoid hardcoding credentials
-
Add Resource Labels
labels = { environment = "production" service = "gitea" managed_by = "terraform" cost_center = "engineering" }
Low Priority (Nice to Have)
-
Modularize Terraform Code
- Create reusable modules for VM + DNS pattern
- Separate module for firewall rules
- Easier to maintain and extend
-
Add terraform.tfvars.example
- Document required variables
- Provide example values
- Help new team members get started
-
Consider Terragrunt
- If planning multi-environment setup (dev/staging/prod)
- DRY configuration management
- Environment-specific overrides
-
Implement CI/CD for Terraform
- Automated
terraform planon PRs - Automated
terraform applyafter merge - State locking verification
- Automated
-
Add Pre-commit Hooks
- Run
terraform fmtautomatically - Run
terraform validate - Run security scanning (tfsec, checkov)
- Run
-
Consider Managed Services
- Cloud Run for containerized apps (simpler than VMs)
- Cloud SQL if databases are needed
- Cloud Storage for artifacts/backups
Security Recommendations Summary
Immediate Actions Required:
- Close ports 3000 and 5678 to public internet
- Implement IP allowlisting for SSH access
- Create custom VPC with proper firewall rules
- Enable VPC Flow Logs for security monitoring
- Implement Cloud Armor for DDoS protection
Additional Security Measures:
- Enable OS Login for SSH key management
- Use Identity-Aware Proxy (IAP) for VM access
- Implement least-privilege service account permissions
- Enable audit logging for all resources
- Regular security scanning with Cloud Security Scanner
- Implement Web Application Firewall (WAF) rules
Data Protection Recommendations
Backup Strategy:
# Example snapshot schedule
resource "google_compute_resource_policy" "daily_backup" {
name = "daily-backup-policy"
region = var.region
snapshot_schedule_policy {
schedule {
daily_schedule {
days_in_cycle = 1
start_time = "04:00"
}
}
retention_policy {
max_retention_days = 14
on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
}
}
}
Disaster Recovery Plan:
- Document manual setup steps in code-readable format (cloud-init)
- Test VM restoration from snapshots quarterly
- Maintain off-site backups of critical data
- Document RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
Estimated Effort for Improvements
| Priority | Task Category | Estimated Time |
|---|---|---|
| High | Security fixes | 4-6 hours |
| High | Data persistence | 2-3 hours |
| High | Service account fix | 30 minutes |
| High | Startup scripts | 4-8 hours |
| Medium | Monitoring setup | 3-4 hours |
| Medium | OS upgrade | 1-2 hours |
| Low | Modularization | 6-8 hours |
| Low | CI/CD pipeline | 4-6 hours |
Total effort for high-priority items: ~12-18 hours Total effort for all recommendations: ~30-40 hours
Conclusion
This infrastructure demonstrates a good starting point for managing cloud resources with Terraform. The code is readable, well-documented, and follows basic IaC principles. However, significant work is needed to make this production-ready.
The most critical issue is the hybrid approach where infrastructure is managed by Terraform but application configuration is manual. This creates a maintenance burden and makes disaster recovery difficult.
Recommended Next Steps:
- Start with security fixes (firewall rules, custom VPC)
- Add persistent data disks and backups
- Automate application provisioning
- Implement monitoring and alerting
- Create runbooks for common operations
With these improvements, this infrastructure could achieve a production-readiness score of 8/10 and provide a solid foundation for the Sing For Hope organization's DevOps needs.
Questions for Further Discussion
- What is the expected traffic volume for these services?
- What are the RTO/RPO requirements for disaster recovery?
- Is high availability required, or is some downtime acceptable?
- What is the budget for infrastructure costs?
- Are there compliance requirements (HIPAA, SOC2, etc.)?
- Who will be responsible for ongoing maintenance?
- Are there plans to add more services to this infrastructure?
Note: This review is based on the code as of November 9, 2025. As infrastructure evolves, periodic reviews should be conducted to ensure continued alignment with best practices and organizational needs.