added infra

2025-11-09 11:17:13 -05:00
parent afcb0b7932
commit 35773c6efe
9 changed files with 855 additions and 0 deletions
--- a/infra/gcp/REVIEW.md
+++ b/infra/gcp/REVIEW.md
@@ -0,0 +1,385 @@
+# Infrastructure Review: GCP Terraform for Sing For Hope
+
+**Review Date:** November 9, 2025
+**Reviewer:** Claude Code
+**Project:** GCP Infrastructure for Gitea and n8n
+
+---
+
+## Executive Summary
+
+This Terraform project manages GCP infrastructure for hosting Gitea (self-hosted Git) and n8n (workflow automation) on separate VM instances. While the code is clean and functional, it has several critical issues that limit production-readiness, particularly around security, data persistence, and infrastructure-as-code completeness.
+
+**Overall Ratings:**
+- Production-Readiness: **4/10**
+- Code Quality: **6/10**
+- Documentation: **7/10**
+- Maintainability: **5/10**
+- Security: **3/10**
+
+---
+
+## Strengths
+
+### 1. Clean Project Structure
+The project follows Terraform best practices with proper file organization:
+- Separate files for variables, outputs, and backend configuration
+- Clear naming conventions
+- Logical resource grouping
+
+### 2. Good Documentation
+The README.md provides:
+- Clear service descriptions
+- Prerequisite checklist
+- Step-by-step usage instructions
+- Honest acknowledgment of manual setup steps
+
+### 3. Remote State Management
+- Uses GCS backend for state storage
+- Prevents state conflicts in team environments
+- Enables state locking
+
+### 4. DNS Security
+- DNSSEC enabled with appropriate configuration
+- Uses NSEC3 for non-existence proof
+- Proper key specifications for signing and zone signing
+
+### 5. Sensible Defaults
+- Variables have reasonable default values
+- Machine type (e2-small) appropriate for small workloads
+- Standard region/zone selection
+
+---
+
+## Critical Issues
+
+### 1. Hardcoded Service Account Email
+**Location:** `main.tf:53`, `main.tf:76`
+
+```terraform
+service_account {
+  email  = "456409048169-compute@developer.gserviceaccount.com"
+  scopes = [...]
+}
+```
+
+**Impact:**
+- Code is not portable across projects
+- Violates infrastructure-as-code principles
+- Will fail if used in different GCP projects
+
+**Fix:**
+Use Terraform data sources to dynamically fetch the default compute service account:
+```terraform
+data "google_compute_default_service_account" "default" {}
+
+service_account {
+  email  = data.google_compute_default_service_account.default.email
+  scopes = [...]
+}
+```
+
+### 2. Infrastructure Drift (Manual Configuration)
+**Location:** README.md notes manual setup of Docker, Nginx, Certbot
+
+**Impact:**
+- Infrastructure cannot be reproduced from code alone
+- Disaster recovery requires manual intervention and documentation
+- Team members cannot spin up identical environments
+- Configuration changes aren't tracked in version control
+
+**Fix:**
+- Add startup scripts to `metadata_startup_script` in VM resources
+- OR use Packer to create pre-configured VM images
+- OR use configuration management tools (Ansible, Cloud Init)
+
+### 3. Security Vulnerabilities
+
+#### a. Overly Permissive Firewall Rules
+**Location:** All firewall rules use `source_ranges = ["0.0.0.0/0"]`
+
+**Issues:**
+- Gitea port 3000 exposed to the internet (`main.tf:82-93`)
+- n8n port 5678 exposed to the internet (`main.tf:121-132`)
+- These should only be accessible via reverse proxy (Nginx)
+
+#### b. Using Default VPC
+**Location:** All resources use `network = "default"`
+
+**Issues:**
+- Default network has permissive routing
+- No network segmentation
+- Shared with other project resources
+- Difficult to implement security best practices
+
+#### c. No SSH Access Control
+- No explicit SSH firewall rules defined
+- Default GCP rules may be too permissive
+- No bastion host or IAP tunneling
+
+### 4. No Data Persistence Strategy
+**Locations:** `main.tf:40-44`, `main.tf:64-68`
+
+**Issues:**
+- Boot disks use default sizing
+- No separate data volumes for application data
+- Gitea repositories and n8n workflows stored on ephemeral boot disk
+- No backup configuration
+- Risk of data loss if VMs are recreated
+
+**Impact:**
+Critical user data (Git repositories, workflow configurations) could be lost during infrastructure updates.
+
+---
+
+## Notable Gaps
+
+### 5. Missing Startup Automation
+While intentionally omitted per README, this creates operational challenges:
+- New team members can't provision working infrastructure
+- Updates require manual SSH intervention
+- No automated application updates or patching
+
+### 6. Outdated Operating System
+**Location:** `main.tf:42`, `main.tf:66`
+
+Currently using `debian-cloud/debian-11` while Debian 12 is available and supported.
+
+### 7. No Monitoring or Alerting
+Missing operational visibility:
+- No Cloud Monitoring dashboards
+- No alerting for VM health, disk usage, or service availability
+- No log aggregation configuration
+- No uptime checks for the applications
+
+### 8. No High Availability or Auto-Healing
+Current setup has single points of failure:
+- Single VM per service
+- No managed instance groups
+- No auto-restart on failure
+- No health checks
+
+### 9. DNS Configuration Gaps
+- Zone signing key uses 1024-bit RSA (should be 2048 bits)
+- No apex domain record defined (only subdomains)
+- TTL of 300 seconds is reasonable but not documented
+
+### 10. Missing Cost Optimization
+- No committed use discounts
+- Could use preemptible VMs for non-production
+- No resource tagging for cost allocation
+- No budgets or billing alerts
+
+---
+
+## Detailed Recommendations
+
+### High Priority (Do First)
+
+1. **Fix Hardcoded Service Account**
+   - Use data sources or variables
+   - Ensures portability across projects
+
+2. **Implement Application Provisioning**
+   - Add startup scripts with idempotent configuration
+   - Or create golden images with Packer
+   - Document all manual steps taken
+
+3. **Secure Firewall Rules**
+   - Remove public access to ports 3000 and 5678
+   - Restrict HTTP/HTTPS to CloudFlare IPs if using CDN
+   - Add explicit SSH rules with IP allowlisting
+
+4. **Create Custom VPC**
+   - Separate network for these resources
+   - Proper subnet configuration
+   - Network tags for better organization
+
+5. **Add Persistent Data Disks**
+   ```terraform
+   resource "google_compute_disk" "gitea_data" {
+     name = "gitea-data"
+     size = 50
+     type = "pd-standard"
+     zone = var.zone
+   }
+   ```
+
+6. **Implement Backup Strategy**
+   - Scheduled snapshots for data disks
+   - Retention policies
+   - Test restore procedures
+
+### Medium Priority (Important but Not Urgent)
+
+7. **Add Monitoring and Alerting**
+   - Cloud Monitoring dashboards for VM metrics
+   - Uptime checks for services
+   - Alert policies for disk usage, CPU, memory
+   - Email/Slack notifications
+
+8. **Upgrade Operating System**
+   - Change to `debian-cloud/debian-12`
+   - Test application compatibility first
+
+9. **Improve DNS Configuration**
+   - Increase zone signing key to 2048 bits
+   - Add apex domain record if needed
+   - Consider lower TTL during migrations
+
+10. **Add Lifecycle Management**
+    ```terraform
+    lifecycle {
+      prevent_destroy = true  # For production
+      ignore_changes  = [metadata_startup_script]
+    }
+    ```
+
+11. **Implement Better Secrets Management**
+    - Use Secret Manager for application secrets
+    - Grant VMs access via service account
+    - Avoid hardcoding credentials
+
+12. **Add Resource Labels**
+    ```terraform
+    labels = {
+      environment = "production"
+      service     = "gitea"
+      managed_by  = "terraform"
+      cost_center = "engineering"
+    }
+    ```
+
+### Low Priority (Nice to Have)
+
+13. **Modularize Terraform Code**
+    - Create reusable modules for VM + DNS pattern
+    - Separate module for firewall rules
+    - Easier to maintain and extend
+
+14. **Add terraform.tfvars.example**
+    - Document required variables
+    - Provide example values
+    - Help new team members get started
+
+15. **Consider Terragrunt**
+    - If planning multi-environment setup (dev/staging/prod)
+    - DRY configuration management
+    - Environment-specific overrides
+
+16. **Implement CI/CD for Terraform**
+    - Automated `terraform plan` on PRs
+    - Automated `terraform apply` after merge
+    - State locking verification
+
+17. **Add Pre-commit Hooks**
+    - Run `terraform fmt` automatically
+    - Run `terraform validate`
+    - Run security scanning (tfsec, checkov)
+
+18. **Consider Managed Services**
+    - Cloud Run for containerized apps (simpler than VMs)
+    - Cloud SQL if databases are needed
+    - Cloud Storage for artifacts/backups
+
+---
+
+## Security Recommendations Summary
+
+### Immediate Actions Required:
+1. Close ports 3000 and 5678 to public internet
+2. Implement IP allowlisting for SSH access
+3. Create custom VPC with proper firewall rules
+4. Enable VPC Flow Logs for security monitoring
+5. Implement Cloud Armor for DDoS protection
+
+### Additional Security Measures:
+- Enable OS Login for SSH key management
+- Use Identity-Aware Proxy (IAP) for VM access
+- Implement least-privilege service account permissions
+- Enable audit logging for all resources
+- Regular security scanning with Cloud Security Scanner
+- Implement Web Application Firewall (WAF) rules
+
+---
+
+## Data Protection Recommendations
+
+### Backup Strategy:
+```terraform
+# Example snapshot schedule
+resource "google_compute_resource_policy" "daily_backup" {
+  name   = "daily-backup-policy"
+  region = var.region
+
+  snapshot_schedule_policy {
+    schedule {
+      daily_schedule {
+        days_in_cycle = 1
+        start_time    = "04:00"
+      }
+    }
+    retention_policy {
+      max_retention_days    = 14
+      on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
+    }
+  }
+}
+```
+
+### Disaster Recovery Plan:
+1. Document manual setup steps in code-readable format (cloud-init)
+2. Test VM restoration from snapshots quarterly
+3. Maintain off-site backups of critical data
+4. Document RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
+
+---
+
+## Estimated Effort for Improvements
+
+| Priority | Task Category | Estimated Time |
+|----------|---------------|----------------|
+| High | Security fixes | 4-6 hours |
+| High | Data persistence | 2-3 hours |
+| High | Service account fix | 30 minutes |
+| High | Startup scripts | 4-8 hours |
+| Medium | Monitoring setup | 3-4 hours |
+| Medium | OS upgrade | 1-2 hours |
+| Low | Modularization | 6-8 hours |
+| Low | CI/CD pipeline | 4-6 hours |
+
+**Total effort for high-priority items:** ~12-18 hours
+**Total effort for all recommendations:** ~30-40 hours
+
+---
+
+## Conclusion
+
+This infrastructure demonstrates a good starting point for managing cloud resources with Terraform. The code is readable, well-documented, and follows basic IaC principles. However, significant work is needed to make this production-ready.
+
+The most critical issue is the **hybrid approach** where infrastructure is managed by Terraform but application configuration is manual. This creates a maintenance burden and makes disaster recovery difficult.
+
+### Recommended Next Steps:
+1. Start with security fixes (firewall rules, custom VPC)
+2. Add persistent data disks and backups
+3. Automate application provisioning
+4. Implement monitoring and alerting
+5. Create runbooks for common operations
+
+With these improvements, this infrastructure could achieve a production-readiness score of 8/10 and provide a solid foundation for the Sing For Hope organization's DevOps needs.
+
+---
+
+## Questions for Further Discussion
+
+1. What is the expected traffic volume for these services?
+2. What are the RTO/RPO requirements for disaster recovery?
+3. Is high availability required, or is some downtime acceptable?
+4. What is the budget for infrastructure costs?
+5. Are there compliance requirements (HIPAA, SOC2, etc.)?
+6. Who will be responsible for ongoing maintenance?
+7. Are there plans to add more services to this infrastructure?
+
+---
+
+**Note:** This review is based on the code as of November 9, 2025. As infrastructure evolves, periodic reviews should be conducted to ensure continued alignment with best practices and organizational needs.