386 lines
12 KiB
Markdown
386 lines
12 KiB
Markdown
# Infrastructure Review: GCP Terraform for Sing For Hope
|
|
|
|
**Review Date:** November 9, 2025
|
|
**Reviewer:** Claude Code
|
|
**Project:** GCP Infrastructure for Gitea and n8n
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
This Terraform project manages GCP infrastructure for hosting Gitea (self-hosted Git) and n8n (workflow automation) on separate VM instances. While the code is clean and functional, it has several critical issues that limit production-readiness, particularly around security, data persistence, and infrastructure-as-code completeness.
|
|
|
|
**Overall Ratings:**
|
|
- Production-Readiness: **4/10**
|
|
- Code Quality: **6/10**
|
|
- Documentation: **7/10**
|
|
- Maintainability: **5/10**
|
|
- Security: **3/10**
|
|
|
|
---
|
|
|
|
## Strengths
|
|
|
|
### 1. Clean Project Structure
|
|
The project follows Terraform best practices with proper file organization:
|
|
- Separate files for variables, outputs, and backend configuration
|
|
- Clear naming conventions
|
|
- Logical resource grouping
|
|
|
|
### 2. Good Documentation
|
|
The README.md provides:
|
|
- Clear service descriptions
|
|
- Prerequisite checklist
|
|
- Step-by-step usage instructions
|
|
- Honest acknowledgment of manual setup steps
|
|
|
|
### 3. Remote State Management
|
|
- Uses GCS backend for state storage
|
|
- Prevents state conflicts in team environments
|
|
- Enables state locking
|
|
|
|
### 4. DNS Security
|
|
- DNSSEC enabled with appropriate configuration
|
|
- Uses NSEC3 for non-existence proof
|
|
- Proper key specifications for signing and zone signing
|
|
|
|
### 5. Sensible Defaults
|
|
- Variables have reasonable default values
|
|
- Machine type (e2-small) appropriate for small workloads
|
|
- Standard region/zone selection
|
|
|
|
---
|
|
|
|
## Critical Issues
|
|
|
|
### 1. Hardcoded Service Account Email
|
|
**Location:** `main.tf:53`, `main.tf:76`
|
|
|
|
```terraform
|
|
service_account {
|
|
email = "456409048169-compute@developer.gserviceaccount.com"
|
|
scopes = [...]
|
|
}
|
|
```
|
|
|
|
**Impact:**
|
|
- Code is not portable across projects
|
|
- Violates infrastructure-as-code principles
|
|
- Will fail if used in different GCP projects
|
|
|
|
**Fix:**
|
|
Use Terraform data sources to dynamically fetch the default compute service account:
|
|
```terraform
|
|
data "google_compute_default_service_account" "default" {}
|
|
|
|
service_account {
|
|
email = data.google_compute_default_service_account.default.email
|
|
scopes = [...]
|
|
}
|
|
```
|
|
|
|
### 2. Infrastructure Drift (Manual Configuration)
|
|
**Location:** README.md notes manual setup of Docker, Nginx, Certbot
|
|
|
|
**Impact:**
|
|
- Infrastructure cannot be reproduced from code alone
|
|
- Disaster recovery requires manual intervention and documentation
|
|
- Team members cannot spin up identical environments
|
|
- Configuration changes aren't tracked in version control
|
|
|
|
**Fix:**
|
|
- Add startup scripts to `metadata_startup_script` in VM resources
|
|
- OR use Packer to create pre-configured VM images
|
|
- OR use configuration management tools (Ansible, Cloud Init)
|
|
|
|
### 3. Security Vulnerabilities
|
|
|
|
#### a. Overly Permissive Firewall Rules
|
|
**Location:** All firewall rules use `source_ranges = ["0.0.0.0/0"]`
|
|
|
|
**Issues:**
|
|
- Gitea port 3000 exposed to the internet (`main.tf:82-93`)
|
|
- n8n port 5678 exposed to the internet (`main.tf:121-132`)
|
|
- These should only be accessible via reverse proxy (Nginx)
|
|
|
|
#### b. Using Default VPC
|
|
**Location:** All resources use `network = "default"`
|
|
|
|
**Issues:**
|
|
- Default network has permissive routing
|
|
- No network segmentation
|
|
- Shared with other project resources
|
|
- Difficult to implement security best practices
|
|
|
|
#### c. No SSH Access Control
|
|
- No explicit SSH firewall rules defined
|
|
- Default GCP rules may be too permissive
|
|
- No bastion host or IAP tunneling
|
|
|
|
### 4. No Data Persistence Strategy
|
|
**Locations:** `main.tf:40-44`, `main.tf:64-68`
|
|
|
|
**Issues:**
|
|
- Boot disks use default sizing
|
|
- No separate data volumes for application data
|
|
- Gitea repositories and n8n workflows stored on ephemeral boot disk
|
|
- No backup configuration
|
|
- Risk of data loss if VMs are recreated
|
|
|
|
**Impact:**
|
|
Critical user data (Git repositories, workflow configurations) could be lost during infrastructure updates.
|
|
|
|
---
|
|
|
|
## Notable Gaps
|
|
|
|
### 5. Missing Startup Automation
|
|
While intentionally omitted per README, this creates operational challenges:
|
|
- New team members can't provision working infrastructure
|
|
- Updates require manual SSH intervention
|
|
- No automated application updates or patching
|
|
|
|
### 6. Outdated Operating System
|
|
**Location:** `main.tf:42`, `main.tf:66`
|
|
|
|
Currently using `debian-cloud/debian-11` while Debian 12 is available and supported.
|
|
|
|
### 7. No Monitoring or Alerting
|
|
Missing operational visibility:
|
|
- No Cloud Monitoring dashboards
|
|
- No alerting for VM health, disk usage, or service availability
|
|
- No log aggregation configuration
|
|
- No uptime checks for the applications
|
|
|
|
### 8. No High Availability or Auto-Healing
|
|
Current setup has single points of failure:
|
|
- Single VM per service
|
|
- No managed instance groups
|
|
- No auto-restart on failure
|
|
- No health checks
|
|
|
|
### 9. DNS Configuration Gaps
|
|
- Zone signing key uses 1024-bit RSA (should be 2048 bits)
|
|
- No apex domain record defined (only subdomains)
|
|
- TTL of 300 seconds is reasonable but not documented
|
|
|
|
### 10. Missing Cost Optimization
|
|
- No committed use discounts
|
|
- Could use preemptible VMs for non-production
|
|
- No resource tagging for cost allocation
|
|
- No budgets or billing alerts
|
|
|
|
---
|
|
|
|
## Detailed Recommendations
|
|
|
|
### High Priority (Do First)
|
|
|
|
1. **Fix Hardcoded Service Account**
|
|
- Use data sources or variables
|
|
- Ensures portability across projects
|
|
|
|
2. **Implement Application Provisioning**
|
|
- Add startup scripts with idempotent configuration
|
|
- Or create golden images with Packer
|
|
- Document all manual steps taken
|
|
|
|
3. **Secure Firewall Rules**
|
|
- Remove public access to ports 3000 and 5678
|
|
- Restrict HTTP/HTTPS to CloudFlare IPs if using CDN
|
|
- Add explicit SSH rules with IP allowlisting
|
|
|
|
4. **Create Custom VPC**
|
|
- Separate network for these resources
|
|
- Proper subnet configuration
|
|
- Network tags for better organization
|
|
|
|
5. **Add Persistent Data Disks**
|
|
```terraform
|
|
resource "google_compute_disk" "gitea_data" {
|
|
name = "gitea-data"
|
|
size = 50
|
|
type = "pd-standard"
|
|
zone = var.zone
|
|
}
|
|
```
|
|
|
|
6. **Implement Backup Strategy**
|
|
- Scheduled snapshots for data disks
|
|
- Retention policies
|
|
- Test restore procedures
|
|
|
|
### Medium Priority (Important but Not Urgent)
|
|
|
|
7. **Add Monitoring and Alerting**
|
|
- Cloud Monitoring dashboards for VM metrics
|
|
- Uptime checks for services
|
|
- Alert policies for disk usage, CPU, memory
|
|
- Email/Slack notifications
|
|
|
|
8. **Upgrade Operating System**
|
|
- Change to `debian-cloud/debian-12`
|
|
- Test application compatibility first
|
|
|
|
9. **Improve DNS Configuration**
|
|
- Increase zone signing key to 2048 bits
|
|
- Add apex domain record if needed
|
|
- Consider lower TTL during migrations
|
|
|
|
10. **Add Lifecycle Management**
|
|
```terraform
|
|
lifecycle {
|
|
prevent_destroy = true # For production
|
|
ignore_changes = [metadata_startup_script]
|
|
}
|
|
```
|
|
|
|
11. **Implement Better Secrets Management**
|
|
- Use Secret Manager for application secrets
|
|
- Grant VMs access via service account
|
|
- Avoid hardcoding credentials
|
|
|
|
12. **Add Resource Labels**
|
|
```terraform
|
|
labels = {
|
|
environment = "production"
|
|
service = "gitea"
|
|
managed_by = "terraform"
|
|
cost_center = "engineering"
|
|
}
|
|
```
|
|
|
|
### Low Priority (Nice to Have)
|
|
|
|
13. **Modularize Terraform Code**
|
|
- Create reusable modules for VM + DNS pattern
|
|
- Separate module for firewall rules
|
|
- Easier to maintain and extend
|
|
|
|
14. **Add terraform.tfvars.example**
|
|
- Document required variables
|
|
- Provide example values
|
|
- Help new team members get started
|
|
|
|
15. **Consider Terragrunt**
|
|
- If planning multi-environment setup (dev/staging/prod)
|
|
- DRY configuration management
|
|
- Environment-specific overrides
|
|
|
|
16. **Implement CI/CD for Terraform**
|
|
- Automated `terraform plan` on PRs
|
|
- Automated `terraform apply` after merge
|
|
- State locking verification
|
|
|
|
17. **Add Pre-commit Hooks**
|
|
- Run `terraform fmt` automatically
|
|
- Run `terraform validate`
|
|
- Run security scanning (tfsec, checkov)
|
|
|
|
18. **Consider Managed Services**
|
|
- Cloud Run for containerized apps (simpler than VMs)
|
|
- Cloud SQL if databases are needed
|
|
- Cloud Storage for artifacts/backups
|
|
|
|
---
|
|
|
|
## Security Recommendations Summary
|
|
|
|
### Immediate Actions Required:
|
|
1. Close ports 3000 and 5678 to public internet
|
|
2. Implement IP allowlisting for SSH access
|
|
3. Create custom VPC with proper firewall rules
|
|
4. Enable VPC Flow Logs for security monitoring
|
|
5. Implement Cloud Armor for DDoS protection
|
|
|
|
### Additional Security Measures:
|
|
- Enable OS Login for SSH key management
|
|
- Use Identity-Aware Proxy (IAP) for VM access
|
|
- Implement least-privilege service account permissions
|
|
- Enable audit logging for all resources
|
|
- Regular security scanning with Cloud Security Scanner
|
|
- Implement Web Application Firewall (WAF) rules
|
|
|
|
---
|
|
|
|
## Data Protection Recommendations
|
|
|
|
### Backup Strategy:
|
|
```terraform
|
|
# Example snapshot schedule
|
|
resource "google_compute_resource_policy" "daily_backup" {
|
|
name = "daily-backup-policy"
|
|
region = var.region
|
|
|
|
snapshot_schedule_policy {
|
|
schedule {
|
|
daily_schedule {
|
|
days_in_cycle = 1
|
|
start_time = "04:00"
|
|
}
|
|
}
|
|
retention_policy {
|
|
max_retention_days = 14
|
|
on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Disaster Recovery Plan:
|
|
1. Document manual setup steps in code-readable format (cloud-init)
|
|
2. Test VM restoration from snapshots quarterly
|
|
3. Maintain off-site backups of critical data
|
|
4. Document RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
|
|
|
|
---
|
|
|
|
## Estimated Effort for Improvements
|
|
|
|
| Priority | Task Category | Estimated Time |
|
|
|----------|---------------|----------------|
|
|
| High | Security fixes | 4-6 hours |
|
|
| High | Data persistence | 2-3 hours |
|
|
| High | Service account fix | 30 minutes |
|
|
| High | Startup scripts | 4-8 hours |
|
|
| Medium | Monitoring setup | 3-4 hours |
|
|
| Medium | OS upgrade | 1-2 hours |
|
|
| Low | Modularization | 6-8 hours |
|
|
| Low | CI/CD pipeline | 4-6 hours |
|
|
|
|
**Total effort for high-priority items:** ~12-18 hours
|
|
**Total effort for all recommendations:** ~30-40 hours
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
This infrastructure demonstrates a good starting point for managing cloud resources with Terraform. The code is readable, well-documented, and follows basic IaC principles. However, significant work is needed to make this production-ready.
|
|
|
|
The most critical issue is the **hybrid approach** where infrastructure is managed by Terraform but application configuration is manual. This creates a maintenance burden and makes disaster recovery difficult.
|
|
|
|
### Recommended Next Steps:
|
|
1. Start with security fixes (firewall rules, custom VPC)
|
|
2. Add persistent data disks and backups
|
|
3. Automate application provisioning
|
|
4. Implement monitoring and alerting
|
|
5. Create runbooks for common operations
|
|
|
|
With these improvements, this infrastructure could achieve a production-readiness score of 8/10 and provide a solid foundation for the Sing For Hope organization's DevOps needs.
|
|
|
|
---
|
|
|
|
## Questions for Further Discussion
|
|
|
|
1. What is the expected traffic volume for these services?
|
|
2. What are the RTO/RPO requirements for disaster recovery?
|
|
3. Is high availability required, or is some downtime acceptable?
|
|
4. What is the budget for infrastructure costs?
|
|
5. Are there compliance requirements (HIPAA, SOC2, etc.)?
|
|
6. Who will be responsible for ongoing maintenance?
|
|
7. Are there plans to add more services to this infrastructure?
|
|
|
|
---
|
|
|
|
**Note:** This review is based on the code as of November 9, 2025. As infrastructure evolves, periodic reviews should be conducted to ensure continued alignment with best practices and organizational needs.
|