added infra
This commit is contained in:
385
infra/gcp/REVIEW.md
Normal file
385
infra/gcp/REVIEW.md
Normal file
@@ -0,0 +1,385 @@
|
||||
# Infrastructure Review: GCP Terraform for Sing For Hope
|
||||
|
||||
**Review Date:** November 9, 2025
|
||||
**Reviewer:** Claude Code
|
||||
**Project:** GCP Infrastructure for Gitea and n8n
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This Terraform project manages GCP infrastructure for hosting Gitea (self-hosted Git) and n8n (workflow automation) on separate VM instances. While the code is clean and functional, it has several critical issues that limit production-readiness, particularly around security, data persistence, and infrastructure-as-code completeness.
|
||||
|
||||
**Overall Ratings:**
|
||||
- Production-Readiness: **4/10**
|
||||
- Code Quality: **6/10**
|
||||
- Documentation: **7/10**
|
||||
- Maintainability: **5/10**
|
||||
- Security: **3/10**
|
||||
|
||||
---
|
||||
|
||||
## Strengths
|
||||
|
||||
### 1. Clean Project Structure
|
||||
The project follows Terraform best practices with proper file organization:
|
||||
- Separate files for variables, outputs, and backend configuration
|
||||
- Clear naming conventions
|
||||
- Logical resource grouping
|
||||
|
||||
### 2. Good Documentation
|
||||
The README.md provides:
|
||||
- Clear service descriptions
|
||||
- Prerequisite checklist
|
||||
- Step-by-step usage instructions
|
||||
- Honest acknowledgment of manual setup steps
|
||||
|
||||
### 3. Remote State Management
|
||||
- Uses GCS backend for state storage
|
||||
- Prevents state conflicts in team environments
|
||||
- Enables state locking
|
||||
|
||||
### 4. DNS Security
|
||||
- DNSSEC enabled with appropriate configuration
|
||||
- Uses NSEC3 for non-existence proof
|
||||
- Proper key specifications for signing and zone signing
|
||||
|
||||
### 5. Sensible Defaults
|
||||
- Variables have reasonable default values
|
||||
- Machine type (e2-small) appropriate for small workloads
|
||||
- Standard region/zone selection
|
||||
|
||||
---
|
||||
|
||||
## Critical Issues
|
||||
|
||||
### 1. Hardcoded Service Account Email
|
||||
**Location:** `main.tf:53`, `main.tf:76`
|
||||
|
||||
```terraform
|
||||
service_account {
|
||||
email = "456409048169-compute@developer.gserviceaccount.com"
|
||||
scopes = [...]
|
||||
}
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- Code is not portable across projects
|
||||
- Violates infrastructure-as-code principles
|
||||
- Will fail if used in different GCP projects
|
||||
|
||||
**Fix:**
|
||||
Use Terraform data sources to dynamically fetch the default compute service account:
|
||||
```terraform
|
||||
data "google_compute_default_service_account" "default" {}
|
||||
|
||||
service_account {
|
||||
email = data.google_compute_default_service_account.default.email
|
||||
scopes = [...]
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Infrastructure Drift (Manual Configuration)
|
||||
**Location:** README.md notes manual setup of Docker, Nginx, Certbot
|
||||
|
||||
**Impact:**
|
||||
- Infrastructure cannot be reproduced from code alone
|
||||
- Disaster recovery requires manual intervention and documentation
|
||||
- Team members cannot spin up identical environments
|
||||
- Configuration changes aren't tracked in version control
|
||||
|
||||
**Fix:**
|
||||
- Add startup scripts to `metadata_startup_script` in VM resources
|
||||
- OR use Packer to create pre-configured VM images
|
||||
- OR use configuration management tools (Ansible, Cloud Init)
|
||||
|
||||
### 3. Security Vulnerabilities
|
||||
|
||||
#### a. Overly Permissive Firewall Rules
|
||||
**Location:** All firewall rules use `source_ranges = ["0.0.0.0/0"]`
|
||||
|
||||
**Issues:**
|
||||
- Gitea port 3000 exposed to the internet (`main.tf:82-93`)
|
||||
- n8n port 5678 exposed to the internet (`main.tf:121-132`)
|
||||
- These should only be accessible via reverse proxy (Nginx)
|
||||
|
||||
#### b. Using Default VPC
|
||||
**Location:** All resources use `network = "default"`
|
||||
|
||||
**Issues:**
|
||||
- Default network has permissive routing
|
||||
- No network segmentation
|
||||
- Shared with other project resources
|
||||
- Difficult to implement security best practices
|
||||
|
||||
#### c. No SSH Access Control
|
||||
- No explicit SSH firewall rules defined
|
||||
- Default GCP rules may be too permissive
|
||||
- No bastion host or IAP tunneling
|
||||
|
||||
### 4. No Data Persistence Strategy
|
||||
**Locations:** `main.tf:40-44`, `main.tf:64-68`
|
||||
|
||||
**Issues:**
|
||||
- Boot disks use default sizing
|
||||
- No separate data volumes for application data
|
||||
- Gitea repositories and n8n workflows stored on ephemeral boot disk
|
||||
- No backup configuration
|
||||
- Risk of data loss if VMs are recreated
|
||||
|
||||
**Impact:**
|
||||
Critical user data (Git repositories, workflow configurations) could be lost during infrastructure updates.
|
||||
|
||||
---
|
||||
|
||||
## Notable Gaps
|
||||
|
||||
### 5. Missing Startup Automation
|
||||
While intentionally omitted per README, this creates operational challenges:
|
||||
- New team members can't provision working infrastructure
|
||||
- Updates require manual SSH intervention
|
||||
- No automated application updates or patching
|
||||
|
||||
### 6. Outdated Operating System
|
||||
**Location:** `main.tf:42`, `main.tf:66`
|
||||
|
||||
Currently using `debian-cloud/debian-11` while Debian 12 is available and supported.
|
||||
|
||||
### 7. No Monitoring or Alerting
|
||||
Missing operational visibility:
|
||||
- No Cloud Monitoring dashboards
|
||||
- No alerting for VM health, disk usage, or service availability
|
||||
- No log aggregation configuration
|
||||
- No uptime checks for the applications
|
||||
|
||||
### 8. No High Availability or Auto-Healing
|
||||
Current setup has single points of failure:
|
||||
- Single VM per service
|
||||
- No managed instance groups
|
||||
- No auto-restart on failure
|
||||
- No health checks
|
||||
|
||||
### 9. DNS Configuration Gaps
|
||||
- Zone signing key uses 1024-bit RSA (should be 2048 bits)
|
||||
- No apex domain record defined (only subdomains)
|
||||
- TTL of 300 seconds is reasonable but not documented
|
||||
|
||||
### 10. Missing Cost Optimization
|
||||
- No committed use discounts
|
||||
- Could use preemptible VMs for non-production
|
||||
- No resource tagging for cost allocation
|
||||
- No budgets or billing alerts
|
||||
|
||||
---
|
||||
|
||||
## Detailed Recommendations
|
||||
|
||||
### High Priority (Do First)
|
||||
|
||||
1. **Fix Hardcoded Service Account**
|
||||
- Use data sources or variables
|
||||
- Ensures portability across projects
|
||||
|
||||
2. **Implement Application Provisioning**
|
||||
- Add startup scripts with idempotent configuration
|
||||
- Or create golden images with Packer
|
||||
- Document all manual steps taken
|
||||
|
||||
3. **Secure Firewall Rules**
|
||||
- Remove public access to ports 3000 and 5678
|
||||
- Restrict HTTP/HTTPS to CloudFlare IPs if using CDN
|
||||
- Add explicit SSH rules with IP allowlisting
|
||||
|
||||
4. **Create Custom VPC**
|
||||
- Separate network for these resources
|
||||
- Proper subnet configuration
|
||||
- Network tags for better organization
|
||||
|
||||
5. **Add Persistent Data Disks**
|
||||
```terraform
|
||||
resource "google_compute_disk" "gitea_data" {
|
||||
name = "gitea-data"
|
||||
size = 50
|
||||
type = "pd-standard"
|
||||
zone = var.zone
|
||||
}
|
||||
```
|
||||
|
||||
6. **Implement Backup Strategy**
|
||||
- Scheduled snapshots for data disks
|
||||
- Retention policies
|
||||
- Test restore procedures
|
||||
|
||||
### Medium Priority (Important but Not Urgent)
|
||||
|
||||
7. **Add Monitoring and Alerting**
|
||||
- Cloud Monitoring dashboards for VM metrics
|
||||
- Uptime checks for services
|
||||
- Alert policies for disk usage, CPU, memory
|
||||
- Email/Slack notifications
|
||||
|
||||
8. **Upgrade Operating System**
|
||||
- Change to `debian-cloud/debian-12`
|
||||
- Test application compatibility first
|
||||
|
||||
9. **Improve DNS Configuration**
|
||||
- Increase zone signing key to 2048 bits
|
||||
- Add apex domain record if needed
|
||||
- Consider lower TTL during migrations
|
||||
|
||||
10. **Add Lifecycle Management**
|
||||
```terraform
|
||||
lifecycle {
|
||||
prevent_destroy = true # For production
|
||||
ignore_changes = [metadata_startup_script]
|
||||
}
|
||||
```
|
||||
|
||||
11. **Implement Better Secrets Management**
|
||||
- Use Secret Manager for application secrets
|
||||
- Grant VMs access via service account
|
||||
- Avoid hardcoding credentials
|
||||
|
||||
12. **Add Resource Labels**
|
||||
```terraform
|
||||
labels = {
|
||||
environment = "production"
|
||||
service = "gitea"
|
||||
managed_by = "terraform"
|
||||
cost_center = "engineering"
|
||||
}
|
||||
```
|
||||
|
||||
### Low Priority (Nice to Have)
|
||||
|
||||
13. **Modularize Terraform Code**
|
||||
- Create reusable modules for VM + DNS pattern
|
||||
- Separate module for firewall rules
|
||||
- Easier to maintain and extend
|
||||
|
||||
14. **Add terraform.tfvars.example**
|
||||
- Document required variables
|
||||
- Provide example values
|
||||
- Help new team members get started
|
||||
|
||||
15. **Consider Terragrunt**
|
||||
- If planning multi-environment setup (dev/staging/prod)
|
||||
- DRY configuration management
|
||||
- Environment-specific overrides
|
||||
|
||||
16. **Implement CI/CD for Terraform**
|
||||
- Automated `terraform plan` on PRs
|
||||
- Automated `terraform apply` after merge
|
||||
- State locking verification
|
||||
|
||||
17. **Add Pre-commit Hooks**
|
||||
- Run `terraform fmt` automatically
|
||||
- Run `terraform validate`
|
||||
- Run security scanning (tfsec, checkov)
|
||||
|
||||
18. **Consider Managed Services**
|
||||
- Cloud Run for containerized apps (simpler than VMs)
|
||||
- Cloud SQL if databases are needed
|
||||
- Cloud Storage for artifacts/backups
|
||||
|
||||
---
|
||||
|
||||
## Security Recommendations Summary
|
||||
|
||||
### Immediate Actions Required:
|
||||
1. Close ports 3000 and 5678 to public internet
|
||||
2. Implement IP allowlisting for SSH access
|
||||
3. Create custom VPC with proper firewall rules
|
||||
4. Enable VPC Flow Logs for security monitoring
|
||||
5. Implement Cloud Armor for DDoS protection
|
||||
|
||||
### Additional Security Measures:
|
||||
- Enable OS Login for SSH key management
|
||||
- Use Identity-Aware Proxy (IAP) for VM access
|
||||
- Implement least-privilege service account permissions
|
||||
- Enable audit logging for all resources
|
||||
- Regular security scanning with Cloud Security Scanner
|
||||
- Implement Web Application Firewall (WAF) rules
|
||||
|
||||
---
|
||||
|
||||
## Data Protection Recommendations
|
||||
|
||||
### Backup Strategy:
|
||||
```terraform
|
||||
# Example snapshot schedule
|
||||
resource "google_compute_resource_policy" "daily_backup" {
|
||||
name = "daily-backup-policy"
|
||||
region = var.region
|
||||
|
||||
snapshot_schedule_policy {
|
||||
schedule {
|
||||
daily_schedule {
|
||||
days_in_cycle = 1
|
||||
start_time = "04:00"
|
||||
}
|
||||
}
|
||||
retention_policy {
|
||||
max_retention_days = 14
|
||||
on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Disaster Recovery Plan:
|
||||
1. Document manual setup steps in code-readable format (cloud-init)
|
||||
2. Test VM restoration from snapshots quarterly
|
||||
3. Maintain off-site backups of critical data
|
||||
4. Document RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
|
||||
|
||||
---
|
||||
|
||||
## Estimated Effort for Improvements
|
||||
|
||||
| Priority | Task Category | Estimated Time |
|
||||
|----------|---------------|----------------|
|
||||
| High | Security fixes | 4-6 hours |
|
||||
| High | Data persistence | 2-3 hours |
|
||||
| High | Service account fix | 30 minutes |
|
||||
| High | Startup scripts | 4-8 hours |
|
||||
| Medium | Monitoring setup | 3-4 hours |
|
||||
| Medium | OS upgrade | 1-2 hours |
|
||||
| Low | Modularization | 6-8 hours |
|
||||
| Low | CI/CD pipeline | 4-6 hours |
|
||||
|
||||
**Total effort for high-priority items:** ~12-18 hours
|
||||
**Total effort for all recommendations:** ~30-40 hours
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
This infrastructure demonstrates a good starting point for managing cloud resources with Terraform. The code is readable, well-documented, and follows basic IaC principles. However, significant work is needed to make this production-ready.
|
||||
|
||||
The most critical issue is the **hybrid approach** where infrastructure is managed by Terraform but application configuration is manual. This creates a maintenance burden and makes disaster recovery difficult.
|
||||
|
||||
### Recommended Next Steps:
|
||||
1. Start with security fixes (firewall rules, custom VPC)
|
||||
2. Add persistent data disks and backups
|
||||
3. Automate application provisioning
|
||||
4. Implement monitoring and alerting
|
||||
5. Create runbooks for common operations
|
||||
|
||||
With these improvements, this infrastructure could achieve a production-readiness score of 8/10 and provide a solid foundation for the Sing For Hope organization's DevOps needs.
|
||||
|
||||
---
|
||||
|
||||
## Questions for Further Discussion
|
||||
|
||||
1. What is the expected traffic volume for these services?
|
||||
2. What are the RTO/RPO requirements for disaster recovery?
|
||||
3. Is high availability required, or is some downtime acceptable?
|
||||
4. What is the budget for infrastructure costs?
|
||||
5. Are there compliance requirements (HIPAA, SOC2, etc.)?
|
||||
6. Who will be responsible for ongoing maintenance?
|
||||
7. Are there plans to add more services to this infrastructure?
|
||||
|
||||
---
|
||||
|
||||
**Note:** This review is based on the code as of November 9, 2025. As infrastructure evolves, periodic reviews should be conducted to ensure continued alignment with best practices and organizational needs.
|
||||
Reference in New Issue
Block a user