singforhope/devops

Fork 0

Files

Javier Hinojosa 35773c6efe added infra

2025-11-09 11:17:13 -05:00

12 KiB

Raw Blame History

Infrastructure Review: GCP Terraform for Sing For Hope

Review Date: November 9, 2025 Reviewer: Claude Code Project: GCP Infrastructure for Gitea and n8n

Executive Summary

This Terraform project manages GCP infrastructure for hosting Gitea (self-hosted Git) and n8n (workflow automation) on separate VM instances. While the code is clean and functional, it has several critical issues that limit production-readiness, particularly around security, data persistence, and infrastructure-as-code completeness.

Overall Ratings:

Production-Readiness: 4/10
Code Quality: 6/10
Documentation: 7/10
Maintainability: 5/10
Security: 3/10

Strengths

1. Clean Project Structure

The project follows Terraform best practices with proper file organization:

Separate files for variables, outputs, and backend configuration
Clear naming conventions
Logical resource grouping

2. Good Documentation

The README.md provides:

Clear service descriptions
Prerequisite checklist
Step-by-step usage instructions
Honest acknowledgment of manual setup steps

3. Remote State Management

Uses GCS backend for state storage
Prevents state conflicts in team environments
Enables state locking

4. DNS Security

DNSSEC enabled with appropriate configuration
Uses NSEC3 for non-existence proof
Proper key specifications for signing and zone signing

5. Sensible Defaults

Variables have reasonable default values
Machine type (e2-small) appropriate for small workloads
Standard region/zone selection

Critical Issues

1. Hardcoded Service Account Email

Location: main.tf:53, main.tf:76

service_account {
  email  = "456409048169-compute@developer.gserviceaccount.com"
  scopes = [...]
}

Impact:

Code is not portable across projects
Violates infrastructure-as-code principles
Will fail if used in different GCP projects

Fix: Use Terraform data sources to dynamically fetch the default compute service account:

data "google_compute_default_service_account" "default" {}

service_account {
  email  = data.google_compute_default_service_account.default.email
  scopes = [...]
}

2. Infrastructure Drift (Manual Configuration)

Location: README.md notes manual setup of Docker, Nginx, Certbot

Impact:

Infrastructure cannot be reproduced from code alone
Disaster recovery requires manual intervention and documentation
Team members cannot spin up identical environments
Configuration changes aren't tracked in version control

Fix:

Add startup scripts to metadata_startup_script in VM resources
OR use Packer to create pre-configured VM images
OR use configuration management tools (Ansible, Cloud Init)

3. Security Vulnerabilities

a. Overly Permissive Firewall Rules

Location: All firewall rules use source_ranges = ["0.0.0.0/0"]

Issues:

Gitea port 3000 exposed to the internet (main.tf:82-93)
n8n port 5678 exposed to the internet (main.tf:121-132)
These should only be accessible via reverse proxy (Nginx)

b. Using Default VPC

Location: All resources use network = "default"

Issues:

Default network has permissive routing
No network segmentation
Shared with other project resources
Difficult to implement security best practices

c. No SSH Access Control

No explicit SSH firewall rules defined
Default GCP rules may be too permissive
No bastion host or IAP tunneling

4. No Data Persistence Strategy

Locations: main.tf:40-44, main.tf:64-68

Issues:

Boot disks use default sizing
No separate data volumes for application data
Gitea repositories and n8n workflows stored on ephemeral boot disk
No backup configuration
Risk of data loss if VMs are recreated

Impact: Critical user data (Git repositories, workflow configurations) could be lost during infrastructure updates.

Notable Gaps

5. Missing Startup Automation

While intentionally omitted per README, this creates operational challenges:

New team members can't provision working infrastructure
Updates require manual SSH intervention
No automated application updates or patching

6. Outdated Operating System

Location: main.tf:42, main.tf:66

Currently using debian-cloud/debian-11 while Debian 12 is available and supported.

7. No Monitoring or Alerting

Missing operational visibility:

No Cloud Monitoring dashboards
No alerting for VM health, disk usage, or service availability
No log aggregation configuration
No uptime checks for the applications

8. No High Availability or Auto-Healing

Current setup has single points of failure:

Single VM per service
No managed instance groups
No auto-restart on failure
No health checks

9. DNS Configuration Gaps

Zone signing key uses 1024-bit RSA (should be 2048 bits)
No apex domain record defined (only subdomains)
TTL of 300 seconds is reasonable but not documented

10. Missing Cost Optimization

No committed use discounts
Could use preemptible VMs for non-production
No resource tagging for cost allocation
No budgets or billing alerts

Detailed Recommendations

High Priority (Do First)

Fix Hardcoded Service Account
- Use data sources or variables
- Ensures portability across projects
Implement Application Provisioning
- Add startup scripts with idempotent configuration
- Or create golden images with Packer
- Document all manual steps taken
Secure Firewall Rules
- Remove public access to ports 3000 and 5678
- Restrict HTTP/HTTPS to CloudFlare IPs if using CDN
- Add explicit SSH rules with IP allowlisting
Create Custom VPC
- Separate network for these resources
- Proper subnet configuration
- Network tags for better organization

Add Persistent Data Disks

resource "google_compute_disk" "gitea_data" {
  name = "gitea-data"
  size = 50
  type = "pd-standard"
  zone = var.zone
}

Implement Backup Strategy
- Scheduled snapshots for data disks
- Retention policies
- Test restore procedures

Medium Priority (Important but Not Urgent)

Add Monitoring and Alerting
- Cloud Monitoring dashboards for VM metrics
- Uptime checks for services
- Alert policies for disk usage, CPU, memory
- Email/Slack notifications
Upgrade Operating System
- Change to debian-cloud/debian-12
- Test application compatibility first
Improve DNS Configuration
- Increase zone signing key to 2048 bits
- Add apex domain record if needed
- Consider lower TTL during migrations

Add Lifecycle Management

lifecycle {
  prevent_destroy = true  # For production
  ignore_changes  = [metadata_startup_script]
}

Implement Better Secrets Management
- Use Secret Manager for application secrets
- Grant VMs access via service account
- Avoid hardcoding credentials

Add Resource Labels

labels = {
  environment = "production"
  service     = "gitea"
  managed_by  = "terraform"
  cost_center = "engineering"
}

Low Priority (Nice to Have)

Modularize Terraform Code
- Create reusable modules for VM + DNS pattern
- Separate module for firewall rules
- Easier to maintain and extend
Add terraform.tfvars.example
- Document required variables
- Provide example values
- Help new team members get started
Consider Terragrunt
- If planning multi-environment setup (dev/staging/prod)
- DRY configuration management
- Environment-specific overrides
Implement CI/CD for Terraform
- Automated terraform plan on PRs
- Automated terraform apply after merge
- State locking verification
Add Pre-commit Hooks
- Run terraform fmt automatically
- Run terraform validate
- Run security scanning (tfsec, checkov)
Consider Managed Services
- Cloud Run for containerized apps (simpler than VMs)
- Cloud SQL if databases are needed
- Cloud Storage for artifacts/backups

Security Recommendations Summary

Immediate Actions Required:

Close ports 3000 and 5678 to public internet
Implement IP allowlisting for SSH access
Create custom VPC with proper firewall rules
Enable VPC Flow Logs for security monitoring
Implement Cloud Armor for DDoS protection

Additional Security Measures:

Enable OS Login for SSH key management
Use Identity-Aware Proxy (IAP) for VM access
Implement least-privilege service account permissions
Enable audit logging for all resources
Regular security scanning with Cloud Security Scanner
Implement Web Application Firewall (WAF) rules

Data Protection Recommendations

Backup Strategy:

# Example snapshot schedule
resource "google_compute_resource_policy" "daily_backup" {
  name   = "daily-backup-policy"
  region = var.region

  snapshot_schedule_policy {
    schedule {
      daily_schedule {
        days_in_cycle = 1
        start_time    = "04:00"
      }
    }
    retention_policy {
      max_retention_days    = 14
      on_source_disk_delete = "KEEP_AUTO_SNAPSHOTS"
    }
  }
}

Disaster Recovery Plan:

Document manual setup steps in code-readable format (cloud-init)
Test VM restoration from snapshots quarterly
Maintain off-site backups of critical data
Document RTO (Recovery Time Objective) and RPO (Recovery Point Objective)

Estimated Effort for Improvements

Priority	Task Category	Estimated Time
High	Security fixes	4-6 hours
High	Data persistence	2-3 hours
High	Service account fix	30 minutes
High	Startup scripts	4-8 hours
Medium	Monitoring setup	3-4 hours
Medium	OS upgrade	1-2 hours
Low	Modularization	6-8 hours
Low	CI/CD pipeline	4-6 hours

Total effort for high-priority items: ~12-18 hours Total effort for all recommendations: ~30-40 hours

Conclusion

This infrastructure demonstrates a good starting point for managing cloud resources with Terraform. The code is readable, well-documented, and follows basic IaC principles. However, significant work is needed to make this production-ready.

The most critical issue is the hybrid approach where infrastructure is managed by Terraform but application configuration is manual. This creates a maintenance burden and makes disaster recovery difficult.

Recommended Next Steps:

Start with security fixes (firewall rules, custom VPC)
Add persistent data disks and backups
Automate application provisioning
Implement monitoring and alerting
Create runbooks for common operations

With these improvements, this infrastructure could achieve a production-readiness score of 8/10 and provide a solid foundation for the Sing For Hope organization's DevOps needs.

Questions for Further Discussion

What is the expected traffic volume for these services?
What are the RTO/RPO requirements for disaster recovery?
Is high availability required, or is some downtime acceptable?
What is the budget for infrastructure costs?
Are there compliance requirements (HIPAA, SOC2, etc.)?
Who will be responsible for ongoing maintenance?
Are there plans to add more services to this infrastructure?

Note: This review is based on the code as of November 9, 2025. As infrastructure evolves, periodic reviews should be conducted to ensure continued alignment with best practices and organizational needs.

12 KiB Raw Blame History

Infrastructure Review: GCP Terraform for Sing For Hope

Executive Summary

Strengths

1. Clean Project Structure

2. Good Documentation

3. Remote State Management

4. DNS Security

5. Sensible Defaults

Critical Issues

1. Hardcoded Service Account Email

2. Infrastructure Drift (Manual Configuration)

3. Security Vulnerabilities

a. Overly Permissive Firewall Rules

b. Using Default VPC

c. No SSH Access Control

4. No Data Persistence Strategy

Notable Gaps

5. Missing Startup Automation

6. Outdated Operating System

7. No Monitoring or Alerting

8. No High Availability or Auto-Healing

9. DNS Configuration Gaps

10. Missing Cost Optimization

Detailed Recommendations

High Priority (Do First)

Medium Priority (Important but Not Urgent)

Low Priority (Nice to Have)

Security Recommendations Summary

Immediate Actions Required:

Additional Security Measures:

Data Protection Recommendations

Backup Strategy:

Disaster Recovery Plan:

Estimated Effort for Improvements

Conclusion

Recommended Next Steps:

Questions for Further Discussion

12 KiB

Raw Blame History