Operations & SLA

Transparent about what we can and can't promise

Uptime Guarantee

99% Monthly Uptime

~7.2 hours of allowed downtime per month

What this means:

  • Measured on API endpoint availability
  • Excludes scheduled maintenance (announced 48hrs ahead)
  • P99 latency <150ms considered "up"

If SLA is missed:

<99% uptime10% monthly credit
<98% uptime25% monthly credit
<95% uptime50% monthly credit

Credits applied automatically to next invoice

What I can't promise (yet):

  • 24/7 human support (I'm solo, monitoring is automated)
  • Instant fixes (some issues take time to debug)
  • Zero downtime (99% means ~7hrs/month is realistic)

Monitoring & Alerting

Infrastructure Monitoring

  • • API availability: 60-second health checks via Pingdom
  • • Latency tracking: P50, P95, P99 on all endpoints
  • • Error rate monitoring: Alerts at >1% error rate
  • • Database performance: Query time, connection pool
  • • Resource utilization: CPU, memory, disk, network

Application Monitoring

  • • Search quality: Zero-result rate tracked per customer
  • • Model performance: NDCG monitoring on test queries
  • • Indexing pipeline: Success rate, latency, backlog
  • • Analytics pipeline: Query log, relevance measurement

Alert Thresholds

P0 (Critical)API down, >1% error rate, >2s p99 latency
P1 (High)Indexing stopped, >2% error rate
P2 (Medium)Elevated latency, elevated zero-result rate
P3 (Low)Non-critical warnings, usage anomalies

Critical Alerts (P0/P1)

PagerDuty - immediate notification

Non-Critical (P2/P3)

Email notification

Incident Response

SeverityDetectionResponseResolution TargetCommunication
P0 (Critical)<5 minutes<2 hours<4 hoursImmediate status page + email
P1 (High)<5 minutes<4 hours<8 hoursStatus page + email to affected
P2 (Non-critical)-48 hours-Email

Incident Communication

  • • Initial notification: Within 15 minutes of detection
  • • Status updates: Every 1 hour during active incident
  • • Resolution notification: Immediate upon fix deployment
  • • Post-mortem: Published within 2 business days

Data Backup & Recovery

Backup Schedule

  • Customer dataDaily, 30-day retention
  • OpenSearch indicesEvery 6 hours
  • ConfigurationGit version controlled
  • ML modelsWeekly, 90-day retention

Recovery Objectives

  • RTO (Recovery Time)<4 hours
  • RPO (Recovery Point)<6 hours
  • Backup testingMonthly DR drills
  • Restore verificationAutomated monthly tests

Disaster Recovery

  • • Multi-AZ setup: AWS us-east-2 (Ohio) with 3 availability zones
  • • Failover procedure: Documented and tested quarterly
  • • Data export: Available via API anytime

Security

Encryption

  • • In transit: TLS 1.3 for all API endpoints
  • • At rest: AES-256 for all stored data

Authentication

  • • API: Bearer tokens (SHA-256 hashed)
  • • OAuth 2.0: Coming Q1 2027
  • • Key rotation: Automated 90-day rotation

Security Practices

  • • Vulnerability scanning upon code changes (GitLab)
  • • Dependency updates via Dependabot
  • • Infrastructure as code (version controlled)
  • • Access logging: 90-day retention

What I don't have (yet):

  • • SOC 2 certification (in progress, 6+ months away)
  • • Penetration test results (planned Q2 2026)
  • • 24/7 human security monitoring (automated only)

Deployment & Updates

Schedule

  • • Frequency: Weekly deployments
  • • Maintenance window: Announced 48 hours ahead
  • • Expected downtime: <15 minutes for most
  • • Emergency patches: Deployed immediately if critical

Process

  • • Canary deployments: 5% → 100% over 2 hours
  • • Automated rollback: <15 minutes if issues
  • • Health checks: Automated post-deployment
  • • Smoke tests: Run on production after deploy

Breaking Changes

  • • Deprecation notice: 30 days minimum
  • • Migration guide: Provided with all breaking changes
  • • Backward compatibility: Maintained where possible
  • • API versioning: /v1, /v2, etc. supported concurrently

Questions About Operations?

I'm happy to discuss infrastructure details and answer technical questions