Operations & SLA
Transparent about what we can and can't promise
Uptime Guarantee
99% Monthly Uptime
~7.2 hours of allowed downtime per month
What this means:
- Measured on API endpoint availability
- Excludes scheduled maintenance (announced 48hrs ahead)
- P99 latency <150ms considered "up"
If SLA is missed:
<99% uptime10% monthly credit
<98% uptime25% monthly credit
<95% uptime50% monthly credit
Credits applied automatically to next invoice
What I can't promise (yet):
- 24/7 human support (I'm solo, monitoring is automated)
- Instant fixes (some issues take time to debug)
- Zero downtime (99% means ~7hrs/month is realistic)
Monitoring & Alerting
Infrastructure Monitoring
- • API availability: 60-second health checks via Pingdom
- • Latency tracking: P50, P95, P99 on all endpoints
- • Error rate monitoring: Alerts at >1% error rate
- • Database performance: Query time, connection pool
- • Resource utilization: CPU, memory, disk, network
Application Monitoring
- • Search quality: Zero-result rate tracked per customer
- • Model performance: NDCG monitoring on test queries
- • Indexing pipeline: Success rate, latency, backlog
- • Analytics pipeline: Query log, relevance measurement
Alert Thresholds
P0 (Critical)API down, >1% error rate, >2s p99 latency
P1 (High)Indexing stopped, >2% error rate
P2 (Medium)Elevated latency, elevated zero-result rate
P3 (Low)Non-critical warnings, usage anomalies
Critical Alerts (P0/P1)
PagerDuty - immediate notification
Non-Critical (P2/P3)
Email notification
Incident Response
| Severity | Detection | Response | Resolution Target | Communication |
|---|---|---|---|---|
| P0 (Critical) | <5 minutes | <2 hours | <4 hours | Immediate status page + email |
| P1 (High) | <5 minutes | <4 hours | <8 hours | Status page + email to affected |
| P2 (Non-critical) | - | 48 hours | - |
Incident Communication
- • Initial notification: Within 15 minutes of detection
- • Status updates: Every 1 hour during active incident
- • Resolution notification: Immediate upon fix deployment
- • Post-mortem: Published within 2 business days
Data Backup & Recovery
Backup Schedule
- Customer dataDaily, 30-day retention
- OpenSearch indicesEvery 6 hours
- ConfigurationGit version controlled
- ML modelsWeekly, 90-day retention
Recovery Objectives
- RTO (Recovery Time)<4 hours
- RPO (Recovery Point)<6 hours
- Backup testingMonthly DR drills
- Restore verificationAutomated monthly tests
Disaster Recovery
- • Multi-AZ setup: AWS us-east-2 (Ohio) with 3 availability zones
- • Failover procedure: Documented and tested quarterly
- • Data export: Available via API anytime
Security
Encryption
- • In transit: TLS 1.3 for all API endpoints
- • At rest: AES-256 for all stored data
Authentication
- • API: Bearer tokens (SHA-256 hashed)
- • OAuth 2.0: Coming Q1 2027
- • Key rotation: Automated 90-day rotation
Security Practices
- • Vulnerability scanning upon code changes (GitLab)
- • Dependency updates via Dependabot
- • Infrastructure as code (version controlled)
- • Access logging: 90-day retention
What I don't have (yet):
- • SOC 2 certification (in progress, 6+ months away)
- • Penetration test results (planned Q2 2026)
- • 24/7 human security monitoring (automated only)
Deployment & Updates
Schedule
- • Frequency: Weekly deployments
- • Maintenance window: Announced 48 hours ahead
- • Expected downtime: <15 minutes for most
- • Emergency patches: Deployed immediately if critical
Process
- • Canary deployments: 5% → 100% over 2 hours
- • Automated rollback: <15 minutes if issues
- • Health checks: Automated post-deployment
- • Smoke tests: Run on production after deploy
Breaking Changes
- • Deprecation notice: 30 days minimum
- • Migration guide: Provided with all breaking changes
- • Backward compatibility: Maintained where possible
- • API versioning: /v1, /v2, etc. supported concurrently
Questions About Operations?
I'm happy to discuss infrastructure details and answer technical questions