Operations & SLA

Transparent about what we can and can't promise

Uptime Guarantee

99% Monthly Uptime

~7.2 hours of allowed downtime per month

What this means:

Measured on API endpoint availability
Excludes scheduled maintenance (announced 48hrs ahead)
P99 latency <150ms considered "up"

If SLA is missed:

<99% uptime10% monthly credit

<98% uptime25% monthly credit

<95% uptime50% monthly credit

Credits applied automatically to next invoice

What I can't promise (yet):

24/7 human support (I'm solo, monitoring is automated)
Instant fixes (some issues take time to debug)
Zero downtime (99% means ~7hrs/month is realistic)

Monitoring & Alerting

Infrastructure Monitoring

• API availability: 60-second health checks via Pingdom
• Latency tracking: P50, P95, P99 on all endpoints
• Error rate monitoring: Alerts at >1% error rate
• Database performance: Query time, connection pool
• Resource utilization: CPU, memory, disk, network

Application Monitoring

• Search quality: Zero-result rate tracked per customer
• Model performance: NDCG monitoring on test queries
• Indexing pipeline: Success rate, latency, backlog
• Analytics pipeline: Query log, relevance measurement

Alert Thresholds

P0 (Critical)API down, >1% error rate, >2s p99 latency

P1 (High)Indexing stopped, >2% error rate

P2 (Medium)Elevated latency, elevated zero-result rate

P3 (Low)Non-critical warnings, usage anomalies

Critical Alerts (P0/P1)

PagerDuty - immediate notification

Non-Critical (P2/P3)

Email notification

Incident Response

Severity	Detection	Response	Resolution Target	Communication
P0 (Critical)	<5 minutes	<2 hours	<4 hours	Immediate status page + email
P1 (High)	<5 minutes	<4 hours	<8 hours	Status page + email to affected
P2 (Non-critical)	-	48 hours	-	Email

Incident Communication

• Initial notification: Within 15 minutes of detection
• Status updates: Every 1 hour during active incident
• Resolution notification: Immediate upon fix deployment
• Post-mortem: Published within 2 business days

Data Backup & Recovery

Backup Schedule

Customer dataDaily, 30-day retention
OpenSearch indicesEvery 6 hours
ConfigurationGit version controlled
ML modelsWeekly, 90-day retention

Recovery Objectives

RTO (Recovery Time)<4 hours
RPO (Recovery Point)<6 hours
Backup testingMonthly DR drills
Restore verificationAutomated monthly tests

Disaster Recovery

• Multi-AZ setup: AWS us-east-2 (Ohio) with 3 availability zones
• Failover procedure: Documented and tested quarterly
• Data export: Available via API anytime

Security

Encryption

• In transit: TLS 1.3 for all API endpoints
• At rest: AES-256 for all stored data

Authentication

• API: Bearer tokens (SHA-256 hashed)
• OAuth 2.0: Coming Q1 2027
• Key rotation: Automated 90-day rotation

Security Practices

• Vulnerability scanning upon code changes (GitLab)
• Dependency updates via Dependabot

• Infrastructure as code (version controlled)
• Access logging: 90-day retention

What I don't have (yet):

• SOC 2 certification (in progress, 6+ months away)
• Penetration test results (planned Q2 2026)
• 24/7 human security monitoring (automated only)

Deployment & Updates

Schedule

• Frequency: Weekly deployments
• Maintenance window: Announced 48 hours ahead
• Expected downtime: <15 minutes for most
• Emergency patches: Deployed immediately if critical

Process

• Canary deployments: 5% → 100% over 2 hours
• Automated rollback: <15 minutes if issues
• Health checks: Automated post-deployment
• Smoke tests: Run on production after deploy

Breaking Changes

• Deprecation notice: 30 days minimum
• Migration guide: Provided with all breaking changes
• Backward compatibility: Maintained where possible
• API versioning: /v1, /v2, etc. supported concurrently

Questions About Operations?

I'm happy to discuss infrastructure details and answer technical questions

Apply for Early Access Contact Me