Downtime Detection | Blue Frog Docs

Downtime Detection

Monitor and respond to website downtime and availability issues

Downtime Detection

What This Means

Downtime occurs when your website becomes inaccessible to users. Detecting downtime quickly is critical for minimizing impact on users, revenue, and search rankings.

Impact

  • Lost revenue - Every minute of downtime costs money
  • SEO penalties - Prolonged downtime affects rankings
  • User trust - Unreliable sites lose visitors
  • Tracking gaps - Analytics data missing during outages
  • SLA violations - Business and contractual impacts

Types of Downtime

Type Description Detection
Full outage Site completely inaccessible HTTP request fails
Partial outage Some pages/features unavailable Specific endpoint monitoring
Degraded performance Site accessible but slow Response time monitoring
Regional outage Issues in specific locations Multi-location monitoring

How to Monitor

External Monitoring Services

What to Monitor

  1. Homepage - Primary availability check
  2. Key pages - Product pages, checkout, login
  3. API endpoints - Critical functionality
  4. Third-party services - Payment, CDN, etc.

Monitoring Configuration

# Example monitoring setup
checks:
  - name: Homepage
    url: https://example.com
    interval: 60  # seconds
    timeout: 30   # seconds
    alerts:
      - type: email
        address: ops@example.com
      - type: slack
        webhook: https://hooks.slack.com/...

  - name: API Health
    url: https://api.example.com/health
    interval: 30
    expected_status: 200
    expected_content: "ok"

Response Procedures

1. Immediate Response

When downtime detected:

  1. Acknowledge alert - Prevent escalation
  2. Verify issue - Check from multiple locations
  3. Check status pages - Hosting, CDN, third-party services
  4. Begin diagnosis - Server logs, error messages

2. Communication

  • Update internal status channel
  • Post to public status page if prolonged
  • Notify affected customers if necessary

3. Resolution

  • Implement fix or failover
  • Verify recovery from multiple locations
  • Document incident and root cause

4. Post-Incident

  • Conduct post-mortem analysis
  • Implement preventive measures
  • Update monitoring if gaps identified

Status Page Best Practices

Create a public status page:

## Current Status
All systems operational ✅

## Components
- Website: Operational
- API: Operational
- Payments: Operational
- CDN: Operational

## Recent Incidents
- [Date] Brief description - Resolved

Hosting options:

Alerting Strategy

Severity Levels

Level Response Time Examples
Critical Immediate Full outage, data loss
High < 15 min Partial outage, degraded performance
Medium < 1 hour Non-critical feature failure
Low Next business day Minor issues, cosmetic bugs

Alert Routing

  • Critical: Phone call, SMS, Slack, Email
  • High: Slack, Email
  • Medium: Email, Ticket
  • Low: Ticket only
// SYS.FOOTER