Uptime Monitoring

Endpoint monitoring & incident management

Aexy's uptime monitoring module provides endpoint health checking for HTTP, TCP, and WebSocket services. Monitors run on configurable intervals and automatically create support tickets when services go down.

Features#

  • Multi-Protocol Checks: HTTP, TCP, and WebSocket endpoint monitoring
  • Configurable Intervals: 1, 5, 15, 30, or 60 minute check intervals
  • Automatic Incident Management: Incidents created after consecutive failures
  • Ticket Integration: Automatically creates and closes support tickets
  • Notifications: Slack, email, and webhook alerts
  • SSL Monitoring: Track SSL certificate expiration (HTTP checks)
  • Post-Mortem Notes: Root cause and resolution documentation

Check Types#

HTTP Check#

Monitors HTTP/HTTPS endpoints with configurable settings.

SettingDescription
URLFull URL to check (http:// or https://)
HTTP MethodGET, POST, HEAD, OPTIONS
Expected Status CodesArray of valid status codes (default: [200, 201, 204])
Request HeadersCustom headers (e.g., Authorization)
Verify SSLWhether to verify SSL certificates
TimeoutRequest timeout in seconds (default: 30)

Collected Metrics:

  • Response time (ms)
  • Status code
  • SSL certificate expiry (days remaining)
  • Error messages

TCP Check#

Monitors TCP port availability.

SettingDescription
HostHostname or IP address
PortTCP port number
TimeoutConnection timeout in seconds (default: 30)

Collected Metrics:

  • Connection time (ms)
  • Connection success/failure
  • Error type (timeout, connection_refused)

WebSocket Check#

Monitors WebSocket endpoints with optional message validation.

SettingDescription
URLWebSocket URL (ws:// or wss://)
WS MessageOptional message to send on connect
WS Expected ResponseExpected response pattern (regex)
TimeoutConnection timeout in seconds (default: 30)

Collected Metrics:

  • Connection time (ms)
  • Handshake success
  • Response validation result
  • Error messages

Incident Management#

How It Works#

  1. Consecutive Failures: Monitor tracks consecutive failed checks
  2. Threshold Reached: When failures >= threshold (default: 3), incident is created
  3. Ticket Created: Support ticket automatically created with incident details
  4. Notifications Sent: Slack/email/webhook alerts dispatched
  5. Recovery Detected: When check succeeds, incident is resolved
  6. Ticket Closed: Linked ticket automatically closed with resolution notes

Failure Flow#

Check fails
    ↓
Increment consecutive_failures
    ↓
consecutive_failures >= threshold?
    ├─ No → Schedule next check
    └─ Yes → Create incident
                ↓
            Create support ticket
                ↓
            Send notifications

Recovery Flow#

Check succeeds
    ↓
Has ongoing incident?
    ├─ No → Reset counters
    └─ Yes → Resolve incident
                ↓
            Close linked ticket
                ↓
            Send recovery notifications (if enabled)

Incident States#

StatusDescription
ongoingIncident in progress, service still down
acknowledgedTeam aware of issue, working on resolution
resolvedService recovered, incident closed

Monitor Status#

StatusDescription
upAll checks passing
downThreshold failures reached
degradedSome failures but under threshold
pausedMonitoring temporarily disabled

Notifications#

Supported Channels#

ChannelConfiguration
SlackSet slack_channel_id on monitor
EmailUses team notification preferences
WebhookCustom webhook_url for external integrations

Notification Events#

  • Incident Created: When threshold failures reached
  • Incident Acknowledged: When team acknowledges incident
  • Incident Resolved: When service recovers (if notify_on_recovery enabled)

Webhook Payload#

{
  "event": "incident.created",
  "monitor": {
    "id": "uuid",
    "name": "API Server",
    "check_type": "http",
    "url": "https://api.example.com/health"
  },
  "incident": {
    "id": "uuid",
    "status": "ongoing",
    "started_at": "2024-01-15T10:30:00Z",
    "error_message": "Connection timeout after 30s"
  },
  "workspace_id": "uuid"
}

Ticket Integration#

Automatic Ticket Creation#

When an incident is created, a support ticket is automatically generated:

  • Title: [UPTIME] {monitor_name} is down
  • Severity: high
  • Priority: urgent
  • Description: Monitor details, error message, timestamp
  • Team: Assigned to configured team (if set)

Automatic Ticket Closure#

When service recovers:

  1. Resolution comment added to ticket:
    Service has recovered.
    
    Duration: 15 minutes
    Started: 2024-01-15 10:30:00 UTC
    Resolved: 2024-01-15 10:45:00 UTC
    Total checks during incident: 15
    Failed checks: 15
    
  2. Ticket status set to closed
  3. Recovery notifications sent (if enabled)

API Endpoints#

Monitors#

GET    /workspaces/{id}/uptime/monitors                    # List monitors
POST   /workspaces/{id}/uptime/monitors                    # Create monitor
GET    /workspaces/{id}/uptime/monitors/{monitor_id}       # Get monitor
PATCH  /workspaces/{id}/uptime/monitors/{monitor_id}       # Update monitor
DELETE /workspaces/{id}/uptime/monitors/{monitor_id}       # Delete monitor
POST   /workspaces/{id}/uptime/monitors/{monitor_id}/pause # Pause monitoring
POST   /workspaces/{id}/uptime/monitors/{monitor_id}/resume # Resume monitoring
POST   /workspaces/{id}/uptime/monitors/{monitor_id}/test  # Run immediate test
GET    /workspaces/{id}/uptime/monitors/{monitor_id}/checks # Get check history
GET    /workspaces/{id}/uptime/monitors/{monitor_id}/stats # Get monitor stats

Incidents#

GET    /workspaces/{id}/uptime/incidents                   # List incidents
GET    /workspaces/{id}/uptime/incidents/{incident_id}     # Get incident
PATCH  /workspaces/{id}/uptime/incidents/{incident_id}     # Update incident
POST   /workspaces/{id}/uptime/incidents/{incident_id}/acknowledge # Acknowledge
POST   /workspaces/{id}/uptime/incidents/{incident_id}/resolve     # Resolve

Statistics#

GET    /workspaces/{id}/uptime/stats                       # Workspace stats

Database Tables#

uptime_monitors#

Stores monitor configurations.

ColumnTypeDescription
idUUIDPrimary key
workspace_idUUIDFK to workspaces
nameVARCHAR(255)Display name
check_typeENUMhttp, tcp, websocket
urlVARCHAR(2048)For HTTP/WS checks
hostVARCHAR(255)For TCP checks
portINTEGERFor TCP checks
http_methodVARCHAR(10)GET, POST, HEAD, OPTIONS
expected_status_codesJSONBArray of valid status codes
request_headersJSONBCustom headers
verify_sslBOOLEANSSL verification
ws_messageTEXTMessage to send on WS connect
ws_expected_responseTEXTExpected WS response pattern
check_interval_secondsINTEGER60, 300, 900, 1800, 3600
timeout_secondsINTEGERRequest timeout (default 30)
consecutive_failures_thresholdINTEGERFailures before alerting
current_statusENUMup, down, degraded, paused
last_check_atTIMESTAMPLast check time
next_check_atTIMESTAMPNext scheduled check
consecutive_failuresINTEGERCurrent failure streak
notification_channelsJSONB["slack", "email", "webhook"]
slack_channel_idVARCHARSlack channel for alerts
webhook_urlVARCHARCustom webhook URL
notify_on_recoveryBOOLEANSend recovery notification
team_idUUIDFK to teams (for ticket routing)
is_activeBOOLEANWhether monitoring is active

uptime_checks#

Stores individual check results (time-series data).

ColumnTypeDescription
idUUIDPrimary key
monitor_idUUIDFK to uptime_monitors
is_upBOOLEANCheck result
status_codeINTEGERHTTP status code
response_time_msINTEGERResponse time
error_messageTEXTError details
error_typeVARCHARtimeout, connection_refused, ssl_error
ssl_expiry_daysINTEGERDays until SSL expiry
checked_atTIMESTAMPCheck timestamp

uptime_incidents#

Stores incident records linked to tickets.

ColumnTypeDescription
idUUIDPrimary key
monitor_idUUIDFK to uptime_monitors
workspace_idUUIDFK to workspaces
ticket_idUUIDFK to tickets
statusENUMongoing, acknowledged, resolved
started_atTIMESTAMPWhen incident started
resolved_atTIMESTAMPWhen resolved
acknowledged_atTIMESTAMPWhen acknowledged
acknowledged_by_idUUIDFK to developers
first_error_messageTEXTInitial error
last_error_messageTEXTMost recent error
total_checksINTEGERChecks during incident
failed_checksINTEGERFailed checks count
root_causeTEXTPost-mortem notes
resolution_notesTEXTResolution details

Temporal Schedules#

Defined in backend/src/aexy/temporal/schedules.py:

{
    "id": "uptime-process-due-checks",
    "activity": "process_due_checks",
    "input_module": "aexy.temporal.activities.uptime",
    "interval": timedelta(seconds=60),
    "queue": TaskQueue.OPERATIONS,
},
{
    "id": "uptime-cleanup-old-checks",
    "activity": "cleanup_old_checks",
    "input_module": "aexy.temporal.activities.uptime",
    "interval": timedelta(hours=24),
    "queue": TaskQueue.OPERATIONS,
},

Activity Descriptions#

ActivityDescription
process_due_checksRuns every minute, dispatches checks for monitors where next_check_at <= now
execute_checkExecutes HTTP/TCP/WebSocket check for a single monitor
send_uptime_notificationSends Slack/email/webhook notifications
cleanup_old_checksRemoves check records older than 30 days

Inspect runs and history in the Temporal UI at http://localhost:8080.


Frontend Pages#

PathDescription
/uptimeDashboard with stats and active incidents
/uptime/monitorsList all monitors with create/manage
/uptime/monitors/{id}Monitor detail with stats and history
/uptime/incidentsList all incidents
/uptime/incidents/{id}Incident detail with timeline
/uptime/historyCheck history browser

Configuration#

Environment Variables#

No additional environment variables required. Uses existing:

  • REDIS_URL - For caching and LLM rate limiting
  • TEMPORAL_ADDRESS - For the Temporal workflow engine that runs checks
  • DATABASE_URL - For PostgreSQL storage

Check Intervals#

ValueDescription
60Every minute
300Every 5 minutes
900Every 15 minutes
1800Every 30 minutes
3600Every hour

Default Settings#

SettingDefault
timeout_seconds30
consecutive_failures_threshold3
verify_ssltrue
http_methodGET
expected_status_codes[200, 201, 204]
notify_on_recoverytrue

File Structure#

frontend/src/
├── app/
│   ├── (app)/uptime/
│   │   ├── page.tsx                    # Dashboard
│   │   ├── monitors/
│   │   │   ├── page.tsx                # Monitors list
│   │   │   └── [monitorId]/page.tsx    # Monitor detail
│   │   ├── incidents/
│   │   │   ├── page.tsx                # Incidents list
│   │   │   └── [incidentId]/page.tsx   # Incident detail
│   │   └── history/page.tsx            # Check history
│   └── products/uptime/page.tsx        # Product landing page
└── lib/
    └── uptime-api.ts                   # Uptime API client

backend/src/aexy/
├── models/uptime.py                    # SQLAlchemy models
├── schemas/uptime.py                   # Pydantic schemas
├── services/
│   ├── uptime_service.py               # Business logic
│   └── uptime_checker.py               # Check executors
├── api/uptime.py                       # REST endpoints
└── temporal/activities/uptime.py       # Temporal activities (process_due_checks, execute_check, cleanup_old_checks)

Troubleshooting#

Monitor Not Running Checks#

  1. Verify monitor is active (is_active = true)
  2. Check the Temporal worker is running (uptime checks run as Temporal scheduled activities):
    docker compose logs temporal-worker
    
  3. Open the Temporal UI at http://localhost:8080 and verify the uptime schedule is active and recent runs are succeeding.

Checks Timing Out#

  1. Increase timeout_seconds on the monitor
  2. Check if endpoint is accessible from container network
  3. Verify any firewall rules allow outbound connections

Ticket Not Created#

  1. Verify consecutive_failures_threshold has been reached
  2. Check ticket service is accessible
  3. Review Temporal worker logs for errors

Notifications Not Sending#

  1. Verify notification channel is configured on monitor
  2. Check Slack integration is connected (for Slack alerts)
  3. Verify webhook URL is reachable
  4. Review Temporal worker logs

SSL Errors#

  1. If certificate is valid but failing, try setting verify_ssl = false
  2. Check certificate chain is complete
  3. Verify system trust store includes required root CAs

WebSocket Connection Failures#

  1. Verify WS URL uses correct protocol (ws:// or wss://)
  2. Check if server requires specific subprotocols
  3. Test connection manually with wscat:
    wscat -c wss://your-endpoint.com/ws
    

Access Control#

Uptime monitoring is enabled for:

  • Engineering Bundle: Full access
  • Full Access Bundle: Full access

Permission required: can_view_uptime

Configure access in Settings > Access for your workspace.