Uptime Monitoring - Aexy Docs

Aexy's uptime monitoring module provides endpoint health checking for HTTP, TCP, and WebSocket services. Monitors run on configurable intervals and automatically create support tickets when services go down.

Features#

Multi-Protocol Checks: HTTP, TCP, and WebSocket endpoint monitoring
Configurable Intervals: 1, 5, 15, 30, or 60 minute check intervals
Automatic Incident Management: Incidents created after consecutive failures
Ticket Integration: Automatically creates and closes support tickets
Notifications: Slack, email, and webhook alerts
SSL Monitoring: Track SSL certificate expiration (HTTP checks)
Post-Mortem Notes: Root cause and resolution documentation

Check Types#

HTTP Check#

Monitors HTTP/HTTPS endpoints with configurable settings.

Setting	Description
URL	Full URL to check (http:// or https://)
HTTP Method	GET, POST, HEAD, OPTIONS
Expected Status Codes	Array of valid status codes (default: [200, 201, 204])
Request Headers	Custom headers (e.g., Authorization)
Verify SSL	Whether to verify SSL certificates
Timeout	Request timeout in seconds (default: 30)

Collected Metrics:

Response time (ms)
Status code
SSL certificate expiry (days remaining)
Error messages

TCP Check#

Monitors TCP port availability.

Setting	Description
Host	Hostname or IP address
Port	TCP port number
Timeout	Connection timeout in seconds (default: 30)

Collected Metrics:

Connection time (ms)
Connection success/failure
Error type (timeout, connection_refused)

WebSocket Check#

Monitors WebSocket endpoints with optional message validation.

Setting	Description
URL	WebSocket URL (ws:// or wss://)
WS Message	Optional message to send on connect
WS Expected Response	Expected response pattern (regex)
Timeout	Connection timeout in seconds (default: 30)

Collected Metrics:

Connection time (ms)
Handshake success
Response validation result
Error messages

Incident Management#

How It Works#

Consecutive Failures: Monitor tracks consecutive failed checks
Threshold Reached: When failures >= threshold (default: 3), incident is created
Ticket Created: Support ticket automatically created with incident details
Notifications Sent: Slack/email/webhook alerts dispatched
Recovery Detected: When check succeeds, incident is resolved
Ticket Closed: Linked ticket automatically closed with resolution notes

Failure Flow#

Check fails
    ↓
Increment consecutive_failures
    ↓
consecutive_failures >= threshold?
    ├─ No → Schedule next check
    └─ Yes → Create incident
                ↓
            Create support ticket
                ↓
            Send notifications

Recovery Flow#

Check succeeds
    ↓
Has ongoing incident?
    ├─ No → Reset counters
    └─ Yes → Resolve incident
                ↓
            Close linked ticket
                ↓
            Send recovery notifications (if enabled)

Incident States#

Status	Description
ongoing	Incident in progress, service still down
acknowledged	Team aware of issue, working on resolution
resolved	Service recovered, incident closed

Monitor Status#

Status	Description
up	All checks passing
down	Threshold failures reached
degraded	Some failures but under threshold
paused	Monitoring temporarily disabled

Notifications#

Supported Channels#

Channel	Configuration
Slack	Set `slack_channel_id` on monitor
Email	Uses team notification preferences
Webhook	Custom `webhook_url` for external integrations

Notification Events#

Incident Created: When threshold failures reached
Incident Acknowledged: When team acknowledges incident
Incident Resolved: When service recovers (if notify_on_recovery enabled)

Webhook Payload#

{
  "event": "incident.created",
  "monitor": {
    "id": "uuid",
    "name": "API Server",
    "check_type": "http",
    "url": "https://api.example.com/health"
  },
  "incident": {
    "id": "uuid",
    "status": "ongoing",
    "started_at": "2024-01-15T10:30:00Z",
    "error_message": "Connection timeout after 30s"
  },
  "workspace_id": "uuid"
}

Ticket Integration#

Automatic Ticket Creation#

When an incident is created, a support ticket is automatically generated:

Title: [UPTIME] {monitor_name} is down
Severity: high
Priority: urgent
Description: Monitor details, error message, timestamp
Team: Assigned to configured team (if set)

Automatic Ticket Closure#

When service recovers:

Resolution comment added to ticket:

Service has recovered.

Duration: 15 minutes
Started: 2024-01-15 10:30:00 UTC
Resolved: 2024-01-15 10:45:00 UTC
Total checks during incident: 15
Failed checks: 15

Ticket status set to closed
Recovery notifications sent (if enabled)

API Endpoints#

Monitors#

GET    /workspaces/{id}/uptime/monitors                    # List monitors
POST   /workspaces/{id}/uptime/monitors                    # Create monitor
GET    /workspaces/{id}/uptime/monitors/{monitor_id}       # Get monitor
PATCH  /workspaces/{id}/uptime/monitors/{monitor_id}       # Update monitor
DELETE /workspaces/{id}/uptime/monitors/{monitor_id}       # Delete monitor
POST   /workspaces/{id}/uptime/monitors/{monitor_id}/pause # Pause monitoring
POST   /workspaces/{id}/uptime/monitors/{monitor_id}/resume # Resume monitoring
POST   /workspaces/{id}/uptime/monitors/{monitor_id}/test  # Run immediate test
GET    /workspaces/{id}/uptime/monitors/{monitor_id}/checks # Get check history
GET    /workspaces/{id}/uptime/monitors/{monitor_id}/stats # Get monitor stats

Incidents#

GET    /workspaces/{id}/uptime/incidents                   # List incidents
GET    /workspaces/{id}/uptime/incidents/{incident_id}     # Get incident
PATCH  /workspaces/{id}/uptime/incidents/{incident_id}     # Update incident
POST   /workspaces/{id}/uptime/incidents/{incident_id}/acknowledge # Acknowledge
POST   /workspaces/{id}/uptime/incidents/{incident_id}/resolve     # Resolve

Statistics#

GET    /workspaces/{id}/uptime/stats                       # Workspace stats

Database Tables#

uptime_monitors#

Stores monitor configurations.

Column	Type	Description
id	UUID	Primary key
workspace_id	UUID	FK to workspaces
name	VARCHAR(255)	Display name
check_type	ENUM	http, tcp, websocket
url	VARCHAR(2048)	For HTTP/WS checks
host	VARCHAR(255)	For TCP checks
port	INTEGER	For TCP checks
http_method	VARCHAR(10)	GET, POST, HEAD, OPTIONS
expected_status_codes	JSONB	Array of valid status codes
request_headers	JSONB	Custom headers
verify_ssl	BOOLEAN	SSL verification
ws_message	TEXT	Message to send on WS connect
ws_expected_response	TEXT	Expected WS response pattern
check_interval_seconds	INTEGER	60, 300, 900, 1800, 3600
timeout_seconds	INTEGER	Request timeout (default 30)
consecutive_failures_threshold	INTEGER	Failures before alerting
current_status	ENUM	up, down, degraded, paused
last_check_at	TIMESTAMP	Last check time
next_check_at	TIMESTAMP	Next scheduled check
consecutive_failures	INTEGER	Current failure streak
notification_channels	JSONB	["slack", "email", "webhook"]
slack_channel_id	VARCHAR	Slack channel for alerts
webhook_url	VARCHAR	Custom webhook URL
notify_on_recovery	BOOLEAN	Send recovery notification
team_id	UUID	FK to teams (for ticket routing)
is_active	BOOLEAN	Whether monitoring is active

uptime_checks#

Stores individual check results (time-series data).

Column	Type	Description
id	UUID	Primary key
monitor_id	UUID	FK to uptime_monitors
is_up	BOOLEAN	Check result
status_code	INTEGER	HTTP status code
response_time_ms	INTEGER	Response time
error_message	TEXT	Error details
error_type	VARCHAR	timeout, connection_refused, ssl_error
ssl_expiry_days	INTEGER	Days until SSL expiry
checked_at	TIMESTAMP	Check timestamp

uptime_incidents#

Stores incident records linked to tickets.

Column	Type	Description
id	UUID	Primary key
monitor_id	UUID	FK to uptime_monitors
workspace_id	UUID	FK to workspaces
ticket_id	UUID	FK to tickets
status	ENUM	ongoing, acknowledged, resolved
started_at	TIMESTAMP	When incident started
resolved_at	TIMESTAMP	When resolved
acknowledged_at	TIMESTAMP	When acknowledged
acknowledged_by_id	UUID	FK to developers
first_error_message	TEXT	Initial error
last_error_message	TEXT	Most recent error
total_checks	INTEGER	Checks during incident
failed_checks	INTEGER	Failed checks count
root_cause	TEXT	Post-mortem notes
resolution_notes	TEXT	Resolution details

Temporal Schedules#

Defined in backend/src/aexy/temporal/schedules.py:

{
    "id": "uptime-process-due-checks",
    "activity": "process_due_checks",
    "input_module": "aexy.temporal.activities.uptime",
    "interval": timedelta(seconds=60),
    "queue": TaskQueue.OPERATIONS,
},
{
    "id": "uptime-cleanup-old-checks",
    "activity": "cleanup_old_checks",
    "input_module": "aexy.temporal.activities.uptime",
    "interval": timedelta(hours=24),
    "queue": TaskQueue.OPERATIONS,
},

Activity Descriptions#

Activity	Description
`process_due_checks`	Runs every minute, dispatches checks for monitors where `next_check_at <= now`
`execute_check`	Executes HTTP/TCP/WebSocket check for a single monitor
`send_uptime_notification`	Sends Slack/email/webhook notifications
`cleanup_old_checks`	Removes check records older than 30 days

Inspect runs and history in the Temporal UI at http://localhost:8080.

Frontend Pages#

Path	Description
`/uptime`	Dashboard with stats and active incidents
`/uptime/monitors`	List all monitors with create/manage
`/uptime/monitors/{id}`	Monitor detail with stats and history
`/uptime/incidents`	List all incidents
`/uptime/incidents/{id}`	Incident detail with timeline
`/uptime/history`	Check history browser

Configuration#

Environment Variables#

No additional environment variables required. Uses existing:

REDIS_URL - For caching and LLM rate limiting
TEMPORAL_ADDRESS - For the Temporal workflow engine that runs checks
DATABASE_URL - For PostgreSQL storage

Check Intervals#

Value	Description
60	Every minute
300	Every 5 minutes
900	Every 15 minutes
1800	Every 30 minutes
3600	Every hour

Default Settings#

Setting	Default
`timeout_seconds`	30
`consecutive_failures_threshold`	3
`verify_ssl`	true
`http_method`	GET
`expected_status_codes`	[200, 201, 204]
`notify_on_recovery`	true

File Structure#

frontend/src/
├── app/
│   ├── (app)/uptime/
│   │   ├── page.tsx                    # Dashboard
│   │   ├── monitors/
│   │   │   ├── page.tsx                # Monitors list
│   │   │   └── [monitorId]/page.tsx    # Monitor detail
│   │   ├── incidents/
│   │   │   ├── page.tsx                # Incidents list
│   │   │   └── [incidentId]/page.tsx   # Incident detail
│   │   └── history/page.tsx            # Check history
│   └── products/uptime/page.tsx        # Product landing page
└── lib/
    └── uptime-api.ts                   # Uptime API client

backend/src/aexy/
├── models/uptime.py                    # SQLAlchemy models
├── schemas/uptime.py                   # Pydantic schemas
├── services/
│   ├── uptime_service.py               # Business logic
│   └── uptime_checker.py               # Check executors
├── api/uptime.py                       # REST endpoints
└── temporal/activities/uptime.py       # Temporal activities (process_due_checks, execute_check, cleanup_old_checks)

Troubleshooting#

Monitor Not Running Checks#

Verify monitor is active (is_active = true)
Check the Temporal worker is running (uptime checks run as Temporal scheduled activities):
```
docker compose logs temporal-worker
```
Open the Temporal UI at http://localhost:8080 and verify the uptime schedule is active and recent runs are succeeding.

Checks Timing Out#

Increase timeout_seconds on the monitor
Check if endpoint is accessible from container network
Verify any firewall rules allow outbound connections

Ticket Not Created#

Verify consecutive_failures_threshold has been reached
Check ticket service is accessible
Review Temporal worker logs for errors

Notifications Not Sending#

Verify notification channel is configured on monitor
Check Slack integration is connected (for Slack alerts)
Verify webhook URL is reachable
Review Temporal worker logs

SSL Errors#

If certificate is valid but failing, try setting verify_ssl = false
Check certificate chain is complete
Verify system trust store includes required root CAs

WebSocket Connection Failures#

Verify WS URL uses correct protocol (ws:// or wss://)
Check if server requires specific subprotocols
Test connection manually with wscat:
```
wscat -c wss://your-endpoint.com/ws
```

Access Control#

Uptime monitoring is enabled for:

Engineering Bundle: Full access
Full Access Bundle: Full access

Permission required: can_view_uptime

Configure access in Settings > Access for your workspace.