> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/iLotuus/Enterprise-SOC-Architecture/llms.txt > Use this file to discover all available pages before exploring further. # Maintenance > Regular maintenance tasks, system tuning, and operational procedures for optimal SOC performance # Maintenance This guide covers regular maintenance tasks, system tuning, and operational procedures to ensure the Enterprise SOC infrastructure operates at peak performance. Proper maintenance prevents system degradation, maintains detection effectiveness, and ensures long-term reliability. ## Maintenance Overview Regular maintenance is essential for SOC health. Schedule maintenance windows during low-activity periods and always have rollback procedures ready. Quick health checks and operational verification to ensure all systems are functioning correctly Rule updates, log review, and performance optimization for sustained effectiveness Deep system analysis, comprehensive updates, and capacity planning reviews Major upgrades, disaster recovery testing, and security assessments ## Regular Maintenance Tasks ### System Health Checks Verify all SOC components are operational: * **Wazuh**: Check manager and agent status ```bash theme={null} /var/ossec/bin/wazuh-control status /var/ossec/bin/agent_control -l ``` * **Elasticsearch**: Verify cluster health ```bash theme={null} curl -X GET "localhost:9200/_cluster/health?pretty" ``` * **Logstash/Fluentd**: Check pipeline status and throughput * **Zabbix**: Verify server and agent connectivity * **Prometheus**: Check target health and scrape status * **TheHive**: Confirm platform accessibility and background jobs Review disconnected agents: * Identify offline Wazuh agents * Check for network connectivity issues * Verify agent services are running * Document persistent offline agents * Escalate critical system outages Confirm logs are being received: * Check Logstash/Fluentd event rates * Verify events appearing in Elasticsearch * Review pipeline errors and failed events * Monitor queue depths and backlogs * Identify silent log sources Ensure alerting is functioning: * Verify recent alerts in Wazuh * Check TheHive integration status * Test notification channels (email, Slack, etc.) * Review alert delivery times Check disk space on all systems: * Elasticsearch data nodes (alert at 75% usage) * Wazuh manager log storage * Backup storage capacity * Database storage * Archive retention ### Quick Performance Check * **Query Response Times**: Test dashboard load times (should be \< 5 seconds) * **Indexing Rate**: Verify Elasticsearch indexing keeps pace with ingestion * **CPU/Memory**: Check for resource exhaustion on critical systems * **Network Throughput**: Monitor bandwidth utilization ### Rule and Signature Updates Update Snort and Suricata signatures: **Snort:** ```bash theme={null} # Pull latest rules from source pulledpork.pl -c /etc/snort/pulledpork.conf # Test configuration snort -T -c /etc/snort/snort.conf # Restart Snort systemctl restart snort ``` **Suricata:** ```bash theme={null} # Update rules using suricata-update suricata-update # Reload rules without restart kill -USR2 $(pidof suricata) ``` **Post-Update:** * Review new rules added * Monitor for new false positives * Document any rule suppressions needed Update Wazuh detection rules: ```bash theme={null} # Backup current rules cp -r /var/ossec/ruleset/rules /var/ossec/ruleset/rules.backup.$(date +%F) # Update Wazuh ruleset /var/ossec/bin/update_ruleset # Test configuration /var/ossec/bin/wazuh-logtest # Restart Wazuh manager systemctl restart wazuh-manager ``` Review custom rules in `/var/ossec/etc/rules/local_rules.xml` for compatibility Refresh threat intelligence feeds: * Update IOC databases * Import new MISP events (if using MISP) * Update IP reputation lists * Refresh malware hash databases * Update domain blocklists * Sync with industry threat feeds Tune detection rules: * Review top noisy alerts from past week * Create suppression rules for confirmed false positives * Adjust alert severity levels * Update correlation thresholds * Document tuning decisions **Example Wazuh suppression:** ```xml theme={null} 3 local_rules.xml ``` Review and prioritize vulnerabilities: * Check for new CVEs affecting SOC infrastructure * Review Wazuh vulnerability detection results * Prioritize patching based on risk * Schedule patch deployment * Verify patch application ### Performance Optimization * **Elasticsearch Index Optimization**: ```bash theme={null} # Force merge old indices curl -X POST "localhost:9200/wazuh-alerts-*/_forcemerge?max_num_segments=1" ``` * **Clear old logs and temporary files** * **Review slow query logs** * **Optimize heavy dashboard queries** * **Check for index bloat** ### Comprehensive System Review Apply system updates: **Operating System Updates:** ```bash theme={null} # Ubuntu/Debian apt update && apt upgrade -y # CentOS/RHEL yum update -y ``` **SOC Component Updates:** * Wazuh manager and agents * Elasticsearch cluster * Logstash/Fluentd * TheHive and Cortex * Zabbix server and agents * Prometheus and exporters Test updates in staging environment before production deployment. Always have rollback plan ready. Manage log data lifecycle: **Elasticsearch Index Management:** ```bash theme={null} # Delete indices older than 90 days curator_cli --host localhost delete_indices --filter_list \ '[{"filtertype":"age","source":"name","timestring":"%Y.%m.%d","unit":"days","unit_count":90}]' # Close indices older than 30 days (keep but not searchable) curator_cli --host localhost close --filter_list \ '[{"filtertype":"age","source":"name","timestring":"%Y.%m.%d","unit":"days","unit_count":30}]' ``` **Archive old data:** * Snapshot indices to long-term storage * Compress archived logs * Verify archive integrity * Update retention documentation Analyze resource usage trends: * Review storage growth rate * Project future capacity needs (3-6 months) * Analyze CPU and memory trends * Review network bandwidth utilization * Identify resource bottlenecks * Plan infrastructure upgrades **Key Metrics:** * Events per second (EPS) trend * Storage growth (GB per day) * Query performance trends * Agent count growth Audit user access and permissions: * Review active user accounts * Verify role assignments * Remove inactive accounts * Audit privileged access * Review API key usage * Update access documentation **Systems to review:** * Wazuh dashboard access * Elasticsearch users * TheHive user accounts * System SSH access * Service accounts Evaluate detection coverage: * Map detections to MITRE ATT\&CK framework * Identify coverage gaps * Review detection rule effectiveness * Analyze false positive rates * Update detection priorities * Document coverage improvements Verify integrations are functioning: * Test Wazuh → TheHive alert creation * Verify Cortex analyzer connectivity * Test IDS → Logstash → Elasticsearch pipeline * Confirm Prometheus → Alertmanager flow * Validate email/Slack notifications * Check firewall log ingestion ### Documentation Updates * Update runbooks with new procedures * Document configuration changes * Refresh architecture diagrams * Update contact lists * Review and update incident playbooks ### Major Updates and Testing Plan and execute major upgrades: * Review release notes for breaking changes * Test upgrades in staging environment * Backup all configurations and data * Schedule maintenance window * Execute upgrade following vendor procedures * Validate functionality post-upgrade * Update documentation **Upgrade Priority:** 1. Security patches (immediate) 2. Critical bug fixes (within 1 month) 3. Feature updates (quarterly) Validate backup and recovery procedures: * Test restore from backups * Verify backup completeness * Practice failover procedures * Test DR site readiness (if applicable) * Document recovery times (RTO/RPO) * Update DR documentation * Train staff on DR procedures Disaster recovery testing is critical. Untested backups are not backups. Conduct security review of SOC infrastructure: * Vulnerability scan all SOC systems * Review security configurations * Audit authentication mechanisms * Test network segmentation * Review firewall rules * Assess encryption in transit and at rest * Penetration test SOC components (optional) Establish performance baselines: * Measure query response times * Benchmark indexing rates * Test maximum EPS capacity * Measure alert processing latency * Document baseline metrics * Compare against previous quarters * Identify performance degradation Verify regulatory compliance: * Review audit logs for completeness * Verify log retention meets requirements * Confirm encryption standards * Validate access controls * Review incident documentation * Generate compliance reports * Address any findings ### Strategic Planning * Review SOC metrics and KPIs * Assess team training needs * Plan infrastructure improvements * Budget for upcoming year * Evaluate new technologies * Update SOC roadmap ## Log Retention and Cleanup ### Retention Policy Guidelines **30 days** - Full search and analysis All logs immediately searchable in Elasticsearch with full indexing **31-90 days** - Reduced access Closed indices, available for search but slower performance **91-365 days** - Archive storage Snapshots stored on cheaper storage, restore required for access **1-7 years** - Compliance retention Compressed archives for regulatory compliance, rarely accessed ### Elasticsearch Index Lifecycle Management Use Elasticsearch Index Lifecycle Management (ILM) to automate index transitions through lifecycle phases. **Example ILM Policy:** ```json theme={null} { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_size": "50GB", "max_age": "1d" } } }, "warm": { "min_age": "30d", "actions": { "forcemerge": {"max_num_segments": 1}, "shrink": {"number_of_shards": 1} } }, "cold": { "min_age": "90d", "actions": { "freeze": {} } }, "delete": { "min_age": "365d", "actions": { "delete": {} } } } } } ``` ### Wazuh Log Management **Archive old Wazuh logs:** ```bash theme={null} # Compress logs older than 30 days find /var/ossec/logs/archives -name "*.log" -mtime +30 -exec gzip {} \; # Move compressed archives to cold storage find /var/ossec/logs/archives -name "*.gz" -mtime +90 -exec mv {} /mnt/cold-storage/wazuh/ \; # Delete archives older than retention policy find /mnt/cold-storage/wazuh -name "*.gz" -mtime +365 -delete ``` ## Performance Tuning ### Elasticsearch Optimization **Index Settings Optimization:** ```json theme={null} { "index": { "number_of_shards": 1, "number_of_replicas": 1, "refresh_interval": "30s", "codec": "best_compression" } } ``` **Best Practices:** * Use time-based indices (daily or weekly rollover) * Set appropriate shard count (aim for 20-50GB per shard) * Increase refresh interval for write-heavy indices * Enable compression for older indices * Disable replicas during bulk indexing * Use index templates for consistent settings **Slow Query Analysis:** ```bash theme={null} # Enable slow query logging curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient": { "logger.index.search.slowlog": "DEBUG", "logger.index.indexing.slowlog": "DEBUG" } }' ``` **Optimization Techniques:** * Use filter context instead of query context when possible * Limit result size and use pagination * Avoid wildcard queries on large fields * Use index patterns to limit search scope * Cache frequently used aggregations * Optimize field mappings (use keyword for exact match) **JVM Heap Size:** ```bash theme={null} # In /etc/elasticsearch/jvm.options # Set heap to 50% of available RAM, max 32GB -Xms16g -Xmx16g ``` **Best Practices:** * Set min and max heap size equal * Never exceed 32GB heap size * Allocate 50% of RAM to heap, leave 50% for filesystem cache * Use SSD storage for data directories * Ensure adequate CPU cores (2+ per node) * Monitor GC pauses (should be \< 1 second) ### Wazuh Performance Tuning **Increase concurrent agent connections:** ```xml theme={null} secure 1514 tcp 131072 no no yes ``` **Agent buffer optimization:** ```xml theme={null} no 5000 500 ``` ### Logstash/Fluentd Pipeline Tuning **Logstash pipeline workers:** ```yaml theme={null} # In /etc/logstash/logstash.yml pipeline.workers: 4 pipeline.batch.size: 250 pipeline.batch.delay: 50 queue.type: persisted queue.max_bytes: 1gb ``` ## Backup and Disaster Recovery ### Backup Strategy **What to backup:** * Elasticsearch indices (snapshots) * Wazuh manager configuration and rules * TheHive case database * Custom detection rules and scripts * System configurations * SSL certificates and keys * User and access control data **Elasticsearch Snapshots:** ```bash theme={null} # Create snapshot repository curl -X PUT "localhost:9200/_snapshot/backup_repository" -H 'Content-Type: application/json' -d' { "type": "fs", "settings": { "location": "/mnt/backup/elasticsearch", "compress": true } }' # Create snapshot (automated via cron) curl -X PUT "localhost:9200/_snapshot/backup_repository/snapshot_$(date +%F)" -H 'Content-Type: application/json' -d' { "indices": "wazuh-*,suricata-*,snort-*", "ignore_unavailable": true, "include_global_state": false }' ``` **Wazuh Configuration Backup:** ```bash theme={null} #!/bin/bash # Daily Wazuh backup script BACKUP_DIR="/mnt/backup/wazuh/$(date +%F)" mkdir -p $BACKUP_DIR # Backup configurations tar -czf $BACKUP_DIR/wazuh-config.tar.gz /var/ossec/etc/ # Backup rules tar -czf $BACKUP_DIR/wazuh-rules.tar.gz /var/ossec/ruleset/ # Backup agent keys cp /var/ossec/etc/client.keys $BACKUP_DIR/ ``` **Replicate to offsite location:** * Cloud storage (S3, Azure Blob, Google Cloud Storage) * Secondary datacenter * Tape backup for long-term retention * Encrypted backup transfer * Verify backup integrity after transfer **Quarterly restore tests:** * Restore Elasticsearch snapshot to test cluster * Restore Wazuh configuration to test manager * Verify data completeness and integrity * Document restore procedures and timing * Update DR documentation with findings ### Disaster Recovery Procedures Disaster recovery procedures must be tested regularly. Plan for complete SOC failure and practice recovery. **Recovery Priority:** 1. **Critical (RTO: 4 hours)** * Wazuh manager (detection and alerting) * Elasticsearch cluster (log search) * TheHive (incident management) 2. **High (RTO: 8 hours)** * IDS/IPS systems (Snort/Suricata) * Log ingestion pipeline (Logstash/Fluentd) * Prometheus monitoring 3. **Medium (RTO: 24 hours)** * Zabbix infrastructure monitoring * Historical data restore * Dashboard customizations **Recovery Procedures:** * Determine scope of failure * Identify affected systems * Estimate recovery time * Activate incident response team * Notify stakeholders * Deploy fresh OS on replacement hardware * Restore system configurations from backup * Restore application data * Verify system functionality * Re-establish network connectivity * Restore Elasticsearch snapshots * Import Wazuh agent keys * Restore TheHive case database * Verify data integrity * Resume log ingestion * Test all integrations * Verify alerting functions * Reconnect agents * Resume normal operations * Document recovery process and timing ## Compliance and Auditing ### Audit Log Management Maintain comprehensive audit logs for security operations, system changes, and access to comply with regulations. **What to Audit:** * User authentication and authorization * Configuration changes * Rule modifications * Incident access and modifications * Data exports and queries * System administrative actions * Backup and restore operations **Elasticsearch Audit Logging:** ```yaml theme={null} # In elasticsearch.yml xpack.security.audit.enabled: true xpack.security.audit.logfile.events.include: - access_granted - access_denied - authentication_failed - authentication_success - connection_denied - connection_granted ``` ### Compliance Reporting Generate regular compliance reports: * **Log retention compliance**: Verify retention periods met * **Access reviews**: Document user access audits * **Incident response**: Timeline and actions for all incidents * **System availability**: Uptime and SLA metrics * **Vulnerability management**: Patching compliance * **Change management**: Documentation of all changes ## Maintenance Best Practices Maintain detailed documentation of: * Maintenance procedures * Configuration changes * Troubleshooting steps * Lessons learned Always test changes in staging: * New rules and signatures * Software updates * Configuration modifications * Integration changes Have rollback procedures for: * Configuration changes * Software upgrades * Rule deployments * Infrastructure changes Enhanced monitoring post-maintenance: * Watch for new errors * Monitor performance metrics * Review alert volume * Validate functionality ### Change Management Process * Document what will change and why * Identify affected systems * Assess risk and impact * Schedule maintenance window * Prepare rollback plan * Notify stakeholders of maintenance window * Inform SOC team of expected changes * Update status pages * Set expectations for downtime * Follow documented procedure * Take before snapshots/backups * Make changes incrementally * Test at each step * Document actual changes made * Test all affected functionality * Verify integrations * Check performance metrics * Review logs for errors * Confirm with stakeholders * Record changes made * Note any issues encountered * Update configuration documentation * Share lessons learned * Close change ticket ### Maintenance Windows **Recommended Schedule:** * **Emergency Patches**: As needed (security critical) * **Routine Updates**: Weekly, Tuesday 2-4 AM * **Major Changes**: Monthly, first Sunday 12-6 AM * **DR Testing**: Quarterly, scheduled 3 months in advance Schedule maintenance during lowest traffic periods based on your organization's patterns. Review metrics to identify optimal windows. ## Troubleshooting Common Issues ### High Resource Usage **Symptoms**: CPU, memory, or disk at capacity **Solutions**: * Identify resource-intensive processes * Optimize heavy queries * Increase refresh intervals * Archive or delete old data * Scale horizontally (add nodes) ### Agent Connectivity Issues **Symptoms**: Agents showing as disconnected **Solutions**: * Verify network connectivity * Check firewall rules (port 1514 for Wazuh) * Restart agent service * Re-key agent if authentication fails * Check manager capacity ### Slow Query Performance **Symptoms**: Dashboards loading slowly **Solutions**: * Review slow query logs * Optimize query filters * Reduce time range * Add indices to filtering fields * Increase cluster resources ## Related Resources * [Monitoring Guide](/operations/monitoring-guide) - Daily monitoring operations and alert management * [Incident Handling](/operations/incident-handling) - Procedures for responding to security incidents * [Threat Hunting](/operations/threat-hunting) - Proactive threat detection techniques