Appendix D: Health Monitor Overview

Prev Next

The Health Monitor service was introduced in the 8.8 release of the NMC and Edge Appliance. The Health Monitor status is available for Edge Appliances running version 8.8 or higher.

Status

Health Monitor status for Edge Appliances is available from the NMC Filers Overview page and NMC Filer Details page, along with recommended remediation steps. Further, warning and error messages are available via NMC notifications and the NMC API and, if configured, can be sent as email alerts, syslog, and SNMP traps.

Figure D-1: Health Monitor status.

The NMC Filers Overview “Health” column shows an aggregate of Health Monitor check status and reports the highest severity check status as the current Health Monitor status for the Edge Appliance. For example, if all checks are healthy, “Healthy” is displayed in the Filers Overview page. If one check is in an unhealthy state but all other checks are healthy, “Unhealthy” is displayed for the Edge Appliance.

Tip: If any Health condition is displayed as “Unhealthy”, you can view detailed information and any recommendations by hovering over the “Unhealthy” indicator. Alternatively, clicking “View Recommendations” opens the Health Monitor Current Status dialog box which displays detailed information and any recommendations.

Status Types

  • Unhealthy: The check is unhealthy. Edge Appliance functionality is likely to be degraded.

  • Warning: The check is approaching an unhealthy state. Not all checks include a warning status.

  • Healthy: The check is reporting no errors.

  • Disabled: This feature is currently disabled.

  • “ - ”: The Edge Appliance Health Monitor status is “Unknown”. The “Unknown” status is reported for pre-8.8 Edge Appliances.

Polling and Thresholds

Each Health Monitor check has an associated default polling interval and default polling threshold. The polling interval corresponds to the number of minutes between polling. The polling threshold corresponds to the number of consecutive checks matching the conditions for the associated status that must be met before the check status changes. Consider the following example:

Polling Interval: 1 minute

Polling Threshold: 30

Expected result: Status for the associated check is polled every minute. If the condition associated with the “Warning” or “Unhealthy” state is matched for 30 consecutive polling events, the associated status is triggered.

The polling interval and polling threshold for each Health Monitor check use default values that cannot currently be adjusted using the NMC UI. Back-end configuration options allow Nasuni Customer Support to customize these values if required.

Monitors and Remediation

Tip: You can also monitor hardware conditions using iDRAC. See iDRAC Configuration.

CPU

If CPU usage exceeds 90 percent (average across cores), the "Warning" status is triggered. If CPU usage exceeds 95 percent (average across cores), the "Unhealthy" status is triggered.

Polling Interval: 1 minute

Polling Threshold: 5 consecutive

Remediation - Add CPUs if possible or contact Nasuni Customer Support.

Cyber Resilience

Health Monitor periodically checks the status of the Cyber Resilience service.

  • If no volumes have Ransomware Detection enabled, this check is disabled. The display field is also disabled.

  • If at least one volume has Ransomware Detection enabled, this check is enabled. The status of the Cyber Resilience service is as follows:

  • Unhealthy if any of the following conditions is true:

    • If Ransomware Detection is enabled, but the attempt to acquire a new set of known ransomware patterns has failed twice.

    • If the Ransomware Detection service is in a failed state.

  • Healthy otherwise.

Polling Interval:

  • For the known ransomware patterns update check: Once every six hours.

  • For the Ransomware Detection service check: Every minute.

Polling Threshold:

  • For the known ransomware patterns update check: 2 failures.

  • For the Ransomware Detection service check: 30 failures.

Remediation:

  • For the known ransomware patterns update check: No customer action necessary. Optionally contact Support. Any errors are shown in the log files. Automatically switches to Healthy on first success of obtaining update of known ransomware patterns.

  • For the Ransomware Detection service check: An alert is emitted. No customer action necessary. Optionally contact Support.

Directory Services

If the Edge Appliance is joined to Active Directory and the Nasuni Edge Appliance AD health polling job fails five times consecutively, the "Unhealthy" status is triggered.

Polling Interval: 5 minutes

Polling Threshold: 5 consecutive

Customers that have implemented ID mapping based on RFC2307 Active Directory Unix attributes need to ensure that the Active Directory Machine account for the Edge Appliance has a defined UID and GID in Active Directory.

Remediation - Confirm that the Edge Appliance can contact the Active Directory Domain and confirm that the AD domain controllers for the site are online. Attempt to use the Edge Appliance UI to rejoin the Edge Appliance to AD. If that fails, open a case with Nasuni Customer Support.

Disk: Disk Errors

If the system's io_error_cnt is more than 0, the "Unhealthy" status is triggered.

If SMART reports an error (only relevant for hardware Edge Appliances), the “Unhealthy” status is triggered.

Polling Interval: 10 minutes

Polling Threshold: 1 event

Remediation - Contact Nasuni Customer Support. A disk might need to be replaced.

File System: Internal File System Utilization

If the Nasuni filesystem is using more than 80 percent of available space, the "Warning" status is triggered. If any filesystem is using more than 90 percent of available space, the "Unhealthy" status is triggered.

Note: Cache and Copy on Write (COW) disk utilization are not included in the Health Monitor file system utilization checks.

Polling Interval: 5 minutes

Polling Threshold: 10 consecutive

Remediation - High internal file system utilization could indicate a problem with key Nasuni processes. If the internal file system fills, processes could fail. Contact Nasuni Customer Support.

Memory

If Edge Appliance memory usage exceeds 90 percent, the "Warning" status is triggered. If memory usage exceeds 95 percent, the "Unhealthy" status is triggered.

Polling Interval: 1 minute

Polling Threshold: 30 consecutive

Remediation - Consider adding memory if possible or contact Nasuni Customer Support.

Memory Fragmentation

Checks for excessive memory fragmentation and supports both “Warning” and “Unhealthy” thresholds. Memory fragmentation events are not currently visible in the Filer Overview or Filer Details page, although “Warning” and “Unhealthy” messages are logged to NMC notifications. Memory fragmentation can impact Edge Appliance performance or could cause operations to fail.

Polling Interval: 1 minute

Polling Threshold: 5 consecutive

Remediation - Contact Nasuni Customer Support if the errors continue.

NFS: File Protocol Access

Health Monitor attempts to periodically check NFS exports and SMB (CIFS) shares. If these checks fail, the "Unhealthy" status is triggered. While the Edge Appliance Health Monitor details and remediation messages do not list the name of the specific “Unhealthy” NFS export or SMB (CIFS) share, the NMC notifications do list the specific “Unhealthy” NFS exports or SMB (CIFS) shares.

Polling Interval: 3 minutes

Polling Threshold: 3 consecutive

Remediation - Remediation includes the following:

  • Check NMC notifications to obtain the names of unhealthy NFS exports or SMB (CIFS) shares. A notification is raised for each NFS export or SMB (CIFS) share that is inaccessible.

  • Confirm that the configured Nasuni volume path referenced for the NFS export or SMB (CIFS) share is valid. If the path has been renamed or moved, the NFS export or SMB (CIFS) share could potentially be unavailable.

  • Manually test NFS or SMB (CIFS) connectivity from a client.

  • If the Edge Appliance is joined to Active Directory and notifications indicate that many or all SMB (CIFS) shares are unavailable on an Edge Appliance, it may indicate the health check process lacks permission to check the status of the SMB (CIFS) share.

    Two configurations could cause this issue:

    • Using the Windows MMC for “Shared Folders” or “Computer Management: Shared” folders properties to edit share permissions for a share is not supported by Nasuni, and could remove the permissions that the health check process depends on to check SMB (CIFS) share health.

      Rather than using the Windows MMC interface for shares to edit share permissions, you should use the built-in Nasuni share Authentication option to control the users and groups that can access shares.

      If using the Windows MMC to edit share permissions is a requirement, ensure that the Active Directory Machine account for the Edge Appliance has at least read access to all Nasuni shares in the MMC.

    • Customers that have implemented ID mapping based on RFC2307 Active Directory Unix attributes need to ensure that the Active Directory Machine account for the Edge Appliance has a defined UID and GID in Active Directory.

Nasuni File IQ

Health Monitor periodically checks the status of Nasuni File IQ. The status of Nasuni File IQ is as follows:

  • Unhealthy if any of the following conditions is true:

    • NFIQ database is down for an extended period of time.

    • Grafana is down for an extended period of time.

    • NFIQ Audit Event Consumer Service (fsep) is down or experiencing errors for an extended period of time.

    • NFIQ Metadata Crawler Service (fsms) is down or experiencing errors for an extended period of time.

    • NFIQ Audit Event Aggregator Service (fsagg) is down or experiencing errors for an extended period of time.

    • NFIQ Audit Event Consumer Service (fsep) is not able to connect to Azure Event Hubs for an extended period of time.

  • Healthy otherwise.

Polling Interval:

  • For the NFIQ Audit Event Consumer Service: Once every 15 minutes.

  • For all the other Nasuni File IQ services: Every minute.

Polling Threshold:

  • For all the Nasuni File IQ services: 30 failures.

Remediation:

  • For the NFIQ Audit Event Consumer Service: Open *.servicebus.windows.net:9093 outbound and ensure that no Intrusion Prevention System (IPS) or firewall blocks connectivity or SSL.

  • For all the other Nasuni File IQ services: No customer action necessary. Optionally contact Support. Any errors are shown in the log files.

Network: Network Connectivity Checks

Polling Group 1

  • Nasuni NMC messaging queues - If a GET call to the SQS "queue_exists" endpoint fails, the "Unhealthy" status is triggered.

  • Global File Lock - Checks if connection to the Global File Lock server endpoint is functional. If a GET call to the NOC's GFL health check endpoint fails, the "Unhealthy" status is triggered.

Polling Interval: 10 minutes

Polling Threshold: 3 consecutive

Polling Group 2

  • Nasuni Orchestration Center (NOC) - If a GET call to the NOC's AUTH endpoint fails, the "Unhealthy" status is triggered.

  • Object Storage - Checks if each of the volumes can connect to the cloud. If a GET call fails for any volume, the "Unhealthy" status is triggered.

Polling Interval: 5 minutes

Polling Threshold: 3 consecutive

Remediation - Confirm network routing and connectivity between the Edge Appliance and network endpoints. The Nasuni Service Console running on the Edge Appliance offers built-in network utilities for traceroute and ping.

Services: Nasuni Services

Nasuni Services - If a critical Nasuni service is not running, the "Unhealthy" status is triggered.

Most Services

Polling Interval: 1 minute

Polling Threshold: 5 consecutive

UnityFS local file system process

Polling Interval: 5 minutes

Polling Threshold: 3 consecutive

Remediation - Contact Nasuni Technical Support.