Ops IQ Appliance Dashboard
You can access these dashboards from the Nasuni Portal (portal.nasuni.com)
Ops IQ → Appliance Dashboard
Public Preview
Ops IQ is being launched in Public Preview to give all customers early access to these powerful new capabilities while allowing us to gather real-world feedback. During this phase, your input helps shape the final experience. While the core functionality is stable, features may evolve based on what we learn from customers like you. We encourage you to explore, experiment, and share feedback—it’s a key part of building a product that truly works for you.
Prerequisites
Requires version 9.15 or higher for all appliances (Edge/File IQ/HA Edge). No additional setup is needed for these dashboards.
Using these dashboards, you can view data since appliances were updated to version 9.15 or higher.
Note: Changes in the underlying metric calculations can affect the history of certain resource metrics.
Permissions
Access to the dashboard is controlled by the Ops IQ permissions under Roles & Permissions. To access the appliance dashboard, ensure the 'View Appliance Dashboard' is enabled for your role.
Ops IQ permissions are enabled for the following pre-canned roles:
- Account Owner
- Super User
- Read Only
Note: Ops IQ permissions are disabled for any custom roles. Admins with 'Edit Roles' (IAM) permission need to grant view permission to users with custom roles. After the update, users must re-authenticate for the new permissions to take effect.
Data Granularity
Appliances collect telemetry data every minute and send reports every five minutes. Refreshing the dashboard or website within a five-minute window will not display new data until the next report is sent.
Data Collection
Appliances run collectd, a lightweight open-source daemon, along with supporting Nasuni software to collect, aggregate, and dispatch system and application performance metrics. Most resource data is sampled every 10 seconds and averaged over one-minute intervals. As a result, short-lived spikes in system telemetry may appear flattened or smoothed out in the charts.
Binning
This chart uses MAX aggregation when binning data over time. As users adjust the time range, individual data points represent the maximum value observed within each interval.
Impact:
- Longer time range: Data is aggregated over larger intervals, which may flatten short spikes or make them less visible while highlighting occasional peaks.
- Zooming in (shorter time range): Shows more granular detail, revealing peaks that were previously averaged together.
This behavior helps highlight the highest observed values, but keep in mind that spike visibility decreases as the time range expands.
Disaster Recovery
After a disaster recovery event, telemetry from the original appliance is not retained. Ops IQ will begin collecting and reporting telemetry from the newly restored appliance.
Appliance uptime (shown under Appliance Details) indicates when the appliance was last booted and can help verify its operational duration.
Dashboard Details
Appliance Selector
Click on 'Select new Appliance' and select the appliance for which you want to view its details, health status, and resource telemetry.
Appliance Overview
Appliance details are minimized by default. To expand appliance details, click the “Show Appliance Details” link.
Appliance Details:
- Serial Number
- Private IP Address
- Appliance Time Zone: Local timezone of the appliance
- Uptime: Time since the last boot
- Volumes: Total number of volumes (owned and connected)
- Available NEA Version: Shows the available update version
- Current version is listed below the selected appliance name
- Last DR Config: Time when the last backup bundle was uploaded to the Nasuni Orchestration Center (NOC). While performing a Disaster Recovery, this DR configuration bundle will be used to recover the appliance.
Hardware Details
- Platform: Underlying hardware platform
- CPU (GHz)
- Ram (GiB)
- Cache Disk (GiB)
- OS Disk (GiB)
- CoW Disk (GiB)
- File IQ Disk (GiB): Available only on File IQ appliances.
Current Appliance Health
The top gauges give you a quick insight into appliance health. Based on telemetry from the last 30 minutes, these gauges monitor the current health of the appliance. Changing the time range or time zone does not impact the data for the current appliance's health.
Gauges
Load Average: The current system load on the appliance. System load is a measure of the number of processes actively competing for CPU time.
If an appliance has 8 CPU cores, then:
Status | 15 min Load Average | Implications |
---|---|---|
![]() |
4.0 (A load average lower than the number of cores) | Indicates underutilization — the CPU has available headroom. |
![]() |
8.0 (All cores are utilized) | The system is fully utilized (each core has one process running or ready to run). |
![]() |
9.0 (A load average higher than the number of cores ) | Suggests CPU contention — there are more processes wanting CPU time than available cores. |
CPU Utilization: Measures how busy the CPU is, expressed as a percentage of time the CPU spends doing work vs. being idle. Sustained high CPU utilization can lead to performance degradation, queue buildup, and a poor user experience.
Status | CPU Utilization | Implications |
---|---|---|
![]() |
<80% | The CPU has adequate headroom for new processes. |
![]() |
80-95% | The system can become unresponsive or sluggish. New processes queue up for CPU time. |
![]() |
>95% | All cores are saturated — no headroom for additional work or spikes. |
Memory Utilization: High memory utilization is not always detrimental, but it can lead to memory pressure, resulting in severe degradation of system performance or even crashes when the system runs out of memory. High memory utilization can affect system performance differently than high CPU utilization.
- Memory Utilization is calculated as:
- Memory Utilization = Total Memory - Available Memory
This reflects how much memory is actively in use and helps identify potential memory pressure.
For appliances running versions earlier than 10.0.4:
Available memory is estimated as: free + buffered + cached memory
- Note: A portion of the cached memory may not be available for consumption.
For appliances running version 10.0.4 and later:
Available memory is collected directly using the free -m command, which provides a more accurate system-level estimate of usable memory without swapping.
Status | Memory Utilization | Implications |
---|---|---|
![]() |
<80% | There is a healthy amount of available memory on the appliance. No Out Of Memory (OOM). |
![]() |
80-95% | Early signs of memory pressure that may degrade performance, lead to instability over time, or escalate into critical issues (Monitor closely) |
![]() |
>95% | The system cannot efficiently allocate memory to new or growing processes. |
Monitoring Resource Trends
Controls
-
Appliance Selector: Pick a 9.15+ appliance to fetch its details and telemetry.
-
Time Range: Lists a few pre-defined time ranges, starting from 10 minutes to 24 hours. You can select custom dates from the dropdown for time ranges greater than 24 hours.
*Notes:- Selecting a time range greater than 10 days can impact page load time, as it takes a few extra seconds to fetch all the data points.
- Using the time range, you can retrieve data for a maximum of 14 days at a time. You can always go further back in time to retrieve data for another 14 days.
- Changes to the underlying data or collection methods can impact historical data points, making older data points inaccessible.
- The appliance dashboard will show data since the appliance was updated to version 9.15 or later.
-
Time Zone: Select the appropriate timezone.
- Default: Browser timezone.
- Tip: You can select the appliance's timezone to view telemetry using its local timezone.
- Appliance timezone is available under appliance details
-
Auto-Refresh: You can enable auto-refresh for both the Appliance Dashboard and Volume Dashboard. If enabled, dashboards will refresh every five minutes to show up-to-date information.
- Auto-refresh is disabled by default.
Chart Interactions
The following chart interactions are available:
- View in full screen: Opens the chart in full screen to view granular details.
- Print chart: Print the selected chart. Using this option, you can save it as a PDF file.
- Download JPEG
- Download CSV
System Performance
Load Average
This chart displays the system's load average over the past 5, 10, and 15 minutes. Load average represents the number of active processes that are either using the CPU or waiting for it.
- How to read it:
- Each line shows the rolling average over its time window:
- 5m – short-term trend (recent spikes)
- 10m – mid-term trend
- 15m – long-term trend
- Each line shows the rolling average over its time window:
- Interpretation:
- Compare values to the number of vCPUs assigned to the VM.
- Load ≈ #Cores → Fully utilized
- Load < #Cores → CPU has headroom
- Load > #Cores → CPU contention; processes may be delayed
- Example:
- If the VM has 4 vCPUs and the 5-minute load average is 6.2, the system is over capacity and likely experiencing performance degradation.
- Compare values to the number of vCPUs assigned to the VM.
- Remediations:
- If the load on your appliance is consistently higher than the number of CPU Cores, we recommend increasing the CPU Cores on the appliance or distributing the workload to other appliances.
CPU Utilization
This chart displays the percentage of CPU resources utilized over time across all vCPUs. It includes time spent on user applications, system tasks, and background processing.
- How to read it:
- Spikes indicate periods of intense CPU activity.
- Interpretation:
- >90% (sustained) → CPU is likely a bottleneck; may impact application performance
- Frequent 100% spikes → Tasks may queue or experience latency.
Memory Usage
For appliances running version 10.0.4 and later:
This chart shows how much system memory is being utilized over time, using a simplified and more accurate measure of available memory, as reported by the Linux free -m utility.
- Memory Utilization: Calculated as Total Memory - Available Memory, representing memory in active use.
- Data Source: Data on memory utilization is collected using Linux free -m utility.
- How to Read It:
- A low memory utilization value indicates the system has ample headroom.
- A rising memory utilization trend suggests increasing memory pressure.
- Interpretation:
- High memory utilization over time may indicate a growing workload or a memory leak.
- Sudden drops in available memory without recovery may warrant further investigation.
For appliances running versions earlier than 10.0.4:
This chart displays how system memory is being used over time, broken down into:
- Utilized: Memory actively used by applications and the OS
- Cached: Memory used to cache frequently accessed files and data
- Buffered: Memory temporarily used for filesystem metadata and I/O operations
- How to read it:
- Utilized memory typically dominates, but cached and buffered areas are reclaimable when needed.
- Interpretation:
- High memory utilization is normal, especially if cache and buffers are large.
- Cached and buffered memory helps performance — It speeds up disk and file access.
- Low free memory becomes problematic only when cache/buffers are shrinking rapidly, indicating that the reclaimable portion of utilized memory will soon be exhausted.
- Learn more about how Linux consumes RAM: Help! Linux ate my RAM!
Network Activity
SMB Sessions
This chart shows the total number of active SMB (Server Message Block) sessions over time.
- How to read it:
- The y-axis shows the total number of concurrent SMB sessions.
- Each point indicates the number of active client connections at that moment.
- Interpretation:
- Stable session count indicates consistent client activity.
- Sudden drops may suggest disconnects, network issues, or access problems.
- Sharp spikes could indicate load tests, backup jobs, or atypical usage patterns.
- Sessions climbing continuously without leveling off may signal resource exhaustion or runaway client behavior.
Network Utilization: Client
This chart displays network throughput between clients and the appliance, broken down as:
- Client Reads: Data sent to clients (e.g., file and directory reads)
- Client Writes: Data received from clients (e.g., file uploads and writes)
- How to read it:
- The y-axis shows data rate (Kbps,Mbps, and Gbps).
- Higher Client Reads typically indicates read-heavy workloads.
- Higher Client Writes may indicate backups or write-heavy activity such as migrations.
- Interpretation:
- Balanced traffic suggests normal file access behavior.
- Sustained high Client Reads may indicate large downloads or frequent access to files, which may trigger more Cloud Downloads and increased cache misses.
- Spikes in either direction may align with sync jobs, backups, or user batch activity.
- Watch out for:
- Utilization nearing network interface capacity can become a bottleneck.
- Sudden unexpected drop in both directions may suggest client disconnects or network issues.
Network Utilization: Cloud
This chart shows the data transfer between the appliance and the cloud object store, broken down as:
- Cloud Downloads: Data retrieved from cloud storage (e.g., file restores, cache misses, metadata lookups)
- Cloud Uploads: Unprotected data pushed to the cloud (e.g., snapshots, file changes, new content)
- Note: This chart also shows network activity between the appliance and the Nasuni Orchestration Center (NOC) as cloud downloads and uploads.
- How to read it:
- The y-axis shows throughput (Kbps, Mbps, and Gbps).
- Cloud Uploads often dominate during ingest or migration.
- Cloud Downloads indicate access to uncached data or sync operations.
- Interpretation:
- Spikes in Cloud Downloads may indicate file restores, cache rehydration, or large read activity.
- Steady Cloud Uploads typically reflect continuous data ingest or scheduled backups.
- Low values reflect idle workloads or fully cached access.
- Watch out for:
- Cloud Out spikes unexpectedly — may indicate frequent cache misses or user/machine-initiated read activity
- Sustained high Cloud In — could signal large data ingestion, sync jobs, or ransomware-related activity.
- Throughput nearing uplink limits may cause performance degradation — check Quality of Service (QoS) settings.
Cache Performance
Cache Hits and Misses
This chart tracks how often file requests are served from the local cache (hits) versus fetched from the cloud (misses).
- Cache Hit: File or metadata was available locally — served immediately.
- Cache Miss: File or metadata was not in cache — had to be downloaded from the cloud.
- How to read it:
- The y-axis shows the total number of cache hits and misses, respectively.
- High cache hit counts indicate effective caching.
- Cache miss spikes often occur during the first time a file or directory is accessed or when clients request infrequently accessed files or directories.
- Interpretation:
- High cache hit ratio → Efficient performance; most client I/O stays local
- Frequent cache misses → Cache is too small, being evicted too often, or clients are accessing cold data
- Watch out for:
- A sudden increase in cache misses may indicate a shift in workload.
- A sudden drop in cache hits can correlate with a recent cache eviction.
- High cache miss rate during peak usage — can impact client performance due to cloud fetch delays. In the event of a sustained high miss rate, consider upgrading to a larger cache for low-latency access to your files.
Cache Utilization
This chart shows how much of the appliance’s configured cache is currently in use. The cache stores frequently accessed files and metadata locally to reduce cloud fetches and improve performance.
- How to read it:
- The y-axis shows cache usage as a percentage of the total cache size.
- Refer to the appliance detail section to quickly see the cache disk size.
- A growth in cache utilization can be attributed to new data or faulting in of existing data from the cloud.
- A sharp decline in cache utilization indicates eviction. The Edge evicts items from cache based on the cache reserve settings configured via the Nasuni Management Console (NMC).
- Refer to this document for more details on cache eviction: Nasuni Edge Appliance Eviction Algorithm
- Interpretation:
- Moderate utilization (50–80%) is expected in healthy systems.
- Very low utilization may indicate over-provisioning or excessive cache flushing.
- Near 100% utilization can lead to frequent evictions and increased cache misses.
- Watch out for:
- 100% Cache Utilization — No room for new data. Edge will encounter issues evicting data from its cache. Edge is an unhealthy state.
- Cache remains above 95% — May signal issues with eviction.
- Multiple drops in Cache Utilization
- Typical for migration jobs or large ingest operations.
- If seen in day-to-day operations, it may result in slower file access times.
- High cache utilization, combined with rising cache misses, suggests that the cache is too small for the active working set. Consider expanding the cache disk.
Disk Performance
Depending on the type, an appliance will have the following disks
- Cache Disk
- OS Disk
- CoW Disk
- File IQ DB Disk (Only on File IQ appliances)
Note: Today, the Ops IQ reports IOPS and Total IO Time for the Cache and OS disk.
IOPS
This chart displays the number of read and write I/O operations per second (IOPS) performed by the disk. IOPS measures the number of discrete read or write requests issued to the disk and is a key indicator of disk activity and performance.
- How to read it:
- The y-axis shows the number of read and write operations per second, respectively.
- Read IOPS: Number of read operations per second
- Write IOPS: Number of write operations per second
- Data Source
- Metrics are collected via the collectd disk plugin, using disk_ops.read and disk_ops.write, derived from Linux’s /proc/diskstats.
- disk_time.read: This metric indicates the average amount of time it took to complete a read operation.
- disk_time.write: This metric shows the average amount of time it took to complete a write operation.
- How to Interpret
- Higher IOPS typically indicates greater disk activity. This can be expected under load from internal services or file data and metadata activity.
- Sudden spikes may indicate burst workloads or unexpected demand.
- Consistent IOPS activity indicates ongoing client or system access to the disk.
- In case of the Cache Disk:
- High read IOPS → Clients are reading files already present in cache, or data is faulted in from the S3 bucket.
- High write IOPS → New or modified data is being written to the cache disk before protecting it with a snapshot.
- Sustained high IOPS, especially if approaching hardware limits, may lead to increased latency or contention.
- Best Practices
- Compare against baseline IOPS for your disk type (HDDs: hundreds, SSDs: thousands to tens of thousands).
- Monitor for changes over time that may suggest workload shifts, performance degradation, or emerging bottlenecks.
Disk I/O Time (Latency and Saturation)
This chart displays the time spent on disk I/O operations, providing insights into disk latency and potential saturation. It helps in understanding how responsive the disk is to requests and whether it's becoming a bottleneck.
- How to read it:
- I/O Time (Total): The total amount of time, in milliseconds, that the disk spent actively performing I/O operations (both reads and writes). It's a direct measure of how busy the disk is.
- Weighted I/O Time (Saturation): This line represents the total time spent doing I/O, but it is weighted by the number of concurrent or overlapping I/O requests that the disk was handling at any given moment. This metric provides a comprehensive view of disk saturation and contention, as it accounts for both the time operations take and the cumulative impact of multiple operations being processed simultaneously or waiting in the disk's queue.
- Data Source:
- Metrics are collected via the collectd disk plugin, using disk_io_time.io_time and disk_io_time.weighted_io_time, derived from Linux’s /proc/diskstats.
- How to Interpret:
- I/O Time (Total):
- Higher values indicate increased disk activity and longer periods where the disk is actively engaged in processing requests.
- Sustained high values approaching 1000ms (1 second) over a 1-second interval suggest the disk is operating near its capacity.
- Spikes can indicate burst workloads or periods of intense disk usage.
- Weighted I/O Time (Saturation):
- Higher values signify increased disk contention and potential I/O bottlenecks. When Weighted IO Time is significantly higher than io_time, it suggests that not only is the disk busy, but that there are many concurrent I/O requests competing for disk resources or waiting in the queue. This indicates that the demand for I/O (in terms of overlapping requests) is exceeding the disk's effective capacity to process those concurrent requests efficiently.
- A sustained increase in weighted_io_time (especially when io_time is also high) is a strong indicator of disk saturation, leading to increased application latency and degraded performance.
- I/O Time (Total):
- Watch out for:
- High and sustained IO Time: This suggests the disk is constantly busy, potentially indicating a workload that exceeds its capabilities.
- Weighted IO Time is significantly higher than IO Time: This is a critical indicator of disk saturation and a bottleneck. It means I/O requests are queuing up, leading to noticeable performance degradation.
- Best Practices:
- Establish Baselines: Look at last month’s telemetry and understand the typical IO Time and Weighted IO Time for your disk under normal operating conditions. This will help you identify anomalies more easily.
- Address Saturation: If Weighted IO Time consistently indicates saturation, consider:
- Distributing workloads across multiple appliances.
- Upgrading to faster disk hardware (e.g., SSDs or NVMe).
FAQs
- The Appliance dashboard loads, but none of the charts have data. How to fix this?
- Ops IQ requires Nasuni appliances to be on version 9.15 or later. Update the appliance to the latest version.
- Ensure the appliance is online
- Are NIQ Appliances included in the Appliance Dashboard?
- Yes. NEA and NIQ appliances are included in the dashboard.
- What causes gaps in telemetry?
- Several factors can cause an interruption in the telemetry. Here are a few causes:
- The appliance was shutdown, or the appliance was recovered via disaster recovery
- Due to some issue, the internal Nasuni services were unable to collect the resource metrics.
- How to provide feedback during private preview:
- We would love to get your feedback on these new dashboards. Use the 'Share Feedback' feature to tell us how you feel about these dashboards.