Node Name
|
Node Name is typically assigned by the system
administrator based on the desired naming convention. It is typically the same
or derived from the Linux hostname for the server. Once selected by the
sysadmin, the value persists across OS reboot.
|
Node GUID
|
GUID of the HFI or Switch.
|
Port #
|
The link port number this SMP came on in.
|
Link State
|
Port State
|
Physical Link State
|
Physical Port State
|
LinkQualityIndicator
|
This is a status indicator, similar to the
signal strength bar display on a mobile phone, that enumerates link quality as
a range of 0-5, with 5 being very good. Values in the lower part of the range
may indicate hardware problems such as port, cable, etc. that surface as signal
integrity issues, leading to performance and other problems.
|
Link Width
|
The possible values for link width: 1x, 2x, 3x,
4x.
|
Link Width Enabled
|
The set of link widths that the LNI protocol
negotiates. The LNI protocol uses only LW.E to negotiate link width during LNI.
|
Link Width Supported
|
The link widths the port can negotiate to
during LNI. In some implementations firmware/driver and/or local device
settings may restrict this value further.
|
Active Link Speed
|
The link speed active value of this port.
|
Link Speed Enabled
|
The link speed enabled value of this port.
|
Link Speed Supported
|
The link speed supported value of this port.
|
RcvData (MB)
|
Receive Data Rate in Megabytes per second
(MBps).
|
RcvPkts
|
The total number of received fabric data
packets.
|
MulticastRcvPkts
|
The total number of multicast and collective
packets received.
This counter includes all valid packets and all packets with a
header up to and including the DLID, where the DLID is within the configured
range for multicast or collectives. Packets within the configured multicast or
collective address space are counted, even if later checks determine the packet
is unroutable or exceeds the SwitchInfo.MulticastFDBCap,
SwitchInfo.CollectiveFDBCap, configured SwitchInfo.MulticastFDBTop or
configured SwitchInfo.CollectiveFDBTop.
|
RcvErrors
|
This counter indicates the total number of
packets containing an error that were received by the port, including physical
errors and malformed packets. It may indicate possible misconfiguration of a
port, either by the SM or (more likely) by user intervention (e.g., using a
tool such as opaportconfig).
|
RcvConstraintErrors
|
This counter is incremented when partition key
or source LID violations are detected in a received packet, indicating a
possible security issue or misconfiguration of device security settings.
|
RcvSwitchRelayErrors
|
This counter indicates the number of packets
that were dropped due to internal routing errors. It is indicative of the
possible misconfiguration of a switch by the SM.
|
RcvRemotePhysicalErrors
|
This counter indicates the number of
downstream effects of signal integrity problems. It is indicative of an SI
issue in the upstream path.
|
RcvFECN
|
When a device receives a packet with the FECN
(Forward Explicit Congestion Notification) bit set to one, this counter is
incremented.
|
RcvBECN
|
When a device receives a packet with the BECN
(Backward Explicit Congestion Notification) bit set to one, this counter is
incremented.
|
RcvBubble
|
This counter indicates the total number of
"flit times" where one or more packets have started to be received, but the
receiver received idle flits from the wire.
|
XmitData (MB)
|
The total number of transmitted fabric data in
Megabytes.
|
XmitPkts
|
The total number of fabric packets transmitted.
This counter includes all fabric packet head flits transmitted
with and without errors (such as PktBadHead).
|
MulticastXmitPkts
|
The total number of multicast and collective
packets transmitted
|
XmitDiscards
|
The number of packets dropped due to one of the following
errors:
- Switch lifetime Limit exceeded
- Switch head-of-queue lifetime limit exceeded
- Output port not in active state
- Packet length exceeded maximum fabric packet size for MTU for
VL
- Flow control disabled and insufficient credits available
- SC2VL_t mapping invalid for given SC
|
XmitConstrainErrors
|
This counter is incremented when partition key
or source LID violations are detected in a packet attempting to be transmitted,
indicating a possible security issue or misconfiguration of device security
settings.
|
XmitWait
|
This counter indicates the amount of time (in
"flit times") any virtual lane had data but was unable to transmit (for reasons
such as no credits available, or that the link was busy sending non-data
packets such as link layer retraining or flow control).
|
XmitTimeCong
|
This counter indicates the total number of
"flit times" that the counter was in a congested state.
|
XmitWastedBW
|
This counter indicates the number of "flit
times" where one or more packets have been started but the transmitters are
forced to send idles due to bubbles.
|
XmitWaitData
|
This counter indicates the number of "flit
times" where one or more packets have been started but interrupted due to
bubbles in the ingress stream.
|
LocalLinkIntegrityErrors
|
This counter indicates the number of retries
initiated by the link transfer layer. It may be indicative of low signal
quality, or may be due to long or low quality cables.
|
FMConfigErrors
|
This counter indicates inconsistencies of low
level SMA configuration on both sides of the link. It is indicative of the
possible misconfiguration of a port, either by the SM, or (more likely) by user
intervention (by using a tool such as
opaportconfig).
|
ExcessiveBufferOverruns
|
This counter, associated with credit
management, indicates an input buffer overrun. It may indicate possible
misconfiguration of a port, either by the SM or (more likely) by user
intervention (e.g., using a tool such as opaportconfig).
|
SwPortCongestion
|
This switch-only counter indicates the number
of packets that were discarded as unable to transmit due to flow control
issues.
|
MarkFECN
|
This counter indicates the total number of
packets that were marked FECN by the transmitter due to congestion.
|
LinkErrorRecovery
|
This counter indicates the number of times the
link has successfully completed the link error recovery process. If LQI is
fluctuating toward low values AND this counter is increasing, it may be
indicative of a bad link. Indication of a more severe signal quality problem.
|
LinkDowned
|
This counter indicates the total number of
times the port has failed the link error recovery process and downed the link.
A large number of occurrences of these events can cause disruptions to fabric
traffic.
|
UncorrectableErrors
|
This counter indicates the number of
unrecoverable internal device errors. It is indicative of a severe hardware
defect or data corruption on the wire.
|