504 lines
30 KiB
Plaintext
504 lines
30 KiB
Plaintext
===============================================================================
|
||
Changes in 3.2.0
|
||
===============================================================================
|
||
# Enable experimental platform support for Blackwell B200-based systems.
|
||
# Added one-shot and two-shot NVLINK SHARP (NVLS) allreduce algorithms for
|
||
half-precision (``float16``, ``bfloat16``) and full-precision (``float32``) datatypes on
|
||
NVLINK4 and NVLINK5 enabled platforms.
|
||
# Added multi-SM based acceleration of TP collectives (`reduce`, `fcollect`,
|
||
`reducescatter`) to improve NVLINK BW utilization on NVLINK4-enabled platform to
|
||
achieve 8x/16x speedup for medium to large-message size (>=1MB).
|
||
# NVSHMEM now also ships LLVM IR bitcode device library to support MLIR-compliant compiler
|
||
toolchain integration on new and upcoming Python DSLs (Triton, Mosaic, Numba, and so on).
|
||
This feature enhances perftest to support cubin-based ``cudaCooperativeLaunch`` and
|
||
kernel function-based ``nvshmemx_collective_launch`` execution to improve robustness
|
||
of the new bitcode device library.
|
||
# Enhanced NVSHMEM host/device side collective and pt-to-pt to use new command-line
|
||
interface to support the runtime tunability of message size, datatype, reduce op, iterations, and so on.
|
||
# Improved heuristics for the automatic selection of on-stream NVLS
|
||
collectives for allgather, reducescatter, and allreduce operations that span
|
||
NVLINK-connected, GPU-based systems.
|
||
# Eliminates dynamic link-time dependency on MPI and SHMEM on perftest and examples
|
||
and replaces them with the dynamic load-time capability in the perftest and examples.
|
||
# Fixed a bug that was related to incorrect bus bandwidth reporting in
|
||
``shmem_p_bw``, ``shmem_g_bw``, ``shmem_atomic_bw``, ``shmem_put_bw``, and ``shmem_get_bw`` perftests.
|
||
# Fixed a bug that was related to rounding error in NVLS reducescatter min and
|
||
max operation due to incorrect usage of vectorized ``float16`` instead of ``uint32`` datatypes.
|
||
# Fixed a bug that was related to dynamic loading of an unversioned bootstrap library.
|
||
# Fixed a bug that was related to linking CMake projects to system installer packages.
|
||
# Fixed a bug that was related to building heterogenous version of device library.
|
||
# Fixed a bug that was related to establishing QP connection in IBGDA transport
|
||
when using Dynamic Connection (DC) mode.
|
||
# Fixed a bug that was related to building perftests for earlier CUDA versions
|
||
(for example, 11.8) that do not support half-precision datatypes (for example, ``__nv_bfloat16``).
|
||
# Fixed a bug that was related to ABI compatibility breakage for allreduce maxloc op.
|
||
# Fixed a bug that was related to non-deterministic deadlock/race condition on the GPU when mixing
|
||
``nvshmemx_team_split_strided`` with ``nvshmemx_barrier_all_on_stream`` operation back-to-back.
|
||
# Fixed a bug that was related to out-of-memory (OOM) during dynamic device memory
|
||
based symmetric heap reservation on platforms with > 8 NVLINK connected GPUs.
|
||
# Fixed a documentation bug that was related to incorrect usage of
|
||
``MPI_Bcast`` and unversioned ``nvshmemx_init_attr_t`` structure when initialization NVSHMEM using unique ID.
|
||
# Fixed a bug that was related to host memory corruption/free when creating
|
||
multiple teams using ``nvshmem_team_split_strided``.
|
||
|
||
|
||
|
||
===============================================================================
|
||
Changes in 3.1.0
|
||
===============================================================================
|
||
# Added support for NVLINK SHARP (NVLS) based collective algorithms on x86 + Hopper and
|
||
Grace Hopper architecture based single and multi-node NVLINK platforms for
|
||
popular deep-learning collective communications (ReduceScatter, Allgather, Allreduce)
|
||
device and on-stream APIs. This feature improves latency for small-message
|
||
size by 2-3x speedup, when compared with one-shot algorithms over
|
||
NVLINK.
|
||
# Added support for GPU kernels that wish to utilize a low-level query API to
|
||
NVLS enabled symmetric memory using `nvshmemx_mc_ptr` host and device API
|
||
for a given target `team`.
|
||
# Added support for new Low-Latency protocol (LL128) for Allgather collective
|
||
communication device and on-stream APIs.
|
||
# Enhanced support for existing low-latency protocol (LL) warp-scoped collective
|
||
to provide a 2x speedup, over traditional algorithms when scaling up number of GPUs upto 32.
|
||
# Added support for half-precision (FP16/BF16) format on collective
|
||
communication (ReduceScatter, Allgather, Allreduce) on-device and on-stream
|
||
APIs.
|
||
# Added support for Python wheels via PyPI repository and rpm/deb package
|
||
distribution.
|
||
# Added support for dynamic RDMA Global Identifier (GID) discovery for RoCE
|
||
transports. This feature enables automatic fallback to the discovered GID
|
||
without requiring the user to specify the GID via runtime variable.
|
||
# Added support for a heterogenous library build system. This feature allows
|
||
the NVSHMEM static library to be built with a separate CUDA version from the NVSHMEM host library.
|
||
This enables new features such as NVLS in the host library while still
|
||
allowing applications compiled against lower versions of CUDA to link to the
|
||
NVSHMEM device library, making the entire library portable to different CUDA
|
||
minor versions while remaining feature complete. Users can specify a distinct
|
||
CUDA version for the device library by specifying
|
||
``NVSHMEM_DEVICELIB_CUDA_HOME=<PATH TO CUDA>``. Otherwise the host CUDA version will be used.
|
||
# Enhance support for NVSHMEM on_stream signal APIs to use
|
||
`cuStreamWriteValue()` over P2P connected GPUs when possible. This makes it possible
|
||
to have a zero-SM implementation of the on_stream signalling op when possible.
|
||
# Added support for DMABuf based registration of NIC control-structures in
|
||
IBGDA to leverage DMABuf mainline support in newer linux kernels (over
|
||
proprietary solution nvidia-peermem).
|
||
# Added a sample code for NVSHMEM UniqueID (UID) socket based
|
||
bootstrap modality under `examples` directory.
|
||
# Added support for NVSHMEM performance benchmarks to our release binary
|
||
packages.
|
||
# Removed host API based nvshmem collectives performance benchmarks.
|
||
# Enhanced collectives performance by adding new metrics - Algorithmic Bandwidth
|
||
(algoBW) and Bus Bandwidth (BusBW) to NVSHMEM performance benchmarks.
|
||
# Fixed support for Ninja build generator in our CMake build system.
|
||
# Fixed a runtime bug related to use of ``NVSHMEM_DEVICE_TIMEOUT_POLLLING`` build time variable.
|
||
# Enhanced our CI pipelines to support job-specific timeout to force
|
||
early termination of any job that is hung on the GPU or CPU and avoid pipeline
|
||
starvation of subsequently queued jobs on the same system.
|
||
# Fixed a performance bug in on-stream collectives perftest related to use of
|
||
cudaMemcpyAsync on the same CUDA stream, where cudaEvent for profiling start &
|
||
end time of the on-stream communication kernel are submitted
|
||
# Fixed a bug related to virtual member functions of
|
||
`nvshmemi_symmetric_heap` by forcing its access specifier to be protected to
|
||
limit its access to only inherited child classes
|
||
# Fixed a bug related to recursive destructor memory corruption and
|
||
`nullptr` access to static member function of `nvshmemi_mem_transport`
|
||
class.
|
||
# Fixed a bug with incorrect compile-time value for
|
||
``NVML_GPU_FABRIC_STATE_COMPLETED`` and ``NVML_GPU_FABRIC_UUID_LEN``
|
||
constants.
|
||
# Fixed a bug in ``nvshmemx_collective_launch_query_gridsize`` which could
|
||
cause it to erroneously return a gridsize of 0.
|
||
# Fixed a bug during ``nvshmem_init`` which could cause application to crash in MNNVL discovery
|
||
when use with CUDA compat libraries at runtime for CUDA toolkit > 12.4.
|
||
# Fixed a bug in ``nvshmemx_collective_launch`` which could cause duplicate initialization of
|
||
nvshmem device state.
|
||
# Fixed a bug related to uninitialized variables in IBGDA device code.
|
||
# Fixed a bug related to out-of-bound access (OOB) in atomic BW performance
|
||
test.
|
||
# Fixed a bug related to missing C/C++ `stdint` headers on Ubuntu24.04 + x86
|
||
based systems.
|
||
# Fixed a bug related to incorrect calculation of team specific stride when
|
||
creating a new team using `nvshmem_team_split_strided`.
|
||
# Enhance the reduce-based collective symmetric memory scratch space to
|
||
512KB to accomodate additional space for reducescatter based collectives.
|
||
|
||
===============================================================================
|
||
Changes in 3.0.6
|
||
===============================================================================
|
||
# Added support for Multi-node systems that have both RDMA networks
|
||
(IB, RoCE, Slingshot, etc) as well as NVLink as a multi-node interconnects.
|
||
# Added support for ABI backward compatibility between host and device libraries.
|
||
Within the same NVSHMEM major version, newer host library will continue to be
|
||
compatible with an older device library version. The work involved minimizing
|
||
ABI surface between host and device libraries and versioning of structs and
|
||
functions that are part of the new ABI surface.
|
||
# Enhance NVSHMEM's memory management infrastructure using object oriented
|
||
programming (OOP) framework with multi-level inheritance to manage support for
|
||
various memory types and to enable support for newer memory types in the future.
|
||
# Added support for PTX testing in NVSHMEM.
|
||
# Added support for CPU assisted IBGDA via the NIC handler to manage NIC doorbell.
|
||
The NIC handler can now be selected through the new environment variable -
|
||
`NVSHMEM_IBGDA_NIC_HANDLER`. This feature would enable IBGDA adoption on systems
|
||
that don't have `PeerMappingOverride=1` driver setting.
|
||
# Improved performance of IBGDA transport initialization by 20-50% when scaling up
|
||
the number of PEs, by batching and minimizing the number of memory registration
|
||
invocations for IB control structures.
|
||
# Enhance support for composing NVSHMEM_TEAM_SHARED on Multi-node NVLink (MNNVL)
|
||
based systems.
|
||
# Improved performance for block scoped reductions by parallelizing send/recv data,
|
||
when sending small size messages. Also, NVSHMEM device code compiled with CUDA 11.0
|
||
and std=c++17 will automatically make use of cooperative group reduction APIs to
|
||
improve performance of local reductions.
|
||
# Fixed implementation of system scoped atomic memory operations (AMO)
|
||
such as `nvshmem_fence/atomic_<ops>` and signaled operations `nvshmem_signal_<op>`
|
||
when communicating over NVLink.
|
||
# Added IBGDA support to automatically prefer RC over DC connected QPs and update
|
||
the default values of `NVSHMEM_IBGDA_NUM_RC_PER_PE/NVSHMEM_IBGDA_NUM_DCI` to be 1.
|
||
# Added assertions in DEVX and IBGDA transport for checking extended atomics
|
||
support in the RDMA NICs.
|
||
# Added support for no-collective synchronization action in
|
||
`nvshmem_malloc/calloc/align/free`, to follow OpenSHMEM spec compliant behavior,
|
||
when requested size or buffer in heap is 0 and NULL respectively.
|
||
# Added support for `nvshmemx_fcollectmem/broadcastmem` device and onstream APIs
|
||
# Improved performance tracing for on-stream and host collectives performance
|
||
benchmarks using `cudaEventElapsedTime` instead of `gettimeofday` API.
|
||
# Added support for performance benchmark `bootstrap_coll` for various bootstrap
|
||
modalities in NVSHMEM.
|
||
# Added support for "Include-What-You-Use" (IWYU) framework in CMake build system.
|
||
# Removed support for deprecated Power-9 systems.
|
||
# Removed support for deprecated makefile build system. NVSHMEM now support CMake
|
||
build system exclusively
|
||
# Fixed a bug in remote transports during memory regisration and deregistration,
|
||
with respect to memory handle management cache.
|
||
# Fixed a bug in QP mapping options `NVSHMEM_IBGDA_DCI_MAP_BY=warp` or
|
||
`NVSHMEM_IBGDA_RC_MAP_BY=warp`, which previously lead to suboptimal mapping of
|
||
QPs to warps/DCTs.
|
||
# Fixed a bug to dynamically load explicitly versioned `libcuda.so` and `libnvml.so`.
|
||
# Fixed a bug in computing NVSHMEM team symmetric heap memory requirements during
|
||
runtime initialization.
|
||
# Fixed a bug related to stale filepaths when aborting a NVSHMEM runtime.
|
||
# Fixed a bug when building NVSHMEM remote transports with
|
||
`HAVE_IBV_ACCESS_RELAXED_ORDERING` set.
|
||
# Fixed a bug that exhibits the behavior of a GPU device hang, when using RC QP
|
||
type with IBGDA.
|
||
# Fixed a bug with an incorrect value of broadcast LL threshold.
|
||
# Fixed a bug in IBDEVX related to incorrect endianness check.
|
||
# Fixed a memory leak in `nvshmem_team_destroy` related to missing teardown for
|
||
two internal subteams for each user created team.
|
||
# Fixed several minor bugs and memory leaks.
|
||
|
||
===============================================================================
|
||
Changes in 2.11.0
|
||
===============================================================================
|
||
|
||
# Added support for Multi-node NVLink (MNNVL) systems when all nodes are
|
||
connected via NVLink
|
||
# Added support for multiple NICs per PE in IBGDA transport. It can be enabled
|
||
using NVSHMEM_IBGDA_ENABLE_MULTI_PORT runtime environment variable.
|
||
# Added support for sockets-based bootstrapping of NVSHMEM jobs through the Unique ID based initialization API
|
||
# Added nvshmemx_hostlib_init API that allows NVSHMEM host library only initialization.
|
||
This is useful for applications that only use NVSHMEM host API and need not
|
||
statically link NVSHMEM device library.
|
||
# Added support for dynamically linking NVSHMEM library through dlopen()
|
||
# Introduces a new nvshmemx_vendor_get_version_info API to query the
|
||
NVSHMEM_VENDOR_MAJOR_VERSION, NVSHMEM_VENDOR_MINOR_VERSION, NVSHMEM_VENDOR_PATCH_VERSION
|
||
for API consumers
|
||
# Added NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE runtime environment variable
|
||
to get the full API support with Multi-Process per GPU (MPG) runs even if
|
||
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE is not set to 1/PEs.
|
||
# Improved throughpout and bandwidth performance of IBGDA transport
|
||
# Fixed hang that was introduced in CUDA VMM path on DGX1V systems in NVSHMEM 2.10.1
|
||
# Improved performance of nvshmemx_quiet_on_stream() API with IBGDA transport
|
||
by leveraging multiple CUDA threads to perform IBGDA quiet operation
|
||
# Fixed hang with minimal proxy service in nvshmem_global_exit on Grace Hopper system due to memory reordering
|
||
of load/stores
|
||
# Enable relaxed ordering by default for InfiniBand transports. Added runtime
|
||
environment variable NVSHMEM_IB_ENABLE_RELAXED_ORDERING to disable it.
|
||
# Increased number of threads launched to execute nvshmemx_<typename>_<op>_reduce_on_stream() API
|
||
# Added runtime environment variable NVSHMEM_DISABLE_DMABUF to disable use of dmabuf
|
||
# Fix in IBGDA transport when doing very large message transfers beyond the maximum
|
||
size supported by a single NIC work request
|
||
# Fixed several minor bugs and memory leaks
|
||
|
||
===============================================================================
|
||
Changes in 2.10.1
|
||
===============================================================================
|
||
|
||
# Support for single and multi-node Grace Hopper systems
|
||
# Support for the EFA provider using the libfabric transport, which can be
|
||
enabled with NVSHMEM_LIBFABRIC_PERSONA=EFA
|
||
# NVRTC support was added for the NVSHMEM device implementation headers.
|
||
# Fixed memory leaks in nvshmem_finalize
|
||
# Added support for calling nvshmem_init and nvshmem_finalize in a loop with
|
||
any bootstrap. Previously the support had existed only for MPI bootstrap
|
||
# Performance optimizations in Alltoall collective API
|
||
# Implemented warp-level automated coalescing of nvshmem_<typename>_g
|
||
operations to contiguous addresses in IBGDA transport
|
||
# Removed redundant consistency operations in IBGDA transport
|
||
# Added support for synchronized memory operations when using VMM API for NVSHMEM symmetric heap
|
||
# Code refactoring to improve host and device library ABI interface
|
||
# Several bug fixes
|
||
|
||
===============================================================================
|
||
Changes in 2.9.0
|
||
===============================================================================
|
||
|
||
# Improvements to CMake build system. CMake is now the default build system and
|
||
the Makefile build system is deprecated.
|
||
# Added loadable network transport modules.
|
||
# NVSHMEM device code can now be inlined to improve performance by enabling
|
||
NVSHMEM_ENABLE_ALL_DEVICE_INLINING when building the NVSHMEM library.
|
||
# Improvements to collective communication performance.
|
||
# Updated libfabric transport to fragment messages larger than the maximum
|
||
length supported by the provider.
|
||
# Improvements to IBGDA transport, including large message support, user buffer
|
||
registration, blocking g/get/amo performance, CUDA module support, and several
|
||
bugfixes.
|
||
# Introduced ABI compatibility for bootstrap modules. This release is
|
||
backawards compatible with the ABI introduced in NVSHMEM 2.8.0.
|
||
# Added NVSHMEM_BOOTSTRAP_*_PLUGIN environment variables that can be used to
|
||
override the default filename used when opening each bootstrap plugin.
|
||
# Improved error handling for GDRCopy.
|
||
# Added a check to detect when the same number of PEs is not run on all nodes.
|
||
# Added a check to detect availability of nvidia_peermem kernel module.
|
||
# Reduced internal stream synchronizations to fix a compatibility bug with CUDA
|
||
graph capture.
|
||
# Fixed a data consistency issue with CUDA graph capture support.
|
||
|
||
===============================================================================
|
||
Changes in 2.8.0
|
||
===============================================================================
|
||
|
||
# The transport formerly called GPU Initiated Communication (GIC) has been
|
||
renamed to InfiniBand GPUDirect Async (IBGDA) to reflect the underlying
|
||
technology used by that transport.
|
||
# Improvements to the all-to-all algorithm were made for both the IBGDA and
|
||
IBRC transports. These changes specifically focused on latency bound all-to-all
|
||
operations.
|
||
# Support for RC connections was added to IBGDA to optimize workloads on small
|
||
PE sets.
|
||
# Fixed an issue in the IBGDA Transport which caused all GPUs on the same host
|
||
to use the same NIC.
|
||
# Fixed an issue in the DMA-BUF registration path. Users no longer need to
|
||
limit their allocation granularity to 4GiB when using DMABUF.
|
||
|
||
===============================================================================
|
||
Changes in 2.7.0
|
||
===============================================================================
|
||
|
||
# Added experimental CMake build system that will replace the Makefile in a
|
||
future release
|
||
# Updated GPU Initiated Communication (GIC) transport provides significant
|
||
performance improvements over NVSHMEM 2.6.0
|
||
# Added NVSHMEM version checks to ensure that the dynamically linked NVSHMEM
|
||
host library is compatible with the statically linked device library. Also
|
||
added compatibility checks for the inbuilt bootstrap plugins.
|
||
# Added support for CUDA minor version compatibility, which allows NVSHMEM
|
||
application binaries built with CUDA M.X to run with M.Y, where M is the
|
||
major version and X and Y are compatible minor versions
|
||
# NVSHMEM library now statically links libcudart_static.a and dlopens libcuda.so
|
||
# Improved timing in NVSHMEM performance tests to reduce noise in measurements
|
||
# Added support for Hopper compute_90 and sm_90
|
||
# Removed support for Pascal compute_60, sm_60, compute_61, and sm_61
|
||
# Added version number suffix to libnvshmem_host.so and bootstrap plugins
|
||
# Added support for dmabuf memory registration
|
||
# Updated Hydra installation script to install Hydra 4.0.2
|
||
# Added a pre-built Hydra launcher to NVSHMEM binary packages.
|
||
# Catch user buffer registration error when requested buffer overlaps with an
|
||
already registered memory region
|
||
# An issue causing validation errors in collective operations when all GPUs
|
||
in a job are connected via PCIe without a remote transport using the proxy
|
||
thread was fixed.
|
||
|
||
|
||
===============================================================================
|
||
Changes in 2.6.0
|
||
===============================================================================
|
||
|
||
# Added new GPU initiated communication transport that allows kernel initiated
|
||
communication to be issued directly to the NIC and bypass the CPU proxy thread.
|
||
The transport is currently provided in experimental mode. It is disabled by default.
|
||
Please refer to installation guide for how to enable it.
|
||
# Updated the libfabric transport with initial support for Slingshot-11 networks.
|
||
Performance tuning for the libfabric transport is ongoing.
|
||
# Added collective algorithms for bcast/fcollect/reduce that use low latency (LL)
|
||
optimization by sending data and synchronization together, resulting in
|
||
significant performance improvements.
|
||
# Added warp- and block-scope implementation of recursive exchange algorithm for
|
||
reduce collectives
|
||
# Fixed bug in host/on-stream RMA API for very large data transfers
|
||
# Fixed bug in implementation of nvshmem_fence and nvshmemx_quiet_on_stream API
|
||
|
||
===============================================================================
|
||
Changes in 2.5.0
|
||
===============================================================================
|
||
|
||
# Added multi-instance support in NVSHMEM. NVSHMEM now builds as two libraries,
|
||
libnvshmem_host.so and libnvshmem_device.a, making it possible for an
|
||
application to have multiple components (for example, shared libraries,
|
||
application itself) that use NVSHMEM. Support for single library, libnvshmem.a,
|
||
still exists for legacy purposes but will be eventually removed.
|
||
# Added nvshmemx_init_status API to query the initialized state of NVSHMEM
|
||
# Added experimental DevX transport that directly uses Mellanox software stack
|
||
for InfiniBand devices
|
||
# Added experimental libfabric transport that will be used to support Slingshot
|
||
networks in a future release
|
||
# Added support for CUDA_VISIBLE_DEVICES. Support for CUDA_VISIBLE_DEVICES is
|
||
not yet available with CUDA VMM and requires setting NVSHMEM_DISABLE_CUDA_VMM=1.
|
||
# Updated PMI and PMI-2 bootstraps to plugins
|
||
# Added nvshmem-info utility to display information about the NVSHMEM library
|
||
# Fixed warnings when using NVSHMEM in applications compiled without RDC
|
||
(Relocatable Device Code) option
|
||
# Renamed internal variables to avoid potential conflicts with variables in
|
||
application
|
||
# Implemented nvshmem_alltoallmem API
|
||
# Improve GPU to NIC assignment logic for Summit/Sierra supercomputer
|
||
# Fixed host barrier API implementation for non-blocking on stream (*_nbi_on_stream)
|
||
point-to-point operations
|
||
# Updated descriptions for NVSHMEM environment variables displayed via
|
||
nvshmem-info or by setting NVSHMEM_INFO=1
|
||
|
||
===============================================================================
|
||
Changes in 2.4.1
|
||
===============================================================================
|
||
|
||
# Added limited support for Multiple Processes per GPU (MPG) on x86 platforms.
|
||
The amount of support depends on availability of CUDA MPS. MPG support is
|
||
currently not available on P9 platforms.
|
||
# Added a local buffer registration API that allows non-symmetric buffers to be
|
||
used as local buffers in NVSHMEM API.
|
||
# Added support for dynamic symmetric heap allocation, which eliminates the need
|
||
to specify NVSHMEM_SYMMETRIC_SIZE.
|
||
This feature is available with CUDA >= 11.3 and is enabled by default on x86
|
||
platforms. On P9 platforms, it is disabled by default, and can be enabled using
|
||
NVSHMEM_CUDA_DISABLE_VMM environment variable
|
||
# Support for very large RMA messages has been added
|
||
# NVSHMEM can now be built without ibrc support by setting NVSHMEM_IBRC_SUPPORT=0
|
||
in the environment before building.
|
||
This allows users to build and run NVSHMEM without the GDRCopy and OFED dependencies.
|
||
# Support for calling nvshmem_init/finalize multiple times with MPI bootstrap
|
||
# Improved testing coverage (large messages, exercising full GPU memory, and so on)
|
||
# Improved the default PE to NIC assignment for DGX2 systems
|
||
# Optimized channel request processing by CPU proxy thread
|
||
# Added support for the shmem_global_exit API
|
||
# Removed redundant barriers to improve the collectives’ performance
|
||
# Significant code refactoring to use templates instead of macros for internal
|
||
functions
|
||
# Improved performance for device-side blocking RMA and strided RMA API
|
||
# Bug fix for buffers with large offsets into the NVSHMEM symmetric heap
|
||
|
||
===============================================================================
|
||
Changes in 2.2.1
|
||
===============================================================================
|
||
|
||
# Implemented dynamic heap memory allocation (requires CUDA version >= 11.3) for
|
||
runs with P2P GPUs. It can be enabled using NVSHMEM_DISABLE_CUDA_VMM=0. Support
|
||
for IB runs will be added in the next release.
|
||
# Improved UCX transport performance for AMO and RMA operations
|
||
# Improved performance for warp and block put/get operations
|
||
# Added atomic support for PCIe connected GPUs over the UCX transport
|
||
# UCX transport now supports non-symmetric buffers for use as local buffers
|
||
in RMA and AMO operations
|
||
# Added support for initializing NVSHMEM in CUmodule
|
||
# Enabled MPI and PMIx bootstrap modules to be compiled externally from the
|
||
NVSHMEM build. This allows multiple builds of these plugins to support various
|
||
MPI and PMIx libraries. They can be selected by setting NVSHMEM_BOOTSTRAP="plugin"
|
||
and NVSHMEM_BOOTSTRAP_PLUGIN="plugin_name.so". Plugin sources are installed along
|
||
with the compiled NVSHMEM library.
|
||
# Enabled MPI bootstrap to be used with nvshmem_init by setting
|
||
NVSHMEM_BOOTSTRAP=MPI or via the bootstrap plugin method.
|
||
# Fixed bugs in nvshmem_<typename>_g and fetch atomics implementation
|
||
# Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match
|
||
OpenSHMEM specification
|
||
# Fixed type of nreduce argument in reduction API to size_t to match OpenSHMEM
|
||
specification
|
||
# Improved NVSHMEM build times with multi-threaded option in CUDA compiler
|
||
(requires CUDA version >= 11.2)
|
||
# Several fixes to address Coverity reports
|
||
|
||
===============================================================================
|
||
Changes in 2.1.2
|
||
===============================================================================
|
||
|
||
# Added a new, experimental UCX internode communication transport layer
|
||
# Added support for automatic warp-level coalescing of nvshmem_g operations
|
||
# Added support for put-with-signal operations on CUDA streams
|
||
# Added support for mapping the symmetric heap using the cuMem APIs
|
||
# Improved performance of single-threaded NVSHMEM put/get device API
|
||
# Added the NVSHMEM_MAX_TEAMS environment variable to specify maximum number
|
||
of teams that can be created
|
||
# Improved the host and on-stream Alltoall performance by using NCCL
|
||
# Fixed a bug in the compare-and-swap operation that caused several bytes of the
|
||
compare operand to be lost
|
||
# Added CPU core affinity to debugging output
|
||
# Added support for the CUDA 11.3 cudaDeviceFlushGPUDirectRDMAWrites API for consistency
|
||
# Improved support for the NVIDIA Tools Extension (NVTX) to enable performance
|
||
analysis through NVIDIA NSight
|
||
# Removed support for nvshmem_wait API that has been deprecated in OpenSHMEM 1.5
|
||
# Removed NVSHMEM_IS_P2P_RUN environment variable, runtime automatically determines it
|
||
# Made improvements to NVSHMEM example codes
|
||
# Added NVSHMEM_REMOTE_TRANSPORT environment variable for selecting the networking
|
||
layer used for communication between nodes
|
||
# Set maxrregcount to 32 for non-inlined device functions to ensure that calling
|
||
these NVSHMEM functions does not negatively affect kernel occupancy
|
||
|
||
===============================================================================
|
||
Changes in 2.0.3
|
||
===============================================================================
|
||
|
||
# Added work-around to avoid deadlocks due to CUDA context resource reconfiguration
|
||
on Power systems
|
||
# Added environment variable NVSHMEM_CUDA_LIMIT_STACK_SIZE to set GPU thread stack size
|
||
on Power systems
|
||
# Use of NCCL for stream and host NVSHMEM collectives is now supported on Power systems
|
||
# Updated threading level support reported for host and stream-based APIs
|
||
to NVSHMEM_THREAD_SERIALIZED. Device-side APIs support NVSHMEM_THREAD_MULTIPLE
|
||
# Fixed a bug that could lead to incorrect behavior for atomic compare-and-swap
|
||
# Fixed an issue that was observed to lead to incorrect results when using GDRCopy
|
||
|
||
===============================================================================
|
||
Changes in 2.0.2 EA
|
||
===============================================================================
|
||
|
||
# Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
|
||
# Added support to use the NVIDIA Collective Communication Library (NCCL) for
|
||
optimized NVSHMEM host and on-stream collectives.
|
||
# Added support for RDMA over Converged Ethernet (RoCE) networks.
|
||
# Added support for PMI-2 to enable an NVSHMEM job launch with srun/SLURM.
|
||
# Added support for PMIx to enable an NVSHMEM job launch with PMIx-compatible
|
||
launchers, such as Slurm and Open MPI.
|
||
# Uniformly reformatted the perftest benchmark output.
|
||
# Added support for the putmem_signal and signal_wait_until APIs.
|
||
# Improved support for single-node environments without InfiniBand.
|
||
# Fixed a bug that occurred when large numbers of fetch atomic operations were
|
||
performed on InfiniBand.
|
||
# Improved topology awareness in NIC-to-GPU assignments for DGX A100 systems.
|
||
|
||
===============================================================================
|
||
Changes in 1.1.3
|
||
===============================================================================
|
||
|
||
# Implements nvshmem_<type>_put_signal API from OpenSHMEM 1.5
|
||
# Adds nvshmemx_signal_op API
|
||
# Optimizes implementation of signal set operation over P2P connected GPUs
|
||
# Optimizes performance of nvshmem_fence() function
|
||
# Optimizes latency of NVSHMEM atomics API
|
||
# Fixes bug in nvshmem_ptr API
|
||
# Fixes bug in implementation of host-side strided transfer (iput, iget, etc.) API
|
||
# Fixes bug in on-stream reduction for `long long` datatype
|
||
# Fixes hang during nvshmem barrier collective operation
|
||
# Fixes __device__ nvshmem_quiet() to also do quiet on IB ops to self
|
||
# Fixes bug in fetch atomic and g implementation
|
||
|
||
===============================================================================
|
||
Changes in 1.0.1
|
||
===============================================================================
|
||
|
||
# Combines the memory of multiple GPUs into a partitioned global address space
|
||
that’s accessed through NVSHMEM APIs.
|
||
# Includes a low-overhead, in-kernel communication API for use by GPU threads.
|
||
# Includes stream-based and CPU-initiated communication APIs.
|
||
# Supports peer-to-peer communication using NVIDIA NVLink and PCI Express and for
|
||
GPU clusters using NVIDIA Mellanox® InfiniBand.
|
||
# Supports x86 and POWER9 processors.
|
||
# Is interoperable with MPI and other OpenSHMEM implementations.
|