diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f1a8d15..a4b805e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,12 +1,14 @@ # Contributing -Thank you for your interest in contributing to cuQuantum Python! Based on the type of contribution, it will fall into two categories: +Thank you for your interest in contributing to cuQuantum Python! Based on the type of contribution, it will fall into three categories: 1. You want to report a bug, feature request, or documentation issue - - File an [issue](https://github.com/NVIDIA/cuQuantum/issues/new) + - File an [issue](https://github.com/NVIDIA/cuQuantum/issues) describing what you encountered or what you want to see changed. - The NVIDIA team will evaluate the issues and triage them, scheduling them for a release. If you believe the issue needs priority attention comment on the issue to notify the team. 2. You want to implement a feature or bug-fix - At this time we do not accept code contributions. +3. You want to share your nice work built upon cuQuantum: + - We would love to hear more about your work! Please share with us on [NVIDIA/cuQuantum GitHub Discussions](https://github.com/NVIDIA/cuQuantum/discussions>)! We also take any cuQuantum-related questions on this forum. diff --git a/python/CONTRIBUTING.md b/python/CONTRIBUTING.md index f1a8d15..a4b805e 100644 --- a/python/CONTRIBUTING.md +++ b/python/CONTRIBUTING.md @@ -1,12 +1,14 @@ # Contributing -Thank you for your interest in contributing to cuQuantum Python! Based on the type of contribution, it will fall into two categories: +Thank you for your interest in contributing to cuQuantum Python! Based on the type of contribution, it will fall into three categories: 1. You want to report a bug, feature request, or documentation issue - - File an [issue](https://github.com/NVIDIA/cuQuantum/issues/new) + - File an [issue](https://github.com/NVIDIA/cuQuantum/issues) describing what you encountered or what you want to see changed. - The NVIDIA team will evaluate the issues and triage them, scheduling them for a release. If you believe the issue needs priority attention comment on the issue to notify the team. 2. You want to implement a feature or bug-fix - At this time we do not accept code contributions. +3. You want to share your nice work built upon cuQuantum: + - We would love to hear more about your work! Please share with us on [NVIDIA/cuQuantum GitHub Discussions](https://github.com/NVIDIA/cuQuantum/discussions>)! We also take any cuQuantum-related questions on this forum. diff --git a/python/README.md b/python/README.md index d723737..7c146ad 100644 --- a/python/README.md +++ b/python/README.md @@ -5,19 +5,7 @@ Please visit the [NVIDIA cuQuantum Python documentation](https://docs.nvidia.com/cuda/cuquantum/python). -## Building - -### Requirements - -Build-time dependencies of the cuQuantum Python package and some versions that -are known to work are as follows: - -* CUDA Toolkit 11.x -* cuQuantum 22.07+ -* cuTENSOR 1.5.0+ -* Python 3.8+ -* Cython - e.g. 0.29.21 -* [packaging](https://packaging.pypa.io/en/latest/) +## Installation ### Install cuQuantum Python from conda-forge @@ -33,26 +21,53 @@ Alternatively, assuming you already have a Python environment set up (it doesn't you can also install cuQuantum Python this way: ``` -pip install cuquantum-python +pip install cuquantum-python-cu11 ``` -The `pip` solver will also install both cuTENSOR and cuQuantum for you. +The `pip` solver will also install all dependencies for you (including both cuTENSOR and cuQuantum wheels). + +Notes: -Note: To properly install the wheels the environment variable `CUQUANTUM_ROOT` must not be set. +- User can still install cuQuantum Python using `pip install cuquantum-python`, which currently points to the `cuquantum-python-cu11` wheel that is subject to change in the future. Installing wheels with the `-cuXX` suffix is encouraged. +- To manually manage all Python dependencies, append `--no-deps` to `pip install` to bypass the `pip` solver, see below. -### Install cuQuantum Python from source +### Building and installing cuQuantum Python from source + +#### Requirements + +The build-time dependencies of the cuQuantum Python package include: + +* CUDA Toolkit 11.x +* cuStateVec 1.1.0+ +* cuTensorNet 2.0.0+ +* cuTENSOR 1.5.0+ +* Python 3.8+ +* Cython >=0.29.22,<3 +* pip 21.3.1+ +* [packaging](https://packaging.pypa.io/en/latest/) +* setuptools 61.0.0+ +* wheel 0.34.0+ + +Except for CUDA and Python, the rest of the build-time dependencies are handled by the new PEP-517-based build system (see Step 7 below). To compile and install cuQuantum Python from source, please follow the steps below: -1. Set `CUDA_PATH` to point to your CUDA installation -2. Set `CUQUANTUM_ROOT` to point to your cuQuantum installation -3. Set `CUTENSOR_ROOT` to point to your cuTENSOR installation -4. Make sure CUDA, cuQuantum and cuTENSOR are visible in your `LD_LIBRARY_PATH` -5. Run `pip install -v .` +1. Clone the [NVIDIA/cuQuantum](https://github.com/NVIDIA/cuQuantum) repository: `git clone https://github.com/NVIDIA/cuQuantum` +2. Set `CUDA_PATH` to point to your CUDA installation +3. [optional] Set `CUQUANTUM_ROOT` to point to your cuQuantum installation +4. [optional] Set `CUTENSOR_ROOT` to point to your cuTENSOR installation +5. [optional] Make sure cuQuantum and cuTENSOR are visible in your `LD_LIBRARY_PATH` +6. Switch to the directory containing the Python implementation: `cd cuQuantum/python` +7. Build and install: + - Run `pip install .` if you skip Step 3-5 above + - Run `pip install -v --no-deps --no-build-isolation .` otherwise (advanced) Notes: -- For the `pip install` step, adding the `-e` flag after `-v` would allow installing the package in-place (i.e., in "editable mode" for testing/developing). -- If `CUSTATEVEC_ROOT` and `CUTENSORNET_ROOT` are set (for the cuStateVec and the cuTensorNet libraries, respectively), they overwrite `CUQUANTUM_ROOT`. -- For local development, set `CUQUANTUM_IGNORE_SOLVER=1` to ignore the dependency on the `cuquantum` wheel. +- For Step 7, if you are building from source for testing/developing purposes you'd likely want to insert a `-e` flag before the last period (so `pip ... .` becomes `pip ... -e .`): + * `-e`: use the "editable" (in-place) mode + * `-v`: enable more verbose output + * `--no-deps`: avoid installing the *run-time* dependencies + * `--no-build-isolation`: reuse the current Python environment instead of creating a new one for building the package (this avoids installing any *build-time* dependencies) +- As an alternative to setting `CUQUANTUM_ROOT`, `CUSTATEVEC_ROOT` and `CUTENSORNET_ROOT` can be set to point to the cuStateVec and the cuTensorNet libraries, respectively. The latter two environment variables take precedence if defined. ## Running @@ -64,14 +79,16 @@ Runtime dependencies of the cuQuantum Python package include: * An NVIDIA GPU with compute capability 7.0+ * Driver: Linux (450.80.02+) * CUDA Toolkit 11.x -* cuQuantum 22.07+ -* cuTENSOR 1.5.0+ +* cuStateVec 1.1.0+ +* cuTensorNet 2.0.0+ +* cuTENSOR 1.6.1+ * Python 3.8+ * NumPy v1.19+ * CuPy v9.5.0+ (see [installation guide](https://docs.cupy.dev/en/stable/install.html)) * PyTorch v1.10+ (optional, see [installation guide](https://pytorch.org/get-started/locally/)) * Qiskit v0.24.0+ (optional, see [installation guide](https://qiskit.org/documentation/getting_started.html)) * Cirq v0.6.0+ (optional, see [installation guide](https://quantumai.google/cirq/install)) +* mpi4py v3.1.0+ (optional, see [installation guide](https://mpi4py.readthedocs.io/en/stable/install.html)) If you install everything from conda-forge, the dependencies are taken care for you (except for the driver). @@ -102,4 +119,4 @@ variable `CUDA_PATH` is not set. ## Citing cuQuantum -Pleae click this Zenodo badge to see the citation format: [![DOI](https://zenodo.org/badge/435003852.svg)](https://zenodo.org/badge/latestdoi/435003852) +Pleae click this Zenodo badge to see the citation format: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6385574.svg)](https://doi.org/10.5281/zenodo.6385574) diff --git a/python/builder/__init__.py b/python/builder/__init__.py new file mode 100644 index 0000000..ab19887 --- /dev/null +++ b/python/builder/__init__.py @@ -0,0 +1,39 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + + +# How does the build system for cuquantum-python work? +# +# - When building a wheel ("pip wheel", "pip install .", or "python setup.py +# bdist_wheel" (discouraged!)), we want to build against the cutensor & +# cuquantum wheels that would be installed to site-packages, so we need +# two things: +# 1. make them the *build-time* dependencies +# 2. set up linker flags to modify rpaths +# +# - For 1. we opt in to use PEP-517, as setup_requires is known to not work +# automatically for users. This is the "price" we pay (by design of +# PEP-517), as it creates a new, "isolated" environment (referred to as +# build isolation) to which all build-time dependencies that live on PyPI +# are installed. Another "price" (also by design) is in the non-editable +# mode (without the "-e" flag) it always builds a wheel for installation. +# +# - For 2. the solution is to create our own bdist_wheel (called first) and +# build_ext (called later) commands. The former would inform the latter +# whether we are building a wheel. +# +# - There is an escape hatch for 1. which is to set "--no-build-isolation". +# Then, users are expected to set CUQUANTUM_ROOT (or CUSTATEVEC_ROOT & +# CUTENSORNET_ROOT) and manage all build-time dependencies themselves. +# This, together with "-e", would not produce any wheel, which is the old +# behavior offered by the environment variable CUQUANTUM_IGNORE_SOLVER=1 +# that we removed and no longer works. +# +# - In any case, the custom build_ext command is in use, which would compute +# the needed compiler flags (depending on it's building a wheel or not) +# and overwrite the incoming Extension instances. +# +# - In any case, the dependencies (on PyPI wheels) are set up by default, +# and "--no-deps" can be passed as usual to tell pip to ignore the +# *run-time* dependencies. diff --git a/python/builder/pep517.py b/python/builder/pep517.py new file mode 100644 index 0000000..276a456 --- /dev/null +++ b/python/builder/pep517.py @@ -0,0 +1,44 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +# This module implements basic PEP 517 backend support, see e.g. +# - https://peps.python.org/pep-0517/ +# - https://setuptools.pypa.io/en/latest/build_meta.html#dynamic-build-dependencies-and-other-build-meta-tweaks +# Specifically, there are 5 APIs required to create a proper build backend, see below. +# For now it's mostly a pass-through to setuptools, except that we need to determine +# some dependencies at build time. +# +# Note that we purposely do not implement the PEP-660 API hooks so that "pip install ... +# --no-build-isolation -e ." behaves as expected (in-place build/installation without +# creating a wheel). This may require pip>21.3.0. + +from packaging.version import Version +from setuptools import build_meta as _build_meta + +import utils # this is builder.utils (the build system has sys.path set up) + + +prepare_metadata_for_build_wheel = _build_meta.prepare_metadata_for_build_wheel +build_wheel = _build_meta.build_wheel +build_sdist = _build_meta.build_sdist + + +# Note: this function returns a list of *build-time* dependencies, so it's not affected +# by "--no-deps" based on the PEP-517 design. +def get_requires_for_build_wheel(config_settings=None): + # set up version constraints: note that CalVer like 22.03 is normalized to + # 22.3 by setuptools, so we must follow the same practice in the constraints; + # also, we don't need the patch number here + cuqnt_require = [f'custatevec-cu{utils.cuda_major_ver}~=1.1', # ">=1.1.0,<2" + f'cutensornet-cu{utils.cuda_major_ver}~=2.0', # ">=2.0.0,<3" + ] + + return _build_meta.get_requires_for_build_wheel(config_settings) + cuqnt_require + + +# Note: We have never promised to support sdist (CUQNT-514). We really cannot +# care less about the correctness here. If we are lucky, setuptools would do +# the right thing for us, but even if it's wrong let's not worry about it. +def get_requires_for_build_sdist(config_settings=None): + return _build_meta.get_requires_for_build_sdist(config_settings) diff --git a/python/builder/utils.py b/python/builder/utils.py new file mode 100644 index 0000000..b028a2b --- /dev/null +++ b/python/builder/utils.py @@ -0,0 +1,205 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import os +import re +import site +import sys + +from packaging.version import Version +from setuptools.command.build_ext import build_ext as _build_ext +from wheel.bdist_wheel import bdist_wheel as _bdist_wheel + + +# Get __version__ variable +source_root = os.path.abspath(os.path.dirname(__file__)) +with open(os.path.join(source_root, '..', 'cuquantum', '_version.py')) as f: + exec(f.read()) +cuqnt_py_ver = __version__ +cuqnt_py_ver_obj = Version(cuqnt_py_ver) +cuqnt_ver_major_minor = f"{cuqnt_py_ver_obj.major}.{cuqnt_py_ver_obj.minor}" + +del __version__, cuqnt_py_ver_obj, source_root + + +# We can't assume users to have CTK installed via pip, so we really need this... +# TODO(leofang): try /usr/local/cuda? +try: + cuda_path = os.environ['CUDA_PATH'] +except KeyError as e: + raise RuntimeError('CUDA is not found, please set $CUDA_PATH') from e + + +def check_cuda_version(): + try: + # We cannot do a dlopen and call cudaRuntimeGetVersion, because it + # requires GPUs. We also do not want to rely on the compiler utility + # provided in distutils (deprecated) or setuptools, as this is a very + # simple string parsing task. + # TODO: switch to cudaRuntimeGetVersion once it's fixed (nvbugs 3624208) + cuda_h = os.path.join(cuda_path, 'include', 'cuda.h') + with open(cuda_h, 'r') as f: + cuda_h = f.read() + m = re.search('#define CUDA_VERSION ([0-9]*)', cuda_h) + if m: + ver = int(m.group(1)) + else: + raise RuntimeError("cannot parse CUDA_VERSION") + except: + raise + else: + # 11020 -> "11.2" + return str(ver // 1000) + '.' + str((ver % 100) // 10) + + +# We only support CUDA 11 in v22.11 +cuda_ver = check_cuda_version() +if cuda_ver == '11.0': + cutensor_ver = cuda_ver + cuda_major_ver = '11' +elif '11.0' < cuda_ver < '12.0': + cutensor_ver = '11' + cuda_major_ver = '11' +else: + raise RuntimeError(f"Unsupported CUDA version: {cuda_ver}") + + +building_wheel = False + + +class bdist_wheel(_bdist_wheel): + + def run(self): + global building_wheel + building_wheel = True + super().run() + + +class build_ext(_build_ext): + + def _set_library_roots(self): + custatevec_root = cutensornet_root = cutensor_root = None + # Note that we need sys.path because of build isolation (since PEP 517) + py_paths = sys.path + [site.getusersitepackages()] + site.getsitepackages() + + # search order: + # 1. installed "cuquantum" package + # 2. env var + for path in py_paths: + path = os.path.join(path, 'cuquantum') + if os.path.isdir(os.path.join(path, 'include')): + custatevec_root = cutensornet_root = path + break + else: + # We allow setting CUSTATEVEC_ROOT and CUTENSORNET_ROOT separately for the ease + # of development, but users are encouraged to either install cuquantum from PyPI + # or conda, or set CUQUANTUM_ROOT to the existing installation. + cuquantum_root = os.environ.get('CUQUANTUM_ROOT') + try: + custatevec_root = os.environ['CUSTATEVEC_ROOT'] + except KeyError as e: + if cuquantum_root is None: + raise RuntimeError('cuStateVec is not found, please set $CUQUANTUM_ROOT ' + 'or $CUSTATEVEC_ROOT') from e + else: + custatevec_root = cuquantum_root + try: + cutensornet_root = os.environ['CUTENSORNET_ROOT'] + except KeyError as e: + if cuquantum_root is None: + raise RuntimeError('cuTensorNet is not found, please set $CUQUANTUM_ROOT ' + 'or $CUTENSORNET_ROOT') from e + else: + cutensornet_root = cuquantum_root + + # search order: + # 1. installed "cutensor" package + # 2. env var + for path in py_paths: + path = os.path.join(path, 'cutensor') + if os.path.isdir(os.path.join(path, 'include')): + cutensor_root = path + break + else: + try: + cutensor_root = os.environ['CUTENSOR_ROOT'] + except KeyError as e: + raise RuntimeError('cuTENSOR is not found, please set $CUTENSOR_ROOT') from e + + return custatevec_root, cutensornet_root, cutensor_root + + def _prep_includes_libs_rpaths(self): + """ + Set global vars cusv_incl_dir, cutn_incl_dir, cusv_lib_dir, cutn_lib_dir, + cusv_lib, cutn_lib, and extra_linker_flags. + """ + custatevec_root, cutensornet_root, cutensor_root = self._set_library_roots() + + global cusv_incl_dir, cutn_incl_dir + cusv_incl_dir = [os.path.join(cuda_path, 'include'), + os.path.join(custatevec_root, 'include')] + cutn_incl_dir = [os.path.join(cuda_path, 'include'), + os.path.join(cutensornet_root, 'include')] + + global cusv_lib_dir, cutn_lib_dir + # we include both lib64 and lib to accommodate all possible sources + cusv_lib_dir = [os.path.join(custatevec_root, 'lib'), + os.path.join(custatevec_root, 'lib64')] + cutn_lib_dir = [os.path.join(cutensornet_root, 'lib'), + os.path.join(cutensornet_root, 'lib64'), + os.path.join(cutensor_root, 'lib'), # wheel + os.path.join(cutensor_root, 'lib', cutensor_ver)] # tarball + + global cusv_lib, cutn_lib, extra_linker_flags + if not building_wheel: + # Note: with PEP-517 the editable mode would not build a wheel for installation + # (and we purposely do not support PEP-660). + cusv_lib = ['custatevec'] + cutn_lib = ['cutensornet', 'cutensor'] + extra_linker_flags = [] + else: + # Note: soname = library major version + # We don't need to link to cuBLAS/cuSOLVER at build time (TODO: perhaps cuTENSOR too...?) + cusv_lib = [':libcustatevec.so.1'] + cutn_lib = [':libcutensornet.so.2', ':libcutensor.so.1'] + # The rpaths must be adjusted given the following full-wheel installation: + # - cuquantum-python: site-packages/cuquantum/{custatevec, cutensornet}/ [=$ORIGIN] + # - cusv & cutn: site-packages/cuquantum/lib/ + # - cutensor: site-packages/cutensor/lib/ + # - cublas: site-packages/nvidia/cublas/lib/ + # - cusolver: site-packages/nvidia/cusolver/lib/ + # (Note that starting v22.11 we use the new wheel format, so all lib wheels have suffix -cuXX, + # and cuBLAS/cuSOLVER additionally have prefix nvidia-.) + ldflag = "-Wl,--disable-new-dtags," + ldflag += "-rpath,$ORIGIN/../lib," + ldflag += "-rpath,$ORIGIN/../../cutensor/lib," + ldflag += "-rpath,$ORIGIN/../../nvidia/cublas/lib," + ldflag += "-rpath,$ORIGIN/../../nvidia/cusolver/lib" + extra_linker_flags = [ldflag] + + print("\n"+"*"*80) + print("CUDA version:", cuda_ver) + print("CUDA path:", cuda_path) + print("cuStateVec path:", custatevec_root) + print("cuTensorNet path:", cutensornet_root) + print("cuTENSOR path:", cutensor_root) + print("*"*80+"\n") + + def build_extension(self, ext): + if ext.name.endswith("custatevec"): + ext.include_dirs = cusv_incl_dir + ext.library_dirs = cusv_lib_dir + ext.libraries = cusv_lib + ext.extra_link_args = extra_linker_flags + elif ext.name.endswith("cutensornet"): + ext.include_dirs = cutn_incl_dir + ext.library_dirs = cutn_lib_dir + ext.libraries = cutn_lib + ext.extra_link_args = extra_linker_flags + + super().build_extension(ext) + + def build_extensions(self): + self._prep_includes_libs_rpaths() + super().build_extensions() diff --git a/python/cuquantum/__init__.py b/python/cuquantum/__init__.py index 2ba5f96..6576381 100644 --- a/python/cuquantum/__init__.py +++ b/python/cuquantum/__init__.py @@ -5,7 +5,7 @@ from cuquantum import custatevec from cuquantum import cutensornet from cuquantum.cutensornet import ( - contract, contract_path, einsum, einsum_path, Network, BaseCUDAMemoryManager, MemoryPointer, + contract, contract_path, einsum, einsum_path, tensor_qualifiers_dtype, Network, BaseCUDAMemoryManager, MemoryPointer, NetworkOptions, OptimizerInfo, OptimizerOptions, PathFinderOptions, ReconfigOptions, SlicerOptions, CircuitToEinsum) from cuquantum.utils import ComputeType, cudaDataType, libraryPropertyType from cuquantum._version import __version__ @@ -27,6 +27,11 @@ cutensornet.GraphAlgo, cutensornet.MemoryModel, cutensornet.OptimizerCost, + cutensornet.TensorSVDConfigAttribute, + cutensornet.TensorSVDNormalization, + cutensornet.TensorSVDPartition, + cutensornet.TensorSVDInfoAttribute, + cutensornet.GateSplitAlgo, ): cutensornet._internal.enum_utils.add_enum_class_doc(enum, chomp="_ATTRIBUTE|_PREFERENCE_ATTRIBUTE") diff --git a/python/cuquantum/__main__.py b/python/cuquantum/__main__.py new file mode 100644 index 0000000..6b063dc --- /dev/null +++ b/python/cuquantum/__main__.py @@ -0,0 +1,107 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import argparse +import os +import site +import sys + +import cuquantum # get the shared libraries loaded + + +def get_lib_path(name): + """Get the loaded shared library path.""" + # Ideally we should call dl_iterate_phdr or dladdr to do the job, but this + # is simpler and not bad; the former two are not strictly portable anyway + # (not part of POSIX). Obviously this only works on Linux! + try: + with open('/proc/self/maps') as f: + lib_map = f.read() + except FileNotFoundError as e: + raise NotImplementedError("This utility is available only on Linux.") from e + lib = set() + for line in lib_map.split('\n'): + if name in line: + fields = line.split() + lib.add(fields[-1]) # pathname is the last field, check "man proc" + if len(lib) == 0: + raise ValueError(f"library {name} is not loaded") + elif len(lib) > 1: + # This could happen when, e.g., a library exists in both the user env + # and LD_LIBRARY_PATH, and somehow both copies get loaded. This is a + # messy problem, but let's work around it by assuming the one in the + # user env is preferred. + lib2 = set() + for s in [site.getusersitepackages()] + site.getsitepackages(): + for path in lib: + if path.startswith(s): + lib2.add(path) + if len(lib2) != 1: + raise RuntimeError(f"cannot find the unique copy of {name}: {lib}") + else: + lib = lib2 + return lib.pop() + + +def _get_cuquantum_libs(): + paths = set() + for lib in ('custatevec', 'cutensornet', 'cutensor'): + path = os.path.normpath(get_lib_path(f"lib{lib}.so")) + paths.add(path) + return tuple(paths) + + +def _get_cuquantum_includes(): + paths = set() + for path in _get_cuquantum_libs(): + path = os.path.normpath(os.path.join(os.path.dirname(path), '..')) + if not os.path.isdir(os.path.join(path, 'include')): + path = os.path.normpath(os.path.join(path, '../include')) + else: + path = os.path.join(path, 'include') + assert os.path.isdir(path), f"path={path} is invalid" + paths.add(path) + return tuple(paths) + + +def _get_cuquantum_target(target): + target = f"lib{target}.so" + libs = [os.path.basename(lib) for lib in _get_cuquantum_libs()] + for lib in libs: + if target in lib: + lib = '.'.join(lib.split('.')[:3]) # keep SONAME + flag = f"-l:{lib} " + break + else: + assert False + return flag + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--includes', action='store_true', + help='get cuQuantum include flags') + parser.add_argument('--libs', action='store_true', + help='get cuQuantum linker flags') + parser.add_argument('--target', action='append', default=[], + choices=('custatevec', 'cutensornet'), + help='get the linker flag for the target cuQuantum component') + args = parser.parse_args() + + if not sys.argv[1:]: + parser.print_help() + sys.exit(1) + if args.includes: + out = ' '.join(f"-I{path}" for path in _get_cuquantum_includes()) + print(out, end=' ') + if args.libs: + paths = set([os.path.dirname(path) for path in _get_cuquantum_libs()]) + out = ' '.join(f"-L{path}" for path in paths) + print(out, end=' ') + flag = '' + for target in args.target: + flag += _get_cuquantum_target(target) + if target == 'cutensornet': + flag += _get_cuquantum_target('cutensor') + print(flag) diff --git a/python/cuquantum/_version.py b/python/cuquantum/_version.py index bb5b86c..732ec2b 100644 --- a/python/cuquantum/_version.py +++ b/python/cuquantum/_version.py @@ -5,4 +5,4 @@ # Note: cuQuantum Python follows the cuQuantum SDK version, which is now # switched to YY.MM and is different from individual libraries' (semantic) # versioning scheme. -__version__ = '22.07.1' # the last digit is for cuQuantum Python only +__version__ = '22.11.0' diff --git a/python/cuquantum/cutensornet/__init__.py b/python/cuquantum/cutensornet/__init__.py index 3f85e3a..162bf55 100644 --- a/python/cuquantum/cutensornet/__init__.py +++ b/python/cuquantum/cutensornet/__init__.py @@ -7,3 +7,4 @@ from cuquantum.cutensornet.memory import * from cuquantum.cutensornet.tensor_network import * from cuquantum.cutensornet.circuit_converter import * +from cuquantum.cutensornet._internal.utils import get_mpi_comm_pointer diff --git a/python/cuquantum/cutensornet/_internal/circuit_converter_utils.py b/python/cuquantum/cutensornet/_internal/circuit_converter_utils.py index a293bde..f427e72 100644 --- a/python/cuquantum/cutensornet/_internal/circuit_converter_utils.py +++ b/python/cuquantum/cutensornet/_internal/circuit_converter_utils.py @@ -4,15 +4,15 @@ try: import cirq - from . import cirq_parser_utils + from . import circuit_parser_utils_cirq except ImportError: - cirq = cirq_parser_utils = None + cirq = circuit_parser_utils_cirq = None import cupy as cp try: import qiskit - from . import qiskit_parser_utils + from . import circuit_parser_utils_qiskit except ImportError: - qiskit = qiskit_parser_utils = None + qiskit = circuit_parser_utils_qiskit = None from .tensor_wrapper import _get_backend_asarray_func @@ -26,7 +26,7 @@ def check_version(package_name, version, minimum_version): """ - Check if the current version of a package is above the required minimum + Check if the current version of a package is above the required minimum. """ version_numbers = [int(i) for i in version.split('.')] minimum_version_numbers = [int(i) for i in minimum_version.split('.')] @@ -52,11 +52,11 @@ def infer_parser(circuit): if qiskit and isinstance(circuit, qiskit.QuantumCircuit): qiskit_version = qiskit.__qiskit_version__['qiskit'] # qiskit metapackage version check_version('qiskit', qiskit_version, QISKIT_MIN_VERSION) - return qiskit_parser_utils + return circuit_parser_utils_qiskit elif cirq and isinstance(circuit, cirq.Circuit): cirq_version = cirq.__version__ check_version('cirq', cirq_version, CIRQ_MIN_VERSION) - return cirq_parser_utils + return circuit_parser_utils_cirq else: base = circuit.__module__.split('.')[0] raise NotImplementedError(f'circuit from {base} not supported') @@ -91,7 +91,7 @@ def parse_bitstring(bitstring, n_qubits=None): def parse_fixed_qubits(fixed): """ - Given a set of qubits with fixed states, return the output bitstring and corresponding qubits order + Given a set of qubits with fixed states, return the output bitstring and corresponding qubits order. """ if fixed: fixed_qubits, fixed_bitstring = zip(*fixed.items()) @@ -133,19 +133,49 @@ def get_bitstring_tensors(bitstring, dtype='complex128', backend=cp): def convert_mode_labels_to_expression(input_mode_labels, output_mode_labels): """ - Create an Einsum expression from input and output index labels + Create an Einsum expression from input and output index labels. Args: input_mode_labels: A sequence of mode labels for each input tensor. output_mode_labels: The desired mode labels for the output tensor. Returns: - An Einsum expression in explicit form + An Einsum expression in explicit form. """ input_symbols = [''.join(map(_get_symbol, idx)) for idx in input_mode_labels] expression = ','.join(input_symbols) + '->' + ''.join(map(_get_symbol, output_mode_labels)) return expression +def get_pauli_gates(pauli_map, dtype='complex128', backend=cp): + """ + Populate the gates for all pauli operators. + + Args: + pauli_map: A dictionary mapping qubits to pauli operators. + dtype: Data type for the tensor operands. + backend: The package the tensor operands belong to. + + Returns: + A sequence of pauli gates. + """ + asarray = _get_backend_asarray_func(backend) + pauli_i = asarray([[1,0], [0,1]], dtype=dtype) + pauli_x = asarray([[0,1], [1,0]], dtype=dtype) + pauli_y = asarray([[0,-1j], [1j,0]], dtype=dtype) + pauli_z = asarray([[1,0], [0,-1]], dtype=dtype) + + operand_map = {'I': pauli_i, + 'X': pauli_x, + 'Y': pauli_y, + 'Z': pauli_z} + gates = [] + for qubit, pauli_char in pauli_map.items(): + operand = operand_map.get(pauli_char) + if operand is None: + raise ValueError('pauli string character must be one of I/X/Y/Z') + gates.append((operand, (qubit,))) + return gates + def parse_gates_to_mode_labels_operands( gates, qubits_frontier, diff --git a/python/cuquantum/cutensornet/_internal/cirq_parser_utils.py b/python/cuquantum/cutensornet/_internal/circuit_parser_utils_cirq.py similarity index 71% rename from python/cuquantum/cutensornet/_internal/cirq_parser_utils.py rename to python/cuquantum/cutensornet/_internal/circuit_parser_utils_cirq.py index e8342c0..68c7c6b 100644 --- a/python/cuquantum/cutensornet/_internal/cirq_parser_utils.py +++ b/python/cuquantum/cutensornet/_internal/circuit_parser_utils_cirq.py @@ -49,7 +49,7 @@ def unfold_circuit(circuit, dtype='complex128', backend=cp): gate_qubits = operation.qubits tensor = unitary(operation).reshape((2,) * 2 * len(gate_qubits)) tensor = asarray(tensor, dtype=dtype) - gates.append([tensor, operation.qubits]) + gates.append((tensor, operation.qubits)) return qubits, gates def get_lightcone_circuit(circuit, coned_qubits): @@ -64,25 +64,16 @@ def get_lightcone_circuit(circuit, coned_qubits): A :class:`cirq.Circuit` object that potentially contains less number of gates """ coned_qubits = set(coned_qubits) + all_operations = list(circuit.all_operations()) n_qubits = len(circuit.all_qubits()) - moments = [] - reversed_moments = circuit.moments[::-1] - n_moments = len(reversed_moments) - for ix, moment in enumerate(reversed_moments): - if len(coned_qubits) == n_qubits: - moments.extend(reversed_moments[ix:]) - break - reduced_moment = [] - reversed_operations = moment.operations[::-1] - n_operations = len(reversed_operations) - for iy, operation in enumerate(reversed_operations): - if len(coned_qubits) == n_qubits: - reduced_moment.extend(reversed_operations[iy:]) - break - qubit_set = set(operation.qubits) - if qubit_set & coned_qubits: - reduced_moment.append(operation) - coned_qubits |= qubit_set - moments.append(Moment(reduced_moment[::-1])) - newqc = Circuit(moments[::-1]) + ix = len(all_operations) + tail_operations = [] + while len(coned_qubits) != n_qubits and ix>0: + ix -= 1 + operation = all_operations[ix] + qubit_set = set(operation.qubits) + if qubit_set & coned_qubits: + tail_operations.append(operation) + coned_qubits |= qubit_set + newqc = Circuit(all_operations[:ix]+tail_operations[::-1]) return newqc diff --git a/python/cuquantum/cutensornet/_internal/qiskit_parser_utils.py b/python/cuquantum/cutensornet/_internal/circuit_parser_utils_qiskit.py similarity index 63% rename from python/cuquantum/cutensornet/_internal/qiskit_parser_utils.py rename to python/cuquantum/cutensornet/_internal/circuit_parser_utils_qiskit.py index 5294263..eb0caad 100644 --- a/python/cuquantum/cutensornet/_internal/qiskit_parser_utils.py +++ b/python/cuquantum/cutensornet/_internal/circuit_parser_utils_qiskit.py @@ -4,14 +4,14 @@ import cupy as cp from qiskit import QuantumCircuit -from qiskit.circuit import Barrier, ControlledGate, Delay, Gate, Instruction, Measure +from qiskit.circuit import Barrier, ControlledGate, Delay, Gate, Measure from qiskit.extensions import UnitaryGate from .tensor_wrapper import _get_backend_asarray_func def remove_measurements(circuit): """ - Return a circuit with final measurement operations removed + Return a circuit with final measurement operations removed. """ circuit = circuit.copy() circuit.remove_final_measurements() @@ -22,39 +22,26 @@ def remove_measurements(circuit): def get_inverse_circuit(circuit): """ - Return a circuit with all gate operations inversed + Return a circuit with all gate operations inversed. """ return circuit.inverse() -def unfold_circuit(circuit, dtype='complex128', qubit_map=None, gates=None, backend=cp): +def get_decomposed_gates(circuit, qubit_map=None, gates=None, gate_process_func=None): """ - Unfold the circuit to obtain the qubits and all gate tensors. All :class:`qiskit.circuit.Gate` and - :class:`qiskit.circuit.Instruction` in the circuit will be decomposed into either standard gates or customized unitary gates. - Barrier and delay operations will be discarded. - - Args: - circuit: A :class:`qiskit.QuantumCircuit` object. All parameters in the circuit must be binded. - dtype: Data type for the tensor operands. - backend: The package the tensor operands belong to. - - Returns: - All qubits and gate operations from the input circuit + Return the gate sequence for the given circuit. Compound gates/instructions will be decomposed + to either standard gates or customized unitary gates. """ if gates is None: gates = [] - asarray = _get_backend_asarray_func(backend) - qubits = circuit.qubits for operation, gate_qubits, _ in circuit: if qubit_map: gate_qubits = [qubit_map[q] for q in gate_qubits] if isinstance(operation, Gate): if 'standard_gate' in str(type(operation)) or isinstance(operation, UnitaryGate): - tensor = operation.to_matrix().reshape((2,2)*len(gate_qubits)) - tensor = asarray(tensor, dtype=dtype) - if isinstance(operation, ControlledGate): - # in qiskit notation, qubit at high index is the target qubit - gate_qubits = gate_qubits[::-1] - gates.append([tensor, gate_qubits]) + if callable(gate_process_func): + gates.append(gate_process_func(operation, gate_qubits)) + else: + gates.append((operation, gate_qubits)) continue else: if isinstance(operation, (Barrier, Delay)): @@ -63,10 +50,37 @@ def unfold_circuit(circuit, dtype='complex128', qubit_map=None, gates=None, back elif not isinstance(operation.definition, QuantumCircuit): # Instruction as composite gate raise ValueError(f'operation type {type(operation)} not supported') - # for composite gate, must provide a map from the sub circuit to the original circuit next_qubit_map = dict(zip(operation.definition.qubits, gate_qubits)) - _, gates = unfold_circuit(operation.definition, dtype=dtype, qubit_map=next_qubit_map, gates=gates, backend=backend) + gates = get_decomposed_gates(operation.definition, qubit_map=next_qubit_map, gates=gates, gate_process_func=gate_process_func) + return gates + +def unfold_circuit(circuit, dtype='complex128', backend=cp): + """ + Unfold the circuit to obtain the qubits and all gate tensors. All :class:`qiskit.circuit.Gate` and + :class:`qiskit.circuit.Instruction` in the circuit will be decomposed into either standard gates or customized unitary gates. + Barrier and delay operations will be discarded. + + Args: + circuit: A :class:`qiskit.QuantumCircuit` object. All parameters in the circuit must be binded. + dtype: Data type for the tensor operands. + backend: The package the tensor operands belong to. + + Returns: + All qubits and gate operations from the input circuit + """ + asarray = _get_backend_asarray_func(backend) + qubits = circuit.qubits + + def gate_process_func(operation, gate_qubits): + tensor = operation.to_matrix().reshape((2,2)*len(gate_qubits)) + tensor = asarray(tensor, dtype=dtype) + if isinstance(operation, ControlledGate): + # in qiskit notation, qubit at high index is the target qubit + gate_qubits = gate_qubits[::-1] + return tensor, gate_qubits + + gates = get_decomposed_gates(circuit, gate_process_func=gate_process_func) return qubits, gates @@ -82,18 +96,17 @@ def get_lightcone_circuit(circuit, coned_qubits): A :class:`qiskit.QuantumCircuit` object that potentially contains less number of gates """ coned_qubits = set(coned_qubits) - reverse_coned_operations = [] - newqc = circuit.copy() - newqc.data = [] - for ix, (operation, gate_qubits, _) in enumerate(circuit[::-1]): - if len(coned_qubits) == circuit.num_qubits: - # when all qubits are coned, all inner gates are preserved - newqc.data = circuit.data[:len(circuit)-ix] - break + gates = get_decomposed_gates(circuit) + newqc = QuantumCircuit(circuit.qubits) + ix = len(gates) + tail_operations = [] + while len(coned_qubits) != circuit.num_qubits and ix>0: + ix -= 1 + operation, gate_qubits = gates[ix] qubit_set = set(gate_qubits) if qubit_set & coned_qubits: - reverse_coned_operations.append([operation, gate_qubits, _]) + tail_operations.append([operation, gate_qubits]) coned_qubits |= qubit_set - - newqc.data.extend(reverse_coned_operations[::-1]) + for operation, gate_qubits in gates[:ix] + tail_operations[::-1]: + newqc.append(operation, gate_qubits) return newqc diff --git a/python/cuquantum/cutensornet/_internal/einsum_parser.py b/python/cuquantum/cutensornet/_internal/einsum_parser.py index 3b58e05..ad00880 100644 --- a/python/cuquantum/cutensornet/_internal/einsum_parser.py +++ b/python/cuquantum/cutensornet/_internal/einsum_parser.py @@ -8,6 +8,7 @@ from collections import Counter from itertools import chain +import string import numpy as np @@ -59,7 +60,7 @@ def parse_single(single): """ Parse single operand mode labels considering ellipsis. Leading or trailing whitespace, if present, is removed. """ - subexpr = single.strip().split('...') + subexpr = single.strip(string.whitespace).split('...') n = len(subexpr) expr = [[Ellipsis]] * (2*n - 1) expr[::2] = subexpr @@ -73,7 +74,7 @@ def check_single(single): for s in single: if s is Ellipsis: continue - if s.isspace() or s in disallowed_labels: + if s in string.whitespace or s in disallowed_labels: return False return True diff --git a/python/cuquantum/cutensornet/_internal/enum_utils.py b/python/cuquantum/cutensornet/_internal/enum_utils.py index cfb0856..25adda7 100644 --- a/python/cuquantum/cutensornet/_internal/enum_utils.py +++ b/python/cuquantum/cutensornet/_internal/enum_utils.py @@ -72,6 +72,17 @@ def create_options_class_from_enum(options_class_name: str, enum_class: IntEnum, return options_class +def snake_to_camel(names): + name = "" + for i, sub_name in enumerate(names): + if i == 0: + name += sub_name.lower() + else: + name += sub_name[0].upper() + sub_name[1:] + name += "_t" + return name + + def camel_to_snake(name, upper=True): """ Convert string from camel case to snake style. diff --git a/python/cuquantum/cutensornet/_internal/optimizer_ifc.py b/python/cuquantum/cutensornet/_internal/optimizer_ifc.py index fa9cf48..84710e4 100644 --- a/python/cuquantum/cutensornet/_internal/optimizer_ifc.py +++ b/python/cuquantum/cutensornet/_internal/optimizer_ifc.py @@ -9,6 +9,7 @@ __all__ = ['OptimizerInfoInterface'] from collections.abc import Sequence +import itertools import operator import numpy as np @@ -16,14 +17,17 @@ from cuquantum import cutensornet as cutn -def _parse_and_map_sliced_modes(sliced_modes, mode_map_user_to_ord, size_dict, dtype_mode=np.int32, dtype_extent=np.int64): +def _parse_and_map_sliced_modes(sliced_modes, mode_map_user_to_ord, size_dict): """ - Parse user-provided sliced modes and create individual, contiguous sliced_modes and sliced extents array. + Parse user-provided sliced modes, create and return a contiguous (sliced mode, slide extent) array of + type `cutn.cutensornet.slice_info_pair_dtype`. """ num_sliced_modes = len(sliced_modes) + slice_info_array = np.empty((num_sliced_modes,), dtype=cutn.cutensornet.slice_info_pair_dtype) + if num_sliced_modes == 0: - return num_sliced_modes, np.zeros((num_sliced_modes,), dtype=dtype_mode), np.zeros((num_sliced_modes,), dtype=dtype_extent) + return slice_info_array # The sliced modes have already passed basic checks when creating the OptimizerOptions dataclass. @@ -31,7 +35,7 @@ def _parse_and_map_sliced_modes(sliced_modes, mode_map_user_to_ord, size_dict, d if pairs: sliced_modes, sliced_extents = zip(*sliced_modes) else: - sliced_extents = np.ones((num_sliced_modes,), dtype=dtype_extent) + sliced_extents = (1,) # Check for invalid mode labels. invalid_modes = tuple(filter(lambda k: k not in mode_map_user_to_ord, sliced_modes)) @@ -39,19 +43,20 @@ def _parse_and_map_sliced_modes(sliced_modes, mode_map_user_to_ord, size_dict, d message = f"Invalid sliced mode labels: {invalid_modes}" raise ValueError(message) - sliced_modes = np.asarray([mode_map_user_to_ord[m] for m in sliced_modes], dtype=dtype_mode) - remainder = tuple(size_dict[m] % e for m, e in zip(sliced_modes, sliced_extents)) - if any(remainder): + slice_info_array["sliced_mode"] = sliced_modes = [mode_map_user_to_ord[m] for m in sliced_modes] + remainder = any(size_dict[m] % e for m, e in itertools.zip_longest(sliced_modes, sliced_extents, fillvalue=1)) + if remainder: raise ValueError("The sliced extents must evenly divide the original extents of the corresponding mode.") + slice_info_array["sliced_extent"] = sliced_extents - return num_sliced_modes, sliced_modes, np.asanyarray(sliced_extents, dtype=dtype_extent) + return slice_info_array InfoEnum = cutn.ContractionOptimizerInfoAttribute -class OptimizerInfoInterface(object): - """ - """ + +class OptimizerInfoInterface: + def __init__(self, network): """ """ @@ -63,10 +68,11 @@ def __init__(self, network): self._largest_tensor = np.zeros((1,), dtype=get_dtype(InfoEnum.LARGEST_TENSOR)) self._num_slices = np.zeros((1,), dtype=get_dtype(InfoEnum.NUM_SLICES)) self._num_sliced_modes = np.zeros((1,), dtype=get_dtype(InfoEnum.NUM_SLICED_MODES)) + self._slicing_config = np.zeros((1,), dtype=get_dtype(InfoEnum.SLICING_CONFIG)) self._slicing_overhead = np.zeros((1,), dtype=get_dtype(InfoEnum.SLICING_OVERHEAD)) self.num_contraction = len(self.network.operands) - 1 - self._path = np.zeros((2*self.num_contraction, ), dtype=np.int32) + self._path = np.zeros((1,), dtype=get_dtype(InfoEnum.PATH)) @staticmethod def _get_scalar_attribute(network, name, attribute): @@ -97,13 +103,6 @@ def num_slices(self): return int(self._num_slices) - @num_slices.setter - def num_slices(self, number): - """ - Set the number of slices in the network. - """ - OptimizerInfoInterface._set_scalar_attribute(network, InfoEnum.NUM_SLICES, self._num_slices, number) - @property def flop_count(self): """ @@ -137,16 +136,11 @@ def path(self): """ Return the contraction path in linear format. """ + path = np.empty((2*self.num_contraction,), dtype=np.int32) + self._path["data"] = path.ctypes.data + OptimizerInfoInterface._get_scalar_attribute(self.network, InfoEnum.PATH, self._path) - network = self.network - - path_wrapper = cutn.ContractionPath(self.num_contraction, self._path.ctypes.data) - size = path_wrapper.get_size() - cutn.contraction_optimizer_info_get_attribute(network.handle, network.optimizer_info_ptr, InfoEnum.PATH, path_wrapper.get_path(), size) - - path = list(zip(*[iter(self._path)]*2)) - - return path + return list(zip(*[iter(path)]*2)) @path.setter def path(self, path): @@ -164,10 +158,13 @@ def path(self, path): raise ValueError(f"The length of the contraction path ({num_contraction}) must be one less than the number of operands ({len(network.operands)}).") path = reduce(operator.concat, path) - self._path = np.array(path, dtype=np.int32) - path_wrapper = cutn.ContractionPath(num_contraction, self._path.ctypes.data) - size = path_wrapper.get_size() - cutn.contraction_optimizer_info_set_attribute(network.handle, network.optimizer_info_ptr, InfoEnum.PATH, path_wrapper.get_path(), size) + path_array = np.asarray(path, dtype=np.int32) + + # Construct the path type. + path = np.array((num_contraction, path_array.ctypes.data), dtype=get_dtype(InfoEnum.PATH)) + + # Set the attribute. + OptimizerInfoInterface._set_scalar_attribute(self.network, InfoEnum.PATH, self._path, path) @property def num_sliced_modes(self): @@ -178,13 +175,6 @@ def num_sliced_modes(self): return int(self._num_sliced_modes) - @num_sliced_modes.setter - def num_sliced_modes(self, number): - """ - Set the number of sliced_modes in the network. - """ - OptimizerInfoInterface._set_scalar_attribute(self.network, InfoEnum.NUM_SLICED_MODES, self._num_sliced_modes, number) - @property def sliced_mode_extent(self): """ @@ -197,14 +187,15 @@ def sliced_mode_extent(self): num_sliced_modes = self.num_sliced_modes - sliced_modes = np.zeros((num_sliced_modes,), dtype=get_dtype(InfoEnum.SLICED_MODE)) - size = num_sliced_modes * sliced_modes.dtype.itemsize - cutn.contraction_optimizer_info_get_attribute(network.handle, network.optimizer_info_ptr, InfoEnum.SLICED_MODE, sliced_modes.ctypes.data, size) - sliced_modes = tuple(network.mode_map_ord_to_user[m] for m in sliced_modes) # Convert to user mode labels + slice_info_array = np.empty((num_sliced_modes,), dtype=cutn.cutensornet.slice_info_pair_dtype) - sliced_extents = np.zeros((num_sliced_modes,), dtype=get_dtype(InfoEnum.SLICED_EXTENT)) - size = num_sliced_modes * sliced_extents.dtype.itemsize - cutn.contraction_optimizer_info_get_attribute(network.handle, network.optimizer_info_ptr, InfoEnum.SLICED_EXTENT, sliced_extents.ctypes.data, size) + slicing_config = self._slicing_config + slicing_config["num_sliced_modes"] = num_sliced_modes + slicing_config["data"] = slice_info_array.ctypes.data + OptimizerInfoInterface._get_scalar_attribute(self.network, InfoEnum.SLICING_CONFIG, slicing_config) + + sliced_modes = tuple(network.mode_map_ord_to_user[m] for m in slice_info_array["sliced_mode"]) # Convert to user mode labels + sliced_extents = slice_info_array["sliced_extent"] return tuple(zip(sliced_modes, sliced_extents)) @@ -216,18 +207,16 @@ def sliced_mode_extent(self, sliced_modes): sliced_mode = sequence of sliced modes, or sequence of (sliced mode, sliced extent) pairs """ - network = self.network - - num_sliced_modes, sliced_modes, sliced_extents = _parse_and_map_sliced_modes(sliced_modes, network.mode_map_user_to_ord, network.size_dict) + get_dtype = cutn.contraction_optimizer_info_get_attribute_dtype - # Set the number of sliced modes first - self.num_sliced_modes = num_sliced_modes + network = self.network - size = num_sliced_modes * sliced_modes.dtype.itemsize - cutn.contraction_optimizer_info_set_attribute(network.handle, network.optimizer_info_ptr, InfoEnum.SLICED_MODE, sliced_modes.ctypes.data, size) + # Construct the slicing config type. + slice_info_array = _parse_and_map_sliced_modes(sliced_modes, network.mode_map_user_to_ord, network.size_dict) + slicing_config = np.array((len(slice_info_array), slice_info_array.ctypes.data), dtype=get_dtype(InfoEnum.SLICING_CONFIG)) - size = num_sliced_modes * sliced_extents.dtype.itemsize - cutn.contraction_optimizer_info_set_attribute(network.handle, network.optimizer_info_ptr, InfoEnum.SLICED_EXTENT, sliced_extents.ctypes.data, size) + # Set the attribute. + OptimizerInfoInterface._set_scalar_attribute(network, InfoEnum.SLICING_CONFIG, self._slicing_config, slicing_config) @property def intermediate_modes(self): diff --git a/python/cuquantum/cutensornet/_internal/package_ifc_cupy.py b/python/cuquantum/cutensornet/_internal/package_ifc_cupy.py index bfc0f67..d6025a8 100644 --- a/python/cuquantum/cutensornet/_internal/package_ifc_cupy.py +++ b/python/cuquantum/cutensornet/_internal/package_ifc_cupy.py @@ -10,6 +10,7 @@ import cupy as cp +from . import utils from .package_ifc import Package @@ -17,7 +18,7 @@ class CupyPackage(Package): @staticmethod def get_current_stream(device_id): - with cp.cuda.Device(device_id): + with utils.device_ctx(device_id): stream = cp.cuda.get_current_stream() return stream @@ -35,6 +36,6 @@ def create_external_stream(device_id, stream_ptr): @staticmethod def create_stream(device_id): - with cp.cuda.Device(device_id): + with utils.device_ctx(device_id): stream = cp.cuda.Stream(null=False, non_blocking=False, ptds=False) return stream diff --git a/python/cuquantum/cutensornet/_internal/tensor_ifc_cupy.py b/python/cuquantum/cutensornet/_internal/tensor_ifc_cupy.py index c284f20..e607790 100644 --- a/python/cuquantum/cutensornet/_internal/tensor_ifc_cupy.py +++ b/python/cuquantum/cutensornet/_internal/tensor_ifc_cupy.py @@ -11,6 +11,7 @@ import cupy import numpy +from . import utils from .tensor_ifc import Tensor @@ -61,7 +62,15 @@ def empty(cls, shape, **context): name = context.get('dtype', 'float32') dtype = CupyTensor.name_to_dtype[name] device = context.get('device', None) - with cupy.cuda.Device(device=device): + + if isinstance(device, cupy.cuda.Device): + device_id = device.id + elif isinstance(device, int): + device_id = device + else: + raise ValueError(f"The device must be specified as an integer or cupy.cuda.Device instance, not '{device}'.") + + with utils.device_ctx(device_id): tensor = cupy.empty(shape, dtype=dtype) return tensor @@ -77,7 +86,7 @@ def to(self, device='cpu'): if not isinstance(device, int): raise ValueError(f"The device must be specified as an integer or 'cpu', not '{device}'.") - with cupy.cuda.Device(device): + with utils.device_ctx(device): tensor_device = cupy.asarray(self.tensor) return tensor_device diff --git a/python/cuquantum/cutensornet/_internal/tensor_ifc_numpy.py b/python/cuquantum/cutensornet/_internal/tensor_ifc_numpy.py index 8d2843d..ea218d6 100644 --- a/python/cuquantum/cutensornet/_internal/tensor_ifc_numpy.py +++ b/python/cuquantum/cutensornet/_internal/tensor_ifc_numpy.py @@ -11,8 +11,10 @@ import cupy import numpy +from . import utils from .tensor_ifc import Tensor + class NumpyTensor(Tensor): """ Tensor wrapper for numpy ndarrays. @@ -73,7 +75,7 @@ def to(self, device='cpu'): if not isinstance(device, int): raise ValueError(f"The device must be specified as an integer or 'cpu', not '{device}'.") - with cupy.cuda.Device(device): + with utils.device_ctx(device): tensor_device = cupy.asarray(self.tensor) return tensor_device diff --git a/python/cuquantum/cutensornet/_internal/utils.py b/python/cuquantum/cutensornet/_internal/utils.py index 930b938..b4ef794 100644 --- a/python/cuquantum/cutensornet/_internal/utils.py +++ b/python/cuquantum/cutensornet/_internal/utils.py @@ -6,8 +6,10 @@ A collection of (internal use) helper functions. """ +import contextlib +import ctypes import functools -from typing import Callable, Dict, Optional +from typing import Callable, Dict, Mapping, Optional import cupy as cp import numpy as np @@ -17,6 +19,7 @@ from . import package_wrapper from . import tensor_wrapper + def infer_object_package(obj): """ Infer the package that defines this object. @@ -55,13 +58,42 @@ def _create_stream_ctx_ptr_cupy_stream(package_ifc, stream): return stream, stream_ctx, stream_ptr -def get_or_create_stream(device, stream, op_package): +@contextlib.contextmanager +def device_ctx(new_device_id): + """ + Semantics: + + 1. The device context manager makes the specified device current from the point of entry until the point of exit. + + 2. When the context manager exits, the current device is reset to what it was when the context manager was entered. + + 3. Any explicit setting of the device within the context manager (using cupy.cuda.Device().use(), torch.cuda.set_device(), + etc) will overrule the device set by the context manager from that point onwards till the context manager exits. In + other words, the context manager provides a local device scope and the current device can be explicitly reset for the + remainder of that scope. + + Corollary: if any library function resets the device globally and this is an undesired side-effect, such functions must be + called from within the device context manager. + + Device context managers can be arbitrarily nested. + """ + old_device_id = cp.cuda.runtime.getDevice() + try: + if old_device_id != new_device_id: + cp.cuda.runtime.setDevice(new_device_id) + yield + finally: + # We should always restore the old device at exit. + cp.cuda.runtime.setDevice(old_device_id) + + +def get_or_create_stream(device_id, stream, op_package): """ Create a stream object from a stream pointer or extract the stream pointer from a stream object, or use the current stream. Args: - device: The device (CuPy object) for the stream. + device_id: The device ID. stream: A stream object, stream pointer, or None. op_package: The package the tensor network operands belong to. @@ -69,7 +101,6 @@ def get_or_create_stream(device, stream, op_package): tuple: CuPy stream object, package stream context, stream pointer. """ - device_id = device.id op_package_ifc = package_wrapper.PACKAGE[op_package] if stream is None: stream = op_package_ifc.get_current_stream(device_id) @@ -134,11 +165,10 @@ def get_memory_limit(memory_limit, device): def get_operands_data(operands): """ - Get the raw data pointer of the input operands and their alignment for cutensornet. + Get the raw data pointer of the input operands for cuTensorNet. """ op_data = tuple(o.data_ptr for o in operands) - alignments = tuple(get_maximal_alignment(p) for p in op_data) - return op_data, alignments + return op_data def create_empty_tensor(cls, extents, dtype, device_id, stream_ctx): @@ -155,30 +185,27 @@ def create_empty_tensor(cls, extents, dtype, device_id, stream_ctx): return tensor -def create_output_tensor(cls, package, output, size_dict, device, data_type): +def create_output_tensor(cls, package, output, size_dict, device_id, data_type): """ - Create output tensor and associated data (modes, extents, strides, alignment). This operation is + Create output tensor and associated data (modes, extents, strides). This operation is blocking and is safe to use with asynchronous memory pools. """ modes = tuple(m for m in output) extents = tuple(size_dict[m] for m in output) package_ifc = package_wrapper.PACKAGE[package] - device_id = device.id stream = package_ifc.create_stream(device_id) stream, stream_ctx, _ = _create_stream_ctx_ptr_cupy_stream(package_ifc, stream) - with device: + with device_ctx(device_id): start = stream.record() output = create_empty_tensor(cls, extents, data_type, device_id, stream_ctx) end = stream.record() end.synchronize() strides = output.strides - alignment = get_maximal_alignment(output.data_ptr) - - return output, modes, extents, strides, alignment + return output, modes, extents, strides def get_network_device_id(operands): @@ -204,6 +231,7 @@ def get_operands_dtype(operands): return dtype +# Unused since cuQuantum 22.11 def get_maximal_alignment(address): """ Calculate the maximal alignment of the provided memory location. @@ -242,6 +270,7 @@ def check_operands_match(orig_operands, new_operands, attribute, description): raise ValueError(message) +# Unused since cuQuantum 22.11 def check_alignments_match(orig_alignments, new_alignments): """ Check if alignment matches between the corresponding new and old operands, and raise an exception if it doesn't. @@ -257,6 +286,30 @@ def check_alignments_match(orig_alignments, new_alignments): raise ValueError(message) +def check_tensor_qualifiers(qualifiers, dtype, num_inputs): + """ + Check if the tensor qualifiers array is valid. + """ + + if qualifiers is None: + return 0 + + prolog = f"The tensor qualifiers must be specified as an one-dimensional NumPy ndarray of 'tensor_qualifiers_dtype' objects." + if not isinstance(qualifiers, np.ndarray): + raise ValueError(prolog) + elif qualifiers.dtype != dtype: + message = prolog + f" The dtype of the ndarray is '{qualifiers.dtype}'." + raise ValueError(message) + elif qualifiers.ndim != 1: + message = prolog + f" The shape of the ndarray is {qualifiers.shape}." + raise ValueError(message) + elif len(qualifiers) != num_inputs: + message = prolog + f" The length of the ndarray is {len(qualifiers)}, while the expected length is {num_inputs}." + raise ValueError(message) + + return qualifiers + + def check_autotune_params(iterations): """ Check if the autotune parameters are of the correct type and within range. @@ -285,6 +338,77 @@ def get_ptr_from_memory_pointer(mem_ptr): raise AttributeError(message) +class Value: + """ + A simple value wrapper holding a default value. + """ + def __init__(self, default, *, validator: Callable[[object], bool]): + """ + Args: + default: The default value to use. + validator: A callable that validates the provided value. + """ + self.validator = validator + self._data = default + + @property + def data(self): + return self._data + + @data.setter + def data(self, value): + self._data = self._validate(value) + + def _validate(self, value): + if self.validator(value): + return value + raise ValueError(f"Internal Error: value '{value}' is not valid.") + + +def check_and_set_options(required: Mapping[str, Value], provided: Mapping[str, object]): + """ + Update each option specified in 'required' by getting the value from 'provided' if it exists or using a default. + """ + for option, value in required.items(): + try: + value.data = provided.pop(option) + except KeyError: + pass + required[option] = value.data + + assert not provided, "Unrecognized options." + + +@contextlib.contextmanager +def cuda_call_ctx(stream, blocking=True, timing=True): + """ + A simple context manager that provides (non-)blocking behavior depending on the `blocking` parameter for CUDA calls. + The call is timed only for blocking behavior when timing is requested. + + An `end` event is recorded after the CUDA call for use in establishing stream ordering for non-blocking calls. This + event is returned together with a `Value` object that stores the elapsed time if the call is blocking and timing is + requested, or None otherwise. + """ + if blocking: + start = cp.cuda.Event(disable_timing = False if timing else True) + stream.record(start) + + end = cp.cuda.Event(disable_timing = False if timing and blocking else True) + + time = Value(None, validator=lambda v: True) + yield end, time + + stream.record(end) + + if not blocking: + return + + end.synchronize() + + if timing: + time.data = cp.cuda.get_elapsed_time(start, end) + + # Decorator definitions def atomic(handler: Callable[[Optional[object]], None], method: bool = False) -> Callable: @@ -361,3 +485,28 @@ def inner(*args, **kwargs): return outer +def get_mpi_comm_pointer(comm): + """Simple helper to get the address to and size of a ``MPI_Comm`` handle. + + Args: + comm (mpi4py.MPI.Comm): An MPI communicator. + + Returns: + tuple: A pair of int values representing the address and the size. + """ + # We won't initialize MPI for users in any case + try: + import mpi4py + init = mpi4py.rc.initialize + mpi4py.rc.initialize = False + from mpi4py import MPI + except ImportError as e: + raise RuntimeError("please install mpi4py") from e + finally: + mpi4py.rc.initialize = init + + if not isinstance(comm, MPI.Comm): + raise ValueError("invalid MPI communicator") + comm_ptr = MPI._addressof(comm) # = MPI_Comm* + mpi_comm_size = MPI._sizeof(MPI.Comm) + return comm_ptr, mpi_comm_size diff --git a/python/cuquantum/cutensornet/circuit_converter.py b/python/cuquantum/cutensornet/circuit_converter.py index 8cbf653..d359c18 100644 --- a/python/cuquantum/cutensornet/circuit_converter.py +++ b/python/cuquantum/cutensornet/circuit_converter.py @@ -8,7 +8,9 @@ __all__ = ['CircuitToEinsum'] +import collections.abc import importlib +import warnings import numpy as np @@ -120,7 +122,33 @@ def state_vector(self, fixed=EMPTY_DICT): The Einstein summation expression and a list of tensor operands. The order of the output mode labels is consistent with :attr:`CircuitToEinsum.qubits`. For :class:`cirq.Circuit`, this order corresponds to all qubits in the circuit sorted in ascending order. For :class:`qiskit.QuantumCircuit`, this order is the same as :attr:`qiskit.QuantumCircuit.qubits`. + + .. note:: + + The kwargs "fixed" is deprecated and will be removed in the future; please switch to :meth:`CircuitToEinsum.batched_amplitudes` for the same functionality. + """ + if fixed: + warnings.warn("The kwargs \"fixed\" is deprecated and will be removed in the future; please " + "switch to CircuitToEinsum.batched_amplitudes() for the same functionality.") + elif fixed is None: + fixed = dict() + + return self.batched_amplitudes(fixed) + + def batched_amplitudes(self, fixed): """ + Generate the Einstein summation expression and tensor operands to compute a batch of bitstring amplitudes for the input circuit. + + Args: + fixed: A dictionary that maps certain qubits to the corresponding fixed states 0 or 1. + + Returns: + The Einstein summation expression and a list of tensor operands. The order of the output mode labels is consistent with :attr:`CircuitToEinsum.qubits`. + For :class:`cirq.Circuit`, this order corresponds to all qubits in the circuit sorted in ascending order. + For :class:`qiskit.QuantumCircuit`, this order is the same as :attr:`qiskit.QuantumCircuit.qubits`. + """ + if not isinstance(fixed, collections.abc.Mapping): + raise TypeError('fixed must be a dictionary') input_mode_labels, input_operands, qubits_frontier = self._get_inputs() fixed_qubits, fixed_bitstring = circ_utils.parse_fixed_qubits(fixed) @@ -130,7 +158,7 @@ def state_vector(self, fixed=EMPTY_DICT): operands = input_operands + circ_utils.get_bitstring_tensors(fixed_bitstring, dtype=self.dtype, backend=self.backend) output_mode_labels = [qubits_frontier[q] for q in self.qubits if q not in fixed] - expression = circ_utils.convert_mode_labels_to_expression(mode_labels, output_mode_labels=output_mode_labels) + expression = circ_utils.convert_mode_labels_to_expression(mode_labels, output_mode_labels) return expression, operands def amplitude(self, bitstring): @@ -151,7 +179,7 @@ def amplitude(self, bitstring): mode_labels = input_mode_labels + [[qubits_frontier[q]] for q in self.qubits] output_mode_labels = [] - expression = circ_utils.convert_mode_labels_to_expression(mode_labels, output_mode_labels=output_mode_labels) + expression = circ_utils.convert_mode_labels_to_expression(mode_labels, output_mode_labels) operands = input_operands + circ_utils.get_bitstring_tensors(bitstring, dtype=self.dtype, backend=self.backend) return expression, operands @@ -179,23 +207,9 @@ def reduced_density_matrix(self, where, fixed=EMPTY_DICT, lightcone=True): .. seealso:: `unitary reverse lightcone cancellation `_ """ - parser = self.parser n_qubits = self.n_qubits - - if lightcone: - coned_qubits = list(where) + list(fixed.keys()) - circuit = parser.get_lightcone_circuit(self.circuit, coned_qubits) - _, gates = parser.unfold_circuit(circuit, dtype=self.dtype, backend=self.backend) - # in cirq, the lightcone circuit may only contain a subset of the original qubits - # It's imperative to use qubits=self.qubits to generate the input tensors - input_mode_labels, input_operands, qubits_frontier = circ_utils.parse_inputs(self.qubits, gates, self.dtype, self.backend) - else: - circuit = self.circuit - input_mode_labels, input_operands, qubits_frontier = self._get_inputs() - # avoid inplace modification on metadata - qubits_frontier = qubits_frontier.copy() - - next_frontier = max(qubits_frontier.values()) + 1 + coned_qubits = list(where) + list(fixed.keys()) + input_mode_labels, input_operands, qubits_frontier, next_frontier, inverse_gates = self._get_forward_inverse_metadata(lightcone, coned_qubits) # handle tensors/mode labels for qubits with fixed state fixed_qubits, fixed_bitstring = circ_utils.parse_fixed_qubits(fixed) @@ -214,9 +228,6 @@ def reduced_density_matrix(self, where, fixed=EMPTY_DICT, lightcone=True): qubits_frontier[iqubit] = next_frontier next_frontier += 1 - # inverse circuit - inverse_circuit = parser.get_inverse_circuit(circuit) - _, inverse_gates = parser.unfold_circuit(inverse_circuit, dtype=self.dtype, backend=self.backend) igate_mode_labels, igate_operands = circ_utils.parse_gates_to_mode_labels_operands(inverse_gates, qubits_frontier, next_frontier) @@ -232,7 +243,67 @@ def reduced_density_matrix(self, where, fixed=EMPTY_DICT, lightcone=True): output_left_mode_labels.append(left_mode_labels) output_right_mode_labels.append(right_mode_labels) output_mode_labels = output_left_mode_labels + output_right_mode_labels - expression = circ_utils.convert_mode_labels_to_expression(mode_labels, output_mode_labels=output_mode_labels) + expression = circ_utils.convert_mode_labels_to_expression(mode_labels, output_mode_labels) + return expression, operands + + def expectation(self, pauli_string, lightcone=True): + """ + Generate the Einstein summation expression and tensor operands to compute the expectation value of a Pauli + string for the input circuit. + + Unitary reverse lightcone cancellation refers to removing the identity formed by a unitary gate (from + the ket state) and its inverse (from the bra state) when there exists no additional operators + in-between. One can take advantage of this technique to reduce the effective network size by + only including the *causal* gates (gates residing in the lightcone). + + Args: + pauli_string: The Pauli string for expectation value computation. It can be: + + - a sequence of characters ``'I'``/``'X'``/``'Y'``/``'Z'``. The length must be equal to the number of qubits. + - a dictionary mapping the selected qubits to Pauli characters. Qubits not specified are + assumed to be applied with the identity operator ``'I'``. + + lightcone: Whether to apply the unitary reverse lightcone cancellation technique to reduce the number of tensors in expectation value computation. + + Returns: + The Einstein summation expression and a list of tensor operands. + + .. note:: + + When ``lightcone=True``, the identity Pauli operators will be omitted in the output operands. The unitary reverse lightcone cancellation technique is then + applied based on the remaining causal qubits to further reduce the size of the network. The reduction effect depends on the circuit topology and the input Pauli string + (so the contraction path cannot be reused for the contraction of different Pauli strings). When ``lightcone=False``, the identity Pauli operators are preserved in the output operands such that the output tensor network has the identical topology for different Pauli strings, and the contraction path only needs to be computed once and can be reused for all Pauli strings. + + .. seealso:: `unitary reverse lightcone cancellation `_ + """ + if isinstance(pauli_string, collections.abc.Sequence): + if len(pauli_string) != self.n_qubits: + raise ValueError('pauli_string must be of equal size as the number of qubits in the circuit') + pauli_string = dict(zip(self.qubits, pauli_string)) + else: + if not isinstance(pauli_string, collections.abc.Mapping): + raise TypeError('pauli_string must be either a sequence of pauli characters or a dictionary') + + n_qubits = self.n_qubits + if lightcone: + pauli_map = {qubit: pauli_char for qubit, pauli_char in pauli_string.items() if pauli_char!='I'} + else: + pauli_map = pauli_string + coned_qubits = pauli_map.keys() + input_mode_labels, input_operands, qubits_frontier, next_frontier, inverse_gates = self._get_forward_inverse_metadata(lightcone, coned_qubits) + + pauli_gates = circ_utils.get_pauli_gates(pauli_map, dtype=self.dtype, backend=self.backend) + gates = pauli_gates + inverse_gates + + gate_mode_labels, gate_operands = circ_utils.parse_gates_to_mode_labels_operands(gates, + qubits_frontier, + next_frontier) + + mode_labels = input_mode_labels + gate_mode_labels + [[qubits_frontier[ix]] for ix in self.qubits] + operands = input_operands + gate_operands + input_operands[:n_qubits] + + output_mode_labels = [] + expression = circ_utils.convert_mode_labels_to_expression(mode_labels, output_mode_labels) return expression, operands def _get_inputs(self): @@ -248,3 +319,38 @@ def _get_inputs(self): if self._metadata is None: self._metadata = circ_utils.parse_inputs(self.qubits, self.gates, self.dtype, self.backend) return self._metadata + + def _get_forward_inverse_metadata(self, lightcone, coned_qubits): + """parse the metadata for forward and inverse circuit. + + Args: + lightcone: Whether to apply the unitary reverse lightcone cancellation technique to reduce the number of tensors in expectation value computation. + coned_qubits: An iterable of qubits to be coned. + + Returns: + tuple: A 5-tuple (``input_mode_labels``, ``input_operands``, ``qubits_frontier``, ``next_frontier``, ``inverse_gates``): + + - ``input_mode_labels`` : A sequence of mode labels for initial states and gate tensors. + - ``input_operands`` : A sequence of operands for initial states and gate tensors. + - ``qubits_frontier``: A dictionary mapping all qubits to their current mode labels. + - ``next_frontier``: The next mode label to use. + - ``inverse_gates``: A sequence of (operand, qubits) for the inverse circuit. + """ + parser = self.parser + if lightcone: + circuit = parser.get_lightcone_circuit(self.circuit, coned_qubits) + _, gates = parser.unfold_circuit(circuit, dtype=self.dtype, backend=self.backend) + # in cirq, the lightcone circuit may only contain a subset of the original qubits + # It's imperative to use qubits=self.qubits to generate the input tensors + input_mode_labels, input_operands, qubits_frontier = circ_utils.parse_inputs(self.qubits, gates, self.dtype, self.backend) + else: + circuit = self.circuit + input_mode_labels, input_operands, qubits_frontier = self._get_inputs() + # avoid inplace modification on metadata + qubits_frontier = qubits_frontier.copy() + + next_frontier = max(qubits_frontier.values()) + 1 + # inverse circuit + inverse_circuit = parser.get_inverse_circuit(circuit) + _, inverse_gates = parser.unfold_circuit(inverse_circuit, dtype=self.dtype, backend=self.backend) + return input_mode_labels, input_operands, qubits_frontier, next_frontier, inverse_gates \ No newline at end of file diff --git a/python/cuquantum/cutensornet/configuration.py b/python/cuquantum/cutensornet/configuration.py index 98f0527..ed18084 100644 --- a/python/cuquantum/cutensornet/configuration.py +++ b/python/cuquantum/cutensornet/configuration.py @@ -11,7 +11,7 @@ import collections from dataclasses import dataclass from logging import Logger -from typing import Dict, Hashable, Iterable, Mapping, Optional, Tuple, Union +from typing import Dict, Hashable, Iterable, Literal, Mapping, Optional, Tuple, Union import cupy as cp @@ -34,6 +34,10 @@ class NetworkOptions(object): logger (logging.Logger): Python Logger object. The root logger will be used if a logger object is not provided. memory_limit: Maximum memory available to cuTensorNet. It can be specified as a value (with optional suffix like K[iB], M[iB], G[iB]) or as a percentage. The default is 80%. + blocking: A flag specifying the behavior of the execution methods :meth:`Network.autotune` and :meth:`Network.contract`. + When ``blocking`` is ``True``, these methods do not return until the operation is complete. When blocking is ``"auto"``, + the methods return immediately when the input tensors are on the GPU. The execution methods always block when the + input tensors are on the CPU. The default is ``True``. allocator: An object that supports the :class:`BaseCUDAMemoryManager` protocol, used to draw device memory. If an allocator is not provided, a memory allocator from the library package will be used (:func:`torch.cuda.caching_allocator_alloc` for PyTorch operands, :func:`cupy.cuda.alloc` otherwise). @@ -43,6 +47,7 @@ class NetworkOptions(object): handle : Optional[int] = None logger : Optional[Logger] = None memory_limit : Optional[Union[int, str]] = r'80%' + blocking : Literal[True, "auto"] = True allocator : Optional[BaseCUDAMemoryManager] = None def __post_init__(self): @@ -64,6 +69,9 @@ def __post_init__(self): if not (m1 or m2): raise ValueError(MEM_LIMIT_DOC % self.memory_limit) + if self.blocking != True and self.blocking != "auto": + raise ValueError("The value specified for blocking must be either True or 'auto'.") + if self.allocator is not None and not isinstance(self.allocator, BaseCUDAMemoryManager): raise TypeError("The allocator must be an object of type that fulfils the BaseCUDAMemoryManager protocol.") @@ -104,6 +112,8 @@ class OptimizerOptions(object): reconfiguration: Options for the reconfiguration algorithm as a :class:`~cuquantum.ReconfigOptions` object or dict containing the ``(parameter, value)`` items for ``ReconfigOptions``. seed: Optional seed for the random number generator. See `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SEED`. + cost_function: The objective function to use for finding the optimal contraction path. + See `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_COST_FUNCTION_OBJECTIVE`. """ samples : Optional[int] = None threads : Optional[int] = None @@ -111,6 +121,7 @@ class OptimizerOptions(object): slicing : Optional[Union[SlicerOptions, ModeSequenceType, ModeExtentSequenceType]] = None reconfiguration : Optional[ReconfigOptions] = None seed : Optional[int] = None + cost_function: Optional[int] = None def _check_option(self, option, option_class, checker=None): if isinstance(option, option_class): @@ -160,6 +171,8 @@ def __post_init__(self): self.slicing = self._check_option(self.slicing, SlicerOptions, self._check_specified_slices) self.reconfiguration = self._check_option(self.reconfiguration, ReconfigOptions, None) self._check_int(self.seed, "seed") + if self.cost_function is not None: + self.cost_function = cuquantum.cutensornet.OptimizerCost(self.cost_function) @dataclass diff --git a/python/cuquantum/cutensornet/cutensornet.pxd b/python/cuquantum/cutensornet/cutensornet.pxd index a4bd024..2dbee35 100644 --- a/python/cuquantum/cutensornet/cutensornet.pxd +++ b/python/cuquantum/cutensornet/cutensornet.pxd @@ -7,7 +7,7 @@ # Once we switch over the names would be prettier (in the Cython # layer). -from libc.stdint cimport int32_t +from libc.stdint cimport int32_t, int64_t, uint32_t from cuquantum.utils cimport DataType, DeviceAllocType, DeviceFreeType, Stream @@ -23,14 +23,27 @@ cdef extern from '' nogil: ctypedef void* _ContractionAutotunePreference 'cutensornetContractionAutotunePreference_t' ctypedef void* _WorkspaceDescriptor 'cutensornetWorkspaceDescriptor_t' ctypedef void* _SliceGroup 'cutensornetSliceGroup_t' + ctypedef void* _TensorDescriptor 'cutensornetTensorDescriptor_t' + ctypedef void* _TensorSVDConfig 'cutensornetTensorSVDConfig_t' + ctypedef void* _TensorSVDInfo 'cutensornetTensorSVDInfo_t' # cuTensorNet structs ctypedef struct _NodePair 'cutensornetNodePair_t': int first int second + ctypedef struct _ContractionPath 'cutensornetContractionPath_t': int numContractions _NodePair *data + + ctypedef struct _SliceInfoPair 'cutensornetSliceInfoPair_t': + int32_t slicedMode + int64_t slicedExtent + + ctypedef struct _SlicingConfig 'cutensornetSlicingConfig_t': + uint32_t numSlicedModes + _SliceInfoPair* data + ctypedef struct _DeviceMemHandler 'cutensornetDeviceMemHandler_t': void* ctx DeviceAllocType device_alloc @@ -38,6 +51,10 @@ cdef extern from '' nogil: # Cython limitation: cannot use C defines in declaring a static array, # so we just have to hard-code CUTENSORNET_ALLOCATOR_NAME_LEN here... char name[64] + + ctypedef struct _TensorQualifiers 'cutensornetTensorQualifiers_t': + int32_t isConjugate # cannot assign default value to fields in cdef structs + ctypedef void(*LoggerCallbackData 'cutensornetLoggerCallbackData_t')( int32_t logLevel, const char* functionName, @@ -95,6 +112,7 @@ cdef extern from '' nogil: CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_INTERMEDIATE_MODES CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_EFFECTIVE_FLOPS_EST CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_RUNTIME_EST + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_SLICING_CONFIG ctypedef enum _ContractionAutotunePreferenceAttribute 'cutensornetContractionAutotunePreferenceAttributes_t': CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS @@ -107,6 +125,34 @@ cdef extern from '' nogil: ctypedef enum _Memspace 'cutensornetMemspace_t': CUTENSORNET_MEMSPACE_DEVICE + CUTENSORNET_MEMSPACE_HOST + + ctypedef enum _TensorSVDConfigAttribute 'cutensornetTensorSVDConfigAttributes_t': + CUTENSORNET_TENSOR_SVD_CONFIG_ABS_CUTOFF + CUTENSORNET_TENSOR_SVD_CONFIG_REL_CUTOFF + CUTENSORNET_TENSOR_SVD_CONFIG_S_NORMALIZATION + CUTENSORNET_TENSOR_SVD_CONFIG_S_PARTITION + + ctypedef enum _TensorSVDNormalization 'cutensornetTensorSVDNormalization_t': + CUTENSORNET_TENSOR_SVD_NORMALIZATION_NONE + CUTENSORNET_TENSOR_SVD_NORMALIZATION_L1 + CUTENSORNET_TENSOR_SVD_NORMALIZATION_L2 + CUTENSORNET_TENSOR_SVD_NORMALIZATION_LINF + + ctypedef enum _TensorSVDPartition 'cutensornetTensorSVDPartition_t': + CUTENSORNET_TENSOR_SVD_PARTITION_NONE + CUTENSORNET_TENSOR_SVD_PARTITION_US + CUTENSORNET_TENSOR_SVD_PARTITION_SV + CUTENSORNET_TENSOR_SVD_PARTITION_UV_EQUAL + + ctypedef enum _TensorSVDInfoAttribute 'cutensornetTensorSVDInfoAttributes_t': + CUTENSORNET_TENSOR_SVD_INFO_FULL_EXTENT + CUTENSORNET_TENSOR_SVD_INFO_REDUCED_EXTENT + CUTENSORNET_TENSOR_SVD_INFO_DISCARDED_WEIGHT + + ctypedef enum _GateSplitAlgo 'cutensornetGateSplitAlgo_t': + CUTENSORNET_GATE_SPLIT_ALGO_DIRECT + CUTENSORNET_GATE_SPLIT_ALGO_REDUCED # cuTensorNet consts int CUTENSORNET_MAJOR diff --git a/python/cuquantum/cutensornet/cutensornet.pyx b/python/cuquantum/cutensornet/cutensornet.pyx index d05f5fd..315d530 100644 --- a/python/cuquantum/cutensornet/cutensornet.pyx +++ b/python/cuquantum/cutensornet/cutensornet.pyx @@ -34,13 +34,19 @@ cdef extern from * nogil: # network descriptor int cutensornetCreateNetworkDescriptor( _Handle, int32_t, const int32_t[], const int64_t* const[], - const int64_t* const[], const int32_t* const[], const uint32_t[], + const int64_t* const[], const int32_t* const[], const _TensorQualifiers[], int32_t, const int64_t[], const int64_t[], const int32_t[], - uint32_t, DataType, _ComputeType, _NetworkDescriptor*) + DataType, _ComputeType, _NetworkDescriptor*) int cutensornetDestroyNetworkDescriptor(_NetworkDescriptor) int cutensornetGetOutputTensorDetails( const _Handle, const _NetworkDescriptor, int32_t*, size_t*, int32_t*, int64_t*, int64_t*) + int cutensornetGetOutputTensorDescriptor( + const _Handle, const _NetworkDescriptor, + _TensorDescriptor*) + int cutensornetGetTensorDetails( + const _Handle, const _TensorDescriptor, + int32_t*, size_t*, int32_t*, int64_t*, int64_t*) # workspace descriptor int cutensornetCreateWorkspaceDescriptor( @@ -48,6 +54,9 @@ cdef extern from * nogil: int cutensornetWorkspaceComputeSizes( const _Handle, const _NetworkDescriptor, const _ContractionOptimizerInfo, _WorkspaceDescriptor) + int cutensornetWorkspaceComputeContractionSizes( + const _Handle, const _NetworkDescriptor, + const _ContractionOptimizerInfo, _WorkspaceDescriptor) int cutensornetWorkspaceGetSize( const _Handle, const _WorkspaceDescriptor, _WorksizePref, _Memspace, uint64_t*) @@ -150,6 +159,60 @@ cdef extern from * nogil: int cutensornetLoggerSetMask(int32_t) int cutensornetLoggerForceDisable() + # tensor descriptor + int cutensornetCreateTensorDescriptor( + _Handle, int32_t, const int64_t[], const int64_t[], const int32_t[], + DataType, _TensorDescriptor*) + int cutensornetDestroyTensorDescriptor(_TensorDescriptor) + + # svdConfig + int cutensornetCreateTensorSVDConfig(_Handle, _TensorSVDConfig*) + int cutensornetDestroyTensorSVDConfig(_TensorSVDConfig) + int cutensornetTensorSVDConfigGetAttribute( + _Handle, _TensorSVDConfig, _TensorSVDConfigAttribute, void*, size_t) + int cutensornetTensorSVDConfigSetAttribute( + _Handle, _TensorSVDConfig, _TensorSVDConfigAttribute, void*, size_t) + + # svdInfo + int cutensornetCreateTensorSVDInfo(_Handle, _TensorSVDInfo*) + int cutensornetDestroyTensorSVDInfo(_TensorSVDInfo) + int cutensornetTensorSVDInfoGetAttribute( + _Handle, _TensorSVDInfo, _TensorSVDInfoAttribute, void*, size_t) + + # tensorSVD + int cutensornetWorkspaceComputeSVDSizes( + _Handle, _TensorDescriptor, _TensorDescriptor, _TensorDescriptor, + _TensorSVDConfig, _WorkspaceDescriptor) + int cutensornetTensorSVD( + _Handle, _TensorDescriptor, void*, _TensorDescriptor, void*, void*, + _TensorDescriptor, void*, _TensorSVDConfig, _TensorSVDInfo, + _WorkspaceDescriptor, Stream) + + # tensorQR + int cutensornetWorkspaceComputeQRSizes( + _Handle, _TensorDescriptor, _TensorDescriptor, _TensorDescriptor, + _WorkspaceDescriptor) + int cutensornetTensorQR( + _Handle, _TensorDescriptor, void*, _TensorDescriptor, void*, + _TensorDescriptor, void*, _WorkspaceDescriptor, Stream) + + # gate split + int cutensornetWorkspaceComputeGateSplitSizes( + _Handle, _TensorDescriptor, _TensorDescriptor, _TensorDescriptor, + _TensorDescriptor, _TensorDescriptor, _GateSplitAlgo, + _TensorSVDConfig, _ComputeType, _WorkspaceDescriptor) + int cutensornetGateSplit( + _Handle, _TensorDescriptor, void*, _TensorDescriptor, void*, + _TensorDescriptor, void*, _TensorDescriptor, void*, void*, + _TensorDescriptor, void*, _GateSplitAlgo, _TensorSVDConfig, + _ComputeType, _TensorSVDInfo, _WorkspaceDescriptor, Stream) + + # distributed + int cutensornetDistributedResetConfiguration(_Handle, void*, size_t) + int cutensornetDistributedGetNumRanks(_Handle, int*) + int cutensornetDistributedGetProcRank(_Handle, int*) + int cutensornetDistributedSynchronize(_Handle) + class cuTensorNetError(RuntimeError): def __init__(self, status): @@ -225,9 +288,9 @@ cpdef size_t get_cudart_version() except*: cpdef intptr_t create_network_descriptor( intptr_t handle, int32_t n_inputs, n_modes_in, extents_in, - strides_in, modes_in, alignments_in, + strides_in, modes_in, qualifiers_in, int32_t n_modes_out, extents_out, - strides_out, modes_out, uint32_t alignment_out, + strides_out, modes_out, int data_type, int compute_type) except*: """Create a tensor network descriptor. @@ -261,11 +324,11 @@ cpdef intptr_t create_network_descriptor( to the corresponding tensor's modes - a nested Python sequence of :class:`int` - alignments_in: A host array of alignments for each input tensor. It can + qualifiers_in: A host array of qualifiers for each input tensor. It can be - - an :class:`int` as the pointer address to the array - - a Python sequence of :class:`int` + - an :class:`int` as the pointer address to the numpy array with dtype `tensor_qualifiers_dtype` + - a numpy array with dtype `tensor_qualifiers_dtype` n_modes_out (int32_t): The number of modes of the output tensor. If this is set to -1 and ``modes_out`` is set to 0 (not provided), @@ -286,7 +349,6 @@ cpdef intptr_t create_network_descriptor( - an :class:`int` as the pointer address to the array - a Python sequence of :class:`int` - alignment_out (uint32_t): The alignment for the output tensor. data_type (cuquantum.cudaDataType): The data type of the input and output tensors. compute_type (cuquantum.ComputeType): The compute type of the tensor @@ -388,14 +450,13 @@ cpdef intptr_t create_network_descriptor( # a pointer address, take it as is modesInPtr = modes_in - # alignments_in can be a pointer address, or a Python sequence - cdef vector[uint32_t] alignmentsInData - cdef uint32_t* alignmentsInPtr - if cpython.PySequence_Check(alignments_in): - alignmentsInData = alignments_in - alignmentsInPtr = alignmentsInData.data() - else: # a pointer address - alignmentsInPtr = alignments_in + # qualifiers_in can be a pointer address or a numpy array + cdef _TensorQualifiers* qualifiersInPtr + if isinstance(qualifiers_in, _numpy.ndarray): + assert qualifiers_in.dtype == tensor_qualifiers_dtype + qualifiersInPtr = <_TensorQualifiers*>qualifiers_in.ctypes.data + else: + qualifiersInPtr = <_TensorQualifiers*> qualifiers_in # extents_out can be a pointer address, or a Python sequence cdef vector[int64_t] extentsOutData @@ -427,8 +488,8 @@ cpdef intptr_t create_network_descriptor( cdef _NetworkDescriptor tn_desc with nogil: status = cutensornetCreateNetworkDescriptor(<_Handle>handle, - n_inputs, numModesInPtr, extentsInPtr, stridesInPtr, modesInPtr, alignmentsInPtr, - n_modes_out, extentsOutPtr, stridesOutPtr, modesOutPtr, alignment_out, + n_inputs, numModesInPtr, extentsInPtr, stridesInPtr, modesInPtr, qualifiersInPtr, + n_modes_out, extentsOutPtr, stridesOutPtr, modesOutPtr, data_type, <_ComputeType>compute_type, &tn_desc) check_status(status) return tn_desc @@ -461,6 +522,10 @@ cpdef tuple get_output_tensor_details(intptr_t handle, intptr_t tn_desc): .. seealso:: `cutensornetGetOutputTensorDetails` """ + warnings.warn("cuquantum.cutensornet.get_output_tensor_details() is " + "deprecated and will be removed in a future release; please " + "switch to cuquantum.cutensornet.get_output_tensor_descriptor() " + "instead", DeprecationWarning, 2) cdef int32_t numModesOut = 0 with nogil: status = cutensornetGetOutputTensorDetails( @@ -480,6 +545,63 @@ cpdef tuple get_output_tensor_details(intptr_t handle, intptr_t tn_desc): check_status(status) return (numModesOut, modes, extents, strides) +cpdef intptr_t get_output_tensor_descriptor( + intptr_t handle, intptr_t tn_desc) except*: + """Get the networks output tensor descriptor. + + Args: + handle (intptr_t): The library handle. + tn_desc (intptr_t): The tensor network descriptor. + + Returns: + intptr_t: An opaque descriptor handle (as Python :class:`int`). + Users are responsible to call :func:`destroy_tensor_descriptor` to + clean it up. + + .. seealso:: `cutensornetGetOutputTensorDescriptor` + """ + cdef _TensorDescriptor desc + with nogil: + status = cutensornetGetOutputTensorDescriptor( + <_Handle>handle, <_NetworkDescriptor>tn_desc, &desc) + check_status(status) + return desc + + +cpdef tuple get_tensor_details(intptr_t handle, intptr_t desc): + """Get the tensor's metadata. + + Args: + handle (intptr_t): The library handle. + desc (intptr_t): A tensor descriptor. + + Returns: + tuple: + The metadata of the tensor: ``(num_modes, modes, extents, + strides)``. + + .. seealso:: `cutensornetGetTensorDetails` + + """ + cdef int32_t numModesOut = 0 + with nogil: + status = cutensornetGetTensorDetails( + <_Handle>handle, <_TensorDescriptor>desc, + &numModesOut, NULL, NULL, NULL, NULL) + check_status(status) + modes = _numpy.empty(numModesOut, dtype=_numpy.int32) + extents = _numpy.empty(numModesOut, dtype=_numpy.int64) + strides = _numpy.empty(numModesOut, dtype=_numpy.int64) + cdef int32_t* mPtr = modes.ctypes.data + cdef int64_t* ePtr = extents.ctypes.data + cdef int64_t* sPtr = strides.ctypes.data + with nogil: + status = cutensornetGetTensorDetails( + <_Handle>handle, <_TensorDescriptor>desc, + &numModesOut, NULL, mPtr, ePtr, sPtr) + check_status(status) + return (numModesOut, modes, extents, strides) + cpdef intptr_t create_workspace_descriptor(intptr_t handle) except*: """Create a workspace descriptor. @@ -523,9 +645,18 @@ cpdef workspace_compute_sizes( tn_desc (intptr_t): The tensor network descriptor. info (intptr_t): The optimizer info handle. workspace (intptr_t): The workspace descriptor. + + .. warning:: + + This function is deprecated and will be removed in a future release. + Use :func:`workspace_compute_contraction_sizes` instead. .. seealso:: `cutensornetWorkspaceComputeSizes` """ + warnings.warn("cuquantum.cutensornet.workspace_compute_sizes() is deprecated and will " + "be removed in the future; please switch to " + "cuquantum.cutensornet.workspace_compute_contraction_sizes() instead", + DeprecationWarning, 2) with nogil: status = cutensornetWorkspaceComputeSizes( <_Handle>handle, <_NetworkDescriptor>tn_desc, @@ -534,6 +665,26 @@ cpdef workspace_compute_sizes( check_status(status) +cpdef workspace_compute_contraction_sizes( + intptr_t handle, intptr_t tn_desc, intptr_t info, intptr_t workspace): + """Compute the required workspace sizes for tensor network contraction. + + Args: + handle (intptr_t): The library handle. + tn_desc (intptr_t): The tensor network descriptor. + info (intptr_t): The optimizer info handle. + workspace (intptr_t): The workspace descriptor. + + .. seealso:: `cutensornetWorkspaceComputeContractionSizes` + """ + with nogil: + status = cutensornetWorkspaceComputeContractionSizes( + <_Handle>handle, <_NetworkDescriptor>tn_desc, + <_ContractionOptimizerInfo>info, + <_WorkspaceDescriptor>workspace) + check_status(status) + + cpdef uint64_t workspace_get_size( intptr_t handle, intptr_t workspace, int pref, int mem_space) except*: """Get the workspace size for the corresponding preference and memory @@ -713,12 +864,34 @@ cpdef destroy_contraction_optimizer_info(intptr_t info): ######################### Python specific utility ######################### +contraction_path_dtype = _numpy.dtype( + {'names':['num_contractions','data'], + 'formats': (_numpy.uint32, _numpy.intp), + 'itemsize': sizeof(_ContractionPath), + }, align=True +) + +# We need this dtype because its members are not of the same type... +slice_info_pair_dtype = _numpy.dtype( + {'names': ('sliced_mode','sliced_extent'), + 'formats': (_numpy.int32, _numpy.int64), + 'itemsize': sizeof(_SliceInfoPair), + }, align=True +) + +slicing_config_dtype = _numpy.dtype( + {'names': ('num_sliced_modes','data'), + 'formats': (_numpy.uint32, _numpy.intp), + 'itemsize': sizeof(_SlicingConfig), + }, align=True +) + cdef dict contract_opti_info_sizes = { CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES: _numpy.int64, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICED_MODES: _numpy.int32, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_SLICED_MODE: _numpy.int32, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_SLICED_EXTENT: _numpy.int64, - CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_PATH: ContractionPath, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_PATH: contraction_path_dtype, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_PHASE1_FLOP_COUNT: _numpy.float64, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT: _numpy.float64, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_LARGEST_TENSOR: _numpy.float64, @@ -727,6 +900,7 @@ cdef dict contract_opti_info_sizes = { CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_INTERMEDIATE_MODES: _numpy.int32, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_EFFECTIVE_FLOPS_EST: _numpy.float64, CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_RUNTIME_EST: _numpy.float64, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_SLICING_CONFIG: slicing_config_dtype, } cpdef contraction_optimizer_info_get_attribute_dtype(int attr): @@ -736,7 +910,8 @@ cpdef contraction_optimizer_info_get_attribute_dtype(int attr): attr (ContractionOptimizerInfoAttribute): The attribute to query. Returns: - The data type of the queried attribute. + The data type of the queried attribute. The returned dtype is always + a valid NumPy dtype object. .. note:: This API has no C counterpart and is a convenient helper for allocating memory for :func:`contraction_optimizer_info_get_attribute` @@ -750,23 +925,26 @@ cpdef contraction_optimizer_info_get_attribute_dtype(int attr): val = ContractionOptimizerInfoAttribute.PATH dtype = contraction_optimizer_info_get_attribute_dtype(val) - # setter + # for setting a path path = np.asarray([(1, 3), (1, 2), (0, 1)], dtype=np.int32) - path_obj = dtype(path.size//2, path.ctypes.data) + # ... or for getting a path; note that num_contractions is the number of + # input tensors minus one + path = np.empty(2*num_contractions, dtype=np.int32) + + path_obj = np.zeros((1,), dtype=dtype) + path_obj["num_contractions"] = path.size // 2 + path_obj["node_pair"] = path.ctypes.ptr + + # for setting a path contraction_optimizer_info_set_attribute( - handle, info, val, path_obj.get_data(), path_obj.get_size()) + handle, info, val, path_obj.ctypes.data, path_obj.dtype.itemsize) - # getter - # num_contractions is the number of input tensors minus one - path = np.empty(2*num_contractions, dtype=np.int32) - path_obj = dtype(num_contractions, path.ctypes.data) + # for getting a path contraction_optimizer_info_get_attribute( - handle, info, val, path_obj.get_data(), path_obj.get_size()) + handle, info, val, path_obj.ctypes.data, path_obj.dtype.itemsize) # now path is filled print(path) - See also the documentation of :class:`ContractionPath`. This design is subject - to change in a future release. """ return contract_opti_info_sizes[attr] @@ -789,9 +967,6 @@ cpdef contraction_optimizer_info_get_attribute( .. note:: To compute ``size``, use the itemsize of the corresponding data type, which can be queried using :func:`contraction_optimizer_info_get_attribute_dtype`. - .. note:: For getting the :data:`ContractionOptimizerInfoAttribute.PATH` attribute - please see :func:`contraction_optimizer_info_get_attribute_dtype`. - .. seealso:: `cutensornetContractionOptimizerInfoGetAttribute` """ with nogil: @@ -817,9 +992,6 @@ cpdef contraction_optimizer_info_set_attribute( .. note:: To compute ``size``, use the itemsize of the corresponding data type, which can be queried using :func:`contraction_optimizer_info_get_attribute_dtype`. - .. note:: For setting the :data:`ContractionOptimizerInfoAttribute.PATH` attribute - please see :func:`contraction_optimizer_info_get_attribute_dtype`. - .. seealso:: `cutensornetContractionOptimizerInfoSetAttribute` """ with nogil: @@ -1272,7 +1444,7 @@ cpdef contraction( stream (intptr_t): The CUDA stream handle (``cudaStream_t`` as Python :class:`int`). - .. note:: + .. warning:: This function is deprecated and will be removed in a future release. Use :func:`contract_slices` instead. @@ -1623,70 +1795,582 @@ cpdef logger_force_disable(): check_status(status) -cdef class ContractionPath: - """A proxy object to hold a `cutensornetContractionPath_t` struct. +cpdef intptr_t create_tensor_descriptor( + intptr_t handle, int32_t n_modes, extents, strides, modes, + int data_type) except*: + """Create a tensor descriptor. - Users provide the number of contractions and a pointer address to the actual - contraction path, and this object creates an `cutensornetContractionPath_t` - instance and fills in the provided information. + Args: + handle (intptr_t): The library handle. + n_modes (int32_t): The number of modes of the tensor. + extents: The extents of the tensor (on host). It can be - Example: + - an :class:`int` as the pointer address to the array + - a Python sequence of :class:`int` - .. code-block:: python + strides: The strides of the tensor (on host). It can be - # the pairwise contraction order is stored as C int - path = np.asarray([(1, 3), (1, 2), (0, 1)], dtype=np.int32) - path_obj = ContractionPath(path.size//2, path.ctypes.data) + - an :class:`int` as the pointer address to the array + - a Python sequence of :class:`int` + + modes: The modes of the tensor (on host). It can be + + - an :class:`int` as the pointer address to the array + - a Python sequence of :class:`int` + + data_type (cuquantum.cudaDataType): The data type of the tensor. + + Returns: + intptr_t: An opaque descriptor handle (as Python :class:`int`). + + .. note:: + If ``strides`` is set to 0 (``NULL``), it means the tensor is in + the Fortran layout (F-contiguous). + + .. seealso:: `cutensornetCreateTensorDescriptor` + """ + # extents can be a pointer address, or a Python sequence + cdef vector[int64_t] extentsData + cdef int64_t* extentsPtr + if cpython.PySequence_Check(extents): + extentsData = extents + extentsPtr = extentsData.data() + else: # a pointer address + extentsPtr = extents + + # strides can be a pointer address, or a Python sequence + cdef vector[int64_t] stridesData + cdef int64_t* stridesPtr + if cpython.PySequence_Check(strides): + stridesData = strides + stridesPtr = stridesData.data() + else: # a pointer address + stridesPtr = strides + + # modes can be a pointer address, or a Python sequence + cdef vector[int32_t] modesData + cdef int32_t* modesPtr + if cpython.PySequence_Check(modes): + modesData = modes + modesPtr = modesData.data() + else: # a pointer address + modesPtr = modes + + cdef _TensorDescriptor desc + with nogil: + status = cutensornetCreateTensorDescriptor( + <_Handle>handle, n_modes, extentsPtr, stridesPtr, modesPtr, + data_type, &desc) + check_status(status) + return desc + + +cpdef destroy_tensor_descriptor(intptr_t desc): + """Destroy a tensor descriptor. + + Args: + desc (intptr_t): The tensor descriptor. + + .. seealso:: `cutensornetDestroyTensorDescriptor` + """ + with nogil: + status = cutensornetDestroyTensorDescriptor(<_TensorDescriptor>desc) + check_status(status) + + +cpdef intptr_t create_tensor_svd_config( + intptr_t handle) except*: + """Create a tensor SVD config object. + + Args: + handle (intptr_t): The library handle. + + Returns: + intptr_t: An opaque tensor SVD config handle (as Python :class:`int`). + + .. seealso:: `cutensornetCreateTensorSVDConfig` + """ + cdef _TensorSVDConfig config + with nogil: + status = cutensornetCreateTensorSVDConfig( + <_Handle>handle, &config) + check_status(status) + return config + + +cpdef destroy_tensor_svd_config(intptr_t config): + """Destroy a tensor SVD config object. + + Args: + config (intptr_t): The tensor SVD config handle. + + .. seealso:: `cutensornetDestroyTensorSVDConfig` + """ + with nogil: + status = cutensornetDestroyTensorSVDConfig( + <_TensorSVDConfig>config) + check_status(status) + + +######################### Python specific utility ######################### + +cdef dict tensor_svd_cfg_sizes = { + CUTENSORNET_TENSOR_SVD_CONFIG_ABS_CUTOFF: _numpy.float64, + CUTENSORNET_TENSOR_SVD_CONFIG_REL_CUTOFF: _numpy.float64, + CUTENSORNET_TENSOR_SVD_CONFIG_S_NORMALIZATION: _numpy.int32, # = sizeof(enum value) + CUTENSORNET_TENSOR_SVD_CONFIG_S_PARTITION: _numpy.int32, # = sizeof(enum value) +} + +cpdef tensor_svd_config_get_attribute_dtype(int attr): + """Get the Python data type of the corresponding tensor SVD config attribute. + + Args: + attr (TensorSVDConfigAttribute): The attribute to query. + + Returns: + The data type of the queried attribute. + + .. note:: This API has no C counterpart and is a convenient helper for + allocating memory for :func:`tensor_svd_config_get_attribute` + and :func:`tensor_svd_config_set_attribute`. + """ + dtype = tensor_svd_cfg_sizes[attr] + if attr == CUTENSORNET_TENSOR_SVD_CONFIG_S_NORMALIZATION: + if _numpy.dtype(dtype).itemsize != sizeof(_TensorSVDNormalization): + warnings.warn("binary size may be incompatible") + elif attr == CUTENSORNET_TENSOR_SVD_CONFIG_S_PARTITION: + if _numpy.dtype(dtype).itemsize != sizeof(_TensorSVDPartition): + warnings.warn("binary size may be incompatible") + return dtype + +########################################################################### + + +cpdef tensor_svd_config_get_attribute( + intptr_t handle, intptr_t config, int attr, + intptr_t buf, size_t size): + """Get the tensor SVD config attribute. + + Args: + handle (intptr_t): The library handle. + config (intptr_t): The tensor SVD config handle. + attr (TensorSVDConfigAttribute): The attribute to set. + buf (intptr_t): The pointer address (as Python :class:`int`) for storing + the returned attribute value. + size (size_t): The size of ``buf`` (in bytes). + + .. note:: To compute ``size``, use the itemsize of the corresponding data + type, which can be queried using :func:`tensor_svd_config_get_attribute_dtype`. + + .. seealso:: `cutensornetTensorSVDConfigGetAttribute` + """ + with nogil: + status = cutensornetTensorSVDConfigGetAttribute( + <_Handle>handle, <_TensorSVDConfig>config, + <_TensorSVDConfigAttribute>attr, + buf, size) + check_status(status) - # get the pointer address to the underlying `cutensornetContractionPath_t` - my_func(..., path_obj.get_data(), ...) - # path must outlive path_obj! - del path_obj - del path +cpdef tensor_svd_config_set_attribute( + intptr_t handle, intptr_t config, int attr, + intptr_t buf, size_t size): + """Set the tensor SVD config attribute. + + Args: + handle (intptr_t): The library handle. + config (intptr_t): The tensor SVD config handle. + attr (TensorSVDConfigAttribute): The attribute to set. + buf (intptr_t): The pointer address (as Python :class:`int`) to the attribute data. + size (size_t): The size of ``buf`` (in bytes). + + .. note:: To compute ``size``, use the itemsize of the corresponding data + type, which can be queried using :func:`tensor_svd_config_get_attribute_dtype`. + + .. seealso:: `cutensornetTensorSVDConfigSetAttribute` + """ + with nogil: + status = cutensornetTensorSVDConfigSetAttribute( + <_Handle>handle, <_TensorSVDConfig>config, + <_TensorSVDConfigAttribute>attr, + buf, size) + check_status(status) + + +cpdef intptr_t create_tensor_svd_info(intptr_t handle) except*: + """Create a tensor SVD info object. Args: - num_contractions (int): The number of contractions in the provided path. - data (uintptr_t): The pointer address (as Python :class:`int`) to the provided path. + handle (intptr_t): The library handle. + + Returns: + intptr_t: An opaque tensor SVD info handle (as Python :class:`int`). + + .. seealso:: `cutensornetCreateTensorSVDInfo` + """ + cdef _TensorSVDInfo info + with nogil: + status = cutensornetCreateTensorSVDInfo( + <_Handle>handle, &info) + check_status(status) + return info + + +cpdef destroy_tensor_svd_info(intptr_t info): + """Destroy a tensor SVD info object. + + Args: + info (intptr_t): The tensor SVD info handle. + + .. seealso:: `cutensornetDestroyTensorSVDInfo` + """ + with nogil: + status = cutensornetDestroyTensorSVDInfo( + <_TensorSVDInfo>info) + check_status(status) + + +######################### Python specific utility ######################### + +cdef dict tensor_svd_info_sizes = { + CUTENSORNET_TENSOR_SVD_INFO_FULL_EXTENT: _numpy.int64, + CUTENSORNET_TENSOR_SVD_INFO_REDUCED_EXTENT: _numpy.int64, + CUTENSORNET_TENSOR_SVD_INFO_DISCARDED_WEIGHT: _numpy.float64, +} + +cpdef tensor_svd_info_get_attribute_dtype(int attr): + """Get the Python data type of the corresponding tensor SVD info attribute. + + Args: + attr (TensorSVDInfoAttribute): The attribute to query. + + Returns: + The data type of the queried attribute. The returned dtype is always + a valid NumPy dtype object. + + .. note:: This API has no C counterpart and is a convenient helper for + allocating memory for :func:`tensor_svd_info_get_attribute`. + + """ + return tensor_svd_info_sizes[attr] + +########################################################################### + + +cpdef tensor_svd_info_get_attribute( + intptr_t handle, intptr_t info, int attr, + intptr_t buf, size_t size): + """Get the tensor SVD info attribute. + + Args: + handle (intptr_t): The library handle. + info (intptr_t): The tensor SVD info handle. + attr (TensorSVDInfoAttribute): The attribute to query. + buf (intptr_t): The pointer address (as Python :class:`int`) for storing + the returned attribute value. + size (size_t): The size of ``buf`` (in bytes). + + .. note:: To compute ``size``, use the itemsize of the corresponding data + type, which can be queried using :func:`tensor_svd_info_get_attribute_dtype`. + + .. seealso:: `cutensornetTensorSVDInfoGetAttribute` + """ + with nogil: + status = cutensornetTensorSVDInfoGetAttribute( + <_Handle>handle, <_TensorSVDInfo>info, + <_TensorSVDInfoAttribute>attr, + buf, size) + check_status(status) + + +cpdef workspace_compute_svd_sizes( + intptr_t handle, intptr_t tensor_in, intptr_t tensor_u, + intptr_t tensor_v, intptr_t config, intptr_t workspace): + """Compute the required workspace sizes for :func:`tensor_svd`. + + Args: + handle (intptr_t): The library handle. + tensor_in (intptr_t): The input tensor descriptor. + tensor_u (intptr_t): The tensor descriptor for the output U. + tensor_v (intptr_t): The tensor descriptor for the output V. + config (intptr_t): The tensor SVD config handle. + workspace (intptr_t): The workspace descriptor. + + .. seealso:: `cutensornetWorkspaceComputeSVDSizes` + """ + with nogil: + status = cutensornetWorkspaceComputeSVDSizes( + <_Handle>handle, <_TensorDescriptor>tensor_in, + <_TensorDescriptor>tensor_u, <_TensorDescriptor>tensor_v, + <_TensorSVDConfig>config, + <_WorkspaceDescriptor>workspace) + check_status(status) + + +cpdef tensor_svd( + intptr_t handle, intptr_t tensor_in, intptr_t raw_data_in, + intptr_t tensor_u, intptr_t u, + intptr_t s, + intptr_t tensor_v, intptr_t v, + intptr_t config, intptr_t info, + intptr_t workspace, intptr_t stream): + """Perform SVD decomposition of a tensor. + + Args: + handle (intptr_t): The library handle. + tensor_in (intptr_t): The input tensor descriptor. + raw_data_in (intptr_t): The pointer address (as Python :class:`int`) to the + input tensor (on device). + tensor_u (intptr_t): The tensor descriptor for the output U. + u (intptr_t): The pointer address (as Python :class:`int`) to the output + tensor U (on device). + s (intptr_t): The pointer address (as Python :class:`int`) to the output + array S (on device). + tensor_v (intptr_t): The tensor descriptor for the output V. + v (intptr_t): The pointer address (as Python :class:`int`) to the output + tensor V (on device). + config (intptr_t): The tensor SVD config handle. + info (intptr_t): The tensor SVD info handle. + workspace (intptr_t): The workspace descriptor. + stream (intptr_t): The CUDA stream handle (``cudaStream_t`` as Python + :class:`int`). .. note:: - Users are responsible for managing the lifetime of the underlying path data - (i.e. the validity of the ``data`` pointer). - .. warning:: - The design of how `cutensornetContractionPath_t` is handled in Python is - experimental and subject to change in a future release. + After this function call, the output tensor descriptors ``tensor_u`` and + ``tensor_v`` may have their shapes and strides changed. See the documentation + for further information. + + .. seealso:: `cutensornetTensorSVD` + """ + with nogil: + status = cutensornetTensorSVD( + <_Handle>handle, <_TensorDescriptor>tensor_in, raw_data_in, + <_TensorDescriptor>tensor_u, u, + s, + <_TensorDescriptor>tensor_v, v, + <_TensorSVDConfig>config, <_TensorSVDInfo>info, + <_WorkspaceDescriptor>workspace, stream) + check_status(status) + + +cpdef workspace_compute_qr_sizes( + intptr_t handle, intptr_t tensor_in, intptr_t tensor_q, + intptr_t tensor_r, intptr_t workspace): + """Compute the required workspace sizes for :func:`tensor_qr`. + + Args: + handle (intptr_t): The library handle. + tensor_in (intptr_t): The input tensor descriptor. + tensor_q (intptr_t): The tensor descriptor for the output Q. + tensor_r (intptr_t): The tensor descriptor for the output R. + workspace (intptr_t): The workspace descriptor. + + .. seealso:: `cutensornetWorkspaceComputeQRSizes` """ - cdef _ContractionPath* path + with nogil: + status = cutensornetWorkspaceComputeQRSizes( + <_Handle>handle, <_TensorDescriptor>tensor_in, + <_TensorDescriptor>tensor_q, <_TensorDescriptor>tensor_r, + <_WorkspaceDescriptor>workspace) + check_status(status) - def __cinit__(self, int num_contractions, uintptr_t data): - self.path = <_ContractionPath*>PyMem_Malloc(sizeof(_ContractionPath)) - def __dealloc__(self): - PyMem_Free(self.path) +cpdef tensor_qr( + intptr_t handle, intptr_t tensor_in, intptr_t raw_data_in, + intptr_t tensor_q, intptr_t q, + intptr_t tensor_r, intptr_t r, + intptr_t workspace, intptr_t stream): + """Perform QR decomposition of a tensor. - def __init__(self, int num_contractions, uintptr_t data): - """ - __init__(self, int num_contractions, uintptr_t data) - """ - self.path.numContractions = num_contractions - self.path.data = <_NodePair*>data + Args: + handle (intptr_t): The library handle. + tensor_in (intptr_t): The input tensor descriptor. + raw_data_in (intptr_t): The pointer address (as Python :class:`int`) to the + input tensor (on device). + tensor_q (intptr_t): The tensor descriptor for the output Q. + q (intptr_t): The pointer address (as Python :class:`int`) to the output + tensor Q (on device). + tensor_r (intptr_t): The tensor descriptor for the output R. + r (intptr_t): The pointer address (as Python :class:`int`) to the output + tensor R (on device). + workspace (intptr_t): The workspace descriptor. + stream (intptr_t): The CUDA stream handle (``cudaStream_t`` as Python + :class:`int`). - def get_path(self): - """Get the pointer address to the underlying `cutensornetContractionPath_t` struct. + .. seealso:: `cutensornetTensorQR` + """ + with nogil: + status = cutensornetTensorQR( + <_Handle>handle, <_TensorDescriptor>tensor_in, raw_data_in, + <_TensorDescriptor>tensor_q, q, + <_TensorDescriptor>tensor_r, r, + <_WorkspaceDescriptor>workspace, stream) + check_status(status) - Returns: - uintptr_t: The pointer address. - """ - return self.path - def get_size(self): - """Get the size of the `cutensornetContractionPath_t` struct. +cpdef workspace_compute_gate_split_sizes( + intptr_t handle, intptr_t tensor_a, intptr_t tensor_b, + intptr_t tensor_g, intptr_t tensor_u, intptr_t tensor_v, + int algo, intptr_t svd_config, int compute_type, + intptr_t workspace): + """Compute the required workspace sizes for :func:`gate_split`. - Returns: - size_t: ``sizeof(cutensornetContractionPath_t)``. - """ - return sizeof(_ContractionPath) + Args: + handle (intptr_t): The library handle. + tensor_a (intptr_t): The tensor descriptor for the input A. + tensor_b (intptr_t): The tensor descriptor for the input B. + tensor_g (intptr_t): The tensor descriptor for the input G (the gate). + tensor_u (intptr_t): The tensor descriptor for the output U. + tensor_v (intptr_t): The tensor descriptor for the output V. + algo (cuquantum.cutensornet.GateSplitAlgo): The gate splitting algorithm. + svd_config (intptr_t): The tensor SVD config handle. + compute_type (cuquantum.ComputeType): The compute type of the + computation. + workspace (intptr_t): The workspace descriptor. + + .. seealso:: `cutensornetWorkspaceComputeGateSplitSizes` + """ + with nogil: + status = cutensornetWorkspaceComputeGateSplitSizes( + <_Handle>handle, <_TensorDescriptor>tensor_a, + <_TensorDescriptor>tensor_b, <_TensorDescriptor>tensor_g, + <_TensorDescriptor>tensor_u, <_TensorDescriptor>tensor_v, + <_GateSplitAlgo>algo, <_TensorSVDConfig>svd_config, + <_ComputeType>compute_type, <_WorkspaceDescriptor>workspace) + check_status(status) + + +cpdef gate_split( + intptr_t handle, intptr_t tensor_a, intptr_t raw_data_a, + intptr_t tensor_b, intptr_t raw_data_b, + intptr_t tensor_g, intptr_t raw_data_g, + intptr_t tensor_u, intptr_t u, + intptr_t s, + intptr_t tensor_v, intptr_t v, + int algo, intptr_t svd_config, int compute_type, + intptr_t svd_info, intptr_t workspace, intptr_t stream): + """Perform gate split operation. + + Args: + handle (intptr_t): The library handle. + tensor_a (intptr_t): The tensor descriptor for the input A. + raw_data_a (intptr_t): The pointer address (as Python :class:`int`) to the + input tensor A (on device). + tensor_b (intptr_t): The tensor descriptor for the input B. + raw_data_b (intptr_t): The pointer address (as Python :class:`int`) to the + input tensor B (on device). + tensor_g (intptr_t): The tensor descriptor for the input G (the gate). + raw_data_g (intptr_t): The pointer address (as Python :class:`int`) to the + gate tensor G (on device). + tensor_u (intptr_t): The tensor descriptor for the output U. + u (intptr_t): The pointer address (as Python :class:`int`) to the output + tensor U (on device). + s (intptr_t): The pointer address (as Python :class:`int`) to the output + array S (on device). + tensor_v (intptr_t): The tensor descriptor for the output V. + v (intptr_t): The pointer address (as Python :class:`int`) to the output + tensor V (on device). + algo (cuquantum.cutensornet.GateSplitAlgo): The gate splitting algorithm. + svd_config (intptr_t): The tensor SVD config handle. + compute_type (cuquantum.ComputeType): The compute type of the + computation. + svd_info (intptr_t): The tensor SVD info handle. + workspace (intptr_t): The workspace descriptor. + stream (intptr_t): The CUDA stream handle (``cudaStream_t`` as Python + :class:`int`). + + .. note:: + + After this function call, the output tensor descriptors ``tensor_u`` and + ``tensor_v`` may have their shapes and strides changed. See the documentation + for further information. + + .. seealso:: `cutensornetGateSplit` + """ + with nogil: + status = cutensornetGateSplit( + <_Handle>handle, + <_TensorDescriptor>tensor_a, raw_data_a, + <_TensorDescriptor>tensor_b, raw_data_b, + <_TensorDescriptor>tensor_g, raw_data_g, + <_TensorDescriptor>tensor_u, u, + s, + <_TensorDescriptor>tensor_v, v, + <_GateSplitAlgo>algo, <_TensorSVDConfig>svd_config, + <_ComputeType>compute_type, <_TensorSVDInfo>svd_info, + <_WorkspaceDescriptor>workspace, stream) + check_status(status) + + +cpdef distributed_reset_configuration( + intptr_t handle, intptr_t comm_ptr, size_t comm_size): + """Reset the distributed communicator. + + Args: + handle (intptr_t): The library handle. + comm_ptr (intptr_t): The pointer to the provided communicator. + comm_size (size_t): The size of the provided communicator + (``sizeof(comm)``). + + .. note:: For using MPI communicators from mpi4py, the helper function + :func:`~cuquantum.cutensornet.get_mpi_comm_pointer` can be used: + + .. code-block:: python + + cutn.distributed_reset_configuration(handle, *get_mpi_comm_pointer(comm)) + + .. seealso:: `cutensornetDistributedResetConfiguration` + """ + with nogil: + status = cutensornetDistributedResetConfiguration( + <_Handle>handle, comm_ptr, comm_size) + check_status(status) + + +cpdef int distributed_get_num_ranks(intptr_t handle) except -1: + """Get the number of distributed ranks. + + Args: + handle (intptr_t): The library handle. + + .. seealso:: `cutensornetDistributedGetNumRanks` + """ + cdef int rank + with nogil: + status = cutensornetDistributedGetNumRanks( + <_Handle>handle, &rank) + check_status(status) + return rank + + +cpdef int distributed_get_proc_rank(intptr_t handle) except -1: + """Get the current process rank. + + Args: + handle (intptr_t): The library handle. + + .. seealso:: `cutensornetDistributedGetProcRank` + """ + cdef int rank + with nogil: + status = cutensornetDistributedGetProcRank( + <_Handle>handle, &rank) + check_status(status) + return rank + + +cpdef distributed_synchronize(intptr_t handle): + """Synchronize the distributed communicator. + + Args: + handle (intptr_t): The library handle. + + .. seealso:: `cutensornetDistributedSynchronize` + """ + with nogil: + status = cutensornetDistributedSynchronize(<_Handle>handle) + check_status(status) class GraphAlgo(IntEnum): @@ -1741,6 +2425,7 @@ class ContractionOptimizerInfoAttribute(IntEnum): NUM_INTERMEDIATE_MODES = CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_INTERMEDIATE_MODES EFFECTIVE_FLOPS_EST = CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_EFFECTIVE_FLOPS_EST RUNTIME_EST = CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_RUNTIME_EST + SLICING_CONFIG = CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_SLICING_CONFIG class ContractionAutotunePreferenceAttribute(IntEnum): """See `cutensornetContractionAutotunePreferenceAttributes_t`.""" @@ -1756,6 +2441,39 @@ class WorksizePref(IntEnum): class Memspace(IntEnum): """See `cutensornetMemspace_t`.""" DEVICE = CUTENSORNET_MEMSPACE_DEVICE + HOST = CUTENSORNET_MEMSPACE_HOST + +class TensorSVDConfigAttribute(IntEnum): + """See `cutensornetTensorSVDConfigAttributes_t`.""" + ABS_CUTOFF = CUTENSORNET_TENSOR_SVD_CONFIG_ABS_CUTOFF + REL_CUTOFF = CUTENSORNET_TENSOR_SVD_CONFIG_REL_CUTOFF + S_NORMALIZATION = CUTENSORNET_TENSOR_SVD_CONFIG_S_NORMALIZATION + S_PARTITION = CUTENSORNET_TENSOR_SVD_CONFIG_S_PARTITION + +class TensorSVDNormalization(IntEnum): + """See `cutensornetTensorSVDNormalization_t`.""" + NONE = CUTENSORNET_TENSOR_SVD_NORMALIZATION_NONE + L1 = CUTENSORNET_TENSOR_SVD_NORMALIZATION_L1 + L2 = CUTENSORNET_TENSOR_SVD_NORMALIZATION_L2 + LINF = CUTENSORNET_TENSOR_SVD_NORMALIZATION_LINF + +class TensorSVDPartition(IntEnum): + """See `cutensornetTensorSVDPartition_t`.""" + NONE = CUTENSORNET_TENSOR_SVD_PARTITION_NONE + US = CUTENSORNET_TENSOR_SVD_PARTITION_US + SV = CUTENSORNET_TENSOR_SVD_PARTITION_SV + UV_EQUAL = CUTENSORNET_TENSOR_SVD_PARTITION_UV_EQUAL + +class TensorSVDInfoAttribute(IntEnum): + """See `cutensornetTensorSVDInfoAttributes_t`.""" + FULL_EXTENT = CUTENSORNET_TENSOR_SVD_INFO_FULL_EXTENT + REDUCED_EXTENT = CUTENSORNET_TENSOR_SVD_INFO_REDUCED_EXTENT + DISCARDED_WEIGHT = CUTENSORNET_TENSOR_SVD_INFO_DISCARDED_WEIGHT + +class GateSplitAlgo(IntEnum): + """See `cutensornetGateSplitAlgo_t`.""" + DIRECT = CUTENSORNET_GATE_SPLIT_ALGO_DIRECT + REDUCED = CUTENSORNET_GATE_SPLIT_ALGO_REDUCED del IntEnum @@ -1766,6 +2484,13 @@ MINOR_VER = CUTENSORNET_MINOR PATCH_VER = CUTENSORNET_PATCH VERSION = CUTENSORNET_VERSION +# numpy dtypes +tensor_qualifiers_dtype = _numpy.dtype( + {'names':('is_conjugate', ), + 'formats': (_numpy.int32, ), + 'itemsize': sizeof(_TensorQualifiers), + }, align=True +) # who owns a reference to user-provided Python objects (k: owner, v: object) cdef dict owner_pyobj = {} diff --git a/python/cuquantum/cutensornet/memory.py b/python/cuquantum/cutensornet/memory.py index 97fd19d..0c9925d 100644 --- a/python/cuquantum/cutensornet/memory.py +++ b/python/cuquantum/cutensornet/memory.py @@ -13,6 +13,9 @@ import cupy as cp +from ._internal import utils + + class MemoryPointer: """ An RAII class for a device memory buffer. @@ -84,20 +87,20 @@ def __init__(self, device_id, logger): """ __init__(device_id) """ - self.device = cp.cuda.Device(device_id) + self.device_id = device_id self.logger = logger def memalloc(self, size): - with self.device: + with utils.device_ctx(self.device_id): device_ptr = cp.cuda.runtime.malloc(size) self.logger.debug(f"_RawCUDAMemoryManager (allocate memory): size = {size}, ptr = {device_ptr}, " - f"device = {self.device}, stream={cp.cuda.get_current_stream()}") + f"device = {self.device_id}, stream={cp.cuda.get_current_stream()}") def create_finalizer(): def finalizer(): - with self.device: - cp.cuda.runtime.free(device_ptr) + # Note: With UVA there is no need to switch context to the device the memory belongs to before calling free(). + cp.cuda.runtime.free(device_ptr) self.logger.debug(f"_RawCUDAMemoryManager (release memory): ptr = {device_ptr}") return finalizer @@ -117,16 +120,16 @@ def __init__(self, device_id, logger): """ __init__(device_id) """ - self.device = cp.cuda.Device(device_id) + self.device_id = device_id self.logger = logger def memalloc(self, size): - with self.device: + with utils.device_ctx(self.device_id): cp_mem_ptr = cp.cuda.alloc(size) device_ptr = cp_mem_ptr.ptr self.logger.debug(f"_CupyCUDAMemoryManager (allocate memory): size = {size}, ptr = {device_ptr}, " - f"device = {self.device}, stream={cp.cuda.get_current_stream()}") + f"device = {self.device_id}, stream={cp.cuda.get_current_stream()}") return cp_mem_ptr @@ -165,4 +168,3 @@ def finalizer(): _MEMORY_MANAGER = {'_raw' : _RawCUDAMemoryManager, 'cupy' : _CupyCUDAMemoryManager, 'torch' : _TorchCUDAMemoryManager} - diff --git a/python/cuquantum/cutensornet/tensor_network.py b/python/cuquantum/cutensornet/tensor_network.py index df5c6c7..81cb583 100644 --- a/python/cuquantum/cutensornet/tensor_network.py +++ b/python/cuquantum/cutensornet/tensor_network.py @@ -10,10 +10,7 @@ import collections import dataclasses -import functools import logging -import os -import sys import cupy as cp import numpy as np @@ -71,12 +68,14 @@ class Network: >>> logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)-8s %(message)s', datefmt='%m-%d %H:%M:%S') Args: - subscripts : The mode labels (subscripts) defining the Einstein summation expression as a comma-separated sequence of + subscripts: The mode labels (subscripts) defining the Einstein summation expression as a comma-separated sequence of characters. Unicode characters are allowed in the expression thereby expanding the size of the tensor network that can be specified using the Einstein summation convention. - operands : A sequence of tensors (ndarray-like objects). The currently supported types are :class:`numpy.ndarray`, + operands: A sequence of tensors (ndarray-like objects). The currently supported types are :class:`numpy.ndarray`, :class:`cupy.ndarray`, and :class:`torch.Tensor`. - options : Specify options for the tensor network as a :class:`~cuquantum.NetworkOptions` object. Alternatively, a `dict` + qualifiers: Specify the tensor qualifiers as a :class:`numpy.ndarray` of :class:`~cuquantum.tensor_qualifiers_dtype` objects + of length equal to the number of operands. + options: Specify options for the tensor network as a :class:`~cuquantum.NetworkOptions` object. Alternatively, a `dict` containing the parameters for the ``NetworkOptions`` constructor can also be provided. If not specified, the value will be set to the default-constructed ``NetworkOptions`` object. @@ -167,7 +166,7 @@ class Network: as specifying options for the tensor network and the optimizer. """ - def __init__(self, *operands, options=None): + def __init__(self, *operands, qualifiers=None, options=None): """ __init__(subscripts, *operands, options=None) """ @@ -192,6 +191,13 @@ def __init__(self, *operands, options=None): self.device_id = options.device_id self.operands = tensor_wrapper.to(self.operands, self.device_id) + # Set blocking or non-blocking behavior. + self.blocking = self.options.blocking is True or self.network_location == 'cpu' + if self.blocking: + self.call_prologue = "This call is blocking and will return only after the operation is complete." + else: + self.call_prologue = "This call is non-blocking and will return immediately after the operation is launched on the device." + # Infer the library package the operands belong to. self.package = utils.get_operands_package(self.operands) @@ -222,12 +228,13 @@ def __init__(self, *operands, options=None): extents_in = tuple(o.shape for o in self.operands) strides_in = tuple(o.strides for o in self.operands) - self.operands_data, alignments_in = utils.get_operands_data(self.operands) + self.operands_data = utils.get_operands_data(self.operands) modes_in = tuple(tuple(m for m in _input) for _input in self.inputs) num_modes_in = tuple(len(m) for m in modes_in) + self.qualifiers_in = utils.check_tensor_qualifiers(qualifiers, cutn.tensor_qualifiers_dtype, num_inputs) - self.contraction, modes_out, extents_out, strides_out, alignment_out = utils.create_output_tensor( - self.output_class, self.package, self.output, self.size_dict, self.device, self.data_type) + self.contraction, modes_out, extents_out, strides_out = utils.create_output_tensor( + self.output_class, self.package, self.output, self.size_dict, self.device_id, self.data_type) # Create/set handle. if options.handle is not None: @@ -235,19 +242,19 @@ def __init__(self, *operands, options=None): self.handle = options.handle else: self.own_handle = True - with self.device: + with utils.device_ctx(self.device_id): self.handle = cutn.create() # Network definition. self.network = cutn.create_network_descriptor(self.handle, num_inputs, - num_modes_in, extents_in, strides_in, modes_in, alignments_in, # inputs - num_modes_out, extents_out, strides_out, modes_out, alignment_out, # output + num_modes_in, extents_in, strides_in, modes_in, self.qualifiers_in, # inputs + num_modes_out, extents_out, strides_out, modes_out, # output typemaps.NAME_TO_DATA_TYPE[self.data_type], self.compute_type) # Keep output extents for creating new tensors, if needed. self.extents_out = extents_out - # Path optimization atributes. + # Path optimization attributes. self.optimizer_config_ptr, self.optimizer_info_ptr = None, None self.optimized = False @@ -257,11 +264,16 @@ def __init__(self, *operands, options=None): # Contraction plan attributes. self.plan = None + self.planned = False # Autotuning attributes. self.autotune_pref_ptr = None self.autotuned = False + # Attributes to establish stream ordering. + self.workspace_stream = None + self.last_compute_event = None + self.valid_state = True self.logger.info("The network has been created.") @@ -285,6 +297,13 @@ def _check_optimized(self, *args, **kwargs): if not self.optimized: raise RuntimeError(f"{what} cannot be performed before contract_path() has been called.") + def _check_planned(self, *args, **kwargs): + """ + """ + what = kwargs['what'] + if not self.planned: + raise RuntimeError(f"Internal Error: {what} cannot be performed before planning has been done.") + def _free_plan_resources(self, exception=None): """ Free resources allocated in network contraction planning. @@ -327,21 +346,22 @@ def _free_path_resources(self, exception=None): @utils.precondition(_check_valid_network) @utils.precondition(_check_optimized, "Workspace memory allocation") @utils.atomic(_free_workspace_memory, method=True) - def _allocate_workspace_memory_perhaps(self, stream_ctx): + def _allocate_workspace_memory_perhaps(self, stream, stream_ctx): if self.workspace_ptr is not None: return assert self.workspace_size is not None, "Internal Error." self.logger.debug("Allocating memory for contracting the tensor network...") - with self.device, stream_ctx: + with utils.device_ctx(self.device_id), stream_ctx: try: self.workspace_ptr = self.allocator.memalloc(self.workspace_size) except TypeError as e: message = "The method 'memalloc' in the allocator object must conform to the interface in the "\ "'BaseCUDAMemoryManager' protocol." raise TypeError(message) from e - self.logger.debug(f"Finished allocating memory of size {formatters.MemoryStr(self.workspace_size)} for contraction.") + self.workspace_stream = stream + self.logger.debug(f"Finished allocating memory of size {formatters.MemoryStr(self.workspace_size)} for contraction in the context of stream {self.workspace_stream}.") device_ptr = utils.get_ptr_from_memory_pointer(self.workspace_ptr) cutn.workspace_set(self.handle, self.workspace_desc, cutn.Memspace.DEVICE, device_ptr, self.workspace_size) @@ -357,7 +377,7 @@ def _calculate_workspace_size(self): # Release workspace already allocated, if any, because the new requirements are likely different. self.workspace_ptr = None - cutn.workspace_compute_sizes(self.handle, self.network, self.optimizer_info_ptr, self.workspace_desc) + cutn.workspace_compute_contraction_sizes(self.handle, self.network, self.optimizer_info_ptr, self.workspace_desc) min_size = cutn.workspace_get_size(self.handle, self.workspace_desc, cutn.WorksizePref.MIN, cutn.Memspace.DEVICE) max_size = cutn.workspace_get_size(self.handle, self.workspace_desc, cutn.WorksizePref.MAX, cutn.Memspace.DEVICE) @@ -376,7 +396,6 @@ def _calculate_workspace_size(self): # Set workspace size to enable contraction planning. The device pointer will be set later during allocation. cutn.workspace_set(self.handle, self.workspace_desc, cutn.Memspace.DEVICE, 0, self.workspace_size) - @utils.precondition(_check_valid_network) @utils.precondition(_check_optimized, "Planning") @utils.atomic(_free_plan_resources, method=True) @@ -456,10 +475,16 @@ def _set_optimizer_options(self, optimize): enum = ConfEnum.SEED self._set_opt_config_option('seed', enum, optimize.seed) + enum = ConfEnum.COST_FUNCTION_OBJECTIVE + self._set_opt_config_option('cost_function', enum, optimize.cost_function) + @utils.precondition(_check_valid_network) @utils.atomic(_free_path_resources, method=True) - def contract_path(self, optimize=None): - """Compute the best contraction path together with any slicing that is needed to ensure that the contraction can be + def contract_path(self, optimize=None, **kwargs): + """ + contract_path(optimize=None) + + Compute the best contraction path together with any slicing that is needed to ensure that the contraction can be performed within the specified memory limit. Args: @@ -481,6 +506,10 @@ def contract_path(self, optimize=None): optimize = utils.check_or_create_options(configuration.OptimizerOptions, optimize, "path optimizer options") + internal_options = dict() + internal_options['create_plan'] = utils.Value(True, validator=lambda v: isinstance(v, bool)) + utils.check_and_set_options(internal_options, kwargs) + if self.optimizer_config_ptr is None: self.optimizer_config_ptr = cutn.create_contraction_optimizer_config(self.handle) if self.optimizer_info_ptr is None: @@ -488,6 +517,10 @@ def contract_path(self, optimize=None): opt_info_ifc = optimizer_ifc.OptimizerInfoInterface(self) + # Special case worth optimizing, as it's an extremely common use case with a trivial path + if len(self.operands) == 2: + optimize.path = [(0, 1)] + # Compute path (or set provided path). if isinstance(optimize.path, configuration.PathFinderOptions): # Set optimizer options. @@ -525,11 +558,15 @@ def contract_path(self, optimize=None): self.optimized = True - # Calculate workspace size required. - self._calculate_workspace_size() + if internal_options['create_plan']: + # Calculate workspace size required. + self._calculate_workspace_size() - # Create plan. - self._create_plan() + # Create plan. + self._create_plan() + self.planned = True + else: + self.planned = False return opt_info.path, opt_info @@ -566,6 +603,7 @@ def _set_autotune_option(self, name, enum, value): @utils.precondition(_check_valid_network) @utils.precondition(_check_optimized, "Autotuning") + @utils.precondition(_check_planned, "Autotuning") def autotune(self, *, iterations=3, stream=None): """Autotune the network to reduce the contraction cost. @@ -588,24 +626,25 @@ def autotune(self, *, iterations=3, stream=None): self._set_autotune_options(options) # Allocate device memory (in stream context) if needed. - stream, stream_ctx, stream_ptr = utils.get_or_create_stream(self.device, stream, self.package) - self._allocate_workspace_memory_perhaps(stream_ctx) + stream, stream_ctx, stream_ptr = utils.get_or_create_stream(self.device_id, stream, self.package) + self._allocate_workspace_memory_perhaps(stream, stream_ctx) # Check if we still hold an output tensor; if not, create a new one. if self.contraction is None: self.contraction = utils.create_empty_tensor(self.output_class, self.extents_out, self.data_type, self.device_id, stream_ctx) + timing = bool(self.logger and self.logger.handlers) self.logger.info(f"Starting autotuning...") - with self.device: - start = stream.record() + self.logger.info(f"{self.call_prologue}") + with utils.device_ctx(self.device_id), utils.cuda_call_ctx(stream, self.blocking, timing) as (self.last_compute_event, elapsed): cutn.contraction_autotune(self.handle, self.plan, self.operands_data, self.contraction.data_ptr, self.workspace_desc, self.autotune_pref_ptr, stream_ptr) - end = stream.record() - end.synchronize() - elapsed = cp.cuda.get_elapsed_time(start, end) + + if elapsed.data is not None: + self.logger.info(f"The autotuning took {elapsed.data:.3f} ms to complete.") self.autotuned = True - self.logger.info(f"The autotuning took {elapsed:.3f} ms to complete.") + @utils.precondition(_check_valid_network) def reset_operands(self, *operands): @@ -618,7 +657,7 @@ def reset_operands(self, *operands): - The shapes, strides, datatypes match those of the old ones. - The packages that the operands belong to match those of the old ones. - - If input tensors are on GPU, the library package, device, and alignments must match. + - If input tensors are on GPU, the library package and device must match. Args: operands: See :class:`Network`'s documentation. @@ -650,16 +689,13 @@ def reset_operands(self, *operands): raise ValueError(f"The new operands must be on the same device ({device_id}) as the original operands " f"({self.device_id}).") - _, orig_alignments = utils.get_operands_data(self.operands) - new_operands_data, new_alignments = utils.get_operands_data(operands) - utils.check_alignments_match(orig_alignments, new_alignments) - # Finally, replace the original data pointers by the new ones. - self.operands_data = new_operands_data + self.operands_data = utils.get_operands_data(operands) self.logger.info("The operands have been reset.") @utils.precondition(_check_valid_network) @utils.precondition(_check_optimized, "Contraction") + @utils.precondition(_check_planned, "Contraction") def contract(self, *, slices=None, stream=None): """Contract the network and return the result. @@ -675,8 +711,8 @@ def contract(self, *, slices=None, stream=None): """ # Allocate device memory (in stream context) if needed. - stream, stream_ctx, stream_ptr = utils.get_or_create_stream(self.device, stream, self.package) - self._allocate_workspace_memory_perhaps(stream_ctx) + stream, stream_ctx, stream_ptr = utils.get_or_create_stream(self.device_id, stream, self.package) + self._allocate_workspace_memory_perhaps(stream, stream_ctx) # Check if we still hold an output tensor; if not, create a new one. if self.contraction is None: @@ -697,16 +733,15 @@ def contract(self, *, slices=None, stream=None): message = f"The provided 'slices' must be a range object or a sequence object. The object type is {type(slices)}." raise TypeError(message) + timing = bool(self.logger and self.logger.handlers) self.logger.info("Starting network contraction...") - with self.device: - start = stream.record() + self.logger.info(f"{self.call_prologue}") + with utils.device_ctx(self.device_id), utils.cuda_call_ctx(stream, self.blocking, timing) as (self.last_compute_event, elapsed): cutn.contract_slices(self.handle, self.plan, self.operands_data, self.contraction.data_ptr, False, self.workspace_desc, slice_group, stream_ptr) - end = stream.record() - end.synchronize() - elapsed = cp.cuda.get_elapsed_time(start, end) - self.logger.info(f"The contraction took {elapsed:.3f} ms to complete.") + if elapsed.data is not None: + self.logger.info(f"The contraction took {elapsed.data:.3f} ms to complete.") # Destroy slice group, if created. if slice_group != 0: @@ -718,6 +753,7 @@ def contract(self, *, slices=None, stream=None): else: out = self.contraction.tensor self.contraction = None # We cannot overwrite what we've already handed to users. + return out def free(self): @@ -731,6 +767,10 @@ def free(self): return try: + # Future operations on the workspace stream should be ordered after the computation. + if self.last_compute_event is not None: + self.workspace_stream.wait_event(self.last_compute_event) + self._free_path_resources() if self.autotune_pref_ptr is not None: @@ -759,7 +799,7 @@ def free(self): self.logger.info("The network resources have been released.") -def contract(*operands, options=None, optimize=None, stream=None, return_info=False): +def contract(*operands, qualifiers=None, options=None, optimize=None, stream=None, return_info=False): r""" contract(subscripts, *operands, options=None, optimize=None, stream=None, return_info=False) @@ -775,6 +815,8 @@ def contract(*operands, options=None, optimize=None, stream=None, return_info=Fa can be specified using the Einstein summation convention. operands : A sequence of tensors (ndarray-like objects). The currently supported types are :class:`numpy.ndarray`, :class:`cupy.ndarray`, and :class:`torch.Tensor`. + qualifiers: Specify the tensor qualifiers as a :class:`numpy.ndarray` of :class:`~cuquantum.tensor_qualifiers_dtype` objects + of length equal to the number of operands. options : Specify options for the tensor network as a :class:`~cuquantum.NetworkOptions` object. Alternatively, a `dict` containing the parameters for the ``NetworkOptions`` constructor can also be provided. If not specified, the value will be set to the default-constructed ``NetworkOptions`` object. @@ -796,14 +838,15 @@ def contract(*operands, options=None, optimize=None, stream=None, return_info=Fa .. code-block:: python - from cuquantum import cutensornet, NetworkOptions, contract + from cuquantum import cutensornet as cutn + from cuquantum import contract, NetworkOptions - handle = cutensornet.create() + handle = cutn.create() network_opts = NetworkOptions(handle=handle, ...) out = contract(..., options=network_opts, ...) # ... the same handle can be reused for further calls ... # when it's done, remember to destroy the handle - cutensornet.destroy(handle) + cutn.destroy(handle) Examples: @@ -896,12 +939,8 @@ def contract(*operands, options=None, optimize=None, stream=None, return_info=Fa >>> r = contract('ij,jk', a, b) """ - options = utils.check_or_create_options(configuration.NetworkOptions, options, "network options") - - optimize = utils.check_or_create_options(configuration.OptimizerOptions, optimize, "path optimizer options") - # Create network. - with Network(*operands, options=options) as network: + with Network(*operands, qualifiers=qualifiers, options=options) as network: # Compute path. opt_info = network.contract_path(optimize=optimize) @@ -917,7 +956,7 @@ def contract(*operands, options=None, optimize=None, stream=None, return_info=Fa return output -def contract_path(*operands, options=None, optimize=None): +def contract_path(*operands, qualifiers=None, options=None, optimize=None): """ contract_path(subscripts, *operands, options=None, optimize=None) @@ -933,6 +972,8 @@ def contract_path(*operands, options=None, optimize=None): can be specified using the Einstein summation convention. operands : A sequence of tensors (ndarray-like objects). The currently supported types are :class:`numpy.ndarray`, :class:`cupy.ndarray`, and :class:`torch.Tensor`. + qualifiers: Specify the tensor qualifiers as a :class:`numpy.ndarray` of :class:`~cuquantum.tensor_qualifiers_dtype` objects + of length equal to the number of operands. options : Specify options for the tensor network as a :class:`~cuquantum.NetworkOptions` object. Alternatively, a `dict` containing the parameters for the ``NetworkOptions`` constructor can also be provided. If not specified, the value will be set to the default-constructed ``NetworkOptions`` object. @@ -952,26 +993,23 @@ def contract_path(*operands, options=None, optimize=None): .. code-block:: python - from cuquantum import cutensornet, NetworkOptions, contract_path + from cuquantum import cutensornet as cutn + from cuquantum import contract, NetworkOptions - handle = cutensornet.create() + handle = cutn.create() network_opts = NetworkOptions(handle=handle, ...) path, info = contract_path(..., options=network_opts, ...) # ... the same handle can be reused for further calls ... # when it's done, remember to destroy the handle - cutensornet.destroy(handle) + cutn.destroy(handle) """ - options = utils.check_or_create_options(configuration.NetworkOptions, options, "network options") - - optimize = utils.check_or_create_options(configuration.OptimizerOptions, optimize, "path optimizer options") - # Create network. - with Network(*operands, options=options) as network: + with Network(*operands, qualifiers=qualifiers, options=options) as network: # Compute path. - path, opt_info = network.contract_path(optimize=optimize) + path, opt_info = network.contract_path(optimize=optimize, create_plan=False) return path, opt_info @@ -1102,6 +1140,6 @@ def einsum_path(*operands, optimize=True): with Network(*operands) as network: # Compute path. - path, opt_info = network.contract_path() + path, opt_info = network.contract_path(create_plan=False) return ['einsum_path', *path], str(opt_info) diff --git a/python/pyproject.toml b/python/pyproject.toml new file mode 100644 index 0000000..4f47793 --- /dev/null +++ b/python/pyproject.toml @@ -0,0 +1,13 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + + +[build-system] +# Ideally we wanna add cuquantum to this list too, but its version +# constraint needs to be dynamically determined, and setuptools' +# support for dynamic dependencies is still on beta, so we use a +# custom PEP-517 backend to handle that instead. +requires = ["Cython>=0.29.22,<3", "packaging", "setuptools>=61.0.0", "wheel"] +build-backend = "pep517" +backend-path = ["builder"] diff --git a/python/samples/cutensornet/approxTN/gate_split_example.py b/python/samples/cutensornet/approxTN/gate_split_example.py new file mode 100644 index 0000000..6816685 --- /dev/null +++ b/python/samples/cutensornet/approxTN/gate_split_example.py @@ -0,0 +1,231 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import cupy as cp +import numpy as np + +import cuquantum +from cuquantum import cutensornet as cutn + + +print("cuTensorNet-vers:", cutn.get_version()) +dev = cp.cuda.Device() # get current device +props = cp.cuda.runtime.getDeviceProperties(dev.id) +print("===== device info ======") +print("GPU-name:", props["name"].decode()) +print("GPU-clock:", props["clockRate"]) +print("GPU-memoryClock:", props["memoryClockRate"]) +print("GPU-nSM:", props["multiProcessorCount"]) +print("GPU-major:", props["major"]) +print("GPU-minor:", props["minor"]) +print("========================") + +################################################################################### +# Gate Split: A_{i,j,k,l} B_{k,o,p,q} G_{m,n,l,o}-> A'_{i,j,x,m} S_{x} B'_{x,n,p,q} +################################################################################### + +data_type = cuquantum.cudaDataType.CUDA_R_32F +compute_type = cuquantum.ComputeType.COMPUTE_32F + +# Create an array of modes + +modes_A_in = [ord(c) for c in ('i','j','k','l')] # input +modes_B_in = [ord(c) for c in ('k','o','p','q')] +modes_G_in = [ord(c) for c in ('m','n','l','o')] + +modes_A_out = [ord(c) for c in ('i','j','x','m')] # output +modes_B_out = [ord(c) for c in ('x','n','p','q')] + +# Create an array of extent (shapes) for each tensor +extent_A_in = (16, 16, 16, 2) +extent_B_in = (16, 2, 16, 16) +extent_G_in = (2, 2, 2, 2) + +shared_extent_out = 16 # truncate shared extent to 16 +extent_A_out = (16, 16, shared_extent_out, 2) +extent_B_out = (shared_extent_out, 2, 16, 16) + +############################ +# Allocate & initialize data +############################ +cp.random.seed(1) +A_in_d = cp.random.random(extent_A_in, dtype=np.float32).astype(np.float32, order='F') # we use fortran layout throughout this example +B_in_d = cp.random.random(extent_B_in, dtype=np.float32).astype(np.float32, order='F') +G_in_d = cp.random.random(extent_G_in, dtype=np.float32).astype(np.float32, order='F') + +A_out_d = cp.empty(extent_A_out, dtype=np.float32, order='F') +S_out_d = cp.empty(shared_extent_out, dtype=np.float32) +B_out_d = cp.empty(extent_B_out, dtype=np.float32, order='F') + +print("Allocate memory for data and initialize data.") + +free_mem, total_mem = dev.mem_info +worksize = free_mem *.7 + +############# +# cuTensorNet +############# + +stream = cp.cuda.Stream() +handle = cutn.create() + +nmode_A_in = len(modes_A_in) +nmode_B_in = len(modes_B_in) +nmode_G_in = len(modes_G_in) +nmode_A_out = len(modes_A_out) +nmode_B_out = len(modes_B_out) + +############################### +# Create tensor descriptors +############################### + +# strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout +strides = 0 +desc_tensor_A_in = cutn.create_tensor_descriptor(handle, nmode_A_in, extent_A_in, strides, modes_A_in, data_type) +desc_tensor_B_in = cutn.create_tensor_descriptor(handle, nmode_B_in, extent_B_in, strides, modes_B_in, data_type) +desc_tensor_G_in = cutn.create_tensor_descriptor(handle, nmode_G_in, extent_G_in, strides, modes_G_in, data_type) + +desc_tensor_A_out = cutn.create_tensor_descriptor(handle, nmode_A_out, extent_A_out, strides, modes_A_out, data_type) +desc_tensor_B_out = cutn.create_tensor_descriptor(handle, nmode_B_out, extent_B_out, strides, modes_B_out, data_type) + +######################################## +# Setup gate split truncation parameters +######################################## + +svd_config = cutn.create_tensor_svd_config(handle) +absCutoff_dtype = cutn.tensor_svd_config_get_attribute_dtype(cutn.TensorSVDConfigAttribute.ABS_CUTOFF) +absCutoff = np.array(1e-2, dtype=absCutoff_dtype) + +cutn.tensor_svd_config_set_attribute(handle, + svd_config, cutn.TensorSVDConfigAttribute.ABS_CUTOFF, absCutoff.ctypes.data, absCutoff.dtype.itemsize) + +relCutoff_dtype = cutn.tensor_svd_config_get_attribute_dtype(cutn.TensorSVDConfigAttribute.REL_CUTOFF) +relCutoff = np.array(1e-2, dtype=relCutoff_dtype) + +cutn.tensor_svd_config_set_attribute(handle, + svd_config, cutn.TensorSVDConfigAttribute.REL_CUTOFF, relCutoff.ctypes.data, relCutoff.dtype.itemsize) + +# create SVDInfo to record truncation information +svd_info = cutn.create_tensor_svd_info(handle) + +gate_algo = cutn.GateSplitAlgo.REDUCED +print("Setup gate split truncation options.") + +############################### +# Query Workspace Size +############################### +work_desc = cutn.create_workspace_descriptor(handle) + +cutn.workspace_compute_gate_split_sizes(handle, + desc_tensor_A_in, desc_tensor_B_in, desc_tensor_G_in, + desc_tensor_A_out, desc_tensor_B_out, + gate_algo, svd_config, compute_type, work_desc) +required_workspace_size = cutn.workspace_get_size(handle, + work_desc, cutn.WorksizePref.MIN, cutn.Memspace.DEVICE) +if worksize < required_workspace_size: + raise MemoryError("Not enough workspace memory is available.") +work = cp.cuda.alloc(required_workspace_size) +cutn.workspace_set( + handle, work_desc, + cutn.Memspace.DEVICE, + work.ptr, required_workspace_size) + +print("Query and allocate required workspace.") + +########### +# Execution +########### + +min_time_cutensornet = 1e100 +num_runs = 3 # to get stable perf results +e1 = cp.cuda.Event() +e2 = cp.cuda.Event() + +for i in range(num_runs): + # restore output + A_out_d[:] = 0 + S_out_d[:] = 0 + B_out_d[:] = 0 + dev.synchronize() + + # restore output tensor descriptors as `cutensornet.gate_split` can potentially update the shared extent in desc_tensor_U/V. + # therefore we here restore desc_tensor_U/V to the original problem + cutn.destroy_tensor_descriptor(desc_tensor_A_out) + cutn.destroy_tensor_descriptor(desc_tensor_B_out) + desc_tensor_A_out = cutn.create_tensor_descriptor(handle, nmode_A_out, extent_A_out, strides, modes_A_out, data_type) + desc_tensor_B_out = cutn.create_tensor_descriptor(handle, nmode_B_out, extent_B_out, strides, modes_B_out, data_type) + + e1.record() + # execution + cutn.gate_split(handle, + desc_tensor_A_in, A_in_d.data.ptr, + desc_tensor_B_in, B_in_d.data.ptr, + desc_tensor_G_in, G_in_d.data.ptr, + desc_tensor_A_out, A_out_d.data.ptr, + S_out_d.data.ptr, + desc_tensor_B_out, B_out_d.data.ptr, + gate_algo, svd_config, compute_type, svd_info, work_desc, stream.ptr) + e2.record() + + # Synchronize and measure timing + e2.synchronize() + time = cp.cuda.get_elapsed_time(e1, e2) # ms + min_time_cutensornet = min_time_cutensornet if min_time_cutensornet < time else time + +full_extent_dtype = cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.FULL_EXTENT) +full_extent = np.empty(1, dtype=full_extent_dtype) +cutn.tensor_svd_info_get_attribute(handle, + svd_info, cutn.TensorSVDInfoAttribute.FULL_EXTENT, full_extent.ctypes.data, full_extent.itemsize) +full_extent = int(full_extent) + +reduced_extent_dtype = cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.REDUCED_EXTENT) +reduced_extent = np.empty(1, dtype=reduced_extent_dtype) +cutn.tensor_svd_info_get_attribute(handle, + svd_info, cutn.TensorSVDInfoAttribute.REDUCED_EXTENT, reduced_extent.ctypes.data, reduced_extent.itemsize) +reduced_extent = int(reduced_extent) + +discarded_weight_dtype = cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT) +discarded_weight = np.empty(1, dtype=discarded_weight_dtype) +cutn.tensor_svd_info_get_attribute(handle, + svd_info, cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT, discarded_weight.ctypes.data, discarded_weight.itemsize) +discarded_weight = float(discarded_weight) + +print(f"Execution time: {min_time_cutensornet} ms") +print("SVD truncation info:") +print(f"For fixed extent truncation of {shared_extent_out}, an absolute cutoff value of {float(absCutoff)}, and a relative cutoff value of {float(relCutoff)}, full extent {full_extent} is reduced to {reduced_extent}") +print(f"Discarded weight: {discarded_weight}") + +# Recall that when we do value-based truncation through absolute or relative cutoff, +# the extent found at runtime maybe lower than we specified in desc_tensor_. +# Therefore we may need to create new containers to hold the new data which takes on fortran layout corresponding to the new extent + +if reduced_extent != shared_extent_out: + extent_A_out_reduced, strides_A_out = cutn.get_tensor_details(handle, desc_tensor_A_out)[2:] + extent_B_out_reduced, strides_B_out = cutn.get_tensor_details(handle, desc_tensor_B_out)[2:] + # note strides in cutensornet are in the unit of count and strides in cupy/numpy are in the unit of nbytes + strides_A_out = [i * A_out_d.itemsize for i in strides_A_out] + strides_B_out = [i * B_out_d.itemsize for i in strides_B_out] + A_out_d = cp.ndarray(extent_A_out_reduced, dtype=np.float32, memptr=A_out_d.data, strides=strides_A_out) + S_out_d = cp.ndarray(reduced_extent, dtype=np.float32, memptr=S_out_d.data, order='F') + B_out_d = cp.ndarray(extent_B_out_reduced, dtype=np.float32, memptr=B_out_d.data, strides=strides_B_out) + +T_d = cp.einsum("ijkl,kopq,mnlo->ijmnpq", A_in_d, B_in_d, G_in_d) +out = cp.einsum("ijxm,x,xnpq->ijmnpq", A_out_d, S_out_d, B_out_d) + +print(f"max diff after truncation {abs(out-T_d).max()}") +print("Check cuTensorNet result.") + +####################################################### + +cutn.destroy_tensor_descriptor(desc_tensor_A_in) +cutn.destroy_tensor_descriptor(desc_tensor_B_in) +cutn.destroy_tensor_descriptor(desc_tensor_G_in) +cutn.destroy_tensor_descriptor(desc_tensor_A_out) +cutn.destroy_tensor_descriptor(desc_tensor_B_out) +cutn.destroy_tensor_svd_config(svd_config) +cutn.destroy_tensor_svd_info(svd_info) +cutn.destroy_workspace_descriptor(work_desc) +cutn.destroy(handle) + +print("Free resource and exit.") diff --git a/python/samples/cutensornet/approxTN/mps_example.py b/python/samples/cutensornet/approxTN/mps_example.py new file mode 100644 index 0000000..6900565 --- /dev/null +++ b/python/samples/cutensornet/approxTN/mps_example.py @@ -0,0 +1,354 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import itertools + +import cupy as cp +import numpy as np + +import cuquantum +from cuquantum import cutensornet as cutn + +class MPSHelper: + + """ + MPSHelper(num_sites, phys_extent, max_virtual_extent, initial_state, data_type, compute_type) + + Create an MPSHelper object for gate splitting algorithm. + i j + -------A-------B------- i j k + p| |q -------> -------A`-------B`------- + GGGGGGGGG r| |s + r| |s + + Args: + num_sites: The number of sites in the MPS. + phys_extents: The extent for the physical mode where the gate tensors are acted on. + max_virtual_extent: The maximal extent allowed for the virtual mode shared between adjacent MPS tensors. + initial_state: A sequence of :class:`cupy.ndarray` representing the initial state of the MPS. + data_type (cuquantum.cudaDataType): The data type for all tensors and gates. + compute_type (cuquantum.ComputeType): The compute type for all gate splitting. + + """ + + def __init__(self, num_sites, phys_extent, max_virtual_extent, initial_state, data_type, compute_type): + self.num_sites = num_sites + self.phys_extent = phys_extent + self.data_type = data_type + self.compute_type = compute_type + + self.phys_modes = [] + self.virtual_modes = [] + self.new_mode = itertools.count(start=0, step=1) + + for i in range(num_sites+1): + self.virtual_modes.append(next(self.new_mode)) + if i != num_sites: + self.phys_modes.append(next(self.new_mode)) + + untruncated_max_extent = phys_extent ** (num_sites // 2) + if max_virtual_extent == 0: + self.max_virtual_extent = untruncated_max_extent + else: + self.max_virtual_extent = min(max_virtual_extent, untruncated_max_extent) + + self.handle = cutn.create() + self.work_desc = cutn.create_workspace_descriptor(self.handle) + self.svd_config = cutn.create_tensor_svd_config(self.handle) + self.svd_info = cutn.create_tensor_svd_info(self.handle) + self.gate_algo = cutn.GateSplitAlgo.DIRECT + + self.desc_tensors = [] + self.state_tensors = [] + + # create tensor descriptors + for i in range(self.num_sites): + self.state_tensors.append(initial_state[i].astype(tensor.dtype, order="F")) + extent = self.get_tensor_extent(i) + modes = self.get_tensor_modes(i) + desc_tensor = cutn.create_tensor_descriptor(self.handle, 3, extent, 0, modes, self.data_type) + self.desc_tensors.append(desc_tensor) + + def get_tensor(self, site): + """Get the tensor operands for a specific site.""" + return self.state_tensors[site] + + def get_tensor_extent(self, site): + """Get the extent of the MPS tensor at a specific site.""" + return self.state_tensors[site].shape + + def get_tensor_modes(self, site): + """Get the current modes of the MPS tensor at a specific site.""" + return (self.virtual_modes[site], self.phys_modes[site], self.virtual_modes[site+1]) + + def set_svd_config(self, abs_cutoff, rel_cutoff, renorm, partition): + """Update the SVD truncation setting. + + Args: + abs_cutoff: The cutoff value for absolute singular value truncation. + rel_cutoff: The cutoff value for relative singular value truncation. + renorm (cuquantum.cutensornet.TensorSVDNormalization): The option for renormalization of the truncated singular values. + partition (cuquantum.cutensornet.TensorSVDPartition): The option for partitioning of the singular values. + """ + + if partition != cutn.TensorSVDPartition.UV_EQUAL: + raise NotImplementedError("this basic example expects partition to be cutensornet.TensorSVDPartition.UV_EQUAL") + + svd_config_attributes = [cutn.TensorSVDConfigAttribute.ABS_CUTOFF, + cutn.TensorSVDConfigAttribute.REL_CUTOFF, + cutn.TensorSVDConfigAttribute.S_NORMALIZATION, + cutn.TensorSVDConfigAttribute.S_PARTITION] + + for (attr, value) in zip(svd_config_attributes, [abs_cutoff, rel_cutoff, renorm, partition]): + dtype = cutn.tensor_svd_config_get_attribute_dtype(attr) + value = np.array([value], dtype=dtype) + cutn.tensor_svd_config_set_attribute(self.handle, + self.svd_config, attr, value.ctypes.data, value.dtype.itemsize) + + def set_gate_algorithm(self, gate_algo): + """Set the algorithm to use for all gate split operations. + + Args: + gate_algo (cuquantum.cutensornet.GateSplitAlgo): The gate splitting algorithm to use. + """ + + self.gate_algo = gate_algo + + def compute_max_workspace_sizes(self): + """Compute the maximal workspace needed for MPS gating algorithm.""" + modes_in_A = [ord(c) for c in ('i', 'p', 'j')] + modes_in_B = [ord(c) for c in ('j', 'q', 'k')] + modes_in_G = [ord(c) for c in ('p', 'q', 'r', 's')] + modes_out_A = [ord(c) for c in ('i', 'r', 'j')] + modes_out_B = [ord(c) for c in ('j', 's', 'k')] + + max_extents_AB = (self.max_virtual_extent, self.phys_extent, self.max_virtual_extent) + extents_in_G = (self.phys_extent, self.phys_extent, self.phys_extent, self.phys_extent) + + desc_tensor_in_A = cutn.create_tensor_descriptor(self.handle, 3, max_extents_AB, 0, modes_in_A, self.data_type) + desc_tensor_in_B = cutn.create_tensor_descriptor(self.handle, 3, max_extents_AB, 0, modes_in_B, self.data_type) + desc_tensor_in_G = cutn.create_tensor_descriptor(self.handle, 4, extents_in_G, 0, modes_in_G, self.data_type) + desc_tensor_out_A = cutn.create_tensor_descriptor(self.handle, 3, max_extents_AB, 0, modes_out_A, self.data_type) + desc_tensor_out_B = cutn.create_tensor_descriptor(self.handle, 3, max_extents_AB, 0, modes_out_B, self.data_type) + + cutn.workspace_compute_gate_split_sizes(self.handle, + desc_tensor_in_A, desc_tensor_in_B, desc_tensor_in_G, + desc_tensor_out_A, desc_tensor_out_B, + self.gate_algo, self.svd_config, self.compute_type, self.work_desc) + + workspace_size = cutn.workspace_get_size(self.handle, self.work_desc, cutn.WorksizePref.MIN, cutn.Memspace.DEVICE) + + # free resources + cutn.destroy_tensor_descriptor(desc_tensor_in_A) + cutn.destroy_tensor_descriptor(desc_tensor_in_B) + cutn.destroy_tensor_descriptor(desc_tensor_in_G) + cutn.destroy_tensor_descriptor(desc_tensor_out_A) + cutn.destroy_tensor_descriptor(desc_tensor_out_B) + return workspace_size + + def set_workspace(self, work, workspace_size): + """Compute the maximal workspace needed for MPS gating algorithm. + + Args: + work: Pointer to the allocated workspace. + workspace_size: The required workspace size on the device. + """ + cutn.workspace_set(self.handle, self.work_desc, cutn.Memspace.DEVICE, work.ptr, workspace_size) + + def apply_gate(self, site_A, site_B, gate, verbose, stream): + """Inplace execution of the apply gate algoritm on site A and site B. + + Args: + site_A: The first site on which the gate is applied to. + site_B: The second site on which the gate is applied to. + gate (cupy.ndarray): The input data for the gate tensor. + verbose: Whether to print out the runtime information during truncation. + stream (cupy.cuda.Stream): The CUDA stream on which the computation is performed. + """ + if site_B - site_A != 1: + raise ValueError("Site B must be the right site of site A") + if site_B >= self.num_sites: + raise ValueError("Site index cannot exceed maximum number of sites") + + desc_tensor_in_A = self.desc_tensors[site_A] + desc_tensor_in_B = self.desc_tensors[site_B] + + phys_mode_in_A = self.phys_modes[site_A] + phys_mode_in_B = self.phys_modes[site_B] + phys_mode_out_A = next(self.new_mode) + phys_mode_out_B = next(self.new_mode) + modes_G = (phys_mode_in_A, phys_mode_in_B, phys_mode_out_A, phys_mode_out_B) + extent_G = (self.phys_extent, self.phys_extent, self.phys_extent, self.phys_extent) + desc_tensor_in_G = cutn.create_tensor_descriptor(self.handle, 4, extent_G, 0, modes_G, self.data_type) + + # construct and initialize the expected output A and B + tensor_in_A = self.state_tensors[site_A] + tensor_in_B = self.state_tensors[site_B] + left_extent_A = tensor_in_A.shape[0] + extent_AB_in = tensor_in_A.shape[2] + right_extent_B = tensor_in_B.shape[2] + combined_extent_left = min(left_extent_A, extent_AB_in * self.phys_extent) * self.phys_extent + combined_extent_right = min(right_extent_B, extent_AB_in * self.phys_extent) * self.phys_extent + extent_Aout_B = min(combined_extent_left, combined_extent_right, self.max_virtual_extent) + + extent_out_A = (left_extent_A, self.phys_extent, extent_Aout_B) + extent_out_B = (extent_Aout_B, self.phys_extent, right_extent_B) + + tensor_out_A = cp.zeros(extent_out_A, dtype=tensor_in_A.dtype, order="F") + tensor_out_B = cp.zeros(extent_out_B, dtype=tensor_in_B.dtype, order="F") + + # create tensor descriptors for output A and B + modes_out_A = (self.virtual_modes[site_A], phys_mode_out_A, self.virtual_modes[site_A+1]) + modes_out_B = (self.virtual_modes[site_B], phys_mode_out_B, self.virtual_modes[site_B+1]) + + desc_tensor_out_A = cutn.create_tensor_descriptor(self.handle, 3, extent_out_A, 0, modes_out_A, self.data_type) + desc_tensor_out_B = cutn.create_tensor_descriptor(self.handle, 3, extent_out_B, 0, modes_out_B, self.data_type) + + cutn.gate_split(self.handle, + desc_tensor_in_A, tensor_in_A.data.ptr, + desc_tensor_in_B, tensor_in_B.data.ptr, + desc_tensor_in_G, gate.data.ptr, + desc_tensor_out_A, tensor_out_A.data.ptr, + 0, # we factorize singular values equally onto output A and B. + desc_tensor_out_B, tensor_out_B.data.ptr, + self.gate_algo, self.svd_config, self.compute_type, + self.svd_info, self.work_desc, stream.ptr) + + if verbose: + full_extent = np.array([0], dtype=cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.FULL_EXTENT)) + reduced_extent = np.array([0], dtype=cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.REDUCED_EXTENT)) + discarded_weight = np.array([0], dtype=cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT)) + + cutn.tensor_svd_info_get_attribute( + self.handle, self.svd_info, cutn.TensorSVDInfoAttribute.FULL_EXTENT, + full_extent.ctypes.data, full_extent.dtype.itemsize) + cutn.tensor_svd_info_get_attribute( + self.handle, self.svd_info, cutn.TensorSVDInfoAttribute.REDUCED_EXTENT, + reduced_extent.ctypes.data, reduced_extent.dtype.itemsize) + cutn.tensor_svd_info_get_attribute( + self.handle, self.svd_info, cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT, + discarded_weight.ctypes.data, discarded_weight.dtype.itemsize) + + print("Virtual bond truncated from {0} to {1} with a discarded weight of {2:.6f}".format(full_extent[0], reduced_extent[0], discarded_weight[0])) + + self.phys_modes[site_A] = phys_mode_out_A + self.phys_modes[site_B] = phys_mode_out_B + self.desc_tensors[site_A] = desc_tensor_out_A + self.desc_tensors[site_B] = desc_tensor_out_B + + extent_out_A = np.zeros((3,), dtype=np.int64) + extent_out_B = np.zeros((3,), dtype=np.int64) + extent_out_A, strides_out_A = cutn.get_tensor_details(self.handle, desc_tensor_out_A)[2:] + extent_out_B, strides_out_B = cutn.get_tensor_details(self.handle, desc_tensor_out_B)[2:] + + # Recall that `cutensornet.gate_split` can potentially find reduced extent during SVD truncation when value-based truncation is used. + # Therefore we here update the container for output tensor A and B. + if extent_out_A[2] != extent_Aout_B: + # note strides in cutensornet are in the unit of count and strides in cupy/numpy are in the unit of nbytes + strides_out_A = [i * tensor_out_A.itemsize for i in strides_out_A] + strides_out_B = [i * tensor_out_B.itemsize for i in strides_out_B] + tensor_out_A = cp.ndarray(extent_out_A, dtype=tensor_out_A.dtype, memptr=tensor_out_A.data, strides=strides_out_A) + tensor_out_B = cp.ndarray(extent_out_B, dtype=tensor_out_B.dtype, memptr=tensor_out_B.data, strides=strides_out_B) + + self.state_tensors[site_A] = tensor_out_A + self.state_tensors[site_B] = tensor_out_B + + cutn.destroy_tensor_descriptor(desc_tensor_in_A) + cutn.destroy_tensor_descriptor(desc_tensor_in_B) + cutn.destroy_tensor_descriptor(desc_tensor_in_G) + + def __del__(self): + """Free all resources owned by the object.""" + for desc_tensor in self.desc_tensors: + cutn.destroy_tensor_descriptor(desc_tensor) + cutn.destroy(self.handle) + cutn.destroy_workspace_descriptor(self.work_desc) + cutn.destroy_tensor_svd_config(self.svd_config) + cutn.destroy_tensor_svd_info(self.svd_info) + + +if __name__ == '__main__': + + print("cuTensorNet-vers:", cutn.get_version()) + dev = cp.cuda.Device() # get current device + props = cp.cuda.runtime.getDeviceProperties(dev.id) + print("===== device info ======") + print("GPU-name:", props["name"].decode()) + print("GPU-clock:", props["clockRate"]) + print("GPU-memoryClock:", props["memoryClockRate"]) + print("GPU-nSM:", props["multiProcessorCount"]) + print("GPU-major:", props["major"]) + print("GPU-minor:", props["minor"]) + print("========================") + + data_type = cuquantum.cudaDataType.CUDA_C_64F + compute_type = cuquantum.ComputeType.COMPUTE_64F + + num_sites = 16 + phys_extent = 2 + max_virtual_extent = 12 + + ## we initialize the MPS state as a product state |000...000> + initial_state = [] + for i in range(num_sites): + # we create dummpy indices for MPS tensors on the boundary for easier bookkeeping + # we'll use Fortran layout throughout this example + tensor = cp.zeros((1,2,1), dtype=np.complex128, order="F") + tensor[0,0,0] = 1.0 + initial_state.append(tensor) + + ################################## + # Initialize an MPSHelper object + ################################## + + mps_helper = MPSHelper(num_sites, phys_extent, max_virtual_extent, initial_state, data_type, compute_type) + + ################################## + # Setup options for gate operation + ################################## + + abs_cutoff = 1e-2 + rel_cutoff = 1e-2 + renorm = cutn.TensorSVDNormalization.L2 + partition = cutn.TensorSVDPartition.UV_EQUAL + mps_helper.set_svd_config(abs_cutoff, rel_cutoff, renorm, partition) + + gate_algo = cutn.GateSplitAlgo.REDUCED + mps_helper.set_gate_algorithm(gate_algo) + + ##################################### + # Workspace estimation and allocation + ##################################### + + free_mem, total_mem = dev.mem_info + worksize = free_mem *.7 + required_workspace_size = mps_helper.compute_max_workspace_sizes() + work = cp.cuda.alloc(worksize) + print(f"Maximal workspace size requried: {required_workspace_size / 1024 ** 3:.3f} GB") + mps_helper.set_workspace(work, required_workspace_size) + + ########### + # Execution + ########### + + stream = cp.cuda.Stream() + cp.random.seed(0) + num_layers = 10 + for i in range(num_layers): + start_site = i % 2 + print(f"Cycle {i}:") + verbose = (i == num_layers-1) + for j in range(start_site, num_sites-1, 2): + # initialize a random 2-qubit gate + gate = cp.random.random([phys_extent,]*4) + 1.j * cp.random.random([phys_extent,]*4) + gate = gate.astype(gate.dtype, order="F") + mps_helper.apply_gate(j, j+1, gate, verbose, stream) + + stream.synchronize() + print("========================") + print("After gate application") + for i in range(num_sites): + tensor = mps_helper.get_tensor(i) + modes = mps_helper.get_tensor_modes(i) + print(f"Site {i}, extent: {tensor.shape}, modes: {modes}") \ No newline at end of file diff --git a/python/samples/cutensornet/approxTN/tensor_qr_example.py b/python/samples/cutensornet/approxTN/tensor_qr_example.py new file mode 100644 index 0000000..4008436 --- /dev/null +++ b/python/samples/cutensornet/approxTN/tensor_qr_example.py @@ -0,0 +1,139 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import cupy as cp +import numpy as np + +import cuquantum +from cuquantum import cutensornet as cutn + + +print("cuTensorNet-vers:", cutn.get_version()) +dev = cp.cuda.Device() # get current device +props = cp.cuda.runtime.getDeviceProperties(dev.id) +print("===== device info ======") +print("GPU-name:", props["name"].decode()) +print("GPU-clock:", props["clockRate"]) +print("GPU-memoryClock:", props["memoryClockRate"]) +print("GPU-nSM:", props["multiProcessorCount"]) +print("GPU-major:", props["major"]) +print("GPU-minor:", props["minor"]) +print("========================") + +############################################### +# Tensor QR: T_{i,j,m,n} -> Q_{i,x,m} R_{n,x,j} +############################################### + +data_type = cuquantum.cudaDataType.CUDA_R_32F + +# Create an array of modes + +modes_T = [ord(c) for c in ('i','j','m','n')] # input +modes_Q = [ord(c) for c in ('i','x','m')] # QR output +modes_R = [ord(c) for c in ('n','x','j')] + +# Create an array of extent (shapes) for each tensor +extent_T = (16, 16, 16, 16) +extent_Q = (16, 256, 16) +extent_R = (16, 256, 16) + +############################ +# Allocate & initialize data +############################ + +T_d = cp.random.random(extent_T, dtype=np.float32).astype(np.float32, order='F') # we use fortran layout throughout this example +Q_d = cp.empty(extent_Q, dtype=np.float32, order='F') +R_d = cp.empty(extent_R, dtype=np.float32, order='F') + +print("Allocate memory for data and initialize data.") + +free_mem, total_mem = dev.mem_info +worksize = free_mem *.7 + +############# +# cuTensorNet +############# +stream = cp.cuda.Stream() +handle = cutn.create() + +nmode_T = len(modes_T) +nmode_Q = len(modes_Q) +nmode_R = len(modes_R) + +############################### +# Create tensor descriptors +############################### + +# strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout +strides = 0 +desc_tensor_T = cutn.create_tensor_descriptor(handle, nmode_T, extent_T, strides, modes_T, data_type) +desc_tensor_Q = cutn.create_tensor_descriptor(handle, nmode_Q, extent_Q, strides, modes_Q, data_type) +desc_tensor_R = cutn.create_tensor_descriptor(handle, nmode_R, extent_R, strides, modes_R, data_type) + +####################################### +# Query and allocate required workspace +####################################### +work_desc = cutn.create_workspace_descriptor(handle) + +cutn.workspace_compute_qr_sizes(handle, desc_tensor_T, desc_tensor_Q, desc_tensor_R, work_desc) +required_workspace_size = cutn.workspace_get_size(handle, + work_desc, cutn.WorksizePref.MIN, cutn.Memspace.DEVICE) +if worksize < required_workspace_size: + raise MemoryError("Not enough workspace memory is available.") +work = cp.cuda.alloc(required_workspace_size) +cutn.workspace_set( + handle, work_desc, + cutn.Memspace.DEVICE, + work.ptr, required_workspace_size) + +print("Query and allocate required workspace.") + +########### +# Execution +########### + +min_time_cutensornet = 1e100 +num_runs = 3 # to get stable perf results +e1 = cp.cuda.Event() +e2 = cp.cuda.Event() + +for i in range(num_runs): + # restore output + Q_d[:] = 0 + R_d[:] = 0 + dev.synchronize() + + e1.record() + # execution + cutn.tensor_qr(handle, desc_tensor_T, T_d.data.ptr, + desc_tensor_Q, Q_d.data.ptr, + desc_tensor_R, R_d.data.ptr, + work_desc, stream.ptr) + e2.record() + + # Synchronize and measure timing + e2.synchronize() + time = cp.cuda.get_elapsed_time(e1, e2) # ms + min_time_cutensornet = min_time_cutensornet if min_time_cutensornet < time else time + +print(f"Execution time: {min_time_cutensornet} ms") + +out = cp.einsum("ixm,nxj->ijmn", Q_d, R_d) + +rtol = atol = 1e-5 +if not cp.allclose(out, T_d, rtol=rtol, atol=atol): + raise RuntimeError(f"result is incorrect, max diff {abs(out-T_d).max()}") +print("Check cuTensorNet result.") + +################ +# Free resources +################ + +cutn.destroy_tensor_descriptor(desc_tensor_T) +cutn.destroy_tensor_descriptor(desc_tensor_Q) +cutn.destroy_tensor_descriptor(desc_tensor_R) +cutn.destroy_workspace_descriptor(work_desc) +cutn.destroy(handle) + +print("Free resource and exit.") diff --git a/python/samples/cutensornet/approxTN/tensor_svd_example.py b/python/samples/cutensornet/approxTN/tensor_svd_example.py new file mode 100644 index 0000000..6c5e065 --- /dev/null +++ b/python/samples/cutensornet/approxTN/tensor_svd_example.py @@ -0,0 +1,208 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import cupy as cp +import numpy as np + +import cuquantum +from cuquantum import cutensornet as cutn + + +print("cuTensorNet-vers:", cutn.get_version()) +dev = cp.cuda.Device() # get current device +props = cp.cuda.runtime.getDeviceProperties(dev.id) +print("===== device info ======") +print("GPU-name:", props["name"].decode()) +print("GPU-clock:", props["clockRate"]) +print("GPU-memoryClock:", props["memoryClockRate"]) +print("GPU-nSM:", props["multiProcessorCount"]) +print("GPU-major:", props["major"]) +print("GPU-minor:", props["minor"]) +print("========================") + +###################################################### +# Tensor SVD: T_{i,j,m,n} -> U_{i,x,m} S_{x} V_{n,x,j} +###################################################### + +data_type = cuquantum.cudaDataType.CUDA_R_32F + +# Create an array of modes + +modes_T = [ord(c) for c in ('i','j','m','n')] # input +modes_U = [ord(c) for c in ('i','x','m')] # SVD output +modes_V = [ord(c) for c in ('n','x','j')] + +# Create an array of extent (shapes) for each tensor +extent_T = (16, 16, 16, 16) +shared_extent = 256 // 2 # truncate shared extent from 256 to 128 +extent_U = (16, shared_extent, 16) +extent_V = (16, shared_extent, 16) + +############################ +# Allocate & initialize data +############################ +cp.random.seed(1) +T_d = cp.random.random(extent_T, dtype=np.float32).astype(np.float32, order='F') # we use fortran layout throughout this example +U_d = cp.empty(extent_U, dtype=np.float32, order='F') +S_d = cp.empty(shared_extent, dtype=np.float32) +V_d = cp.empty(extent_V, dtype=np.float32, order='F') + +print("Allocate memory for data and initialize data.") + +free_mem, total_mem = dev.mem_info +worksize = free_mem *.7 + +############# +# cuTensorNet +############# + +stream = cp.cuda.Stream() +handle = cutn.create() + +nmode_T = len(modes_T) +nmode_U = len(modes_U) +nmode_V = len(modes_V) + +############################### +# Create tensor descriptor +############################### + +# strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout +strides = 0 +desc_tensor_T = cutn.create_tensor_descriptor(handle, nmode_T, extent_T, strides, modes_T, data_type) +desc_tensor_U = cutn.create_tensor_descriptor(handle, nmode_U, extent_U, strides, modes_U, data_type) +desc_tensor_V = cutn.create_tensor_descriptor(handle, nmode_V, extent_V, strides, modes_V, data_type) + +################################## +# Setup SVD truncation parameters +################################## + +svd_config = cutn.create_tensor_svd_config(handle) +abs_cutoff_dtype = cutn.tensor_svd_config_get_attribute_dtype(cutn.TensorSVDConfigAttribute.ABS_CUTOFF) +abs_cutoff = np.array(1e-2, dtype=abs_cutoff_dtype) + +cutn.tensor_svd_config_set_attribute(handle, + svd_config, cutn.TensorSVDConfigAttribute.ABS_CUTOFF, abs_cutoff.ctypes.data, abs_cutoff.dtype.itemsize) + +rel_cutoff_dtype = cutn.tensor_svd_config_get_attribute_dtype(cutn.TensorSVDConfigAttribute.REL_CUTOFF) +rel_cutoff = np.array(4e-2, dtype=rel_cutoff_dtype) + +cutn.tensor_svd_config_set_attribute(handle, + svd_config, cutn.TensorSVDConfigAttribute.REL_CUTOFF, rel_cutoff.ctypes.data, rel_cutoff.dtype.itemsize) + +print("Setup SVD truncation parameters.") + +# create SVDInfo to record truncation information +svd_info = cutn.create_tensor_svd_info(handle) + +############################### +# Query Workspace Size +############################### +work_desc = cutn.create_workspace_descriptor(handle) + +cutn.workspace_compute_svd_sizes(handle, desc_tensor_T, desc_tensor_U, desc_tensor_V, svd_config, work_desc) +required_workspace_size = cutn.workspace_get_size(handle, + work_desc, cutn.WorksizePref.MIN, cutn.Memspace.DEVICE) +if worksize < required_workspace_size: + raise MemoryError("Not enough workspace memory is available.") +work = cp.cuda.alloc(required_workspace_size) +cutn.workspace_set( + handle, work_desc, + cutn.Memspace.DEVICE, + work.ptr, required_workspace_size) + +print("Query and allocate required workspace.") + +##### +# Run +##### + +min_time_cutensornet = 1e100 +num_runs = 3 # to get stable perf results +e1 = cp.cuda.Event() +e2 = cp.cuda.Event() + +for i in range(num_runs): + # restore output + U_d[:] = 0 + S_d[:] = 0 + V_d[:] = 0 + dev.synchronize() + + # restore output tensor descriptors as `cutensornet.tensor_svd` can potentially update the shared extent in desc_tensor_U/V. + # therefore we here restore desc_tensor_U/V to the original problem + cutn.destroy_tensor_descriptor(desc_tensor_U) + cutn.destroy_tensor_descriptor(desc_tensor_V) + desc_tensor_U = cutn.create_tensor_descriptor(handle, nmode_U, extent_U, strides, modes_U, data_type) + desc_tensor_V = cutn.create_tensor_descriptor(handle, nmode_V, extent_V, strides, modes_V, data_type) + + e1.record() + # execution + cutn.tensor_svd(handle, desc_tensor_T, T_d.data.ptr, + desc_tensor_U, U_d.data.ptr, + S_d.data.ptr, + desc_tensor_V, V_d.data.ptr, + svd_config, svd_info, + work_desc, stream.ptr) + + e2.record() + + # Synchronize and measure timing + e2.synchronize() + time = cp.cuda.get_elapsed_time(e1, e2) # ms + min_time_cutensornet = min_time_cutensornet if min_time_cutensornet < time else time + +full_extent_dtype = cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.FULL_EXTENT) +full_extent = np.empty(1, dtype=full_extent_dtype) +cutn.tensor_svd_info_get_attribute(handle, + svd_info, cutn.TensorSVDInfoAttribute.FULL_EXTENT, full_extent.ctypes.data, full_extent.itemsize) +full_extent = int(full_extent) + +reduced_extent_dtype = cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.REDUCED_EXTENT) +reduced_extent = np.empty(1, dtype=reduced_extent_dtype) +cutn.tensor_svd_info_get_attribute(handle, + svd_info, cutn.TensorSVDInfoAttribute.REDUCED_EXTENT, reduced_extent.ctypes.data, reduced_extent.itemsize) +reduced_extent = int(reduced_extent) + +discarded_weight_dtype = cutn.tensor_svd_info_get_attribute_dtype(cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT) +discarded_weight = np.empty(1, dtype=discarded_weight_dtype) +cutn.tensor_svd_info_get_attribute(handle, + svd_info, cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT, discarded_weight.ctypes.data, discarded_weight.itemsize) +discarded_weight = float(discarded_weight) + +print(f"Execution time: {min_time_cutensornet} ms") +print("SVD truncation info:") +print(f"For fixed extent truncation of {shared_extent}, an absolute cutoff value of {float(abs_cutoff)}, and a relative cutoff value of {float(rel_cutoff)}, full extent {full_extent} is reduced to {reduced_extent}") +print(f"Discarded weight: {discarded_weight}") + +# Recall that when we do value-based truncation through absolute or relative cutoff, +# the extent found at runtime maybe lower than we specified in desc_tensor_. +# Therefore we may need to create new containers to hold the new data which takes on fortran layout corresponding to the new extent +extent_U_out, strides_U_out = cutn.get_tensor_details(handle, desc_tensor_U)[2:] +extent_V_out, strides_V_out = cutn.get_tensor_details(handle, desc_tensor_V)[2:] + +if extent_U_out[1] != shared_extent: + # note strides in cutensornet are in the unit of count and strides in cupy/numpy are in the unit of nbytes + strides_U_out = [i * U_d.itemsize for i in strides_U_out] + strides_V_out = [i * V_d.itemsize for i in strides_V_out] + U_d = cp.ndarray(extent_U_out, dtype=np.float32, memptr=U_d.data, strides=strides_U_out) + S_d = cp.ndarray(extent_U_out[1], dtype=np.float32, memptr=S_d.data, order='F') + V_d = cp.ndarray(extent_V_out, dtype=np.float32, memptr=V_d.data, strides=strides_V_out) + +out = cp.einsum("ixm,x,nxj->ijmn", U_d, S_d, V_d) + +print(f"max diff after truncation {abs(out-T_d).max()}") +print("Check cuTensorNet result.") + +####################################################### + +cutn.destroy_tensor_descriptor(desc_tensor_T) +cutn.destroy_tensor_descriptor(desc_tensor_U) +cutn.destroy_tensor_descriptor(desc_tensor_V) +cutn.destroy_workspace_descriptor(work_desc) +cutn.destroy_tensor_svd_config(svd_config) +cutn.destroy_tensor_svd_info(svd_info) +cutn.destroy(handle) + +print("Free resource and exit.") diff --git a/python/samples/cutensornet/circuit_converter/cirq_advanced.ipynb b/python/samples/cutensornet/circuit_converter/cirq_advanced.ipynb index 713e367..79967fe 100644 --- a/python/samples/cutensornet/circuit_converter/cirq_advanced.ipynb +++ b/python/samples/cutensornet/circuit_converter/cirq_advanced.ipynb @@ -154,7 +154,7 @@ }, { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -392,7 +392,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Exact energy from cutn: 0.44448295245377206\n" + "Exact energy from cutn: 0.4444829524537696\n" ] } ], @@ -400,12 +400,11 @@ "def compute_energy_cutn(resolved_circuit, length, h, jr, jc):\n", " nrow = ncol = length\n", " assert length == jr.shape[1] == jc.shape[0]\n", - " Zop = cp.diag([1,-1]).astype('complex128')\n", " \n", - " def compute_rdm(myconverter, where, options):\n", - " expression, operands = myconverter.reduced_density_matrix(where, lightcone=True)\n", - " rdm = contract(expression, *operands, options=options)\n", - " return rdm\n", + " def compute_energy_term(myconverter, pauli_string, options):\n", + " expression, operands = myconverter.expectation(pauli_string, lightcone=True)\n", + " e = contract(expression, *operands, options=options).real\n", + " return e\n", " \n", " def expectation(x):\n", " energy = 0.\n", @@ -418,30 +417,25 @@ " \n", " for i in range(nrow):\n", " for j in range(ncol):\n", - " # for one-body terms, we construct the 1-RDM for each qubit\n", + " # one-body terms\n", " q = qubits[i*ncol+j]\n", - " where = (q,)\n", - " rdm = compute_rdm(myconverter, where, options)\n", - " energy += cp.sum(rdm * Zop).real * h[i][j]\n", - " # this would work too\n", - " # energy += contract('ij,ij->', rdm, Zop, options=options).real * h[i][j]\n", + " pauli_string = {q: 'Z'}\n", + " energy += compute_energy_term(myconverter, pauli_string, options) * h[i][j]\n", " \n", - " # for two-body terms, we construct the 2-RDM for all adjacent pairs of qubits\n", + " # two-body terms\n", " # - vertical bond\n", " if i != nrow-1:\n", " top = qubits[i*ncol+j]\n", " bottom = qubits[(i+1)*ncol+j]\n", - " where = (top, bottom)\n", - " rdm = compute_rdm(myconverter, where, options)\n", - " energy += cp.einsum('ijIJ,iI,jJ->', rdm, Zop, Zop).real * jr[i][j]\n", + " pauli_string = {top: 'Z', bottom: 'Z'}\n", + " energy += compute_energy_term(myconverter, pauli_string, options) * jr[i][j]\n", " \n", " # - horizontal bond\n", " if j != ncol-1:\n", " left = qubits[i*ncol+j]\n", " right = qubits[i*ncol+(j+1)]\n", - " where = (left, right)\n", - " rdm = compute_rdm(myconverter, where, options)\n", - " energy += cp.einsum('ijIJ,iI,jJ->', rdm, Zop, Zop).real * jc[i][j]\n", + " pauli_string = {left:'Z', right:'Z'}\n", + " energy += compute_energy_term(myconverter, pauli_string, options) * jc[i][j]\n", " \n", " # handle should be explictly destroyed\n", " cutn.destroy(handle)\n", @@ -526,7 +520,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -563,7 +557,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.9.12" } }, "nbformat": 4, diff --git a/python/samples/cutensornet/circuit_converter/cirq_basic.ipynb b/python/samples/cutensornet/circuit_converter/cirq_basic.ipynb index 85b9eab..846d33a 100644 --- a/python/samples/cutensornet/circuit_converter/cirq_basic.ipynb +++ b/python/samples/cutensornet/circuit_converter/cirq_basic.ipynb @@ -26,6 +26,8 @@ "metadata": {}, "outputs": [], "source": [ + "import itertools\n", + "\n", "import cirq\n", "from cirq.testing import random_circuit\n", "import cupy as cp\n", @@ -162,9 +164,9 @@ "einsum expression:\n", "a,b,c,d,e,f,g,ha,ijdb,kg,lf,mc,nm,ok,pqhj,ri,sp,tl,uvqe,wn,xs,yzwo,Ar,By,CDuv,EFtx,Gz,HIDF,JC,KLGE,I,J,B,A,H,L,K->\n", "\n", - "for bitstring 0000000, amplitude: (0.17677669529663717+0j), probability: 0.031250000000000104\n", + "for bitstring 0000000, amplitude: (0.17677669529663714+0j), probability: 0.03125000000000009\n", "\n", - "difference from state vector: 0.0\n" + "difference from state vector: 2.7755575615628914e-17\n" ] } ], @@ -183,6 +185,98 @@ "print(f'difference from state vector: {amp_diff}')" ] }, + { + "cell_type": "markdown", + "id": "34161bd5-0a1b-4972-b84e-0ae34b7ab216", + "metadata": {}, + "source": [ + "### calculate batch of bistring amplitudes\n", + "\n", + "In this example, we calculate a batch of bistring amplitudes $\\langle 00000ij|\\psi\\rangle$ where the first 5 qubits are fixed at state $00000$ and the last two qubit states are batched. This is equivalent to computing a slice of the full state vector." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "e5e6a666-a586-4ec2-ba03-7398049cc721", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "for bitstring 0000000, amplitude: 0.1768+0.0000j, difference from state vector: 0.0000\n", + "for bitstring 0000001, amplitude: 0.0000+0.0000j, difference from state vector: 0.0000\n", + "for bitstring 0000010, amplitude: 0.0000+0.0000j, difference from state vector: 0.0000\n", + "for bitstring 0000011, amplitude: 0.1250+0.1250j, difference from state vector: 0.0000\n" + ] + } + ], + "source": [ + "fixed_states = '00000'\n", + "fixed_index = tuple(map(int, fixed_states))\n", + "num_fixed = len(fixed_states)\n", + "\n", + "# mapping of the first 5 qubits to the fixed state\n", + "fixed = dict(zip(myconverter.qubits[:num_fixed], fixed_states))\n", + "\n", + "expression, operands = myconverter.batched_amplitudes(fixed)\n", + "batched_amplitudes = contract(expression, *operands)\n", + "\n", + "for ibit, jbit in itertools.product(range(2), repeat=2):\n", + " bitstring = fixed_states + str(ibit) + str(jbit)\n", + " index = fixed_index + (ibit, jbit)\n", + " amplitude = batched_amplitudes[(ibit, jbit)]\n", + " amplitude_from_sv = sv[index]\n", + " amp_diff = abs(amplitude-amplitude_from_sv)\n", + " print(f'for bitstring {bitstring}, amplitude: {amplitude:.4f}, difference from state vector: {amp_diff:.4f}')" + ] + }, + { + "cell_type": "markdown", + "id": "773d861a-1b00-4bfa-8a33-4e766954e560", + "metadata": {}, + "source": [ + "### compute expectation value $\\langle \\psi|\\hat{O}| \\psi\\rangle$\n", + "\n", + "In this example, we compute the expectation value for a pauli string $IIYXIXY$. For comparision, we compute the same value via contracting reduced density matrix with the operator." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "b05c257d-602a-4473-83ee-93adbe701b5a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "expectation value for IIYXIXY: (0.5000000000000016+0j)\n", + "is expectation value in agreement? True\n" + ] + } + ], + "source": [ + "pauli_string = 'IIYXIXY'\n", + "expression, operands = myconverter.expectation(pauli_string, lightcone=True)\n", + "expec = contract(expression, *operands)\n", + "print(f'expectation value for {pauli_string}: {expec}')\n", + "\n", + "# expectation value from reduced density matrix\n", + "qubits = myconverter.qubits\n", + "where = qubits[2:4] + qubits[5:]\n", + "rdm_expression, rdm_operands = myconverter.reduced_density_matrix(where, lightcone=True)\n", + "rdm = contract(rdm_expression, *rdm_operands)\n", + "\n", + "pauli_x = cp.asarray([[0,1],[1,0]], dtype=myconverter.dtype)\n", + "pauli_y = cp.asarray([[0,-1j], [1j,0]], dtype=myconverter.dtype)\n", + "pauli_z = cp.asarray([[1,0],[0,-1]], dtype=myconverter.dtype)\n", + "expec_from_rdm = cp.einsum('abcdABCD,aA,bB,cC,dD->', rdm, pauli_y, pauli_x, pauli_x, pauli_y)\n", + "\n", + "print(f\"is expectation value in agreement?\", cp.allclose(expec, expec_from_rdm))" + ] + }, { "cell_type": "markdown", "id": "cc22ffa5-92b8-4978-a6a9-c7e4ec731bd6", @@ -195,7 +289,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "id": "75170ff5-8b17-43c3-9220-610680cd4f39", "metadata": {}, "outputs": [ @@ -209,7 +303,6 @@ } ], "source": [ - "qubits = sorted(circuit.all_qubits()) # ensure we can index the qubits correctly\n", "where = qubits[:2]\n", "fixed = {qubits[3]: '0',\n", " qubits[4]: '0'}\n", @@ -241,7 +334,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.9.12" } }, "nbformat": 4, diff --git a/python/samples/cutensornet/circuit_converter/qiskit_advanced.ipynb b/python/samples/cutensornet/circuit_converter/qiskit_advanced.ipynb index 07cb563..b94378b 100644 --- a/python/samples/cutensornet/circuit_converter/qiskit_advanced.ipynb +++ b/python/samples/cutensornet/circuit_converter/qiskit_advanced.ipynb @@ -57,7 +57,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -86,7 +86,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -174,7 +174,7 @@ }, { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -227,7 +227,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -257,7 +257,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -349,22 +349,31 @@ " \n", " circuit = create_qaoa_circ(G, theta)\n", " myconverter = CircuitToEinsum(circuit, dtype='complex128', backend=cp)\n", - " Zop = cp.diag([1,-1]).astype('complex128')\n", " \n", " for (i, j), weight in weights.items():\n", - " where = (circuit.qubits[i], circuit.qubits[j])\n", - " \n", - " expression, operands = myconverter.reduced_density_matrix(where, lightcone=True)\n", + " pauli_string = {circuit.qubits[i]: 'Z',\n", + " circuit.qubits[j]: 'Z'}\n", + " expression, operands = myconverter.expectation(pauli_string, lightcone=True)\n", " _, path_info = get_path(expression, operands, options, (i, j))\n", - " rdm = contract(expression, *operands,\n", - " optimize={'path': path_info.path, 'slicing': path_info.slices},\n", - " options=options)\n", + " e += contract(expression, *operands,\n", + " optimize={'path': path_info.path, 'slicing': path_info.slices},\n", + " options=options).real\n", " \n", - " e += cp.einsum('ijIJ,iI,jJ->', rdm, Zop, Zop).real\n", - "\n", + " # the same task can be achieved with reduced density matrix:\n", + " \n", + " # where = (circuit.qubits[i], circuit.qubits[j])\n", + " # expression, operands = myconverter.reduced_density_matrix(where, lightcone=True)\n", + " # _, path_info = get_path(expression, operands, options, (i, j))\n", + " # rdm = contract(expression, *operands, \n", + " # optimize={'path': path_info.path, 'slicing': path_info.slices}, \n", + " # options=options).real\n", + " # Zop = cp.diag([1,-1]).astype('complex128')\n", + " # e+= cp.einsum('ijIJ,Ii,Jj->', rdm, Zop, Zop).real\n", + " \n", " # handle should be explictly destroyed\n", + " \n", " cutn.destroy(handle)\n", - " \n", + " \n", " return e\n", " \n", " return expectation\n", @@ -460,13 +469,13 @@ "name": "stdout", "output_type": "stream", "text": [ - " fun: -5.974514320263179\n", + " fun: -5.974513781450166\n", " maxcv: 0.0\n", " message: 'Optimization terminated successfully.'\n", - " nfev: 99\n", + " nfev: 126\n", " status: 1\n", " success: True\n", - " x: array([2.09900591, 1.27967004, 1.80460603, 1.99555842]) \n", + " x: array([2.09892427, 1.27968803, 1.80457485, 1.99563031]) \n", "\n", " fun: -6.4296875\n", " maxcv: 0.0\n", @@ -590,7 +599,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.9.12" } }, "nbformat": 4, diff --git a/python/samples/cutensornet/circuit_converter/qiskit_basic.ipynb b/python/samples/cutensornet/circuit_converter/qiskit_basic.ipynb index 89e0584..c239095 100644 --- a/python/samples/cutensornet/circuit_converter/qiskit_basic.ipynb +++ b/python/samples/cutensornet/circuit_converter/qiskit_basic.ipynb @@ -26,6 +26,8 @@ "metadata": {}, "outputs": [], "source": [ + "import itertools\n", + "\n", "import cupy as cp\n", "import numpy as np\n", "import qiskit\n", @@ -53,7 +55,7 @@ "outputs": [ { "data": { - "image/png": "\n", + "image/png": "\n", "text/plain": [ "
" ] @@ -177,6 +179,97 @@ "print(f'difference from state vector {amp_diff}')" ] }, + { + "cell_type": "markdown", + "id": "6fa82949-43e1-4d55-8af5-554aa5c9020e", + "metadata": {}, + "source": [ + "### calculate batch of bistring amplitudes\n", + "\n", + "In this example, we calculate a batch of bistring amplitudes $\\langle 00000ij|\\psi\\rangle$ where the first 5 qubits are fixed at state $00000$ and the last two qubit states are batched. This is equivalent to computing a slice of the full state vector." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "180793ae-8cbd-44ba-ac3b-19d4679989cb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "for bitstring 0000000, amplitude: -0.4516+0.2872j, difference from state vector: 0.0000\n", + "for bitstring 0000001, amplitude: 0.0000+0.0000j, difference from state vector: 0.0000\n", + "for bitstring 0000010, amplitude: -0.1704+0.5073j, difference from state vector: 0.0000\n", + "for bitstring 0000011, amplitude: 0.0000+0.0000j, difference from state vector: 0.0000\n" + ] + } + ], + "source": [ + "fixed_states = '00000'\n", + "fixed_index = tuple(map(int, fixed_states))\n", + "num_fixed = len(fixed_states)\n", + "\n", + "# mapping of the first 5 qubits to the fixed state\n", + "fixed = dict(zip(myconverter.qubits[:num_fixed], fixed_states))\n", + "\n", + "expression, operands = myconverter.batched_amplitudes(fixed)\n", + "batched_amplitudes = contract(expression, *operands)\n", + "\n", + "for ibit, jbit in itertools.product(range(2), repeat=2):\n", + " bitstring = fixed_states + str(ibit) + str(jbit)\n", + " index = fixed_index + (ibit, jbit)\n", + " amplitude = batched_amplitudes[(ibit, jbit)]\n", + " amplitude_from_sv = sv[index]\n", + " amp_diff = abs(amplitude-amplitude_from_sv)\n", + " print(f'for bitstring {bitstring}, amplitude: {amplitude:.4f}, difference from state vector: {amp_diff:.4f}')" + ] + }, + { + "cell_type": "markdown", + "id": "5ae16861-c6eb-4ec0-a527-c65aed4cbcd8", + "metadata": {}, + "source": [ + "### compute expectation value $\\langle \\psi|\\hat{O}| \\psi\\rangle$\n", + "\n", + "In this example, we compute the expectation value for a pauli string $IXXZZII$. For comparision, we compute the same value via contracting reduced density matrix with the operator." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "079beec1-8bc2-4aec-a272-01e160e104f1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "expectation value for IXXZZII: (0.13257764904443228+0j)\n", + "is expectation value in agreement? True\n" + ] + } + ], + "source": [ + "pauli_string = 'IXXZZII'\n", + "expression, operands = myconverter.expectation(pauli_string, lightcone=True)\n", + "expec = contract(expression, *operands)\n", + "print(f'expectation value for {pauli_string}: {expec}')\n", + "\n", + "# expectation value from reduced density matrix\n", + "qubits = myconverter.qubits\n", + "where = qubits[1:5]\n", + "rdm_expression, rdm_operands = myconverter.reduced_density_matrix(where, lightcone=True)\n", + "rdm = contract(rdm_expression, *rdm_operands)\n", + "\n", + "pauli_x = cp.asarray([[0,1],[1,0]], dtype=myconverter.dtype)\n", + "pauli_z = cp.asarray([[1,0],[0,-1]], dtype=myconverter.dtype)\n", + "expec_from_rdm = cp.einsum('abcdABCD,aA,bB,cC,dD->', rdm, pauli_x, pauli_x, pauli_z, pauli_z)\n", + "\n", + "print(f\"is expectation value in agreement?\", cp.allclose(expec, expec_from_rdm))" + ] + }, { "cell_type": "markdown", "id": "24f09f1f-6507-4804-ac12-d96d8ee332da", @@ -189,7 +282,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "id": "af4a503c-2f3e-4093-89ef-06457f68de9c", "metadata": {}, "outputs": [ @@ -234,7 +327,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.9.12" } }, "nbformat": 4, diff --git a/python/samples/cutensornet/coarse/example21.py b/python/samples/cutensornet/coarse/example21.py new file mode 100644 index 0000000..910b58e --- /dev/null +++ b/python/samples/cutensornet/coarse/example21.py @@ -0,0 +1,17 @@ +""" +Example illustrating lazy conjugation using tensor qualifiers. +""" +import numpy as np + +from cuquantum import contract, tensor_qualifiers_dtype + +a = np.random.rand(3, 2) + 1j * np.random.rand(3, 2) +b = np.random.rand(2, 3) + 1j * np.random.rand(2, 3) + +# Specify tensor qualifiers for the second tensor operand 'b'. +qualifiers = np.zeros((2,), dtype=tensor_qualifiers_dtype) +qualifiers[1]['is_conjugate'] = True + +r = contract("ij,jk", a, b, qualifiers=qualifiers) +s = np.einsum("ij,jk", a, b.conj()) +assert np.allclose(r, s), "Incorrect results for a * conjugate(b)" diff --git a/python/samples/cutensornet/coarse/example22_mpi_auto.py b/python/samples/cutensornet/coarse/example22_mpi_auto.py new file mode 100644 index 0000000..76b825c --- /dev/null +++ b/python/samples/cutensornet/coarse/example22_mpi_auto.py @@ -0,0 +1,79 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +""" +Example illustrating automatically parallelizing slice-based tensor network contraction with cuQuantum using MPI. +Here we use: + + - the buffer interface APIs offered by mpi4py v3.1.0+ for communicating ndarray-like objects + - CUDA-aware MPI (note: as of cuTensorNet v2.0.0 using non-CUDA-aware MPI is not supported + and would cause segfault). + - cuQuantum 22.11+ (cuTensorNet v2.0.0+) for the new distributed contraction feature + +$ mpiexec -n 4 python example22_mpi_auto.py +""" +import os + +import cupy as cp +from cupy.cuda.runtime import getDeviceCount +from mpi4py import MPI # this line initializes MPI + +import cuquantum +from cuquantum import cutensornet as cutn + + +root = 0 +comm = MPI.COMM_WORLD +rank, size = comm.Get_rank(), comm.Get_size() + +# Check if the env var is set +if not "CUTENSORNET_COMM_LIB" in os.environ: + raise RuntimeError("need to set CUTENSORNET_COMM_LIB to the path of the MPI wrapper library") + +if not os.path.isfile(os.environ["CUTENSORNET_COMM_LIB"]): + raise RuntimeError("CUTENSORNET_COMM_LIB does not point to the path of the MPI wrapper library") + +# Assign the device for each process. +device_id = rank % getDeviceCount() +cp.cuda.Device(device_id).use() + +expr = 'ehl,gj,edhg,bif,d,c,k,iklj,cf,a->ba' +shapes = [(8, 2, 5), (5, 7), (8, 8, 2, 5), (8, 6, 3), (8,), (6,), (5,), (6, 5, 5, 7), (6, 3), (3,)] + +# Set the operand data on root. Since we use the buffer interface APIs offered by mpi4py for communicating array +# objects, we can directly use device arrays (cupy.ndarray, for example) as we assume mpi4py is built against +# a CUDA-aware MPI. +if rank == root: + operands = [cp.random.rand(*shape) for shape in shapes] +else: + operands = [cp.empty(shape) for shape in shapes] + +# Broadcast the operand data. Throughout this sample we take advantage of the upper-case mpi4py APIs +# that support communicating CPU & GPU buffers (without staging) to reduce serialization overhead for +# array-like objects. This capability requires mpi4py v3.10+. +for operand in operands: + comm.Bcast(operand, root) + +# Bind the communicator to the library handle +handle = cutn.create() +cutn.distributed_reset_configuration( + handle, *cutn.get_mpi_comm_pointer(comm) +) + +# Compute the contraction (with distributed path finding & contraction execution) +result = cuquantum.contract(expr, *operands, options={'device_id' : device_id, 'handle': handle}) + +# Create a new GPU buffer for verification +result_cp = cp.empty_like(result) + +# Sum the partial contribution from each process on root, with GPU +if rank == root: + comm.Reduce(sendbuf=MPI.IN_PLACE, recvbuf=result_cp, op=MPI.SUM, root=root) +else: + comm.Reduce(sendbuf=result_cp, recvbuf=None, op=MPI.SUM, root=root) + +# Check correctness. +if rank == root: + result_cp = cp.einsum(expr, *operands, optimize=True) + print("Does the cuQuantum parallel contraction result match the cupy.einsum result?", cp.allclose(result, result_cp)) diff --git a/python/samples/cutensornet/fine/example4_mpi_nccl.py b/python/samples/cutensornet/fine/example4_mpi_nccl.py new file mode 100644 index 0000000..c87a509 --- /dev/null +++ b/python/samples/cutensornet/fine/example4_mpi_nccl.py @@ -0,0 +1,99 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +""" +Example illustrating slice-based parallel tensor network contraction with cuQuantum using NCCL and MPI. Here +we create the input tensors directly on the GPU using CuPy since NCCL only supports GPU buffers. + +The low-level Python wrapper for NCCL is provided by CuPy. MPI (through mpi4py) is only needed to bootstrap +the multiple processes, set up the NCCL communicator, and to communicate data on the CPU. NCCL can be used +without MPI for a "single process multiple GPU" model. + +For users who do not have NCCL installed already, CuPy provides detailed instructions on how to install +it for both pip and conda users when "import cupy.cuda.nccl" fails. + +We recommend that those using CuPy v10+ use CuPy's high-level "cupyx.distributed" module to avoid having to +manipulate GPU pointers in Python. + +Note that with recent NCCL, GPUs cannot be oversubscribed (not more than one process per GPU). Users will +see an NCCL error if the number of processes on a node exceeds the number of GPUs on that node. + +$ mpiexec -n 4 python example4_mpi_nccl.py +""" + +import cupy as cp +from cupy.cuda import nccl +from cupy.cuda.runtime import getDeviceCount +from mpi4py import MPI + +from cuquantum import Network + +# Set up the MPI environment. +root = 0 +comm_mpi = MPI.COMM_WORLD +rank, size = comm_mpi.Get_rank(), comm_mpi.Get_size() + +# Assign the device for each process. +device_id = rank % getDeviceCount() + +# Define the tensor network topology. +expr = 'ehl,gj,edhg,bif,d,c,k,iklj,cf,a->ba' +shapes = [(8, 2, 5), (5, 7), (8, 8, 2, 5), (8, 6, 3), (8,), (6,), (5,), (6, 5, 5, 7), (6, 3), (3,)] + +# Note that all NCCL operations must be performed in the correct device context. +cp.cuda.Device(device_id).use() + +# Set up the NCCL communicator. +nccl_id = nccl.get_unique_id() if rank == root else None +nccl_id = comm_mpi.bcast(nccl_id, root) +comm_nccl = nccl.NcclCommunicator(size, nccl_id, rank) + +# Set the operand data on root. +if rank == root: + operands = [cp.random.rand(*shape) for shape in shapes] +else: + operands = [cp.empty(shape) for shape in shapes] + +# Broadcast the operand data. We pass in the CuPy ndarray data pointers to the NCCL APIs. +stream_ptr = cp.cuda.get_current_stream().ptr +for operand in operands: + comm_nccl.broadcast(operand.data.ptr, operand.data.ptr, operand.size, nccl.NCCL_FLOAT64, root, stream_ptr) + +# Create network object. +network = Network(expr, *operands) + +# Compute the path on all ranks with 8 samples for hyperoptimization. Force slicing to enable parallel contraction. +path, info = network.contract_path(optimize={'samples': 8, 'slicing': {'min_slices': max(16, size)}}) + +# Select the best path from all ranks. Note that we still use the MPI communicator here for simplicity. +opt_cost, sender = comm_mpi.allreduce(sendobj=(info.opt_cost, rank), op=MPI.MINLOC) +if rank == root: + print(f"Process {sender} has the path with the lowest FLOP count {opt_cost}.") + +# Broadcast info from the sender to all other ranks. +info = comm_mpi.bcast(info, sender) + +# Set path and slices. +path, info = network.contract_path(optimize={'path': info.path, 'slicing': info.slices}) + +# Calculate this process's share of the slices. +num_slices = info.num_slices +chunk, extra = num_slices // size, num_slices % size +slice_begin = rank * chunk + min(rank, extra) +slice_end = num_slices if rank == size - 1 else (rank + 1) * chunk + min(rank + 1, extra) +slices = range(slice_begin, slice_end) + +print(f"Process {rank} is processing slice range: {slices}.") + +# Contract the group of slices the process is responsible for. +result = network.contract(slices=slices) + +# Sum the partial contribution from each process on root. +stream_ptr = cp.cuda.get_current_stream().ptr +comm_nccl.reduce(result.data.ptr, result.data.ptr, result.size, nccl.NCCL_FLOAT64, nccl.NCCL_SUM, root, stream_ptr) + +# Check correctness. +if rank == root: + result_cp = cp.einsum(expr, *operands, optimize=True) + print("Does the cuQuantum parallel contraction result match the cupy.einsum result?", cp.allclose(result, result_cp)) diff --git a/python/samples/cutensornet/tensornet_example.py b/python/samples/cutensornet/tensornet_example.py old mode 100644 new mode 100755 index 3461f33..ab030e3 --- a/python/samples/cutensornet/tensornet_example.py +++ b/python/samples/cutensornet/tensornet_example.py @@ -21,59 +21,43 @@ print("GPU-minor:", props["minor"]) print("========================") -########################################################## -# Computing: D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y} -########################################################## +###################################################################################### +# Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m} +###################################################################################### -print("Include headers and define data types") +print("Include headers and define data types.") data_type = cuquantum.cudaDataType.CUDA_R_32F compute_type = cuquantum.ComputeType.COMPUTE_32F -numInputs = 3 +num_inputs = 4 # Create an array of modes -modesA = [ord(c) for c in ('m','h','k','n')] -modesB = [ord(c) for c in ('u','k','h')] -modesC = [ord(c) for c in ('x','u','y')] -modesD = [ord(c) for c in ('m','x','n','y')] +modes_A = [ord(c) for c in ('a','b','c','d','e','f')] +modes_B = [ord(c) for c in ('b','g','h','e','i','j')] +modes_C = [ord(c) for c in ('m','a','g','f','i','k')] +modes_D = [ord(c) for c in ('l','c','h','d','j','m')] +modes_R = [ord(c) for c in ('k','l')] # Create an array of extents (shapes) for each tensor -extentA = (96, 64, 64, 96) -extentB = (96, 64, 64) -extentC = (64, 96, 64) -extentD = (96, 64, 96, 64) - -print("Define network, modes, and extents") - -############################ -# Allocate & initialize data -############################ - -A_d = cp.random.random((np.prod(extentA),), dtype=np.float32) -B_d = cp.random.random((np.prod(extentB),), dtype=np.float32) -C_d = cp.random.random((np.prod(extentC),), dtype=np.float32) -D_d = cp.empty((np.prod(extentD),), dtype=np.float32) -rawDataIn_d = (A_d.data.ptr, B_d.data.ptr, C_d.data.ptr) - -A = cp.asnumpy(A_d) -B = cp.asnumpy(B_d) -C = cp.asnumpy(C_d) -D = np.empty(D_d.shape, dtype=np.float32) - -#################### -# Allocate workspace -#################### - -# this is one way to proceed: query the currently available memory on the -# device, and allocate a big fraction of it... -#freeMem, totalMem = dev.mem_info -#worksize = int(freeMem * 0.9) -# ...but in this case we can set a much tighter upper bound, since we know -# the rough answer already -worksize = 128*1024**2 # = 128 MB, can be smaller -work = cp.cuda.alloc(worksize) - -print("Allocate memory for data and workspace, and initialize data.") +dim = 8 +extent_A = (dim,) * 6 +extent_B = (dim,) * 6 +extent_C = (dim,) * 6 +extent_D = (dim,) * 6 +extent_R = (dim,) * 2 + +print("Define network, modes, and extents.") + +################# +# Initialize data +################# + +A_d = cp.random.random((np.prod(extent_A),), dtype=np.float32) +B_d = cp.random.random((np.prod(extent_B),), dtype=np.float32) +C_d = cp.random.random((np.prod(extent_C),), dtype=np.float32) +D_d = cp.random.random((np.prod(extent_D),), dtype=np.float32) +R_d = cp.zeros((np.prod(extent_R),), dtype=np.float32) +raw_data_in_d = (A_d.data.ptr, B_d.data.ptr, C_d.data.ptr, D_d.data.ptr) ############# # cuTensorNet @@ -82,73 +66,68 @@ stream = cp.cuda.Stream() handle = cutn.create() -nmodeA = len(modesA) -nmodeB = len(modesB) -nmodeC = len(modesC) -nmodeD = len(modesD) +nmode_A = len(modes_A) +nmode_B = len(modes_B) +nmode_C = len(modes_C) +nmode_D = len(modes_D) +nmode_R = len(modes_R) ############################### # Create Contraction Descriptor ############################### -# These also work, but require a bit more keystrokes -#modesA = np.asarray(modesA, dtype=np.int32) -#modesB = np.asarray(modesB, dtype=np.int32) -#modesC = np.asarray(modesC, dtype=np.int32) -#modesIn = (modesA.ctypes.data, modesB.ctypes.data, modesC.ctypes.data) -#extentA = np.asarray(extentA, dtype=np.int64) -#extentB = np.asarray(extentB, dtype=np.int64) -#extentC = np.asarray(extentC, dtype=np.int64) -#extentsIn = (extentA.ctypes.data, extentB.ctypes.data, extentC.ctypes.data) - -modesIn = (modesA, modesB, modesC) -extentsIn = (extentA, extentB, extentC) -numModesIn = (nmodeA, nmodeB, nmodeC) - -# strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout -stridesIn = (0, 0, 0) - -# compute the alignments -# we hard-code them here because CuPy arrays are at least 256B aligned -alignmentsIn = (256, 256, 256) -alignmentOut = 256 - -# setup tensor network -descNet = cutn.create_network_descriptor(handle, - numInputs, numModesIn, extentsIn, stridesIn, modesIn, alignmentsIn, # inputs - nmodeD, extentD, 0, modesD, alignmentOut, # output +modes_in = (modes_A, modes_B, modes_C, modes_D) +extents_in = (extent_A, extent_B, extent_C, extent_D) +num_modes_in = (nmode_A, nmode_B, nmode_C, nmode_D) + +# Strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout +strides_in = (0, 0, 0, 0) + +# Set up the tensor qualifiers for all input tensors +qualifiers_in = np.zeros(num_inputs, dtype=cutn.tensor_qualifiers_dtype) + +# Set up tensor network +desc_net = cutn.create_network_descriptor(handle, + num_inputs, num_modes_in, extents_in, strides_in, modes_in, qualifiers_in, # inputs + nmode_R, extent_R, 0, modes_R, # output data_type, compute_type) print("Initialize the cuTensorNet library and create a network descriptor.") +##################################################### +# Choose workspace limit based on available resources +##################################################### + +free_mem, total_mem = dev.mem_info +workspace_limit = int(free_mem * 0.9) + ############################################## # Find "optimal" contraction order and slicing ############################################## -optimizerConfig = cutn.create_contraction_optimizer_config(handle) +optimizer_config = cutn.create_contraction_optimizer_config(handle) # Set the value of the partitioner imbalance factor to 30 (if desired) imbalance_dtype = cutn.contraction_optimizer_config_get_attribute_dtype( cutn.ContractionOptimizerConfigAttribute.GRAPH_IMBALANCE_FACTOR) imbalance_factor = np.asarray((30,), dtype=imbalance_dtype) cutn.contraction_optimizer_config_set_attribute( - handle, optimizerConfig, cutn.ContractionOptimizerConfigAttribute.GRAPH_IMBALANCE_FACTOR, + handle, optimizer_config, cutn.ContractionOptimizerConfigAttribute.GRAPH_IMBALANCE_FACTOR, imbalance_factor.ctypes.data, imbalance_factor.dtype.itemsize) -optimizerInfo = cutn.create_contraction_optimizer_info(handle, descNet) +optimizer_info = cutn.create_contraction_optimizer_info(handle, desc_net) -cutn.contraction_optimize( - handle, descNet, optimizerConfig, worksize, optimizerInfo) +cutn.contraction_optimize(handle, desc_net, optimizer_config, workspace_limit, optimizer_info) -numSlices_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( +num_slices_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( cutn.ContractionOptimizerInfoAttribute.NUM_SLICES) -numSlices = np.zeros((1,), dtype=numSlices_dtype) +num_slices = np.zeros((1,), dtype=num_slices_dtype) cutn.contraction_optimizer_info_get_attribute( - handle, optimizerInfo, cutn.ContractionOptimizerInfoAttribute.NUM_SLICES, - numSlices.ctypes.data, numSlices.dtype.itemsize) -numSlices = int(numSlices) + handle, optimizer_info, cutn.ContractionOptimizerInfoAttribute.NUM_SLICES, + num_slices.ctypes.data, num_slices.dtype.itemsize) +num_slices = int(num_slices) -assert numSlices > 0 +assert num_slices > 0 print("Find an optimized contraction path with cuTensorNet optimizer.") @@ -156,20 +135,18 @@ # Initialize all pair-wise contraction plans (for cuTENSOR) ########################################################### -workDesc = cutn.create_workspace_descriptor(handle) -cutn.workspace_compute_sizes(handle, descNet, optimizerInfo, workDesc) -requiredWorkspaceSize = cutn.workspace_get_size( - handle, workDesc, +work_desc = cutn.create_workspace_descriptor(handle) +cutn.workspace_compute_contraction_sizes(handle, desc_net, optimizer_info, work_desc) +required_workspace_size = cutn.workspace_get_size( + handle, work_desc, cutn.WorksizePref.MIN, cutn.Memspace.DEVICE) -if worksize < requiredWorkspaceSize: - raise MemoryError("Not enough workspace memory is available.") +work = cp.cuda.alloc(required_workspace_size) cutn.workspace_set( - handle, workDesc, + handle, work_desc, cutn.Memspace.DEVICE, - work.ptr, worksize) -plan = cutn.create_contraction_plan( - handle, descNet, optimizerInfo, workDesc) + work.ptr, required_workspace_size) +plan = cutn.create_contraction_plan(handle, desc_net, optimizer_info, work_desc) ################################################################################### # Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel @@ -177,45 +154,41 @@ pref = cutn.create_contraction_autotune_preference(handle) -numAutotuningIterations = 5 # may be 0 +num_autotuning_iterations = 5 # may be 0 n_iter_dtype = cutn.contraction_autotune_preference_get_attribute_dtype( cutn.ContractionAutotunePreferenceAttribute.MAX_ITERATIONS) -numAutotuningIterations = np.asarray([numAutotuningIterations], dtype=n_iter_dtype) +num_autotuning_iterations = np.asarray([num_autotuning_iterations], dtype=n_iter_dtype) cutn.contraction_autotune_preference_set_attribute( handle, pref, cutn.ContractionAutotunePreferenceAttribute.MAX_ITERATIONS, - numAutotuningIterations.ctypes.data, numAutotuningIterations.dtype.itemsize) + num_autotuning_iterations.ctypes.data, num_autotuning_iterations.dtype.itemsize) -# modify the plan again to find the best pair-wise contractions +# Modify the plan again to find the best pair-wise contractions cutn.contraction_autotune( - handle, plan, rawDataIn_d, D_d.data.ptr, - workDesc, pref, stream.ptr) + handle, plan, raw_data_in_d, R_d.data.ptr, + work_desc, pref, stream.ptr) cutn.destroy_contraction_autotune_preference(pref) print("Create a contraction plan for cuTENSOR and optionally auto-tune it.") -##### -# Run -##### +########### +# Execution +########### minTimeCUTENSOR = 1e100 -numRuns = 3 # to get stable perf results +num_runs = 3 # to get stable perf results e1 = cp.cuda.Event() e2 = cp.cuda.Event() -sliceGroup = cutn.create_slice_group_from_id_range(handle, 0, numSlices, 1) - -for i in range(numRuns): - # restore output - D_d.data.copy_from(D.ctypes.data, D.size * D.dtype.itemsize) - dev.synchronize() +slice_group = cutn.create_slice_group_from_id_range(handle, 0, num_slices, 1) +for i in range(num_runs): # Contract over all slices. # A user may choose to parallelize over the slices across multiple devices. e1.record() cutn.contract_slices( - handle, plan, rawDataIn_d, D_d.data.ptr, False, - workDesc, sliceGroup, stream.ptr) + handle, plan, raw_data_in_d, R_d.data.ptr, False, + work_desc, slice_group, stream.ptr) e2.record() # Synchronize and measure timing @@ -223,16 +196,20 @@ time = cp.cuda.get_elapsed_time(e1, e2) / 1000 # ms -> s minTimeCUTENSOR = minTimeCUTENSOR if minTimeCUTENSOR < time else time - print("Contract the network, each slice uses the same contraction plan.") -# recall that we set strides to null (0), so the data are in F-contiguous layout -A_d = A_d.reshape(extentA, order='F') -B_d = B_d.reshape(extentB, order='F') -C_d = C_d.reshape(extentC, order='F') -D_d = D_d.reshape(extentD, order='F') -out = cp.einsum("mhkn,ukh,xuy->mxny", A_d, B_d, C_d) -if not cp.allclose(out, D_d): +# free up the workspace +del work + +# Recall that we set strides to null (0), so the data are in F-contiguous layout +A_d = A_d.reshape(extent_A, order='F') +B_d = B_d.reshape(extent_B, order='F') +C_d = C_d.reshape(extent_C, order='F') +D_d = D_d.reshape(extent_D, order='F') +R_d = R_d.reshape(extent_R, order='F') +path, _ = cuquantum.einsum_path("abcdef,bgheij,magfik,lchdjm->kl", A_d, B_d, C_d, D_d) +out = cp.einsum("abcdef,bgheij,magfik,lchdjm->kl", A_d, B_d, C_d, D_d, optimize=path) +if not cp.allclose(out, R_d): raise RuntimeError("result is incorrect") print("Check cuTensorNet result against that of cupy.einsum().") @@ -242,20 +219,20 @@ cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT) flops = np.zeros((1,), dtype=flops_dtype) cutn.contraction_optimizer_info_get_attribute( - handle, optimizerInfo, cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT, + handle, optimizer_info, cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT, flops.ctypes.data, flops.dtype.itemsize) flops = float(flops) -print(f"numSlices: {numSlices}") -print(f"{minTimeCUTENSOR * 1000 / numSlices} ms / slice") -print(f"{flops/1e9/minTimeCUTENSOR} GFLOPS/s") +print(f"num_slices: {num_slices}") +print(f"{minTimeCUTENSOR * 1000 / num_slices} ms / slice") +print(f"{flops / 1e9 / minTimeCUTENSOR} GFLOPS/s") -cutn.destroy_slice_group(sliceGroup) +cutn.destroy_slice_group(slice_group) cutn.destroy_contraction_plan(plan) -cutn.destroy_contraction_optimizer_info(optimizerInfo) -cutn.destroy_contraction_optimizer_config(optimizerConfig) -cutn.destroy_network_descriptor(descNet) -cutn.destroy_workspace_descriptor(workDesc) +cutn.destroy_contraction_optimizer_info(optimizer_info) +cutn.destroy_contraction_optimizer_config(optimizer_config) +cutn.destroy_network_descriptor(desc_net) +cutn.destroy_workspace_descriptor(work_desc) cutn.destroy(handle) print("Free resource and exit.") diff --git a/python/samples/cutensornet/tensornet_example_mpi.py b/python/samples/cutensornet/tensornet_example_mpi.py old mode 100644 new mode 100755 index 13c1617..956c63e --- a/python/samples/cutensornet/tensornet_example_mpi.py +++ b/python/samples/cutensornet/tensornet_example_mpi.py @@ -17,11 +17,11 @@ print("*** Printing is done only from the root process to prevent jumbled messages ***") print(f"The number of processes is {size}") -# Get cuTensorNet version and device properties. -numDevices = cp.cuda.runtime.getDeviceCount() -deviceId = rank % numDevices # We assume that the processes are mapped to nodes in contiguous chunks. -dev = cp.cuda.Device(deviceId) +num_devices = cp.cuda.runtime.getDeviceCount() +device_id = rank % num_devices +dev = cp.cuda.Device(device_id) dev.use() + props = cp.cuda.runtime.getDeviceProperties(dev.id) if rank == root: print("cuTensorNet-vers:", cutn.get_version()) @@ -34,60 +34,62 @@ print("GPU-minor:", props["minor"]) print("========================") -########################################################## -# Computing: D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y} -########################################################## +###################################################################################### +# Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m} +###################################################################################### if rank == root: - print("Include headers and define data types") + print("Include headers and define data types.") data_type = cuquantum.cudaDataType.CUDA_R_32F compute_type = cuquantum.ComputeType.COMPUTE_32F -numInputs = 3 +num_inputs = 4 # Create an array of modes -modesA = [ord(c) for c in ('m','h','k','n')] -modesB = [ord(c) for c in ('u','k','h')] -modesC = [ord(c) for c in ('x','u','y')] -modesD = [ord(c) for c in ('m','x','n','y')] +modes_A = [ord(c) for c in ('a','b','c','d','e','f')] +modes_B = [ord(c) for c in ('b','g','h','e','i','j')] +modes_C = [ord(c) for c in ('m','a','g','f','i','k')] +modes_D = [ord(c) for c in ('l','c','h','d','j','m')] +modes_R = [ord(c) for c in ('k','l')] # Create an array of extents (shapes) for each tensor -extentA = (96, 64, 64, 96) -extentB = (96, 64, 64) -extentC = (64, 96, 64) -extentD = (96, 64, 96, 64) +dim = 8 +extent_A = (dim,) * 6 +extent_B = (dim,) * 6 +extent_C = (dim,) * 6 +extent_D = (dim,) * 6 +extent_R = (dim,) * 2 if rank == root: - print("Define network, modes, and extents") + print("Define network, modes, and extents.") -############################ -# Allocate & initialize data -############################ +################# +# Initialize data +################# if rank == root: - A = np.random.random((np.prod(extentA),)).astype(np.float32) - B = np.random.random((np.prod(extentB),)).astype(np.float32) - C = np.random.random((np.prod(extentC),)).astype(np.float32) + A = np.random.random(np.prod(extent_A)).astype(np.float32) + B = np.random.random(np.prod(extent_B)).astype(np.float32) + C = np.random.random(np.prod(extent_C)).astype(np.float32) + D = np.random.random(np.prod(extent_D)).astype(np.float32) else: - A = np.empty((np.prod(extentA),), dtype=np.float32) - B = np.empty((np.prod(extentB),), dtype=np.float32) - C = np.empty((np.prod(extentC),), dtype=np.float32) -D = np.empty(extentD, dtype=np.float32) + A = np.empty(np.prod(extent_A), dtype=np.float32) + B = np.empty(np.prod(extent_B), dtype=np.float32) + C = np.empty(np.prod(extent_C), dtype=np.float32) + D = np.empty(np.prod(extent_D), dtype=np.float32) +R = np.empty(extent_R) -# Broadcast data to all ranks. comm.Bcast(A, root) comm.Bcast(B, root) comm.Bcast(C, root) +comm.Bcast(D, root) -# Copy data onto the device on all ranks. A_d = cp.asarray(A) B_d = cp.asarray(B) C_d = cp.asarray(C) -D_d = cp.empty((np.prod(extentD),), dtype=np.float32) -rawDataIn_d = (A_d.data.ptr, B_d.data.ptr, C_d.data.ptr) - -if rank == root: - print("Allocate memory for data, calculate workspace limit, and initialize data.") +D_d = cp.asarray(D) +R_d = cp.empty(np.prod(extent_R), dtype=np.float32) +raw_data_in_d = (A_d.data.ptr, B_d.data.ptr, C_d.data.ptr, D_d.data.ptr) ############# # cuTensorNet @@ -96,41 +98,30 @@ stream = cp.cuda.Stream() handle = cutn.create() -nmodeA = len(modesA) -nmodeB = len(modesB) -nmodeC = len(modesC) -nmodeD = len(modesD) +nmode_A = len(modes_A) +nmode_B = len(modes_B) +nmode_C = len(modes_C) +nmode_D = len(modes_D) +nmode_R = len(modes_R) ############################### # Create Contraction Descriptor ############################### -# These also work, but require a bit more keystrokes -#modesA = np.asarray(modesA, dtype=np.int32) -#modesB = np.asarray(modesB, dtype=np.int32) -#modesC = np.asarray(modesC, dtype=np.int32) -#modesIn = (modesA.ctypes.data, modesB.ctypes.data, modesC.ctypes.data) -#extentA = np.asarray(extentA, dtype=np.int64) -#extentB = np.asarray(extentB, dtype=np.int64) -#extentC = np.asarray(extentC, dtype=np.int64) -#extentsIn = (extentA.ctypes.data, extentB.ctypes.data, extentC.ctypes.data) - -modesIn = (modesA, modesB, modesC) -extentsIn = (extentA, extentB, extentC) -numModesIn = (nmodeA, nmodeB, nmodeC) - -# strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout -stridesIn = (0, 0, 0) - -# compute the alignments -# we hard-code them here because CuPy arrays are at least 256B aligned -alignmentsIn = (256, 256, 256) -alignmentOut = 256 - -# setup tensor network -descNet = cutn.create_network_descriptor(handle, - numInputs, numModesIn, extentsIn, stridesIn, modesIn, alignmentsIn, # inputs - nmodeD, extentD, 0, modesD, alignmentOut, # output +modes_in = (modes_A, modes_B, modes_C, modes_D) +extents_in = (extent_A, extent_B, extent_C, extent_D) +num_modes_in = (nmode_A, nmode_B, nmode_C, nmode_D) + +# Strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout +strides_in = (0, 0, 0, 0) + +# Set up the tensor qualifiers for all input tensors +qualifiers_in = np.zeros(num_inputs, dtype=cutn.tensor_qualifiers_dtype) + +# Set up tensor network +desc_net = cutn.create_network_descriptor(handle, + num_inputs, num_modes_in, extents_in, strides_in, modes_in, qualifiers_in, # inputs + nmode_R, extent_R, 0, modes_R, # output data_type, compute_type) if rank == root: @@ -140,38 +131,33 @@ # Choose workspace limit based on available resources ##################################################### -freeMem, totalMem = dev.mem_info -totalMem = comm.allreduce(totalMem, MPI.MIN) -workspaceLimit = int(totalMem * 0.9) +free_mem, total_mem = dev.mem_info +free_mem = comm.allreduce(free_mem, MPI.MIN) +workspace_limit = int(free_mem * 0.9) ############################################## # Find "optimal" contraction order and slicing ############################################## -optimizerConfig = cutn.create_contraction_optimizer_config(handle) -optimizerInfo = cutn.create_contraction_optimizer_info(handle, descNet) - -# Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny -# example with 3 operands, all processes will compute the same globally optimal path. This is not the case for large -# tensor networks. For large networks, hyperoptimization is also beneficial and can be enabled by setting the -# optimizer config attribute cutn.ContractionOptimizerConfigAttribute.HYPER_NUM_SAMPLES. +optimizer_config = cutn.create_contraction_optimizer_config(handle) +optimizer_info = cutn.create_contraction_optimizer_info(handle, desc_net) # Force slicing min_slices_dtype = cutn.contraction_optimizer_config_get_attribute_dtype( cutn.ContractionOptimizerConfigAttribute.SLICER_MIN_SLICES) min_slices_factor = np.asarray((size,), dtype=min_slices_dtype) cutn.contraction_optimizer_config_set_attribute( - handle, optimizerConfig, cutn.ContractionOptimizerConfigAttribute.SLICER_MIN_SLICES, + handle, optimizer_config, cutn.ContractionOptimizerConfigAttribute.SLICER_MIN_SLICES, min_slices_factor.ctypes.data, min_slices_factor.dtype.itemsize) cutn.contraction_optimize( - handle, descNet, optimizerConfig, workspaceLimit, optimizerInfo) + handle, desc_net, optimizer_config, workspace_limit, optimizer_info) flops_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT) flops = np.zeros((1,), dtype=flops_dtype) cutn.contraction_optimizer_info_get_attribute( - handle, optimizerInfo, cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT, + handle, optimizer_info, cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT, flops.ctypes.data, flops.dtype.itemsize) flops = float(flops) @@ -180,9 +166,9 @@ if rank == root: print(f"Process {sender} has the path with the lowest FLOP count {flops}.") -# Get buffer size for optimizerInfo and broadcast it. +# Get buffer size for optimizer_info and broadcast it. if rank == sender: - bufSize = cutn.contraction_optimizer_info_get_packed_size(handle, optimizerInfo) + bufSize = cutn.contraction_optimizer_info_get_packed_size(handle, optimizer_info) else: bufSize = 0 # placeholder bufSize = comm.bcast(bufSize, sender) @@ -190,60 +176,59 @@ # Allocate buffer. buf = np.empty((bufSize,), dtype=np.int8) -# Pack optimizerInfo on sender and broadcast it. +# Pack optimizer_info on sender and broadcast it. if rank == sender: - cutn.contraction_optimizer_info_pack_data(handle, optimizerInfo, buf, bufSize) + cutn.contraction_optimizer_info_pack_data(handle, optimizer_info, buf, bufSize) comm.Bcast(buf, sender) -# Unpack optimizerInfo from buffer. +# Unpack optimizer_info from buffer. if rank != sender: cutn.update_contraction_optimizer_info_from_packed_data( - handle, buf, bufSize, optimizerInfo) + handle, buf, bufSize, optimizer_info) -numSlices_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( +num_slices_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( cutn.ContractionOptimizerInfoAttribute.NUM_SLICES) -numSlices = np.zeros((1,), dtype=numSlices_dtype) +num_slices = np.zeros((1,), dtype=num_slices_dtype) cutn.contraction_optimizer_info_get_attribute( - handle, optimizerInfo, cutn.ContractionOptimizerInfoAttribute.NUM_SLICES, - numSlices.ctypes.data, numSlices.dtype.itemsize) -numSlices = int(numSlices) + handle, optimizer_info, cutn.ContractionOptimizerInfoAttribute.NUM_SLICES, + num_slices.ctypes.data, num_slices.dtype.itemsize) +num_slices = int(num_slices) -assert numSlices > 0 +assert num_slices > 0 # Calculate each process's share of the slices. -procChunk = numSlices / size -extra = numSlices % size -procSliceBegin = rank * procChunk + min(rank, extra) -procSliceEnd = numSlices if rank == size - 1 else (rank + 1) * procChunk + min(rank + 1, extra) +proc_chunk = num_slices / size +extra = num_slices % size +proc_slice_begin = rank * proc_chunk + min(rank, extra) +proc_slice_end = num_slices if rank == size - 1 else (rank + 1) * proc_chunk + min(rank + 1, extra) if rank == root: print("Find an optimized contraction path with cuTensorNet optimizer.") -############################################################# -# Create workspace descriptor, allocate workspace, and set it -############################################################# - -workDesc = cutn.create_workspace_descriptor(handle) -cutn.workspace_compute_sizes(handle, descNet, optimizerInfo, workDesc) -requiredWorkspaceSize = cutn.workspace_get_size( - handle, workDesc, +########################################################### +# Initialize all pair-wise contraction plans (for cuTENSOR) +########################################################### + +work_desc = cutn.create_workspace_descriptor(handle) +cutn.workspace_compute_contraction_sizes(handle, desc_net, optimizer_info, work_desc) +required_workspace_size = cutn.workspace_get_size( + handle, work_desc, cutn.WorksizePref.MIN, cutn.Memspace.DEVICE) -work = cp.cuda.alloc(requiredWorkspaceSize) +work = cp.cuda.alloc(required_workspace_size) cutn.workspace_set( - handle, workDesc, + handle, work_desc, cutn.Memspace.DEVICE, - work.ptr, requiredWorkspaceSize) + work.ptr, required_workspace_size) if rank == root: print("Allocate workspace.") - + ########################################################### # Initialize all pair-wise contraction plans (for cuTENSOR) ########################################################### -plan = cutn.create_contraction_plan( - handle, descNet, optimizerInfo, workDesc) +plan = cutn.create_contraction_plan(handle, desc_net, optimizer_info, work_desc) ################################################################################### # Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel @@ -251,49 +236,42 @@ pref = cutn.create_contraction_autotune_preference(handle) -numAutotuningIterations = 5 # may be 0 +num_autotuning_iterations = 5 # may be 0 n_iter_dtype = cutn.contraction_autotune_preference_get_attribute_dtype( cutn.ContractionAutotunePreferenceAttribute.MAX_ITERATIONS) -numAutotuningIterations = np.asarray([numAutotuningIterations], dtype=n_iter_dtype) +num_autotuning_iterations = np.asarray([num_autotuning_iterations], dtype=n_iter_dtype) cutn.contraction_autotune_preference_set_attribute( handle, pref, cutn.ContractionAutotunePreferenceAttribute.MAX_ITERATIONS, - numAutotuningIterations.ctypes.data, numAutotuningIterations.dtype.itemsize) + num_autotuning_iterations.ctypes.data, num_autotuning_iterations.dtype.itemsize) # modify the plan again to find the best pair-wise contractions cutn.contraction_autotune( - handle, plan, rawDataIn_d, D_d.data.ptr, - workDesc, pref, stream.ptr) + handle, plan, raw_data_in_d, R_d.data.ptr, + work_desc, pref, stream.ptr) cutn.destroy_contraction_autotune_preference(pref) -if rank == root: +if rank == root: print("Create a contraction plan for cuTENSOR and optionally auto-tune it.") -##### -# Run -##### +########### +# Execution +########### minTimeCUTENSOR = 1e100 -numRuns = 3 # to get stable perf results +num_runs = 3 # to get stable perf results e1 = cp.cuda.Event() e2 = cp.cuda.Event() +slice_group = cutn.create_slice_group_from_id_range(handle, proc_slice_begin, proc_slice_end, 1) -# Create a cutensornetSliceGroup_t object from a range of slice IDs. -sliceGroup = cutn.create_slice_group_from_id_range(handle, procSliceBegin, procSliceEnd, 1) - -for i in range(numRuns): - dev.synchronize() - - # Contract over the range of slices this process is responsible for. - - # Don't accumulate into output since we use a one-process-per-gpu model. - accumulateOutput = False - +for i in range(num_runs): + # Contract over all slices. + # A user may choose to parallelize over the slices across multiple devices. e1.record() cutn.contract_slices( - handle, plan, rawDataIn_d, D_d.data.ptr, accumulateOutput, - workDesc, sliceGroup, stream.ptr) + handle, plan, raw_data_in_d, R_d.data.ptr, False, + work_desc, slice_group, stream.ptr) e2.record() # Synchronize and measure timing @@ -302,40 +280,53 @@ minTimeCUTENSOR = minTimeCUTENSOR if minTimeCUTENSOR < time else time if rank == root: - print("Contract the network, all slices within the same rank use the same contraction plan.") - print(f"numSlices: {numSlices}") - numSlicesProc = procSliceEnd - procSliceBegin - print(f"numSlices on root process: {numSlicesProc}") - if numSlicesProc > 0: - print(f"{minTimeCUTENSOR * 1000 / numSlicesProc} ms / slice") - -cutn.destroy_slice_group(sliceGroup) -D[...] = cp.asnumpy(D_d).reshape(extentD, order='F') + print("Contract the network, each slice uses the same contraction plan.") + +# free up the workspace +del work + +R[...] = cp.asnumpy(R_d).reshape(extent_R, order='F') # Reduce on root process. if rank == root: - comm.Reduce(MPI.IN_PLACE, D, root=root) + comm.Reduce(MPI.IN_PLACE, R, root=root) else: - comm.Reduce(D, D, root=root) + comm.Reduce(R, R, root=root) # Compute the reference result. if rank == root: - # recall that we set strides to null (0), so the data are in F-contiguous layout - A_d = A_d.reshape(extentA, order='F') - B_d = B_d.reshape(extentB, order='F') - C_d = C_d.reshape(extentC, order='F') - D_d = D_d.reshape(extentD, order='F') - out = cp.einsum("mhkn,ukh,xuy->mxny", A_d, B_d, C_d) - if not cp.allclose(out, D): + # Recall that we set strides to null (0), so the data are in F-contiguous layout + A_d = A_d.reshape(extent_A, order='F') + B_d = B_d.reshape(extent_B, order='F') + C_d = C_d.reshape(extent_C, order='F') + D_d = D_d.reshape(extent_D, order='F') + path, _ = cuquantum.einsum_path("abcdef,bgheij,magfik,lchdjm->kl", A_d, B_d, C_d, D_d) + out = cp.einsum("abcdef,bgheij,magfik,lchdjm->kl", A_d, B_d, C_d, D_d, optimize=path) + + if not cp.allclose(out, R): raise RuntimeError("result is incorrect") print("Check cuTensorNet result against that of cupy.einsum().") ####################################################### +flops_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( + cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT) +flops = np.zeros((1,), dtype=flops_dtype) +cutn.contraction_optimizer_info_get_attribute( + handle, optimizer_info, cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT, + flops.ctypes.data, flops.dtype.itemsize) +flops = float(flops) + +if rank == root: + print(f"num_slices: {num_slices}") + print(f"{minTimeCUTENSOR * 1000 / num_slices} ms / slice") + print(f"{flops / 1e9 / minTimeCUTENSOR} GFLOPS/s") + +cutn.destroy_slice_group(slice_group) cutn.destroy_contraction_plan(plan) -cutn.destroy_contraction_optimizer_info(optimizerInfo) -cutn.destroy_contraction_optimizer_config(optimizerConfig) -cutn.destroy_network_descriptor(descNet) -cutn.destroy_workspace_descriptor(workDesc) +cutn.destroy_contraction_optimizer_info(optimizer_info) +cutn.destroy_contraction_optimizer_config(optimizer_config) +cutn.destroy_network_descriptor(desc_net) +cutn.destroy_workspace_descriptor(work_desc) cutn.destroy(handle) if rank == root: diff --git a/python/samples/cutensornet/tensornet_example_mpi_auto.py b/python/samples/cutensornet/tensornet_example_mpi_auto.py new file mode 100755 index 0000000..1365d44 --- /dev/null +++ b/python/samples/cutensornet/tensornet_example_mpi_auto.py @@ -0,0 +1,293 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import cupy as cp +import numpy as np +from mpi4py import MPI + +import cuquantum +from cuquantum import cutensornet as cutn + + +root = 0 +comm = MPI.COMM_WORLD +rank, size = comm.Get_rank(), comm.Get_size() +if rank == root: + print("*** Printing is done only from the root process to prevent jumbled messages ***") + print(f"The number of processes is {size}") + +num_devices = cp.cuda.runtime.getDeviceCount() +device_id = rank % num_devices +dev = cp.cuda.Device(device_id) +dev.use() + +props = cp.cuda.runtime.getDeviceProperties(dev.id) +if rank == root: + print("cuTensorNet-vers:", cutn.get_version()) + print("===== root process device info ======") + print("GPU-name:", props["name"].decode()) + print("GPU-clock:", props["clockRate"]) + print("GPU-memoryClock:", props["memoryClockRate"]) + print("GPU-nSM:", props["multiProcessorCount"]) + print("GPU-major:", props["major"]) + print("GPU-minor:", props["minor"]) + print("========================") + +###################################################################################### +# Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m} +###################################################################################### + +if rank == root: + print("Include headers and define data types.") + +data_type = cuquantum.cudaDataType.CUDA_R_32F +compute_type = cuquantum.ComputeType.COMPUTE_32F +num_inputs = 4 + +# Create an array of modes +modes_A = [ord(c) for c in ('a','b','c','d','e','f')] +modes_B = [ord(c) for c in ('b','g','h','e','i','j')] +modes_C = [ord(c) for c in ('m','a','g','f','i','k')] +modes_D = [ord(c) for c in ('l','c','h','d','j','m')] +modes_R = [ord(c) for c in ('k','l')] + +# Create an array of extents (shapes) for each tensor +dim = 8 +extent_A = (dim,) * 6 +extent_B = (dim,) * 6 +extent_C = (dim,) * 6 +extent_D = (dim,) * 6 +extent_R = (dim,) * 2 + +if rank == root: + print("Define network, modes, and extents.") + +################# +# Initialize data +################# + +if rank == root: + A = np.random.random(np.prod(extent_A)).astype(np.float32) + B = np.random.random(np.prod(extent_B)).astype(np.float32) + C = np.random.random(np.prod(extent_C)).astype(np.float32) + D = np.random.random(np.prod(extent_D)).astype(np.float32) +else: + A = np.empty(np.prod(extent_A), dtype=np.float32) + B = np.empty(np.prod(extent_B), dtype=np.float32) + C = np.empty(np.prod(extent_C), dtype=np.float32) + D = np.empty(np.prod(extent_D), dtype=np.float32) + +comm.Bcast(A, root) +comm.Bcast(B, root) +comm.Bcast(C, root) +comm.Bcast(D, root) + +A_d = cp.asarray(A) +B_d = cp.asarray(B) +C_d = cp.asarray(C) +D_d = cp.asarray(D) +R_d = cp.empty(np.prod(extent_R), dtype=np.float32) +raw_data_in_d = (A_d.data.ptr, B_d.data.ptr, C_d.data.ptr, D_d.data.ptr) + +############# +# cuTensorNet +############# + +stream = cp.cuda.Stream() +handle = cutn.create() + +nmode_A = len(modes_A) +nmode_B = len(modes_B) +nmode_C = len(modes_C) +nmode_D = len(modes_D) +nmode_R = len(modes_R) + +############################### +# Create Contraction Descriptor +############################### + +modes_in = (modes_A, modes_B, modes_C, modes_D) +extents_in = (extent_A, extent_B, extent_C, extent_D) +num_modes_in = (nmode_A, nmode_B, nmode_C, nmode_D) + +# Strides are optional; if no stride (0) is provided, then cuTensorNet assumes a generalized column-major data layout +strides_in = (0, 0, 0, 0) + +# Set up the tensor qualifiers for all input tensors +qualifiers_in = np.zeros(num_inputs, dtype=cutn.tensor_qualifiers_dtype) + +# Set up tensor network +desc_net = cutn.create_network_descriptor(handle, + num_inputs, num_modes_in, extents_in, strides_in, modes_in, qualifiers_in, # inputs + nmode_R, extent_R, 0, modes_R, # output + data_type, compute_type) + +if rank == root: + print("Initialize the cuTensorNet library and create a network descriptor.") + +##################################################### +# Choose workspace limit based on available resources +##################################################### + +free_mem, total_mem = dev.mem_info +free_mem = comm.allreduce(free_mem, MPI.MIN) +workspace_limit = int(free_mem * 0.9) + +cutn_comm = comm.Dup() +cutn.distributed_reset_configuration(handle, MPI._addressof(cutn_comm), MPI._sizeof(cutn_comm)) +if rank == root: + print("Reset distributed MPI configuration") + +############################################## +# Find "optimal" contraction order and slicing +############################################## + +optimizer_config = cutn.create_contraction_optimizer_config(handle) +optimizer_info = cutn.create_contraction_optimizer_info(handle, desc_net) + +# Force slicing +min_slices_dtype = cutn.contraction_optimizer_config_get_attribute_dtype( + cutn.ContractionOptimizerConfigAttribute.SLICER_MIN_SLICES) +min_slices_factor = np.asarray((size,), dtype=min_slices_dtype) +cutn.contraction_optimizer_config_set_attribute( + handle, optimizer_config, cutn.ContractionOptimizerConfigAttribute.SLICER_MIN_SLICES, + min_slices_factor.ctypes.data, min_slices_factor.dtype.itemsize) + +cutn.contraction_optimize( + handle, desc_net, optimizer_config, workspace_limit, optimizer_info) + + +num_slices_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( + cutn.ContractionOptimizerInfoAttribute.NUM_SLICES) +num_slices = np.zeros((1,), dtype=num_slices_dtype) +cutn.contraction_optimizer_info_get_attribute( + handle, optimizer_info, cutn.ContractionOptimizerInfoAttribute.NUM_SLICES, + num_slices.ctypes.data, num_slices.dtype.itemsize) +num_slices = int(num_slices) + +assert num_slices > 0 + +if rank == root: + print("Find an optimized contraction path with cuTensorNet optimizer.") + +########################################################### +# Initialize all pair-wise contraction plans (for cuTENSOR) +########################################################### + +work_desc = cutn.create_workspace_descriptor(handle) +cutn.workspace_compute_contraction_sizes(handle, desc_net, optimizer_info, work_desc) +required_workspace_size = cutn.workspace_get_size( + handle, work_desc, + cutn.WorksizePref.MIN, + cutn.Memspace.DEVICE) +work = cp.cuda.alloc(required_workspace_size) +cutn.workspace_set( + handle, work_desc, + cutn.Memspace.DEVICE, + work.ptr, required_workspace_size) + +if rank == root: + print("Allocate workspace.") + +########################################################### +# Initialize all pair-wise contraction plans (for cuTENSOR) +########################################################### + +plan = cutn.create_contraction_plan(handle, desc_net, optimizer_info, work_desc) + +################################################################################### +# Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel +################################################################################### + +pref = cutn.create_contraction_autotune_preference(handle) + +num_autotuning_iterations = 5 # may be 0 +n_iter_dtype = cutn.contraction_autotune_preference_get_attribute_dtype( + cutn.ContractionAutotunePreferenceAttribute.MAX_ITERATIONS) +num_autotuning_iterations = np.asarray([num_autotuning_iterations], dtype=n_iter_dtype) +cutn.contraction_autotune_preference_set_attribute( + handle, pref, + cutn.ContractionAutotunePreferenceAttribute.MAX_ITERATIONS, + num_autotuning_iterations.ctypes.data, num_autotuning_iterations.dtype.itemsize) + +# modify the plan again to find the best pair-wise contractions +cutn.contraction_autotune( + handle, plan, raw_data_in_d, R_d.data.ptr, + work_desc, pref, stream.ptr) + +cutn.destroy_contraction_autotune_preference(pref) + +if rank == root: + print("Create a contraction plan for cuTENSOR and optionally auto-tune it.") + +########### +# Execution +########### + +minTimeCUTENSOR = 1e100 +num_runs = 3 # to get stable perf results +e1 = cp.cuda.Event() +e2 = cp.cuda.Event() +slice_group = cutn.create_slice_group_from_id_range(handle, 0, num_slices, 1) + +for i in range(num_runs): + # Contract over all slices. + # A user may choose to parallelize over the slices across multiple devices. + e1.record() + cutn.contract_slices( + handle, plan, raw_data_in_d, R_d.data.ptr, False, + work_desc, slice_group, stream.ptr) + e2.record() + + # Synchronize and measure timing + e2.synchronize() + time = cp.cuda.get_elapsed_time(e1, e2) / 1000 # ms -> s + minTimeCUTENSOR = minTimeCUTENSOR if minTimeCUTENSOR < time else time + +if rank == root: + print("Contract the network, each slice uses the same contraction plan.") + +# free up the workspace +del work + +# Compute the reference result. +if rank == root: + # recall that we set strides to null (0), so the data are in F-contiguous layout + A_d = A_d.reshape(extent_A, order='F') + B_d = B_d.reshape(extent_B, order='F') + C_d = C_d.reshape(extent_C, order='F') + D_d = D_d.reshape(extent_D, order='F') + R_d = R_d.reshape(extent_R, order='F') + path, _ = cuquantum.einsum_path("abcdef,bgheij,magfik,lchdjm->kl", A_d, B_d, C_d, D_d) + out = cp.einsum("abcdef,bgheij,magfik,lchdjm->kl", A_d, B_d, C_d, D_d, optimize=path) + + if not cp.allclose(out, R_d): + raise RuntimeError("result is incorrect") + print("Check cuTensorNet result against that of cupy.einsum().") + +####################################################### + +flops_dtype = cutn.contraction_optimizer_info_get_attribute_dtype( + cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT) +flops = np.zeros((1,), dtype=flops_dtype) +cutn.contraction_optimizer_info_get_attribute( + handle, optimizer_info, cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT, + flops.ctypes.data, flops.dtype.itemsize) +flops = float(flops) + +if rank == root: + print(f"num_slices: {num_slices}") + print(f"{minTimeCUTENSOR * 1000 / num_slices} ms / slice") + print(f"{flops / 1e9 / minTimeCUTENSOR} GFLOPS/s") + +cutn.destroy_slice_group(slice_group) +cutn.destroy_contraction_plan(plan) +cutn.destroy_contraction_optimizer_info(optimizer_info) +cutn.destroy_contraction_optimizer_config(optimizer_config) +cutn.destroy_network_descriptor(desc_net) +cutn.destroy_workspace_descriptor(work_desc) +cutn.destroy(handle) + +if rank == root: + print("Free resource and exit.") diff --git a/python/setup.py b/python/setup.py index 90700a7..2592164 100644 --- a/python/setup.py +++ b/python/setup.py @@ -3,217 +3,82 @@ # SPDX-License-Identifier: BSD-3-Clause import os -import site import sys -from packaging.version import Version -from setuptools import setup, Extension, find_packages from Cython.Build import cythonize +from setuptools import setup, Extension, find_packages - -# Get __version__ variable +# this is tricky: sys.path gets overwritten at different stages of the build +# flow, so we need to hack sys.path ourselves... source_root = os.path.abspath(os.path.dirname(__file__)) -with open(os.path.join(source_root, 'cuquantum', '_version.py')) as f: - exec(f.read()) +sys.path.append(os.path.join(source_root, 'builder')) +import utils # this is builder.utils # Use README for the project long description -with open("README.md") as f: +with open(os.path.join(source_root, "README.md")) as f: long_description = f.read() -# set up version constraints: note that CalVer like 22.03 is normalized to -# 22.3 by setuptools, so we must follow the same practice in the constraints; -# also, we don't need the Python patch number here -cuqnt_py_ver = Version(__version__) -cuqnt_ver_major_minor = f"{cuqnt_py_ver.major}.{cuqnt_py_ver.minor}" - - -# search order: -# 1. installed "cuquantum" package -# 2. env var -for path in site.getsitepackages(): - path = os.path.join(path, 'cuquantum') - if os.path.isdir(path): - cuquantum_root = path - using_cuquantum_wheel = True - break -else: - cuquantum_root = os.environ.get('CUQUANTUM_ROOT') - using_cuquantum_wheel = False - - -# We allow setting CUSTATEVEC_ROOT and CUTENSORNET_ROOT separately for the ease -# of development, but users are encouraged to either install cuquantum from PyPI -# or conda, or set CUQUANTUM_ROOT to the existing installation. -try: - custatevec_root = os.environ['CUSTATEVEC_ROOT'] - using_cuquantum_wheel = False -except KeyError as e: - if cuquantum_root is None: - raise RuntimeError('cuStateVec is not found, please install "cuquantum" ' - 'or set $CUQUANTUM_ROOT') from e - else: - custatevec_root = cuquantum_root -try: - cutensornet_root = os.environ['CUTENSORNET_ROOT'] - using_cuquantum_wheel = False -except KeyError as e: - if cuquantum_root is None: - raise RuntimeError('cuTensorNet is not found, please install "cuquantum" ' - 'or set $CUQUANTUM_ROOT') from e - else: - cutensornet_root = cuquantum_root - - -# search order: -# 1. installed "cutensor" package -# 2. env var -for path in site.getsitepackages(): - path = os.path.join(path, 'cutensor') - if os.path.isdir(path): - cutensor_root = path - assert using_cuquantum_wheel # if this raises, the env is corrupted - break -else: - cutensor_root = os.environ.get('CUTENSOR_ROOT') - assert not using_cuquantum_wheel -if cutensor_root is None: - raise RuntimeError('cuTENSOR is not found, please install "cutensor" ' - 'or set $CUTENSOR_ROOT') - - -# We can't assume users to have CTK installed via pip, so we really need this... -# TODO(leofang): try /usr/local/cuda? -try: - cuda_path = os.environ['CUDA_PATH'] -except KeyError as e: - raise RuntimeError('CUDA is not found, please set $CUDA_PATH') from e +# Get test requirements +with open(os.path.join(source_root, "tests/requirements.txt")) as f: + tests_require = f.read().split('\n') -# TODO: use setup.cfg and/or pyproject.toml -setup_requires = [ - 'Cython>=0.29.22,<3', - 'packaging', - ] +# Runtime dependencies +# - cuTENSOR version is constrained in the cutensornet-cuXX package, so we don't +# need to list it install_requires = [ 'numpy', # 'cupy', # TODO: use "cupy-wheel" once it's stablized, see https://github.com/cupy/cupy/issues/6688 # 'torch', # <-- PyTorch is optional; also, the PyPI version does not support GPU... + f'custatevec-cu{utils.cuda_major_ver}~=1.1', # ">=1.1.0,<2" + f'cutensornet-cu{utils.cuda_major_ver}~=2.0', # ">=2.0.0,<3" ] -ignore_cuquantum_dep = bool(os.environ.get('CUQUANTUM_IGNORE_SOLVER', False)) -if not ignore_cuquantum_dep: - assert using_cuquantum_wheel # if this raises, the env is corrupted - # - cuTENSOR version is constrained in the cuquantum package, so we don't - # need to list it - # - here we assume no API breaking across releases, if there's any we must - # bump the lowest supported version; we can't cap the highest supported - # version as we don't use semantic versioning, unfortunately... - setup_requires.append(f'cuquantum>={cuqnt_ver_major_minor}.*') - install_requires.append(f'cuquantum>={cuqnt_ver_major_minor}.*') - - -def check_cuda_version(): - try: - # We cannot do a dlopen and call cudaRuntimeGetVersion, because it - # requires GPUs. We also do not want to rely on the compiler utility - # provided in distutils (deprecated) or setuptools, as this is a very - # simple string parsing task. - # TODO: switch to cudaRuntimeGetVersion once it's fixed (nvbugs 3624208) - cuda_h = os.path.join(cuda_path, 'include', 'cuda.h') - with open(cuda_h, 'r') as f: - cuda_h = f.read().split('\n') - for line in cuda_h: - if "#define CUDA_VERSION" in line: - ver = int(line.split()[-1]) - break - else: - raise RuntimeError("cannot parse CUDA_VERSION") - except: - raise - else: - # 11020 -> "11.2" - return str(ver // 1000) + '.' + str((ver % 100) // 10) -cuda_ver = check_cuda_version() -if cuda_ver in ('10.2', '11.0'): - cutensor_ver = cuda_ver -elif '11.0' < cuda_ver < '12.0': - cutensor_ver = '11' +# Note: the extension attributes are overwritten in build_extension() +ext_modules = [ + Extension( + "cuquantum.custatevec.custatevec", + sources=["cuquantum/custatevec/custatevec.pyx"], + ), + Extension( + "cuquantum.cutensornet.cutensornet", + sources=["cuquantum/cutensornet/cutensornet.pyx"], + ), + Extension( + "cuquantum.utils", + sources=["cuquantum/utils.pyx"], + include_dirs=[os.path.join(utils.cuda_path, 'include')], + ), +] + + +cmdclass = { + 'build_ext': utils.build_ext, + 'bdist_wheel': utils.bdist_wheel, +} + +if utils.cuda_major_ver == '11': + cuda_classifier = [ + "Environment :: GPU :: NVIDIA CUDA :: 11.0", + "Environment :: GPU :: NVIDIA CUDA :: 11.1", + "Environment :: GPU :: NVIDIA CUDA :: 11.2", + "Environment :: GPU :: NVIDIA CUDA :: 11.3", + "Environment :: GPU :: NVIDIA CUDA :: 11.4", + "Environment :: GPU :: NVIDIA CUDA :: 11.5", + "Environment :: GPU :: NVIDIA CUDA :: 11.6", + "Environment :: GPU :: NVIDIA CUDA :: 11.7", + "Environment :: GPU :: NVIDIA CUDA :: 11.8", + ] else: - raise RuntimeError(f"Unsupported CUDA version: {cuda_ver}") - - -def prepare_libs_and_rpaths(): - global cusv_lib_dir, cutn_lib_dir - # we include both lib64 and lib to accommodate all possible sources - cusv_lib_dir = [os.path.join(custatevec_root, 'lib'), - os.path.join(custatevec_root, 'lib64')] - cutn_lib_dir = [os.path.join(cutensornet_root, 'lib'), - os.path.join(cutensornet_root, 'lib64'), - os.path.join(cutensor_root, 'lib', cutensor_ver)] - - global cusv_lib, cutn_lib, extra_linker_flags - if using_cuquantum_wheel: - cusv_lib = [':libcustatevec.so.1'] - cutn_lib = [':libcutensornet.so.1', ':libcutensor.so.1'] - # The rpaths must be adjusted given the following full-wheel installation: - # cuquantum-python: site-packages/cuquantum/{custatevec, cutensornet}/ [=$ORIGIN] - # cusv & cutn: site-packages/cuquantum/lib/ - # cutensor: site-packages/cutensor/lib/CUDA_VER/ - ldflag = "-Wl,--disable-new-dtags," - ldflag += "-rpath,$ORIGIN/../lib," - ldflag += f"-rpath,$ORIGIN/../../cutensor/lib/{cutensor_ver}" - extra_linker_flags = [ldflag] - else: - cusv_lib = ['custatevec'] - cutn_lib = ['cutensornet', 'cutensor'] - extra_linker_flags = [] - - -prepare_libs_and_rpaths() -print("\n****************************************************************") -print("CUDA version:", cuda_ver) -print("CUDA path:", cuda_path) -print("cuStateVec path:", custatevec_root) -print("cuTensorNet path:", cutensornet_root) -print("cuTENSOR path:", cutensor_root) -print("****************************************************************\n") - - -custatevec = Extension( - "cuquantum.custatevec.custatevec", - sources=["cuquantum/custatevec/custatevec.pyx"], - include_dirs=[os.path.join(cuda_path, 'include'), - os.path.join(custatevec_root, 'include')], - library_dirs=cusv_lib_dir, - libraries=cusv_lib, - extra_link_args=extra_linker_flags, -) - - -cutensornet = Extension( - "cuquantum.cutensornet.cutensornet", - sources=["cuquantum/cutensornet/cutensornet.pyx"], - include_dirs=[os.path.join(cuda_path, 'include'), - os.path.join(cutensornet_root, 'include')], - library_dirs=cutn_lib_dir, - libraries=cutn_lib, - extra_link_args=extra_linker_flags, -) - - -utils = Extension( - "cuquantum.utils", - sources=["cuquantum/utils.pyx"], - include_dirs=[os.path.join(cuda_path, 'include')], -) - + cuda_classifier = None +# TODO: move static metadata to pyproject.toml setup( - name="cuquantum-python", - version=__version__, + name=f"cuquantum-python-cu{utils.cuda_major_ver}", + version=utils.cuqnt_py_ver, description="NVIDIA cuQuantum Python", long_description=long_description, long_description_content_type="text/markdown", @@ -234,32 +99,15 @@ def prepare_libs_and_rpaths(): "Programming Language :: Python :: 3.10", "Programming Language :: Python :: Implementation :: CPython", "Environment :: GPU :: NVIDIA CUDA", - "Environment :: GPU :: NVIDIA CUDA :: 11.0", - "Environment :: GPU :: NVIDIA CUDA :: 11.1", - "Environment :: GPU :: NVIDIA CUDA :: 11.2", - "Environment :: GPU :: NVIDIA CUDA :: 11.3", - "Environment :: GPU :: NVIDIA CUDA :: 11.4", - "Environment :: GPU :: NVIDIA CUDA :: 11.5", - "Environment :: GPU :: NVIDIA CUDA :: 11.6", - "Environment :: GPU :: NVIDIA CUDA :: 11.7", - ], - ext_modules=cythonize([custatevec, cutensornet, utils,], + ] + cuda_classifier, + ext_modules=cythonize(ext_modules, verbose=True, language_level=3, compiler_directives={'embedsignature': True}), packages=find_packages(include=['cuquantum', 'cuquantum.*']), package_data={"": ["*.pxd", "*.pyx", "*.py"],}, zip_safe=False, python_requires='>=3.8', - setup_requires=setup_requires, install_requires=install_requires, - tests_require=install_requires + [ - # pytest < 6.2 is slow in collecting tests - 'pytest>=6.2', - 'opt_einsum', - # optional test deps - #'cffi>=1.0.0', - #'nbmake>=1.3.0', # for testing notebooks - #'cirq>=0.6.0', - #'qiskit>=0.24.0', - ] + tests_require=install_requires+tests_require, + cmdclass=cmdclass, ) diff --git a/python/tests/conftest.py b/python/tests/conftest.py new file mode 100644 index 0000000..9bddfdf --- /dev/null +++ b/python/tests/conftest.py @@ -0,0 +1,29 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +# The following configs are needed to deselect/ignore collected tests for +# various reasons, see pytest-dev/pytest#3730. In particular, this strategy +# is borrowed from https://github.com/pytest-dev/pytest/issues/3730#issuecomment-567142496. + + +def pytest_configure(config): + config.addinivalue_line( + "markers", "uncollect_if(*, func): function to unselect tests from parametrization" + ) + + +def pytest_collection_modifyitems(config, items): + removed = [] + kept = [] + for item in items: + m = item.get_closest_marker('uncollect_if') + if m: + func = m.kwargs['func'] + if func(**item.callspec.params): + removed.append(item) + continue + kept.append(item) + if removed: + config.hook.pytest_deselected(items=removed) + items[:] = kept diff --git a/python/tests/cuquantum_tests/__init__.py b/python/tests/cuquantum_tests/__init__.py index c08f9b5..e589952 100644 --- a/python/tests/cuquantum_tests/__init__.py +++ b/python/tests/cuquantum_tests/__init__.py @@ -1,3 +1,270 @@ # Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES # # SPDX-License-Identifier: BSD-3-Clause + +import os +import sys +import tempfile + +try: + import cffi +except ImportError: + cffi = None +import cupy +import numpy +import pytest + +from cuquantum import ComputeType, cudaDataType + + +if cffi: + # if the Python binding is not installed in the editable mode (pip install + # -e .), the cffi tests would fail as the modules cannot be imported + sys.path.append(os.getcwd()) + + +dtype_to_data_type = { + numpy.float16: cudaDataType.CUDA_R_16F, + numpy.float32: cudaDataType.CUDA_R_32F, + numpy.float64: cudaDataType.CUDA_R_64F, + numpy.complex64: cudaDataType.CUDA_C_32F, + numpy.complex128: cudaDataType.CUDA_C_64F, +} + + +dtype_to_compute_type = { + numpy.float16: ComputeType.COMPUTE_16F, + numpy.float32: ComputeType.COMPUTE_32F, + numpy.float64: ComputeType.COMPUTE_64F, + numpy.complex64: ComputeType.COMPUTE_32F, + numpy.complex128: ComputeType.COMPUTE_64F, +} + + +# we don't wanna recompile for every test case... +_cffi_mod1 = None +_cffi_mod2 = None + +def _can_use_cffi(): + if cffi is None or os.environ.get('CUDA_PATH') is None: + return False + else: + return True + + +class MemoryResourceFactory: + + def __init__(self, source): + self.source = source + + def get_dev_mem_handler(self): + if self.source == "py-callable": + return (*self._get_cuda_callable(), self.source) + elif self.source == "cffi": + # ctx is not needed, so set to NULL + return (0, *self._get_functor_address(), self.source) + elif self.source == "cffi_struct": + return self._get_handler_address() + # TODO: add more different memory sources + else: + raise NotImplementedError + + def _get_cuda_callable(self): + def alloc(size, stream): + return cupy.cuda.runtime.mallocAsync(size, stream) + + def free(ptr, size, stream): + cupy.cuda.runtime.freeAsync(ptr, stream) + + return alloc, free + + def _get_functor_address(self): + if not _can_use_cffi(): + raise RuntimeError + + global _cffi_mod1 + if _cffi_mod1 is None: + import importlib + mod_name = f"cuquantum_test_{self.source}" + ffi = cffi.FFI() + ffi.set_source(mod_name, """ + #include + + // cffi limitation: we can't use the actual type cudaStream_t because + // it's considered an "incomplete" type and we can't get the functor + // address by doing so... + + int my_alloc(void* ctx, void** ptr, size_t size, void* stream) { + return (int)cudaMallocAsync(ptr, size, stream); + } + + int my_free(void* ctx, void* ptr, size_t size, void* stream) { + return (int)cudaFreeAsync(ptr, stream); + } + """, + include_dirs=[os.environ['CUDA_PATH']+'/include'], + library_dirs=[os.environ['CUDA_PATH']+'/lib64'], + libraries=['cudart'], + ) + ffi.cdef(""" + int my_alloc(void* ctx, void** ptr, size_t size, void* stream); + int my_free(void* ctx, void* ptr, size_t size, void* stream); + """) + ffi.compile(verbose=True) + self.ffi = ffi + _cffi_mod1 = importlib.import_module(mod_name) + self.ffi_mod = _cffi_mod1 + + alloc_addr = self._get_address("my_alloc") + free_addr = self._get_address("my_free") + return alloc_addr, free_addr + + def _get_handler_address(self): + if not _can_use_cffi(): + raise RuntimeError + + global _cffi_mod2 + if _cffi_mod2 is None: + import importlib + mod_name = f"cuquantum_test_{self.source}" + ffi = cffi.FFI() + ffi.set_source(mod_name, """ + #include + + // cffi limitation: we can't use the actual type cudaStream_t because + // it's considered an "incomplete" type and we can't get the functor + // address by doing so... + + int my_alloc(void* ctx, void** ptr, size_t size, void* stream) { + return (int)cudaMallocAsync(ptr, size, stream); + } + + int my_free(void* ctx, void* ptr, size_t size, void* stream) { + return (int)cudaFreeAsync(ptr, stream); + } + + typedef struct { + void* ctx; + int (*device_alloc)(void* ctx, void** ptr, size_t size, void* stream); + int (*device_free)(void* ctx, void* ptr, size_t size, void* stream); + char name[64]; + } myHandler; + + myHandler* init_myHandler(myHandler* h, const char* name) { + h->ctx = NULL; + h->device_alloc = my_alloc; + h->device_free = my_free; + memcpy(h->name, name, 64); + return h; + } + """, + include_dirs=[os.environ['CUDA_PATH']+'/include'], + library_dirs=[os.environ['CUDA_PATH']+'/lib64'], + libraries=['cudart'], + ) + ffi.cdef(""" + typedef struct { + ...; + } myHandler; + + myHandler* init_myHandler(myHandler* h, const char* name); + """) + ffi.compile(verbose=True) + self.ffi = ffi + _cffi_mod2 = importlib.import_module(mod_name) + self.ffi_mod = _cffi_mod2 + + h = self.handler = self.ffi_mod.ffi.new("myHandler*") + self.ffi_mod.lib.init_myHandler(h, self.source.encode()) + return self._get_address(h) + + def _get_address(self, func_name_or_ptr): + if isinstance(func_name_or_ptr, str): + func_name = func_name_or_ptr + data = str(self.ffi_mod.ffi.addressof(self.ffi_mod.lib, func_name)) + else: + ptr = func_name_or_ptr # ptr to struct + data = str(self.ffi_mod.ffi.addressof(ptr[0])) + # data has this format: "" + return int(data.split()[-1][:-1], base=16) + + +class MemHandlerTestBase: + + mod = None + prefix = None + error = None + + def _test_set_get_device_mem_handler(self, source, handle): + if (isinstance(source, str) and source.startswith('cffi') + and not _can_use_cffi()): + pytest.skip("cannot run cffi tests") + + if source is not None: + mr = MemoryResourceFactory(source) + handler = mr.get_dev_mem_handler() + self.mod.set_device_mem_handler(handle, handler) + # round-trip test + queried_handler = self.mod.get_device_mem_handler(handle) + if source == 'cffi_struct': + # I'm lazy, otherwise I'd also fetch the functor addresses here... + assert queried_handler[0] == 0 # ctx is NULL + assert queried_handler[-1] == source + else: + assert queried_handler == handler + else: + with pytest.raises(self.error) as e: + queried_handler = self.mod.get_device_mem_handler(handle) + assert f'{self.prefix.upper()}_STATUS_NO_DEVICE_ALLOCATOR' in str(e.value) + + +class LoggerTestBase: + + mod = None + prefix = None + + def test_logger_set_level(self): + self.mod.logger_set_level(6) # on + self.mod.logger_set_level(0) # off + + def test_logger_set_mask(self): + self.mod.logger_set_mask(16) # should not raise + + def test_logger_set_callback_data(self): + # we also test logger_open_file() here to avoid polluting stdout + + def callback(level, name, message, my_data, is_ok=False): + log = f"{level}, {name}, {message} (is_ok={is_ok}) -> logged\n" + my_data.append(log) + + handle = None + my_data = [] + is_ok = True + + with tempfile.TemporaryDirectory() as temp: + file_name = os.path.join(temp, f"{self.prefix}_test") + self.mod.logger_open_file(file_name) + self.mod.logger_set_callback_data(callback, my_data, is_ok=is_ok) + self.mod.logger_set_level(6) + + try: + handle = self.mod.create() + self.mod.destroy(handle) + except: + if handle: + self.mod.destroy(handle) + raise + finally: + self.mod.logger_force_disable() # to not affect the rest of tests + + with open(file_name) as f: + log_from_f = f.read() + + # check the log file + assert f'[{self.prefix}Create]' in log_from_f + assert f'[{self.prefix}Destroy]' in log_from_f + + # check the captured data (note we log 2 APIs) + log = ''.join(my_data) + assert log.count("-> logged") >= 2 + assert log.count("is_ok=True") >= 2 diff --git a/python/tests/cuquantum_tests/custatevec_tests/test_custatevec.py b/python/tests/cuquantum_tests/custatevec_tests/test_custatevec.py index b85ae20..abd6b87 100644 --- a/python/tests/cuquantum_tests/custatevec_tests/test_custatevec.py +++ b/python/tests/cuquantum_tests/custatevec_tests/test_custatevec.py @@ -3,14 +3,7 @@ # SPDX-License-Identifier: BSD-3-Clause import copy -import os -import sys -import tempfile - -try: - import cffi -except ImportError: - cffi = None + import cupy from cupy import testing import numpy @@ -20,6 +13,9 @@ from cuquantum import ComputeType, cudaDataType from cuquantum import custatevec as cusv +from .. import (_can_use_cffi, dtype_to_compute_type, dtype_to_data_type, + MemHandlerTestBase, MemoryResourceFactory, LoggerTestBase) + ################################################################### # @@ -30,23 +26,6 @@ # ################################################################### -if cffi: - # if the Python binding is not installed in the editable mode (pip install - # -e .), the cffi tests would fail as the modules cannot be imported - sys.path.append(os.getcwd()) - -dtype_to_data_type = { - numpy.dtype(numpy.complex64): cudaDataType.CUDA_C_32F, - numpy.dtype(numpy.complex128): cudaDataType.CUDA_C_64F, -} - - -dtype_to_compute_type = { - numpy.dtype(numpy.complex64): ComputeType.COMPUTE_32F, - numpy.dtype(numpy.complex128): ComputeType.COMPUTE_64F, -} - - @pytest.fixture() def handle(): h = cusv.create() @@ -192,155 +171,6 @@ def test_get_property(self): cuquantum.libraryPropertyType.PATCH_LEVEL) -# we don't wanna recompile for every test case... -_cffi_mod1 = None -_cffi_mod2 = None - -def _can_use_cffi(): - if cffi is None or os.environ.get('CUDA_PATH') is None: - return False - else: - return True - - -class MemoryResourceFactory: - - def __init__(self, source, name=None): - self.source = source - self.name = source if name is None else name - - def get_dev_mem_handler(self): - if self.source == "py-callable": - return (*self._get_cuda_callable(), self.name) - elif self.source == "cffi": - # ctx is not needed, so set to NULL - return (0, *self._get_functor_address(), self.name) - elif self.source == "cffi_struct": - return self._get_handler_address() - # TODO: add more different memory sources - else: - raise NotImplementedError - - def _get_cuda_callable(self): - def alloc(size, stream): - return cupy.cuda.runtime.mallocAsync(size, stream) - - def free(ptr, size, stream): - cupy.cuda.runtime.freeAsync(ptr, stream) - - return alloc, free - - def _get_functor_address(self): - if not _can_use_cffi(): - raise RuntimeError - - global _cffi_mod1 - if _cffi_mod1 is None: - import importlib - mod_name = f"cusv_test_{self.source}" - ffi = cffi.FFI() - ffi.set_source(mod_name, """ - #include - - // cffi limitation: we can't use the actual type cudaStream_t because - // it's considered an "incomplete" type and we can't get the functor - // address by doing so... - - int my_alloc(void* ctx, void** ptr, size_t size, void* stream) { - return (int)cudaMallocAsync(ptr, size, stream); - } - - int my_free(void* ctx, void* ptr, size_t size, void* stream) { - return (int)cudaFreeAsync(ptr, stream); - } - """, - include_dirs=[os.environ['CUDA_PATH']+'/include'], - library_dirs=[os.environ['CUDA_PATH']+'/lib64'], - libraries=['cudart'], - ) - ffi.cdef(""" - int my_alloc(void* ctx, void** ptr, size_t size, void* stream); - int my_free(void* ctx, void* ptr, size_t size, void* stream); - """) - ffi.compile(verbose=True) - self.ffi = ffi - _cffi_mod1 = importlib.import_module(mod_name) - self.ffi_mod = _cffi_mod1 - - alloc_addr = self._get_address("my_alloc") - free_addr = self._get_address("my_free") - return alloc_addr, free_addr - - def _get_handler_address(self): - if not _can_use_cffi(): - raise RuntimeError - - global _cffi_mod2 - if _cffi_mod2 is None: - import importlib - mod_name = f"cusv_test_{self.source}" - ffi = cffi.FFI() - ffi.set_source(mod_name, """ - #include - - // cffi limitation: we can't use the actual type cudaStream_t because - // it's considered an "incomplete" type and we can't get the functor - // address by doing so... - - int my_alloc(void* ctx, void** ptr, size_t size, void* stream) { - return (int)cudaMallocAsync(ptr, size, stream); - } - - int my_free(void* ctx, void* ptr, size_t size, void* stream) { - return (int)cudaFreeAsync(ptr, stream); - } - - typedef struct { - void* ctx; - int (*device_alloc)(void* ctx, void** ptr, size_t size, void* stream); - int (*device_free)(void* ctx, void* ptr, size_t size, void* stream); - char name[64]; - } myHandler; - - myHandler* init_myHandler(myHandler* h, const char* name) { - h->ctx = NULL; - h->device_alloc = my_alloc; - h->device_free = my_free; - memcpy(h->name, name, 64); - return h; - } - """, - include_dirs=[os.environ['CUDA_PATH']+'/include'], - library_dirs=[os.environ['CUDA_PATH']+'/lib64'], - libraries=['cudart'], - ) - ffi.cdef(""" - typedef struct { - ...; - } myHandler; - - myHandler* init_myHandler(myHandler* h, const char* name); - """) - ffi.compile(verbose=True) - self.ffi = ffi - _cffi_mod2 = importlib.import_module(mod_name) - self.ffi_mod = _cffi_mod2 - - h = self.handler = self.ffi_mod.ffi.new("myHandler*") - self.ffi_mod.lib.init_myHandler(h, self.name.encode()) - return self._get_address(h) - - def _get_address(self, func_name_or_ptr): - if isinstance(func_name_or_ptr, str): - func_name = func_name_or_ptr - data = str(self.ffi_mod.ffi.addressof(self.ffi_mod.lib, func_name)) - else: - ptr = func_name_or_ptr # ptr to struct - data = str(self.ffi_mod.ffi.addressof(ptr[0])) - # data has this format: "" - return int(data.split()[-1][:-1], base=16) - - class TestHandle: def test_handle_create_destroy(self, handle): @@ -380,7 +210,7 @@ def test_abs2sum_on_z_basis(self, handle, input_form): basis_bits = list(range(self.n_qubits)) basis_bits, basis_bits_len = self._return_data( basis_bits, 'basis_bits', *input_form['basis_bits']) - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] # case 1: both are computed sum0, sum1 = cusv.abs2sum_on_z_basis( @@ -425,7 +255,7 @@ def test_abs2sum_array_no_mask(self, handle, xp, input_form): sv[1] = 1./numpy.sqrt(2) sv[4] = 1./numpy.sqrt(2) - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] bit_ordering = list(range(self.n_qubits)) bit_ordering, bit_ordering_len = self._return_data( bit_ordering, 'bit_ordering', *input_form['bit_ordering']) @@ -458,7 +288,7 @@ def test_collapse_on_z_basis(self, handle, parity, input_form): basis_bits = list(range(self.n_qubits)) basis_bits, basis_bits_len = self._return_data( basis_bits, 'basis_bits', *input_form['basis_bits']) - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] cusv.collapse_on_z_basis( handle, sv.data.ptr, data_type, self.n_qubits, @@ -489,7 +319,7 @@ def test_collapse_by_bitstring(self, handle, input_form): bit_ordering = list(range(self.n_qubits)) bit_ordering, _ = self._return_data( bit_ordering, 'bit_ordering', *input_form['bit_ordering']) - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] norm = 0.5 # the sv after collapse is normalized as sv -> sv / \sqrt{norm} @@ -528,7 +358,7 @@ def test_measure_on_z_basis(self, handle, rand, collapse, input_form): basis_bits = list(range(self.n_qubits)) basis_bits, basis_bits_len = self._return_data( basis_bits, 'basis_bits', *input_form['basis_bits']) - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] orig_sv = sv.copy() parity = cusv.measure_on_z_basis( @@ -561,7 +391,7 @@ def test_batch_measure(self, handle, rand, collapse, input_form): sv[-1] = numpy.sqrt(0.5) orig_sv = sv.copy() - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] bitstring = numpy.empty(self.n_qubits, dtype=numpy.int32) bit_ordering = list(range(self.n_qubits)) bit_ordering, _ = self._return_data( @@ -606,7 +436,7 @@ def test_apply_pauli_rotation(self, handle, input_form): sv[0] = 0 sv[4] = 1 - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] targets = [0, 1] targets, targets_len = self._return_data( targets, 'targets', *input_form['targets']) @@ -646,8 +476,8 @@ def test_apply_matrix(self, handle, xp, input_form, mempool): pytest.skip("cannot run cffi tests") sv = self.get_sv() - data_type = dtype_to_data_type[sv.dtype] - compute_type = dtype_to_compute_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] + compute_type = dtype_to_compute_type[self.dtype] targets = [0, 1, 2] targets, targets_len = self._return_data( targets, 'targets', *input_form['targets']) @@ -710,8 +540,8 @@ def test_apply_generalized_permutation_matrix( sv = self.get_sv() sv[:] = 1 # invalid sv just to make math checking easier - data_type = dtype_to_data_type[sv.dtype] - compute_type = dtype_to_compute_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] + compute_type = dtype_to_compute_type[self.dtype] # TODO(leofang): test permutation on either host or device permutation = list(numpy.random.permutation(2**self.n_qubits)) @@ -787,8 +617,8 @@ def test_compute_expectation(self, handle, xp, expect_dtype, input_form, mempool sv = self.get_sv() sv[:] = numpy.sqrt(1/(2**self.n_qubits)) - data_type = dtype_to_data_type[sv.dtype] - compute_type = dtype_to_compute_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] + compute_type = dtype_to_compute_type[self.dtype] basis_bits = list(range(self.n_qubits)) basis_bits, basis_bits_len = self._return_data( basis_bits, 'basis_bits', *input_form['basis_bits']) @@ -835,8 +665,8 @@ def test_compute_expectations_on_pauli_basis(self, handle): # create a uniform sv sv = self.get_sv() sv[:] = numpy.sqrt(1/(2**self.n_qubits)) - data_type = dtype_to_data_type[sv.dtype] - compute_type = dtype_to_compute_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] + compute_type = dtype_to_compute_type[self.dtype] # measure XX...X, YY..Y, ZZ...Z paulis = [[cusv.Pauli.X for i in range(self.n_qubits)], @@ -877,8 +707,8 @@ def test_sampling(self, handle, input_form, mempool): sv = self.get_sv() sv[:] = numpy.sqrt(1/(2**self.n_qubits)) - data_type = dtype_to_data_type[sv.dtype] - compute_type = dtype_to_compute_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] + compute_type = dtype_to_compute_type[self.dtype] shots = 4096 bitstrings = numpy.empty((shots,), dtype=numpy.int64) @@ -959,8 +789,8 @@ def test_accessor_get(self, handle, readonly, input_form, mempool): data /= cupy.sqrt(data**2) sv[:] = data - data_type = dtype_to_data_type[sv.dtype] - compute_type = dtype_to_compute_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] + compute_type = dtype_to_compute_type[self.dtype] # measure all qubits bit_ordering = list(range(self.n_qubits)) @@ -1021,8 +851,8 @@ def test_accessor_set(self, handle, readonly, input_form, mempool): data /= cupy.sqrt(data**2) sv[:] = data - data_type = dtype_to_data_type[sv.dtype] - compute_type = dtype_to_compute_type[sv.dtype] + data_type = dtype_to_data_type[self.dtype] + compute_type = dtype_to_compute_type[self.dtype] # measure all qubits bit_ordering = list(range(self.n_qubits)) @@ -1109,8 +939,8 @@ def test_apply_matrix_type( and not _can_use_cffi()): pytest.skip("cannot run cffi tests") - data_type = dtype_to_data_type[xp.dtype(dtype)] - compute_type = dtype_to_compute_type[xp.dtype(dtype)] + data_type = dtype_to_data_type[dtype] + compute_type = dtype_to_compute_type[dtype] n_targets = 4 # matrix can live on host or device @@ -1170,7 +1000,7 @@ def test_batch_measure_with_offset( self, multi_gpu_handles, rand, collapse, input_form): handles = multi_gpu_handles sub_sv = self.get_sv() - data_type = dtype_to_data_type[sub_sv[0].dtype] + data_type = dtype_to_data_type[self.dtype] bit_ordering = list(range(self.n_local_bits)) bit_ordering, bit_ordering_len = self._return_data( bit_ordering, 'bit_ordering', *input_form['bit_ordering']) @@ -1260,7 +1090,7 @@ class TestSwap: def test_swap_index_bits(self, handle, dtype, input_form): n_qubits = 4 sv = cupy.zeros(2**n_qubits, dtype=dtype) - data_type = dtype_to_data_type[sv.dtype] + data_type = dtype_to_data_type[dtype] # set sv to |0110> sv[6] = 1 @@ -1317,7 +1147,7 @@ def test_multi_device_swap_index_bits( handles = multi_gpu_handles n_handles = len(handles) sub_sv = self.get_sv() - data_type = dtype_to_data_type[sub_sv[0].dtype] + data_type = dtype_to_data_type[self.dtype] # set sv to |0110> (up to normalization) with cupy.cuda.Device(0): @@ -1365,79 +1195,21 @@ def test_multi_device_swap_index_bits( assert sub_sv[1][4] == 1 -class TestMemHandler: +class TestMemHandler(MemHandlerTestBase): + + mod = cusv + prefix = "custatevec" + error = cusv.cuStateVecError # TODO: add more different memory sources @pytest.mark.parametrize( 'source', (None, "py-callable", 'cffi', 'cffi_struct') ) - def test_set_get_device_mem_handler(self, handle, source): - if (isinstance(source, str) and source.startswith('cffi') - and not _can_use_cffi()): - pytest.skip("cannot run cffi tests") + def test_set_get_device_mem_handler(self, source, handle): + self._test_set_get_device_mem_handler(source, handle) - if source is not None: - mr = MemoryResourceFactory(source) - handler = mr.get_dev_mem_handler() - cusv.set_device_mem_handler(handle, handler) - # round-trip test - queried_handler = cusv.get_device_mem_handler(handle) - if source == 'cffi_struct': - # I'm lazy, otherwise I'd also fetch the functor addresses here... - assert queried_handler[0] == 0 # ctx is NULL - assert queried_handler[-1] == source - else: - assert queried_handler == handler - else: - with pytest.raises(cusv.cuStateVecError) as e: - queried_handler = cusv.get_device_mem_handler(handle) - assert 'CUSTATEVEC_STATUS_NO_DEVICE_ALLOCATOR' in str(e.value) - - -class TestLogger: - - def test_logger_set_level(self): - cusv.logger_set_level(6) # on - cusv.logger_set_level(0) # off - - def test_logger_set_mask(self): - cusv.logger_set_mask(16) # should not raise - - def test_logger_set_callback_data(self): - # we also test logger_open_file() here to avoid polluting stdout - - def callback(level, name, message, my_data, is_ok=False): - log = f"{level}, {name}, {message} (is_ok={is_ok}) -> logged\n" - my_data.append(log) - - handle = None - my_data = [] - is_ok = True - - with tempfile.TemporaryDirectory() as temp: - file_name = os.path.join(temp, "cusv_test") - cusv.logger_open_file(file_name) - cusv.logger_set_callback_data(callback, my_data, is_ok=is_ok) - cusv.logger_set_level(6) - - try: - handle = cusv.create() - cusv.destroy(handle) - except: - if handle: - cusv.destroy(handle) - raise - finally: - cusv.logger_force_disable() # to not affect the rest of tests - - with open(file_name) as f: - log_from_f = f.read() - - # check the log file - assert '[custatevecCreate]' in log_from_f - assert '[custatevecDestroy]' in log_from_f - - # check the captured data (note we log 2 APIs) - log = ''.join(my_data) - assert log.count("-> logged") >= 2 - assert log.count("is_ok=True") >= 2 + +class TestLogger(LoggerTestBase): + + mod = cusv + prefix = "custatevec" diff --git a/python/tests/cuquantum_tests/cutensornet_tests/circuit_utils.py b/python/tests/cuquantum_tests/cutensornet_tests/circuit_utils.py index 28d081c..be63d71 100644 --- a/python/tests/cuquantum_tests/cutensornet_tests/circuit_utils.py +++ b/python/tests/cuquantum_tests/cutensornet_tests/circuit_utils.py @@ -2,6 +2,7 @@ # # SPDX-License-Identifier: BSD-3-Clause +import itertools from types import MappingProxyType try: @@ -20,8 +21,11 @@ qiskit = None from cuquantum import contract, CircuitToEinsum +from cuquantum.cutensornet._internal.circuit_converter_utils import convert_mode_labels_to_expression from cuquantum.cutensornet._internal.circuit_converter_utils import EINSUM_SYMBOLS_BASE -from .testutils import atol_mapper, rtol_mapper +from cuquantum.cutensornet._internal.circuit_converter_utils import get_pauli_gates +from cuquantum.cutensornet._internal.circuit_converter_utils import parse_gates_to_mode_labels_operands +from .test_utils import atol_mapper, rtol_mapper # note: this implementation would cause pytorch tests being silently skipped @@ -67,6 +71,11 @@ def where_fixed_generator(qubits, nfix_max, nsite_max=None): yield where, fixed +def random_pauli_string_generator(n_qubits, num_strings=4): + for _ in range(num_strings): + yield ''.join(np.random.choice(['I','X', 'Y', 'Z'], n_qubits)) + + def get_partial_indices(qubits, fixed): partial_indices = [slice(None)] * len(qubits) index_map = {'0': slice(0, 1), @@ -189,23 +198,24 @@ def __init__(self, circuit, dtype, backend, nsample, nsite_max, nfix_max): self.nsite_max = max(1, min(nsite_max, self.n_qubits-1)) self.nfix_max = max(min(nfix_max, self.n_qubits-nsite_max-1), 0) - def get_state_vector_from_simulator(self, fixed=EMPTY_DICT): + def get_state_vector_from_simulator(self): if self.sv is None: self.sv = self._get_state_vector_from_simulator() - if fixed: - partial_indices = get_partial_indices(self.qubits, fixed) - sv = self.sv[tuple(partial_indices)] - return sv.reshape((2,)*(self.n_qubits-len(fixed))) - else: - return self.sv + return self.sv def get_amplitude_from_simulator(self, bitstring): sv = self.get_state_vector_from_simulator() index = [int(ibit) for ibit in bitstring] return sv[tuple(index)] + def get_batched_amplitudes_from_simulator(self, fixed): + sv = self.get_state_vector_from_simulator() + partial_indices = get_partial_indices(self.qubits, fixed) + batched_amplitudes = sv[tuple(partial_indices)] + return batched_amplitudes.reshape((2,)*(self.n_qubits-len(fixed))) + def get_reduced_density_matrix_from_simulator(self, where, fixed=EMPTY_DICT): - """ + r""" For where = (a, b), reduced density matrix is formulated as: :math: `rho_{a,b,a^{\prime},b^{\prime}} = \sum_{c,d,e,...} SV^{\star}_{a^{\prime}, b^{\prime}, c, d, e, ...} SV_{a, b, c, d, e, ...}` """ @@ -229,19 +239,43 @@ def get_reduced_density_matrix_from_simulator(self, where, fixed=EMPTY_DICT): else: rdm = contract(expression, sv, sv.conj()) return rdm + + def get_expectation_from_sv(self, pauli_string): + input_mode_labels = [[*range(self.n_qubits)]] + qubits_frontier = dict(zip(self.qubits, itertools.count())) + next_frontier = max(qubits_frontier.values()) + 1 + + pauli_map = dict(zip(self.qubits, pauli_string)) + dtype = getattr(self.backend, self.dtype) + pauli_gates = get_pauli_gates(pauli_map, dtype=dtype, backend=self.backend) + gate_mode_labels, gate_operands = parse_gates_to_mode_labels_operands(pauli_gates, + qubits_frontier, + next_frontier) + + mode_labels = input_mode_labels + gate_mode_labels + [[qubits_frontier[ix] for ix in self.qubits]] + output_mode_labels = [] + expression = convert_mode_labels_to_expression(mode_labels, output_mode_labels) + + sv = self.get_state_vector_from_simulator() + if self.backend is torch: + operands = [sv] + gate_operands + [sv.conj().resolve_conj()] + else: + operands = [sv] + gate_operands + [sv.conj()] + expec = contract(expression, *operands) + return expec + def _get_state_vector_from_simulator(self): raise NotImplementedError def test_state_vector(self): - for fixed in where_fixed_generator(self.qubits, self.nfix_max): - expression, operands = self.converter.state_vector(fixed=fixed) - sv1 = contract(expression, *operands) - sv2 = self.get_state_vector_from_simulator(fixed=fixed) - self.backend.allclose( - sv1, sv2, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) + expression, operands = self.converter.state_vector() + sv1 = contract(expression, *operands) + sv2 = self.get_state_vector_from_simulator() + self.backend.allclose( + sv1, sv2, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) - def test_bitstrings(self): + def test_amplitude(self): for bitstring in bitstring_generator(self.n_qubits, self.nsample): expression, operands = self.converter.amplitude(bitstring) amp1 = contract(expression, *operands) @@ -249,7 +283,15 @@ def test_bitstrings(self): self.backend.allclose( amp1, amp2, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) - def test_reduced_density_matrices(self): + def test_batched_amplitudes(self): + for fixed in where_fixed_generator(self.qubits, self.nfix_max): + expression, operands = self.converter.batched_amplitudes(fixed) + batched_amps1 = contract(expression, *operands) + batched_amps2 = self.get_batched_amplitudes_from_simulator(fixed) + self.backend.allclose( + batched_amps1, batched_amps2, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) + + def test_reduced_density_matrix(self): for where, fixed in where_fixed_generator(self.qubits, self.nfix_max, nsite_max=self.nsite_max): expression1, operands1 = self.converter.reduced_density_matrix(where, fixed=fixed, lightcone=True) expression2, operands2 = self.converter.reduced_density_matrix(where, fixed=fixed, lightcone=False) @@ -262,11 +304,27 @@ def test_reduced_density_matrices(self): rdm1, rdm2, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) self.backend.allclose( rdm1, rdm3, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) + + def test_expectation(self): + for pauli_string in random_pauli_string_generator(self.n_qubits, 2): + expression1, operands1 = self.converter.expectation(pauli_string, lightcone=True) + expression2, operands2 = self.converter.expectation(pauli_string, lightcone=False) + assert len(operands1) <= len(operands2) + expec1 = contract(expression1, *operands1) + expec2 = contract(expression2, *operands2) + expec3 = self.get_expectation_from_sv(pauli_string) + + self.backend.allclose( + expec1, expec2, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) + self.backend.allclose( + expec1, expec3, atol=atol_mapper[self.dtype], rtol=rtol_mapper[self.dtype]) def run_tests(self): self.test_state_vector() - self.test_bitstrings() - self.test_reduced_density_matrices() + self.test_amplitude() + self.test_batched_amplitudes() + self.test_reduced_density_matrix() + self.test_expectation() class CirqTester(BaseTester): diff --git a/python/tests/cuquantum_tests/cutensornet_tests/data.py b/python/tests/cuquantum_tests/cutensornet_tests/data.py index 1ce7224..ae9cd6b 100644 --- a/python/tests/cuquantum_tests/cutensornet_tests/data.py +++ b/python/tests/cuquantum_tests/cutensornet_tests/data.py @@ -2,20 +2,17 @@ # # SPDX-License-Identifier: BSD-3-Clause -try: - import torch -except ImportError: - torch = None - import cuquantum -# note: this implementation would cause pytorch tests being silently skipped -# if pytorch is not available, which is the desired effect since otherwise -# it'd be too noisy -backend_names = ("numpy", "cupy") -if torch: - backend_names += ("torch-cpu", "torch-gpu") +# We include torch tests here unconditionally, and use pytest deselect to +# exclude them if torch is not present. +backend_names = ( + "numpy", + "cupy", + "torch-cpu", + "torch-gpu", +) dtype_names = ( diff --git a/python/tests/cuquantum_tests/cutensornet_tests/test_contract.py b/python/tests/cuquantum_tests/cutensornet_tests/test_contract.py index 02a6688..25d73fd 100644 --- a/python/tests/cuquantum_tests/cutensornet_tests/test_contract.py +++ b/python/tests/cuquantum_tests/cutensornet_tests/test_contract.py @@ -15,12 +15,14 @@ from cuquantum.cutensornet._internal.utils import infer_object_package from .data import backend_names, dtype_names, einsum_expressions -from .testutils import atol_mapper, EinsumFactory, rtol_mapper -from .testutils import compute_and_normalize_numpy_path -from .testutils import set_path_to_optimizer_options +from .test_utils import atol_mapper, EinsumFactory, rtol_mapper +from .test_utils import compute_and_normalize_numpy_path +from .test_utils import deselect_contract_tests +from .test_utils import set_path_to_optimizer_options # TODO: parametrize compute type? +@pytest.mark.uncollect_if(func=deselect_contract_tests) @pytest.mark.parametrize( "use_numpy_path", (False, True) ) @@ -46,9 +48,7 @@ def _test_runner( stream, use_numpy_path, **kwargs): einsum_expr = copy.deepcopy(einsum_expr_pack) if isinstance(einsum_expr, list): - einsum_expr, network_opts, optimizer_opts, overwrite_dtype = einsum_expr - if dtype != overwrite_dtype: - pytest.skip(f"skipping {dtype} is requested") + einsum_expr, network_opts, optimizer_opts, _ = einsum_expr else: network_opts = optimizer_opts = None assert isinstance(einsum_expr, (str, tuple)) @@ -83,9 +83,9 @@ def _test_runner( *data, options=network_opts, optimize=optimizer_opts, stream=stream, return_info=return_info) if return_info: - out, info = out - assert isinstance(info[0], list) # path - assert isinstance(info[1], cuquantum.OptimizerInfo) + out, (path, info) = out + assert isinstance(path, list) + assert isinstance(info, cuquantum.OptimizerInfo) else: # cuquantum.einsum() optimize = kwargs.pop('optimize') if optimize == 'path': diff --git a/python/tests/cuquantum_tests/cutensornet_tests/test_contract_path.py b/python/tests/cuquantum_tests/cutensornet_tests/test_contract_path.py index ebd8ad0..cb1a4e4 100644 --- a/python/tests/cuquantum_tests/cutensornet_tests/test_contract_path.py +++ b/python/tests/cuquantum_tests/cutensornet_tests/test_contract_path.py @@ -8,9 +8,9 @@ import cuquantum from .data import einsum_expressions -from .testutils import compute_and_normalize_numpy_path -from .testutils import EinsumFactory -from .testutils import set_path_to_optimizer_options +from .test_utils import compute_and_normalize_numpy_path +from .test_utils import EinsumFactory +from .test_utils import set_path_to_optimizer_options @pytest.mark.parametrize( diff --git a/python/tests/cuquantum_tests/cutensornet_tests/test_cutensornet.py b/python/tests/cuquantum_tests/cutensornet_tests/test_cutensornet.py index 4a6dc87..52d6955 100644 --- a/python/tests/cuquantum_tests/cutensornet_tests/test_cutensornet.py +++ b/python/tests/cuquantum_tests/cutensornet_tests/test_cutensornet.py @@ -5,21 +5,24 @@ from collections import abc import functools import os -import sys -import tempfile -try: - import cffi -except ImportError: - cffi = None import cupy from cupy import testing import numpy +try: + import mpi4py + from mpi4py import MPI # init! +except ImportError: + mpi4py = MPI = None import pytest import cuquantum from cuquantum import ComputeType, cudaDataType from cuquantum import cutensornet as cutn +from .test_utils import atol_mapper, rtol_mapper + +from .. import (_can_use_cffi, dtype_to_compute_type, dtype_to_data_type, + MemHandlerTestBase, MemoryResourceFactory, LoggerTestBase) ################################################################### @@ -31,29 +34,6 @@ # ################################################################### -if cffi: - # if the Python binding is not installed in the editable mode (pip install - # -e .), the cffi tests would fail as the modules cannot be imported - sys.path.append(os.getcwd()) - -dtype_to_data_type = { - numpy.float16: cudaDataType.CUDA_R_16F, - numpy.float32: cudaDataType.CUDA_R_32F, - numpy.float64: cudaDataType.CUDA_R_64F, - numpy.complex64: cudaDataType.CUDA_C_32F, - numpy.complex128: cudaDataType.CUDA_C_64F, -} - - -dtype_to_compute_type = { - numpy.float16: ComputeType.COMPUTE_16F, - numpy.float32: ComputeType.COMPUTE_32F, - numpy.float64: ComputeType.COMPUTE_64F, - numpy.complex64: ComputeType.COMPUTE_32F, - numpy.complex128: ComputeType.COMPUTE_64F, -} - - def manage_resource(name): def decorator(impl): @functools.wraps(impl) @@ -65,22 +45,40 @@ def test_func(self, *args, **kwargs): tn, dtype, input_form, output_form = self.tn, self.dtype, self.input_form, self.output_form einsum, shapes = tn # unpack tn = TensorNetworkFactory(einsum, shapes, dtype) - i_n_inputs, i_n_modes, i_extents, i_strides, i_modes, i_alignments = \ + i_n_inputs, i_n_modes, i_extents, i_strides, i_modes = \ tn.get_input_metadata(**input_form) - o_n_modes, o_extents, o_strides, o_modes, o_alignments = \ + o_n_modes, o_extents, o_strides, o_modes = \ tn.get_output_metadata(**output_form) + i_qualifiers = numpy.zeros(i_n_inputs, dtype=cutn.tensor_qualifiers_dtype) h = cutn.create_network_descriptor( self.handle, - i_n_inputs, i_n_modes, i_extents, i_strides, i_modes, i_alignments, - o_n_modes, o_extents, o_strides, o_modes, o_alignments, + i_n_inputs, i_n_modes, i_extents, i_strides, i_modes, i_qualifiers, + o_n_modes, o_extents, o_strides, o_modes, dtype_to_data_type[dtype], dtype_to_compute_type[dtype]) # we also need to keep the tn data alive self.tn = tn + elif name == 'tensor_decom': + tn, dtype, tensor_form = self.tn, self.dtype, self.tensor_form + einsum, shapes = tn # unpack + tn = TensorDecompositionFactory(einsum, shapes, dtype) + h = [] + for t in tn.tensor_names: + t = cutn.create_tensor_descriptor( + self.handle, + *tn.get_tensor_metadata(t, **tensor_form), + dtype_to_data_type[dtype]) + h.append(t) + # we also need to keep the tn data alive + self.tn = tn elif name == 'config': h = cutn.create_contraction_optimizer_config(self.handle) elif name == 'info': h = cutn.create_contraction_optimizer_info( self.handle, self.dscr) + elif name == 'svd_config': + h = cutn.create_tensor_svd_config(self.handle) + elif name == 'svd_info': + h = cutn.create_tensor_svd_info(self.handle) elif name == 'autotune': h = cutn.create_contraction_autotune_preference(self.handle) elif name == 'workspace': @@ -103,12 +101,22 @@ def test_func(self, *args, **kwargs): elif name == 'dscr' and hasattr(self, name): cutn.destroy_network_descriptor(self.dscr) del self.dscr + elif name == 'tensor_decom' and hasattr(self, name): + for t in self.tensor_decom: + cutn.destroy_tensor_descriptor(t) + del self.tensor_decom elif name == 'config' and hasattr(self, name): cutn.destroy_contraction_optimizer_config(self.config) del self.config elif name == 'info' and hasattr(self, name): cutn.destroy_contraction_optimizer_info(self.info) del self.info + elif name == 'svd_config' and hasattr(self, name): + cutn.destroy_tensor_svd_config(self.svd_config) + del self.svd_config + elif name == 'svd_info' and hasattr(self, name): + cutn.destroy_tensor_svd_info(self.svd_info) + del self.svd_info elif name == 'autotune' and hasattr(self, name): cutn.destroy_contraction_autotune_preference(self.autotune) del self.autotune @@ -122,155 +130,6 @@ def test_func(self, *args, **kwargs): return decorator -# we don't wanna recompile for every test case... -_cffi_mod1 = None -_cffi_mod2 = None - -def _can_use_cffi(): - if cffi is None or os.environ.get('CUDA_PATH') is None: - return False - else: - return True - - -class MemoryResourceFactory: - - def __init__(self, source, name=None): - self.source = source - self.name = source if name is None else name - - def get_dev_mem_handler(self): - if self.source == "py-callable": - return (*self._get_cuda_callable(), self.name) - elif self.source == "cffi": - # ctx is not needed, so set to NULL - return (0, *self._get_functor_address(), self.name) - elif self.source == "cffi_struct": - return self._get_handler_address() - # TODO: add more different memory sources - else: - raise NotImplementedError - - def _get_cuda_callable(self): - def alloc(size, stream): - return cupy.cuda.runtime.mallocAsync(size, stream) - - def free(ptr, size, stream): - cupy.cuda.runtime.freeAsync(ptr, stream) - - return alloc, free - - def _get_functor_address(self): - if not _can_use_cffi(): - raise RuntimeError - - global _cffi_mod1 - if _cffi_mod1 is None: - import importlib - mod_name = f"cutn_test_{self.source}" - ffi = cffi.FFI() - ffi.set_source(mod_name, """ - #include - - // cffi limitation: we can't use the actual type cudaStream_t because - // it's considered an "incomplete" type and we can't get the functor - // address by doing so... - - int my_alloc(void* ctx, void** ptr, size_t size, void* stream) { - return (int)cudaMallocAsync(ptr, size, stream); - } - - int my_free(void* ctx, void* ptr, size_t size, void* stream) { - return (int)cudaFreeAsync(ptr, stream); - } - """, - include_dirs=[os.environ['CUDA_PATH']+'/include'], - library_dirs=[os.environ['CUDA_PATH']+'/lib64'], - libraries=['cudart'], - ) - ffi.cdef(""" - int my_alloc(void* ctx, void** ptr, size_t size, void* stream); - int my_free(void* ctx, void* ptr, size_t size, void* stream); - """) - ffi.compile(verbose=True) - self.ffi = ffi - _cffi_mod1 = importlib.import_module(mod_name) - self.ffi_mod = _cffi_mod1 - - alloc_addr = self._get_address("my_alloc") - free_addr = self._get_address("my_free") - return alloc_addr, free_addr - - def _get_handler_address(self): - if not _can_use_cffi(): - raise RuntimeError - - global _cffi_mod2 - if _cffi_mod2 is None: - import importlib - mod_name = f"cutn_test_{self.source}" - ffi = cffi.FFI() - ffi.set_source(mod_name, """ - #include - - // cffi limitation: we can't use the actual type cudaStream_t because - // it's considered an "incomplete" type and we can't get the functor - // address by doing so... - - int my_alloc(void* ctx, void** ptr, size_t size, void* stream) { - return (int)cudaMallocAsync(ptr, size, stream); - } - - int my_free(void* ctx, void* ptr, size_t size, void* stream) { - return (int)cudaFreeAsync(ptr, stream); - } - - typedef struct { - void* ctx; - int (*device_alloc)(void* ctx, void** ptr, size_t size, void* stream); - int (*device_free)(void* ctx, void* ptr, size_t size, void* stream); - char name[64]; - } myHandler; - - myHandler* init_myHandler(myHandler* h, const char* name) { - h->ctx = NULL; - h->device_alloc = my_alloc; - h->device_free = my_free; - memcpy(h->name, name, 64); - return h; - } - """, - include_dirs=[os.environ['CUDA_PATH']+'/include'], - library_dirs=[os.environ['CUDA_PATH']+'/lib64'], - libraries=['cudart'], - ) - ffi.cdef(""" - typedef struct { - ...; - } myHandler; - - myHandler* init_myHandler(myHandler* h, const char* name); - """) - ffi.compile(verbose=True) - self.ffi = ffi - _cffi_mod2 = importlib.import_module(mod_name) - self.ffi_mod = _cffi_mod2 - - h = self.handler = self.ffi_mod.ffi.new("myHandler*") - self.ffi_mod.lib.init_myHandler(h, self.name.encode()) - return self._get_address(h) - - def _get_address(self, func_name_or_ptr): - if isinstance(func_name_or_ptr, str): - func_name = func_name_or_ptr - data = str(self.ffi_mod.ffi.addressof(self.ffi_mod.lib, func_name)) - else: - ptr = func_name_or_ptr # ptr to struct - data = str(self.ffi_mod.ffi.addressof(ptr[0])) - # data has this format: "" - return int(data.split()[-1][:-1], base=16) - - class TestLibHelper: def test_get_version(self): @@ -310,20 +169,21 @@ def __init__(self, einsum, shapes, dtype): assert all([len(i) == len(s) for i, s in zip(inputs, i_shapes)]) assert len(output) == len(o_shape) + # xp strides in bytes, cutn strides in counts + itemsize = cupy.dtype(dtype).itemsize + self.input_tensors = [ testing.shaped_random(s, cupy, dtype) for s in i_shapes] self.input_n_modes = [len(i) for i in inputs] self.input_extents = i_shapes - self.input_strides = [arr.strides for arr in self.input_tensors] + self.input_strides = [[stride // itemsize for stride in arr.strides] for arr in self.input_tensors] self.input_modes = [tuple([ord(m) for m in i]) for i in inputs] - self.input_alignments = [256] * len(i_shapes) self.output_tensor = cupy.empty(o_shape, dtype=dtype) self.output_n_modes = len(o_shape) self.output_extent = o_shape - self.output_stride = self.output_tensor.strides + self.output_stride = [stride // itemsize for stride in self.output_tensor.strides] self.output_mode = tuple([ord(m) for m in output]) - self.output_alignment = 256 def _get_data_type(self, category): if 'n_modes' in category: @@ -334,8 +194,6 @@ def _get_data_type(self, category): return numpy.int64 elif 'mode' in category: return numpy.int32 - elif 'alignment' in category: - return numpy.uint32 elif 'tensor' in category: return None # unused else: @@ -397,17 +255,14 @@ def get_input_metadata(self, **kwargs): extents = self._return_data('input_extents', kwargs.pop('extent')) strides = self._return_data('input_strides', kwargs.pop('stride')) modes = self._return_data('input_modes', kwargs.pop('mode')) - alignments = self._return_data( - 'input_alignments', kwargs.pop('alignment')) - return n_inputs, n_modes, extents, strides, modes, alignments + return n_inputs, n_modes, extents, strides, modes def get_output_metadata(self, **kwargs): n_modes = self.output_n_modes extent = self._return_data('output_extent', kwargs.pop('extent')) stride = self._return_data('output_stride', kwargs.pop('stride')) mode = self._return_data('output_mode', kwargs.pop('mode')) - alignment = self.output_alignment - return n_modes, extent, stride, mode, alignment + return n_modes, extent, stride, mode def get_input_tensors(self, **kwargs): data = self._return_data('input_tensors', kwargs['data']) @@ -429,11 +284,11 @@ def get_output_tensor(self): ), 'input_form': ( {'n_modes': 'int', 'extent': 'int', 'stride': 'int', - 'mode': 'int', 'alignment': 'int', 'data': 'int'}, + 'mode': 'int', 'data': 'int'}, {'n_modes': 'int', 'extent': 'seq', 'stride': 'seq', - 'mode': 'seq', 'alignment': 'int', 'data': 'seq'}, + 'mode': 'seq', 'data': 'seq'}, {'n_modes': 'seq', 'extent': 'nested_seq', 'stride': 'nested_seq', - 'mode': 'seq', 'alignment': 'seq', 'data': 'seq'}, + 'mode': 'seq', 'data': 'seq'}, ), 'output_form': ( {'extent': 'int', 'stride': 'int', 'mode': 'int'}, @@ -448,18 +303,33 @@ class TestTensorNetworkBase: class TestTensorNetworkDescriptor(TestTensorNetworkBase): + @pytest.mark.parametrize( + 'API', ('old', 'new') + ) @manage_resource('handle') @manage_resource('dscr') - def test_descriptor_create_destroy(self): + def test_descriptor_create_destroy(self, API): # we could just do a simple round-trip test, but let's also get # this helper API tested handle, dscr = self.handle, self.dscr - num_modes, modes, extents, strides = cutn.get_output_tensor_details(handle, dscr) + + if API == 'old': + # TODO: remove this branch + num_modes, modes, extents, strides = cutn.get_output_tensor_details( + handle, dscr) + else: + tensor_dscr = cutn.get_output_tensor_descriptor(handle, dscr) + num_modes, modes, extents, strides = cutn.get_tensor_details( + handle, tensor_dscr) + assert num_modes == self.tn.output_n_modes assert (modes == numpy.asarray(self.tn.output_mode, dtype=numpy.int32)).all() assert (extents == numpy.asarray(self.tn.output_extent, dtype=numpy.int64)).all() assert (strides == numpy.asarray(self.tn.output_stride, dtype=numpy.int64)).all() + if API == 'new': + cutn.destroy_tensor_descriptor(tensor_dscr) + class TestOptimizerInfo(TestTensorNetworkBase): @@ -468,12 +338,11 @@ def _get_path(self, handle, info): def _set_path(self, handle, info, path): attr = cutn.ContractionOptimizerInfoAttribute.PATH + dtype = cutn.contraction_optimizer_info_get_attribute_dtype(attr) if not isinstance(path, numpy.ndarray): path = numpy.ascontiguousarray(path, dtype=numpy.int32) - num_contraction = path.shape[0] - p = cutn.ContractionPath(num_contraction, path.ctypes.data) - cutn.contraction_optimizer_info_set_attribute( - handle, info, attr, p.get_path(), p.get_size()) + path_obj = numpy.asarray((path.shape[0], path.ctypes.data), dtype=dtype) + self._set_scalar_attr(handle, info, attr, path_obj) def _get_scalar_attr(self, handle, info, attr): dtype = cutn.contraction_optimizer_info_get_attribute_dtype(attr) @@ -507,6 +376,7 @@ def test_optimizer_info_create_destroy(self): def test_optimizer_info_get_set_attribute(self, attr): if attr in ( cutn.ContractionOptimizerInfoAttribute.NUM_SLICES, + cutn.ContractionOptimizerInfoAttribute.NUM_SLICED_MODES, cutn.ContractionOptimizerInfoAttribute.PHASE1_FLOP_COUNT, cutn.ContractionOptimizerInfoAttribute.FLOP_COUNT, cutn.ContractionOptimizerInfoAttribute.LARGEST_TENSOR, @@ -519,6 +389,7 @@ def test_optimizer_info_get_set_attribute(self, attr): cutn.ContractionOptimizerInfoAttribute.PATH, cutn.ContractionOptimizerInfoAttribute.SLICED_MODE, cutn.ContractionOptimizerInfoAttribute.SLICED_EXTENT, + cutn.ContractionOptimizerInfoAttribute.SLICING_CONFIG, cutn.ContractionOptimizerInfoAttribute.INTERMEDIATE_MODES, cutn.ContractionOptimizerInfoAttribute.NUM_INTERMEDIATE_MODES, ): @@ -697,10 +568,16 @@ def test_contraction_workflow( # manage workspace if mempool is None: cutn.workspace_compute_sizes(handle, dscr, info, workspace) + required_size_deprecated = cutn.workspace_get_size( + handle, workspace, + getattr(cutn.WorksizePref, f"{workspace_pref.upper()}"), + cutn.Memspace.DEVICE) # TODO: parametrize memspace? + cutn.workspace_compute_contraction_sizes(handle, dscr, info, workspace) required_size = cutn.workspace_get_size( handle, workspace, getattr(cutn.WorksizePref, f"{workspace_pref.upper()}"), cutn.Memspace.DEVICE) # TODO: parametrize memspace? + assert required_size == required_size_deprecated if workspace_size < required_size: assert False, \ f"wrong assumption on the workspace size " \ @@ -780,77 +657,601 @@ def test_slice_group(self, source): @pytest.mark.parametrize( 'source', (None, "py-callable", 'cffi', 'cffi_struct') ) -class TestMemHandler: +class TestMemHandler(MemHandlerTestBase): + + mod = cutn + prefix = "cutensornet" + error = cutn.cuTensorNetError @manage_resource('handle') def test_set_get_device_mem_handler(self, source): - if (isinstance(source, str) and source.startswith('cffi') - and not _can_use_cffi()): - pytest.skip("cannot run cffi tests") + self._test_set_get_device_mem_handler(source, self.handle) - handle = self.handle - if source is not None: - mr = MemoryResourceFactory(source) - handler = mr.get_dev_mem_handler() - cutn.set_device_mem_handler(handle, handler) - # round-trip test - queried_handler = cutn.get_device_mem_handler(handle) - if source == 'cffi_struct': - # I'm lazy, otherwise I'd also fetch the functor addresses here... - assert queried_handler[0] == 0 # ctx is NULL - assert queried_handler[-1] == source + +class TensorDecompositionFactory: + + # QR Example: "ab->ax,xb" + # SVD Example: "ab->ax,x,xb" + # Gate Example: "ijk,klm,jkpq->->ipk,k,kqm" for indirect gate with singular values returned. + # "ijk,klm,jkpq->ipk,-,kqm" for direct gate algorithm with singular values equally partitioned onto u and v + + # self.reconstruct must be a valid einsum expr and can be used to reconstruct + # the input tensor if no/little truncation was done + + # This factory CANNOT be reused; once a tensor descriptor uses it, it must + # be discarded. + + svd_partitioned = ('<', '-', '>') # reserved symbols + + def __init__(self, einsum, shapes, dtype): + if len(shapes) == 3: + self.tensor_names = ['input', 'left', 'right'] + self.einsum = einsum + elif len(shapes) == 5: + self.tensor_names = ['inputA', 'inputB', 'inputG', 'left', 'right'] + if einsum.count("->") == 1: + self.gate_algorithm = cutn.GateSplitAlgo.DIRECT + self.einsum = einsum + elif einsum.count("->") == 2: + self.gate_algorithm = cutn.GateSplitAlgo.REDUCED + self.einsum = einsum.replace("->->", "->") else: - assert queried_handler == handler + raise NotImplementedError + else: + raise NotImplementedError + + inputs, output = self.einsum.split('->') + output = output.split(',') + if len(output) == 2: # QR + left, right = output + self.reconstruct = f"{left},{right}->{inputs}" + all_modes = [inputs, left, right] + elif len(output) == 3: # SVD or Gate + left, mid_mode, right = output + common_mode = set(left).intersection(right).pop() + assert len(common_mode) == 1 + idx_left = left.find(common_mode) + idx_right = right.find(common_mode) + self.mid_mode = mid_mode + + if len(shapes) == 3: # svd + all_modes = [inputs, left, right] + assert shapes[1][idx_left] == shapes[2][idx_right] + self.mid_extent = shapes[1][idx_left] + self.reference_einsum = None + if mid_mode in self.svd_partitioned: + # s is already merged into left, both, or right + self.reconstruct = f"{left},{right}->{inputs}" + else: + assert mid_mode == common_mode + self.reconstruct = f"{left},{common_mode},{right}->{inputs}" + else: # Gate + all_modes = list(inputs.split(","))+[left, right] + assert shapes[3][idx_left] == shapes[4][idx_right] + self.mid_extent = shapes[3][idx_left] + contracted_output_modes = "".join((set(left) | set(right)) - (set(left) & set(right))) + self.reference_einsum = f"{inputs}->{contracted_output_modes}" + if mid_mode in self.svd_partitioned: + # s is already merged into left, both, or right + self.reconstruct = f"{left},{right}->{contracted_output_modes}" + else: + assert mid_mode == common_mode + self.reconstruct = f"{left},{common_mode},{right}->{contracted_output_modes}" else: - with pytest.raises(cutn.cuTensorNetError) as e: - queried_handler = cutn.get_device_mem_handler(handle) - assert 'CUTENSORNET_STATUS_NO_DEVICE_ALLOCATOR' in str(e.value) + assert False + del output + + # xp strides in bytes, cutn strides in counts + dtype = cupy.dtype(dtype) + itemsize = dtype.itemsize + + for name, shape, modes in zip(self.tensor_names, shapes, all_modes): + real_dtype = dtype.char.lower() + if name.startswith('input'): + if dtype.char != real_dtype: # complex + arr = (cupy.random.random(shape, dtype=real_dtype) + + 1j*cupy.random.random(shape, dtype=real_dtype)).astype(dtype) + else: + arr = cupy.random.random(shape, dtype=dtype) + else: + arr = cupy.empty(shape, dtype=dtype, order='F') + setattr(self, f'{name}_tensor', arr) + setattr(self, f'{name}_n_modes', len(arr.shape)) + setattr(self, f'{name}_extent', arr.shape) + setattr(self, f'{name}_stride', [stride // itemsize for stride in arr.strides]) + setattr(self, f'{name}_mode', tuple([ord(m) for m in modes])) + def _get_data_type(self, category): + if 'n_modes' in category: + return numpy.int32 + elif 'extent' in category: + return numpy.int64 + elif 'stride' in category: + return numpy.int64 + elif 'mode' in category: + return numpy.int32 + elif 'tensor' in category: + return None # unused + else: + assert False -class TestLogger: + def _return_data(self, category, return_value): + data = getattr(self, category) - def test_logger_set_level(self): - cutn.logger_set_level(6) # on - cutn.logger_set_level(0) # off + if return_value == 'int': + if len(data) == 0: + # empty, give it a NULL + return 0 + else: + # return int as void* + data = numpy.asarray(data, dtype=self._get_data_type(category)) + setattr(self, category, data) # keep data alive + return data.ctypes.data + elif return_value == 'seq': + return data + else: + assert False - def test_logger_set_mask(self): - cutn.logger_set_mask(16) # should not raise + def get_tensor_metadata(self, name, **kwargs): + assert name in self.tensor_names + n_modes = getattr(self, f'{name}_n_modes') + extent = self._return_data(f'{name}_extent', kwargs.pop('extent')) + stride = self._return_data(f'{name}_stride', kwargs.pop('stride')) + mode = self._return_data(f'{name}_mode', kwargs.pop('mode')) + return n_modes, extent, stride, mode - def test_logger_set_callback_data(self): - # we also test logger_open_file() here to avoid polluting stdout + def get_tensor_ptr(self, name): + return getattr(self, f'{name}_tensor').data.ptr - def callback(level, name, message, my_data, is_ok=False): - log = f"{level}, {name}, {message} (is_ok={is_ok}) -> logged\n" - my_data.append(log) - handle = None - my_data = [] - is_ok = True +@testing.parameterize(*testing.product({ + 'tn': ( + ('ab->ax,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->ax,bx', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,bx', [(8, 8), (8, 8), (8, 8)]), + ('ab->ax,xb', [(6, 8), (6, 6), (6, 8)]), + ('ab->ax,bx', [(6, 8), (6, 6), (8, 6)]), + ('ab->xa,xb', [(6, 8), (6, 6), (6, 8)]), + ('ab->xa,bx', [(6, 8), (6, 6), (8, 6)]), + ('ab->ax,xb', [(8, 6), (8, 6), (6, 6)]), + ('ab->ax,bx', [(8, 6), (8, 6), (6, 6)]), + ('ab->xa,xb', [(8, 6), (6, 8), (6, 6)]), + ('ab->xa,bx', [(8, 6), (6, 8), (6, 6)]), + ), + 'dtype': ( + numpy.float32, numpy.float64, numpy.complex64, numpy.complex128 + ), + 'tensor_form': ( + {'extent': 'int', 'stride': 'int', 'mode': 'int'}, + {'extent': 'seq', 'stride': 'seq', 'mode': 'seq'}, + ), +})) +class TestTensorQR: - with tempfile.TemporaryDirectory() as temp: - file_name = os.path.join(temp, "cutn_test") - cutn.logger_open_file(file_name) - cutn.logger_set_callback_data(callback, my_data, is_ok=is_ok) - cutn.logger_set_level(6) + # There is no easy way for us to test each API independently, so we instead + # parametrize the steps and test the whole workflow + @manage_resource('handle') + @manage_resource('tensor_decom') + @manage_resource('workspace') + def test_tensor_qr(self): + # unpack + handle, tn, workspace = self.handle, self.tn, self.workspace + tensor_in, tensor_q, tensor_r = self.tensor_decom + dtype = cupy.dtype(self.dtype) + + # prepare workspace + cutn.workspace_compute_qr_sizes( + handle, tensor_in, tensor_q, tensor_r, workspace) + # for now host workspace is always 0, so just query device one + # also, it doesn't matter which one (min/recommended/max) is queried + required_size = cutn.workspace_get_size( + handle, workspace, cutn.WorksizePref.MIN, + cutn.Memspace.DEVICE) # TODO: parametrize memspace? + if required_size > 0: + workspace_ptr = cupy.cuda.alloc(required_size) + cutn.workspace_set( + handle, workspace, cutn.Memspace.DEVICE, + workspace_ptr.ptr, required_size) + # round-trip check + assert (workspace_ptr.ptr, required_size) == cutn.workspace_get( + handle, workspace, cutn.Memspace.DEVICE) + + # perform QR + stream = cupy.cuda.get_current_stream().ptr # TODO + cutn.tensor_qr( + handle, tensor_in, tn.get_tensor_ptr('input'), + tensor_q, tn.get_tensor_ptr('left'), + tensor_r, tn.get_tensor_ptr('right'), + workspace, stream) + + # we add a minimal correctness check here as we are not protected by + # any high-level API yet + out = cupy.einsum(tn.reconstruct, tn.left_tensor, tn.right_tensor) + assert cupy.allclose(out, tn.input_tensor, + rtol=rtol_mapper[dtype.name], + atol=atol_mapper[dtype.name]) + + +# TODO: expand tests: +# - add truncation +# - use config (cutoff & normalization) +@testing.parameterize(*testing.product({ + 'tn': ( + # no truncation, no partition + ('ab->ax,x,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->ax,x,bx', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,x,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,x,bx', [(8, 8), (8, 8), (8, 8)]), + ('ab->ax,x,xb', [(6, 8), (6, 6), (6, 8)]), + ('ab->ax,x,bx', [(6, 8), (6, 6), (8, 6)]), + ('ab->xa,x,xb', [(6, 8), (6, 6), (6, 8)]), + ('ab->xa,x,bx', [(6, 8), (6, 6), (8, 6)]), + ('ab->ax,x,xb', [(8, 6), (8, 6), (6, 6)]), + ('ab->ax,x,bx', [(8, 6), (8, 6), (6, 6)]), + ('ab->xa,x,xb', [(8, 6), (6, 8), (6, 6)]), + ('ab->xa,x,bx', [(8, 6), (6, 8), (6, 6)]), + # no truncation, partition to u + ('ab->ax,<,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->ax,<,bx', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,<,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,<,bx', [(8, 8), (8, 8), (8, 8)]), + # no truncation, partition to v + ('ab->ax,>,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->ax,>,bx', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,>,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,>,bx', [(8, 8), (8, 8), (8, 8)]), + # no truncation, partition to both + ('ab->ax,-,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->ax,-,bx', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,-,xb', [(8, 8), (8, 8), (8, 8)]), + ('ab->xa,-,bx', [(8, 8), (8, 8), (8, 8)]), + ), + 'dtype': ( + numpy.float32, numpy.float64, numpy.complex64, numpy.complex128 + ), + 'tensor_form': ( + {'extent': 'int', 'stride': 'int', 'mode': 'int'}, + {'extent': 'seq', 'stride': 'seq', 'mode': 'seq'}, + ), +})) +class TestTensorSVD: + + def _get_scalar_attr(self, handle, obj_type, obj, attr): + if obj_type == 'config': + dtype_getter = cutn.tensor_svd_config_get_attribute_dtype + getter = cutn.tensor_svd_config_get_attribute + elif obj_type == 'info': + dtype_getter = cutn.tensor_svd_info_get_attribute_dtype + getter = cutn.tensor_svd_info_get_attribute + else: + assert False - try: - handle = cutn.create() - cutn.destroy(handle) - except: - if handle: - cutn.destroy(handle) - raise - finally: - cutn.logger_force_disable() # to not affect the rest of tests + dtype = dtype_getter(attr) + data = numpy.empty((1,), dtype=dtype) + getter(handle, obj, attr, data.ctypes.data, data.dtype.itemsize) + return data + + def _set_scalar_attr(self, handle, obj_type, obj, attr, data): + assert obj_type == 'config' # svd info has no setter + dtype_getter = cutn.tensor_svd_config_get_attribute_dtype + setter = cutn.tensor_svd_config_set_attribute + + dtype = dtype_getter(attr) + if not isinstance(data, numpy.ndarray): + data = numpy.asarray(data, dtype=dtype) + setter(handle, obj, attr, data.ctypes.data, data.dtype.itemsize) + + # There is no easy way for us to test each API independently, so we instead + # parametrize the steps and test the whole workflow + @manage_resource('handle') + @manage_resource('tensor_decom') + @manage_resource('svd_config') + @manage_resource('svd_info') + @manage_resource('workspace') + def test_tensor_svd(self): + # unpack + handle, tn, workspace = self.handle, self.tn, self.workspace + tensor_in, tensor_u, tensor_v = self.tensor_decom + svd_config, svd_info = self.svd_config, self.svd_info + dtype = cupy.dtype(self.dtype) + + # prepare workspace + cutn.workspace_compute_svd_sizes( + handle, tensor_in, tensor_u, tensor_v, svd_config, workspace) + # for now host workspace is always 0, so just query device one + # also, it doesn't matter which one (min/recommended/max) is queried + required_size = cutn.workspace_get_size( + handle, workspace, cutn.WorksizePref.MIN, + cutn.Memspace.DEVICE) # TODO: parametrize memspace? + if required_size > 0: + workspace_ptr = cupy.cuda.alloc(required_size) + cutn.workspace_set( + handle, workspace, cutn.Memspace.DEVICE, + workspace_ptr.ptr, required_size) + # round-trip check + assert (workspace_ptr.ptr, required_size) == cutn.workspace_get( + handle, workspace, cutn.Memspace.DEVICE) + + # set singular value partitioning, if requested + if tn.mid_mode in tn.svd_partitioned: + if tn.mid_mode == '<': + data = cutn.TensorSVDPartition.US + elif tn.mid_mode == '-': + data = cutn.TensorSVDPartition.UV_EQUAL + else: # = '<': + data = cutn.TensorSVDPartition.SV + self._set_scalar_attr( + handle, 'config', svd_config, + cutn.TensorSVDConfigAttribute.S_PARTITION, + data) + # do a round-trip test as a sanity check + factor = self._get_scalar_attr( + handle, 'config', svd_config, + cutn.TensorSVDConfigAttribute.S_PARTITION) + assert factor == data + + # perform SVD + stream = cupy.cuda.get_current_stream().ptr # TODO + if tn.mid_mode in tn.svd_partitioned: + s_ptr = 0 + else: + s = cupy.empty(tn.mid_extent, dtype=dtype.char.lower()) + s_ptr = s.data.ptr + cutn.tensor_svd( + handle, tensor_in, tn.get_tensor_ptr('input'), + tensor_u, tn.get_tensor_ptr('left'), + s_ptr, + tensor_v, tn.get_tensor_ptr('right'), + svd_config, svd_info, workspace, stream) + + # sanity checks (only valid for no truncation) + assert tn.mid_extent == self._get_scalar_attr( + handle, 'info', svd_info, + cutn.TensorSVDInfoAttribute.FULL_EXTENT) + assert tn.mid_extent == self._get_scalar_attr( + handle, 'info', svd_info, + cutn.TensorSVDInfoAttribute.REDUCED_EXTENT) + assert 0 == self._get_scalar_attr( + handle, 'info', svd_info, + cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT) + + # we add a minimal correctness check here as we are not protected by + # any high-level API yet + if tn.mid_mode in tn.svd_partitioned: + out = cupy.einsum( + tn.reconstruct, tn.left_tensor, tn.right_tensor) + else: + out = cupy.einsum( + tn.reconstruct, tn.left_tensor, s, tn.right_tensor) + assert cupy.allclose(out, tn.input_tensor, + rtol=rtol_mapper[dtype.name], + atol=atol_mapper[dtype.name]) + + +# TODO: expand tests: +# - add truncation +# - use config (cutoff & normalization) +@testing.parameterize(*testing.product({ + 'tn': ( + # direct algorithm, no truncation, no partition + ('ijk,klm,jlpq->ipk,k,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->kpi,k,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->pki,k,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + # direct algorithm, no truncation, partition onto u + ('ijk,klm,jlpq->ipk,<,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->kpi,<,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->pki,<,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + # direct algorithm, no truncation, partition onto v + ('ijk,klm,jlpq->ipk,>,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->kpi,>,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->pki,>,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + # direct algorithm, no truncation, partition onto u and v equally + ('ijk,klm,jlpq->ipk,-,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->kpi,-,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->pki,-,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + # reduced algorithm, no truncation, no partition + ('ijk,klm,jlpq->->ipk,k,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->->kpi,k,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->->pki,k,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + # reduced algorithm, no truncation, partition onto u + ('ijk,klm,jlpq->->ipk,<,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->->kpi,<,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->->pki,<,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + # reduced algorithm, no truncation, partition onto v + ('ijk,klm,jlpq->->ipk,>,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->->kpi,>,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->->pki,>,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + # reduced algorithm, no truncation, partition onto u and v equally + ('ijk,klm,jlpq->->ipk,-,kqm', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (4, 2, 8), (8, 2, 4)]), + ('ijk,klm,jlpq->->kpi,-,qmk', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (8, 2, 4), (2, 4, 8)]), + ('ijk,klm,jlpq->->pki,-,mkq', [(4, 2, 4), (4, 2, 4), (2, 2, 2, 2), (2, 8, 4), (4, 8, 2)]), + ), + 'dtype': ( + numpy.float32, numpy.float64, numpy.complex64, numpy.complex128 + ), + 'tensor_form': ( + {'extent': 'int', 'stride': 'int', 'mode': 'int'}, + {'extent': 'seq', 'stride': 'seq', 'mode': 'seq'}, + ), +})) +class TestTensorGate: + + def _get_scalar_attr(self, handle, obj_type, obj, attr): + if obj_type == 'config': + dtype_getter = cutn.tensor_svd_config_get_attribute_dtype + getter = cutn.tensor_svd_config_get_attribute + elif obj_type == 'info': + dtype_getter = cutn.tensor_svd_info_get_attribute_dtype + getter = cutn.tensor_svd_info_get_attribute + else: + assert False + + dtype = dtype_getter(attr) + data = numpy.empty((1,), dtype=dtype) + getter(handle, obj, attr, data.ctypes.data, data.dtype.itemsize) + return data + + def _set_scalar_attr(self, handle, obj_type, obj, attr, data): + assert obj_type == 'config' # svd info has no setter + dtype_getter = cutn.tensor_svd_config_get_attribute_dtype + setter = cutn.tensor_svd_config_set_attribute + + dtype = dtype_getter(attr) + if not isinstance(data, numpy.ndarray): + data = numpy.asarray(data, dtype=dtype) + setter(handle, obj, attr, data.ctypes.data, data.dtype.itemsize) + + # There is no easy way for us to test each API independently, so we instead + # parametrize the steps and test the whole workflow + @manage_resource('handle') + @manage_resource('tensor_decom') + @manage_resource('svd_config') + @manage_resource('svd_info') + @manage_resource('workspace') + def test_gate_split(self): + # unpack + handle, tn, workspace = self.handle, self.tn, self.workspace + tensor_in_a, tensor_in_b, tensor_in_g, tensor_u, tensor_v = self.tensor_decom + gate_algorithm = tn.gate_algorithm + svd_config, svd_info = self.svd_config, self.svd_info + dtype = cupy.dtype(self.dtype) + compute_type = dtype_to_compute_type[self.dtype] + # prepare workspace + cutn.workspace_compute_gate_split_sizes(handle, + tensor_in_a, tensor_in_b, tensor_in_g, tensor_u, tensor_v, + gate_algorithm, svd_config, compute_type, workspace) + # for now host workspace is always 0, so just query device one + # also, it doesn't matter which one (min/recommended/max) is queried + required_size = cutn.workspace_get_size( + handle, workspace, cutn.WorksizePref.MIN, + cutn.Memspace.DEVICE) # TODO: parametrize memspace? + if required_size > 0: + workspace_ptr = cupy.cuda.alloc(required_size) + cutn.workspace_set( + handle, workspace, cutn.Memspace.DEVICE, + workspace_ptr.ptr, required_size) + # round-trip check + assert (workspace_ptr.ptr, required_size) == cutn.workspace_get( + handle, workspace, cutn.Memspace.DEVICE) + + # set singular value partitioning, if requested + if tn.mid_mode in tn.svd_partitioned: + if tn.mid_mode == '<': + data = cutn.TensorSVDPartition.US + elif tn.mid_mode == '-': + data = cutn.TensorSVDPartition.UV_EQUAL + else: # = '<': + data = cutn.TensorSVDPartition.SV + self._set_scalar_attr( + handle, 'config', svd_config, + cutn.TensorSVDConfigAttribute.S_PARTITION, + data) + # do a round-trip test as a sanity check + factor = self._get_scalar_attr( + handle, 'config', svd_config, + cutn.TensorSVDConfigAttribute.S_PARTITION) + assert factor == data + + # perform gate split + stream = cupy.cuda.get_current_stream().ptr # TODO + if tn.mid_mode in tn.svd_partitioned: + s_ptr = 0 + else: + s = cupy.empty(tn.mid_extent, dtype=dtype.char.lower()) + s_ptr = s.data.ptr + cutn.gate_split(handle, tensor_in_a, tn.get_tensor_ptr('inputA'), + tensor_in_b, tn.get_tensor_ptr('inputB'), + tensor_in_g, tn.get_tensor_ptr('inputG'), + tensor_u, tn.get_tensor_ptr('left'), s_ptr, + tensor_v, tn.get_tensor_ptr('right'), + gate_algorithm, svd_config, compute_type, + svd_info, workspace, stream) + + # sanity checks (only valid for no truncation) + assert tn.mid_extent == self._get_scalar_attr( + handle, 'info', svd_info, + cutn.TensorSVDInfoAttribute.FULL_EXTENT) + assert tn.mid_extent == self._get_scalar_attr( + handle, 'info', svd_info, + cutn.TensorSVDInfoAttribute.REDUCED_EXTENT) + assert 0 == self._get_scalar_attr( + handle, 'info', svd_info, + cutn.TensorSVDInfoAttribute.DISCARDED_WEIGHT) + + # we add a minimal correctness check here as we are not protected by + # any high-level API yet + if tn.mid_mode in tn.svd_partitioned: + out = cupy.einsum( + tn.reconstruct, tn.left_tensor, tn.right_tensor) + else: + out = cupy.einsum( + tn.reconstruct, tn.left_tensor, s, tn.right_tensor) + reference = cupy.einsum(tn.reference_einsum, tn.inputA_tensor, tn.inputB_tensor, tn.inputG_tensor) + error = cupy.linalg.norm(out - reference) + assert cupy.allclose(out, reference, + rtol=rtol_mapper[dtype.name], + atol=atol_mapper[dtype.name]) + + +class TestTensorSVDConfig: + + @manage_resource('handle') + @manage_resource('svd_config') + def test_tensor_svd_config_create_destroy(self): + # simple round-trip test + pass + + @pytest.mark.parametrize( + 'attr', [val for val in cutn.TensorSVDConfigAttribute] + ) + @manage_resource('handle') + @manage_resource('svd_config') + def test_tensor_svd_config_get_set_attribute(self, attr): + handle, svd_config = self.handle, self.svd_config + dtype = cutn.tensor_svd_config_get_attribute_dtype(attr) + # Hack: assume this is a valid value for all attrs + factor = numpy.asarray([0.8], dtype=dtype) + cutn.tensor_svd_config_set_attribute( + handle, svd_config, attr, + factor.ctypes.data, factor.dtype.itemsize) + # do a round-trip test as a sanity check + factor2 = numpy.zeros_like(factor) + cutn.tensor_svd_config_get_attribute( + handle, svd_config, attr, + factor2.ctypes.data, factor2.dtype.itemsize) + assert factor == factor2 + + +@pytest.mark.skipif(mpi4py is None, reason="need mpi4py") +@pytest.mark.skipif(os.environ.get("CUTENSORNET_COMM_LIB") is None, + reason="wrapper lib not set") +class TestDistributed: + + def _get_comm(self, comm): + if comm == 'world': + return MPI.COMM_WORLD.Dup() + elif comm == 'self': + return MPI.COMM_SELF.Dup() + else: + assert False + + @pytest.mark.parametrize( + 'comm', ('world', 'self'), + ) + @manage_resource('handle') + def test_distributed(self, comm): + handle = self.handle + comm = self._get_comm(comm) + cutn.distributed_reset_configuration( + handle, *cutn.get_mpi_comm_pointer(comm)) + assert comm.Get_size() == cutn.distributed_get_num_ranks(handle) + assert comm.Get_rank() == cutn.distributed_get_proc_rank(handle) + cutn.distributed_synchronize(handle) + # no need to free the comm, for world/self mpi4py does it for us... - with open(file_name) as f: - log_from_f = f.read() - # check the log file - assert '[cutensornetCreate]' in log_from_f - assert '[cutensornetDestroy]' in log_from_f +class TestLogger(LoggerTestBase): - # check the captured data (note we log 2 APIs) - log = ''.join(my_data) - assert log.count("-> logged") >= 2 - assert log.count("is_ok=True") >= 2 + mod = cutn + prefix = "cutensornet" diff --git a/python/tests/cuquantum_tests/cutensornet_tests/test_internal.py b/python/tests/cuquantum_tests/cutensornet_tests/test_internal.py new file mode 100644 index 0000000..4e97467 --- /dev/null +++ b/python/tests/cuquantum_tests/cutensornet_tests/test_internal.py @@ -0,0 +1,87 @@ +import threading + +import cupy as cp +from cupy.cuda.runtime import getDevice, setDevice +import pytest + +from cuquantum.cutensornet._internal import utils + + +class TestDeviceCtx: + + @pytest.mark.skipif( + cp.cuda.runtime.getDeviceCount() < 2, reason='not enough GPUs') + def test_device_ctx(self): + assert getDevice() == 0 + with utils.device_ctx(0): + assert getDevice() == 0 + with utils.device_ctx(1): + assert getDevice() == 1 + with utils.device_ctx(0): + assert getDevice() == 0 + assert getDevice() == 1 + assert getDevice() == 0 + assert getDevice() == 0 + + with utils.device_ctx(1): + assert getDevice() == 1 + setDevice(0) + with utils.device_ctx(1): + assert getDevice() == 1 + assert getDevice() == 0 + assert getDevice() == 0 + + @pytest.mark.skipif( + cp.cuda.runtime.getDeviceCount() < 2, reason='not enough GPUs') + def test_thread_safe(self): + # adopted from https://github.com/cupy/cupy/blob/master/tests/cupy_tests/cuda_tests/test_device.py + # recall that the CUDA context is maintained per-thread, so when each thread + # starts it is on the default device (=device 0). + t0_setup = threading.Event() + t1_setup = threading.Event() + t0_first_exit = threading.Event() + + t0_exit_device = [] + t1_exit_device = [] + + def t0_seq(): + with utils.device_ctx(0): + with utils.device_ctx(1): + t0_setup.set() + t1_setup.wait() + t0_exit_device.append(getDevice()) + t0_exit_device.append(getDevice()) + t0_first_exit.set() + assert getDevice() == 0 + + def t1_seq(): + t0_setup.wait() + with utils.device_ctx(1): + with utils.device_ctx(0): + t1_setup.set() + t0_first_exit.wait() + t1_exit_device.append(getDevice()) + t1_exit_device.append(getDevice()) + assert getDevice() == 0 + + try: + cp.cuda.runtime.setDevice(1) + t0 = threading.Thread(target=t0_seq) + t1 = threading.Thread(target=t1_seq) + t1.start() + t0.start() + t0.join() + t1.join() + assert t0_exit_device == [1, 0] + assert t1_exit_device == [0, 1] + finally: + cp.cuda.runtime.setDevice(0) + + def test_one_shot(self): + dev = utils.device_ctx(0) + with dev: + pass + # CPython raises AttributeError, but we should not care here + with pytest.raises(Exception): + with dev: + pass diff --git a/python/tests/cuquantum_tests/cutensornet_tests/test_network.py b/python/tests/cuquantum_tests/cutensornet_tests/test_network.py index 74489cb..2b5f610 100644 --- a/python/tests/cuquantum_tests/cutensornet_tests/test_network.py +++ b/python/tests/cuquantum_tests/cutensornet_tests/test_network.py @@ -2,6 +2,7 @@ # # SPDX-License-Identifier: BSD-3-Clause +import functools import copy import re import sys @@ -15,12 +16,15 @@ from cuquantum.cutensornet._internal.utils import infer_object_package from .data import backend_names, dtype_names, einsum_expressions -from .testutils import atol_mapper, EinsumFactory, rtol_mapper -from .testutils import compute_and_normalize_numpy_path -from .testutils import set_path_to_optimizer_options +from .test_utils import atol_mapper, EinsumFactory, rtol_mapper +from .test_utils import check_intermediate_modes +from .test_utils import compute_and_normalize_numpy_path +from .test_utils import deselect_contract_tests +from .test_utils import set_path_to_optimizer_options # TODO: parametrize compute type? +@pytest.mark.uncollect_if(func=deselect_contract_tests) @pytest.mark.parametrize( "use_numpy_path", (False, True) ) @@ -49,9 +53,7 @@ def test_network( stream, use_numpy_path): einsum_expr = copy.deepcopy(einsum_expr_pack) if isinstance(einsum_expr, list): - einsum_expr, network_opts, optimizer_opts, overwrite_dtype = einsum_expr - if dtype != overwrite_dtype: - pytest.skip(f"skipping {dtype} is requested") + einsum_expr, network_opts, optimizer_opts, _ = einsum_expr else: network_opts = optimizer_opts = None assert isinstance(einsum_expr, (str, tuple)) @@ -60,21 +62,24 @@ def test_network( operands = factory.generate_operands( factory.input_shapes, xp, dtype, order) backend = sys.modules[infer_object_package(operands[0])] + data = factory.convert_by_format(operands) if stream: if backend is numpy: stream = cupy.cuda.Stream() # implementation detail else: stream = backend.cuda.Stream() - data = factory.convert_by_format(operands) tn = Network(*data, options=network_opts) # We already test tn as a context manager in the samples, so let's test # explicitly calling tn.free() here. try: if not use_numpy_path: - _, info = tn.contract_path(optimize=optimizer_opts) + path, info = tn.contract_path(optimize=optimizer_opts) uninit_f_str = re.compile("{.*}") assert uninit_f_str.search(str(info)) is None + check_intermediate_modes( + info.intermediate_modes, factory.input_modes, + factory.output_modes, path) else: try: path_ref = compute_and_normalize_numpy_path( @@ -87,22 +92,37 @@ def test_network( optimizer_opts = set_path_to_optimizer_options( optimizer_opts, path_ref) path, _ = tn.contract_path(optimizer_opts) - assert path == path_ref # round-trip test + # round-trip test + # note that within each pair it could have different order + assert all(map(lambda x, y: sorted(x) == sorted(y), path, path_ref)) if autotune: tn.autotune(iterations=autotune, stream=stream) - out = tn.contract(stream=stream) - if stream: - stream.synchronize() - backend_out = sys.modules[infer_object_package(out)] - assert backend_out is backend - assert out.dtype == operands[0].dtype - - out_ref = opt_einsum.contract( - *data, backend="torch" if "torch" in xp else xp) - assert backend.allclose( - out, out_ref, atol=atol_mapper[dtype], rtol=rtol_mapper[dtype]) + # check the result + self._verify_contract( + tn, operands, backend, data, xp, dtype, stream) - # TODO: test tn.reset_operands() + # generate new data and bind them to the TN + operands = factory.generate_operands( + factory.input_shapes, xp, dtype, order) + data = factory.convert_by_format(operands) + tn.reset_operands(*operands) + # check the result + self._verify_contract( + tn, operands, backend, data, xp, dtype, stream) finally: tn.free() + + def _verify_contract( + self, tn, operands, backend, data, xp, dtype, stream): + out = tn.contract(stream=stream) + if stream: + stream.synchronize() + backend_out = sys.modules[infer_object_package(out)] + assert backend_out is backend + assert out.dtype == operands[0].dtype + + out_ref = opt_einsum.contract( + *data, backend="torch" if "torch" in xp else xp) + assert backend.allclose( + out, out_ref, atol=atol_mapper[dtype], rtol=rtol_mapper[dtype]) diff --git a/python/tests/cuquantum_tests/cutensornet_tests/testutils.py b/python/tests/cuquantum_tests/cutensornet_tests/test_utils.py similarity index 69% rename from python/tests/cuquantum_tests/cutensornet_tests/testutils.py rename to python/tests/cuquantum_tests/cutensornet_tests/test_utils.py index 8496a5a..756dd3a 100644 --- a/python/tests/cuquantum_tests/cutensornet_tests/testutils.py +++ b/python/tests/cuquantum_tests/cutensornet_tests/test_utils.py @@ -2,6 +2,7 @@ # # SPDX-License-Identifier: BSD-3-Clause +import re import sys import cupy @@ -75,6 +76,71 @@ def compute_and_normalize_numpy_path(data, num_operands): return norm_path +def convert_linear_to_ssa(path): + n_inputs = len(path)+1 + remaining = [*range(n_inputs)] + ssa_path = [] + counter = n_inputs + + for first, second in path: + idx1 = remaining[first] + idx2 = remaining[second] + ssa_path.append((idx1, idx2)) + remaining.remove(idx1) + remaining.remove(idx2) + remaining.append(counter) + counter += 1 + + return ssa_path + + +def check_ellipsis(modes): + # find ellipsis, record the position, remove it, and modify the modes + if isinstance(modes, str): + ellipsis = modes.find("...") + if ellipsis >= 0: + modes = modes.replace("...", "") + else: + try: + ellipsis = modes.index(Ellipsis) + except ValueError: + ellipsis = -1 + if ellipsis >= 0: + modes = modes[:ellipsis] + modes[ellipsis+1:] + return ellipsis, modes + + +def check_intermediate_modes( + intermediate_modes, input_modes, output_modes, path): + + # remove ellipsis, if any, since it's singleton + input_modes = list(map( + lambda modes: (lambda modes: check_ellipsis(modes))(modes)[1], + input_modes + )) + _, output_modes = check_ellipsis(output_modes) + # peek at the very first element + if (isinstance(intermediate_modes[0], tuple) + and isinstance(intermediate_modes[0][0], str)): + # this is our internal mode label for ellipsis + custom_label = re.compile(r'\b__\d+__\b') + intermediate_modes = list(map( + lambda modes: list(filter(lambda mode: not custom_label.match(mode), modes)), + intermediate_modes + )) + + ssa_path = convert_linear_to_ssa(path) + contraction_list = input_modes + contraction_list += intermediate_modes + + for k, (i, j) in enumerate(ssa_path): + modesA = set(contraction_list[i]) + modesB = set(contraction_list[j]) + modesOut = set(intermediate_modes[k]) + assert modesOut.issubset(modesA.union(modesB)) + assert set(output_modes) == set(intermediate_modes[-1]) + + class EinsumFactory: """Take a valid einsum expression and compute shapes, modes, etc for testing.""" @@ -99,17 +165,7 @@ def _gen_shape(self, modes): shape = [] # find ellipsis, record the position, and remove it - if isinstance(modes, str): - ellipsis = modes.find("...") - if ellipsis >= 0: - modes = modes.replace("...", "") - else: - try: - ellipsis = modes.index(Ellipsis) - except ValueError: - ellipsis = -1 - if ellipsis >= 0: - modes = modes[:ellipsis] + modes[ellipsis+1:] + ellipsis, modes = check_ellipsis(modes) # generate extents for remaining modes for mode in modes: @@ -210,3 +266,21 @@ def convert_by_format(self, operands, *, dummy=False): data.append(tuple(self.output_modes)) return data + + +# We use the pytest marker hook to deselect/ignore collected tests +# that we do not want to run. This is better than showing a ton of +# tests as "skipped" at the end, since technically they never get +# tested. +# +# Note the arguments here must be named and ordered in exactly the +# same way as the tests being marked by @pytest.mark.uncollect_if(). +def deselect_contract_tests( + einsum_expr_pack, xp, dtype, *args, **kwargs): + if xp.startswith('torch') and torch is None: + return True + if isinstance(einsum_expr_pack, list): + _, _, _, overwrite_dtype = einsum_expr_pack + if dtype != overwrite_dtype: + return True + return False diff --git a/python/tests/cuquantum_tests/test_cuquantum.py b/python/tests/cuquantum_tests/test_cuquantum.py new file mode 100644 index 0000000..3214a9c --- /dev/null +++ b/python/tests/cuquantum_tests/test_cuquantum.py @@ -0,0 +1,55 @@ +# Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES +# +# SPDX-License-Identifier: BSD-3-Clause + +import os +import subprocess +import sys + +import pytest + + +# TODO: mark this test as slow and don't run it every time +class TestModuleUtils: + + @pytest.mark.parametrize( + 'includes', (True, False) + ) + @pytest.mark.parametrize( + 'libs', (True, False) + ) + @pytest.mark.parametrize( + 'target', (None, 'custatevec', 'cutensornet', True) + ) + def test_cuquantum(self, includes, libs, target): + # We need to launch a subprocess to have a clean ld state + cmd = [sys.executable, '-m', 'cuquantum'] + if includes: + cmd.append('--includes') + if libs: + cmd.append('--libs') + if target: + if target is True: + cmd.extend(('--target', 'custatevec')) + cmd.extend(('--target', 'cutensornet')) + else: + cmd.extend(('--target', target)) + + result = subprocess.run(cmd, capture_output=True, env=os.environ) + if result.returncode: + if includes is False and libs is False and target is None: + assert result.returncode == 1 + assert 'usage' in result.stdout.decode() + return + msg = f'Got error:\n' + msg += f'stdout: {result.stdout.decode()}\n' + msg += f'stderr: {result.stderr.decode()}\n' + assert False, msg + + out = result.stdout.decode().split() + if includes: + assert any([s.startswith('-I') for s in out]) + if libs: + assert any([s.startswith('-L') for s in out]) + if target: + assert any([s.startswith('-l') for s in out]) diff --git a/python/tests/run_python_tests.sh b/python/tests/run_python_tests.sh new file mode 100755 index 0000000..1ac9050 --- /dev/null +++ b/python/tests/run_python_tests.sh @@ -0,0 +1,116 @@ +#!/bin/bash +# +# Unified launch script for cuquantum-Python tests +# TODO: unify this scripts with others + +set -x + +# The (path to) the MPI launcher. +MPIEXEC=mpirun + +# Open MPI needs this to run inside the docker +export OMPI_ALLOW_RUN_AS_ROOT=1 +export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 + +if [ -z "${CUTENSORNET_BUILD_BINARY_DIR}" ]; then + echo "Error: CUTENSORNET_BUILD_BINARY_DIR is not set" + exit -1 +fi + +echo "Build directory: ${CUTENSORNET_BUILD_BINARY_DIR}" + +# The path to the cuTensorNet-MPI wrapper library. +export CUTENSORNET_COMM_LIB=${CUTENSORNET_BUILD_BINARY_DIR}/../distributed_interfaces/libcutensornet_distributed_interface_mpi.so + +# Show Open MPI info +ompi_info + +# The Python3 executable. +PYTHON3=python3 + +# The path to the Python 'samples' directory. +SAMPLES_DIR=../samples + +# The path to the directory from which pytest collects "cuquantum" tests. +PYTEST_CUQUANTUM_DIR=./cuquantum_tests + +# The path to the directory from which pytest collects "samples" tests. +PYTEST_SAMPLES_DIR=./samples_tests + +########################################################### +# Error function adopted from the Google Shell Style Guide. +########################################################### +error() { + echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $*" >&2 +} + +################################################################################ +# Function to run the specified MPI samples. In case of error, the script name +# and the error code are logged to stderr and a non-zero status corresponding +# to the number of test failures is returned. +# +# Globals: +# MPIEXEC +# PYTHON3 +# Arguments: +# The number of processes to use. +# The list of MPI samples (with path) to run. +################################################################################ +run_python_mpi_samples() { + nproc=$1 + mpi_samples="${@:2}" + status=0 + for mpi_sample in ${mpi_samples}; do + # WAR: our CI only has single GPU, so we cannot launch more than 1 NCCL rank + if [[ "${mpi_sample}" = *"nccl"* ]]; then + nproc=1 + fi + ${MPIEXEC} -np ${nproc} ${PYTHON3} ${mpi_sample} + test_status=$? + if [ ${test_status} -ne 0 ]; then + error "Test \"${mpi_sample}\" exited with status ${test_status}." + fi + status=$((status+${test_status})) + done + return ${status} +} + +STATUS=0 + +################################################################################ +# Tests using pytest. +################################################################################ + +${PYTHON3} -m pytest ${PYTEST_CUQUANTUM_DIR} +test_status=$? +if [ ${test_status} -ne 0 ]; then + error "pytest \"${PYTEST_CUQUANTUM_DIR}\" exited with status ${test_status}." +fi +STATUS=$((STATUS+${test_status})) + +${PYTHON3} -m pytest -n 2 ${PYTEST_SAMPLES_DIR} +test_status=$? +if [ ${test_status} -ne 0 ]; then + error "pytest \"${PYTEST_SAMPLES_DIR}\" exited with status ${test_status}." +fi +STATUS=$((STATUS+${test_status})) + +################################################################################ +# Test MPI samples. +################################################################################ + +# The path to the cuTensorNet-MPI wrapper library. +export CUTENSORNET_COMM_LIB=${CUTENSORNET_BUILD_BINARY_DIR}/../distributed_interfaces/libcutensornet_distributed_interface_mpi.so + +# Find all the MPI sample programs. +mpi_samples=$(find ${SAMPLES_DIR} -name "*_mpi*.py") + +run_python_mpi_samples 2 ${mpi_samples} +test_status=$? +if [ ${test_status} -ne 0 ]; then + error "The MPI samples tests in \"${SAMPLES_DIR}\" exited with status ${test_status}." +fi +STATUS=$((STATUS+${test_status})) + +unset CUTENSORNET_COMM_LIB +exit ${STATUS} diff --git a/samples/custatevec/CMakeLists.txt b/samples/custatevec/CMakeLists.txt index e234579..3b59879 100644 --- a/samples/custatevec/CMakeLists.txt +++ b/samples/custatevec/CMakeLists.txt @@ -69,7 +69,8 @@ set(CMAKE_CUDA_EXTENSIONS OFF) set(CMAKE_CUDA_FLAGS_ARCH_SM70 "-gencode arch=compute_70,code=sm_70") set(CMAKE_CUDA_FLAGS_ARCH_SM75 "-gencode arch=compute_75,code=sm_75") set(CMAKE_CUDA_FLAGS_ARCH_SM80 "-gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80") -set(CMAKE_CUDA_FLAGS_ARCH "${CMAKE_CUDA_FLAGS_ARCH_SM70} ${CMAKE_CUDA_FLAGS_ARCH_SM75} ${CMAKE_CUDA_FLAGS_ARCH_SM80}") +set(CMAKE_CUDA_FLAGS_ARCH_SM90 "-gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90") +set(CMAKE_CUDA_FLAGS_ARCH "${CMAKE_CUDA_FLAGS_ARCH_SM70} ${CMAKE_CUDA_FLAGS_ARCH_SM75} ${CMAKE_CUDA_FLAGS_ARCH_SM80} ${CMAKE_CUDA_FLAGS_ARCH_SM90}") set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} ${CMAKE_CUDA_FLAGS_ARCH}") # ########################################## @@ -104,7 +105,7 @@ function(add_custatevec_example GROUP_TARGET EXAMPLE_NAME EXAMPLE_SOURCES) ${EXAMPLE_TARGET} PROPERTIES CUDA_ARCHITECTURES - "70;75;80" + "70;75;80;90" ) # Install example install( diff --git a/samples/custatevec/Makefile b/samples/custatevec/Makefile index 57ad27a..a96ea9a 100644 --- a/samples/custatevec/Makefile +++ b/samples/custatevec/Makefile @@ -13,7 +13,8 @@ LINKER_FLAGS := -lcudart -lcustatevec ARCH_FLAGS_SM70 = -gencode arch=compute_70,code=sm_70 ARCH_FLAGS_SM75 = -gencode arch=compute_75,code=sm_75 ARCH_FLAGS_SM80 = -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80 -ARCH_FLAGS = $(ARCH_FLAGS_SM70) $(ARCH_FLAGS_SM75) $(ARCH_FLAGS_SM80) +ARCH_FLAGS_SM90 = -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 +ARCH_FLAGS = $(ARCH_FLAGS_SM70) $(ARCH_FLAGS_SM75) $(ARCH_FLAGS_SM80) $(ARCH_FLAGS_SM90) CXX_FLAGS = -std=c++11 $(INCLUDE_DIRS) $(LIBRARY_DIRS) $(ARCH_FLAGS) $(LINKER_FLAGS) diff --git a/samples/custatevec/README.md b/samples/custatevec/README.md index 6f5ec4b..20f2436 100644 --- a/samples/custatevec/README.md +++ b/samples/custatevec/README.md @@ -23,12 +23,12 @@ make -j8 # Support -* **Supported SM Architectures:** SM 7.0, SM 7.5, SM 8.0, SM 8.6 +* **Supported GPU Architectures:** any NVIDIA GPU with compute capability 7.0 or later * **Supported OSes:** Linux * **Supported CPU Architectures**: x86_64, arm64, ppc64le * **Language**: `C++11` # Prerequisites -* [CUDA 11.4 toolkit](https://developer.nvidia.com/cuda-downloads) (or above) and compatible driver (see [CUDA Driver Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions)). +* [CUDA 11.8 toolkit](https://developer.nvidia.com/cuda-downloads) (or above) and compatible driver (see [CUDA Driver Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions)). * [CMake 3.13](https://cmake.org/download/) or above diff --git a/samples/cutensornet/CMakeLists.txt b/samples/cutensornet/CMakeLists.txt index 0014f62..b372298 100644 --- a/samples/cutensornet/CMakeLists.txt +++ b/samples/cutensornet/CMakeLists.txt @@ -100,6 +100,7 @@ function(add_cutensornet_example GROUP_TARGET EXAMPLE_NAME EXAMPLE_SOURCES) cutensornet cutensor cudart + cusolver cublasLt $<$:MPI::MPI_CXX> ) @@ -128,10 +129,15 @@ endfunction() add_custom_target(cutensornet_examples) add_cutensornet_example(cutensornet_examples "cuTENSORNet.example.tensornet" tensornet_example.cu) +add_cutensornet_example(cutensornet_examples "cuTENSORNet.example.tensornet.svd" approxTN/tensor_svd_example.cu) +add_cutensornet_example(cutensornet_examples "cuTENSORNet.example.tensornet.qr" approxTN/tensor_qr_example.cu) +add_cutensornet_example(cutensornet_examples "cuTENSORNet.example.tensornet.gate" approxTN/gate_split_example.cu) +add_cutensornet_example(cutensornet_examples "cuTENSORNet.example.tensornet.mps" approxTN/mps_example.cu) find_package(MPI) if (MPI_FOUND) add_cutensornet_example(cutensornet_examples "cuTENSORNet.example.tensornet.mpi" tensornet_example_mpi.cu) + add_cutensornet_example(cutensornet_examples "cuTENSORNet.example.tensornet.mpi.auto" tensornet_example_mpi_auto.cu) else () - message(WARNING "An MPI installation was not detected. Please install MPI if you would like to build the distributed example(s).") + message(WARNING "An MPI installation was not detected. Please install CUDA-aware MPI if you would like to build the distributed example(s).") endif () diff --git a/samples/cutensornet/Makefile b/samples/cutensornet/Makefile index 8bc4189..3ec739a 100644 --- a/samples/cutensornet/Makefile +++ b/samples/cutensornet/Makefile @@ -10,19 +10,26 @@ MPI_ROOT := ${MPI_ROOT} INCLUDE_DIRS := -I${CUTENSORNET_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_ROOT}/include LIBRARY_DIRS := -L${CUTENSORNET_ROOT}/lib -L${CUTENSORNET_ROOT}/lib64 -L${CUTENSOR_ROOT}/lib/11 -LINKER_FLAGS := -lcutensornet -lcutensor -lcudart +LINKER_FLAGS := -lcutensornet -lcutensor -lcudart -lcusolver ARCH_FLAGS_SM70 = -gencode arch=compute_70,code=sm_70 ARCH_FLAGS_SM75 = -gencode arch=compute_75,code=sm_75 ARCH_FLAGS_SM80 = -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80 -ARCH_FLAGS = $(ARCH_FLAGS_SM70) $(ARCH_FLAGS_SM75) $(ARCH_FLAGS_SM80) +ARCH_FLAGS_SM86 = -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 +ARCH_FLAGS_SM90 = -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 +ARCH_FLAGS = $(ARCH_FLAGS_SM70) $(ARCH_FLAGS_SM75) $(ARCH_FLAGS_SM80) $(ARCH_FLAGS_SM86) $(ARCH_FLAGS_SM90) CXX_FLAGS = -std=c++11 $(INCLUDE_DIRS) $(LIBRARY_DIRS) $(LINKER_FLAGS) $(ARCH_FLAGS) all: check-env ${CUDA_PATH}/bin/nvcc tensornet_example.cu -o tensornet_example ${CXX_FLAGS} + ${CUDA_PATH}/bin/nvcc approxTN/tensor_svd_example.cu -o tensor_svd_example ${CXX_FLAGS} + ${CUDA_PATH}/bin/nvcc approxTN/tensor_qr_example.cu -o tensor_qr_example ${CXX_FLAGS} + ${CUDA_PATH}/bin/nvcc approxTN/gate_split_example.cu -o gate_split_example ${CXX_FLAGS} + ${CUDA_PATH}/bin/nvcc approxTN/mps_example.cu -o mps_example ${CXX_FLAGS} ifdef MPI_ROOT - ${CUDA_PATH}/bin/nvcc tensornet_example_mpi.cu -Xlinker -rpath,${MPI_ROOT}/lib -L${MPI_ROOT}/lib -o tensornet_example_mpi ${CXX_FLAGS} -lmpi + ${CUDA_PATH}/bin/nvcc tensornet_example_mpi.cu -Xlinker -rpath,${MPI_ROOT}/lib -L${MPI_ROOT}/lib -o tensornet_example_mpi ${CXX_FLAGS} -lmpi + ${CUDA_PATH}/bin/nvcc tensornet_example_mpi_auto.cu -Xlinker -rpath,${MPI_ROOT}/lib -L${MPI_ROOT}/lib -o tensornet_example_mpi_auto ${CXX_FLAGS} -lmpi endif check-env: @@ -52,4 +59,6 @@ check-env: fi clean: - rm -f tensornet_example tensornet_example.o tensornet_example_mpi tensornet_example_mpi.o + rm -f tensornet_example tensornet_example.o tensornet_example_mpi tensornet_example_mpi.o tensornet_example_mpi_auto tensornet_example_mpi_auto.o + rm -f tensor_qr_example tensor_qr_example.o tensor_svd_example tensor_svd_example.o + rm -f gatesplit_example gatesplit_example.o mps_example mps_example.o diff --git a/samples/cutensornet/README.md b/samples/cutensornet/README.md index 5e59b7f..ff40e0a 100644 --- a/samples/cutensornet/README.md +++ b/samples/cutensornet/README.md @@ -30,15 +30,29 @@ To execute the serial sample in a command shell, simply use: ``` ./tensornet_example ``` -To execute the parallel MPI sample, run: +To execute the parallel MPI sample with automatic MPI parallelization, run: +``` +mpiexec -n N ./tensornet_example_mpi_auto +``` +where `N` is the desired number of processes. You will need to define +the environment variable CUTENSORNET_COMM_LIB as described in the Getting Started +section of the cuTensorNet library documentation (Installation and Compilation). + +To execute the parallel MPI sample with explicit MPI parallelization, run: ``` mpiexec -n N ./tensornet_example_mpi ``` where `N` is the desired number of processes. In this example, `N` can be larger than the number of GPUs in your system. +The tensor SVD sample can be easily executed in a command shell using: +``` +./tensor_svd_example +``` +The sample for tensor QR, gate split and MPS can also be executed in the same fashion. + ## Support -* **Supported SM Architectures:** SM 7.0, SM 7.5, SM 8.0, SM 8.6 +* **Supported SM Architectures:** SM 7.0, SM 7.5, SM 8.0, SM 8.6, SM 9.0 * **Supported OSes:** Linux * **Supported CPU Architectures**: x86_64, aarch64-sbsa, ppc64le * **Language**: C++11 or above @@ -64,9 +78,27 @@ This sample consists of: * Performing the computation of the contraction using `cutensornetContractSlices` for a group of slices (in this case, all of the slices) created (destroyed) using the `cutensornetCreateSliceGroupFromIDRange` (`cutensornetDestroySliceGroup`) API. * Freeing the cuTensorNet resources. -### 2. Parallel execution (`tensornet_example_mpi.cu`) +### 2. Parallel execution (`tensornet_example_mpi_auto.cu`) -The parallel MPI sample illustrates advanced usage of cuTensorNet. Specifically, it demonstrates how to find a contraction path in parallel and how to exploit slice-based parallelism by contracting a subset of slices on each process. +This parallel MPI sample enables automatic distributed parallelization across multiple/many GPUs. +Specifically, it demonstrates how to activate an automatic distributed parallelization inside +the cuTensorNet library such that it will find a contraction path and subsequently contract +the tensor network in parallel using exactly the same source code as in a serial (single-GPU) run. +Currently one will need a CUDA-aware MPI library implementation to run this sample. Please refer +to the Getting Started section of the cuTensorNet library documenation for full details. + +This sample consists of: +* A basic skeleton setting up a simple MPI+CUDA computation using a one GPU per MPI process model. +* Activation call that enables automatic distributed parallelization inside the cuTensorNet library. +* Parallel execution of the tensor network contraction path finder (`cutensornetContractionOptimize`). +* Parallel execution of the tensor network contraction (`cutensornetContractSlices`). + +### 3. Parallel execution via explicit MPI calls (`tensornet_example_mpi.cu`) + +This parallel MPI sample illustrates advanced usage of cuTensorNet. Specifically, it demonstrates +how to find a contraction path in parallel and how to exploit slice-based parallelism by contracting +a subset of slices on each process using manual MPI instrumentation. Note that the previous parallel +sample will do all these for you automatically without any chages to the original (serial) source code. This sample consists of: * A basic skeleton setting up a simple MPI+CUDA computation using a one GPU per process model. @@ -74,3 +106,60 @@ This sample consists of: * Finding an optimal path with `cutensornetContractionOptimize` in parallel, and using global reduction (`MPI_MINLOC`) to find the best path and the owning process's identity. Note that the contraction optimizer on each process sets a different random seed, so each process typically computes a different optimal path for sufficiently large tensor networks. * Broadcasting the winner's `optimizerInfo` object by serializing it using the `cutensornetContractionOptimizerInfoGetPackedSize` and `cutensornetContractionOptimizerInfoPackData` APIs, and deserializing it into an existing `optimizerInfo` object using the `cutensornetUpdateContractionOptimizerInfoFromPackedData` API. * Computing the subset of slice IDs (in a relatively load-balanced fashion) for which each process is responsible, contracting them, and performing a global reduction (sum) to get the final result on the root process. + +### 4. Tensor QR (`approxTN/tensor_qr_example.cu`) + +This sample demonstrates how to use cuTensorNet to perform tensor QR operation. + +This sample consists of: +* Defining input and output tensors using `cutensornetCreateTensorDescriptor`. +* Querying the required workspace for the computation using `cutensornetWorkspaceComputeQRSizes`. +* Performing the computation of tensor QR using `cutensornetTensorQR`. +* Freeing the cuTensorNet resources. + +### 5. Tensor SVD (`approxTN/tensor_svd_example.cu`) + +This sample demonstrates how to use cuTensorNet to perform tensor SVD operation. + +This sample consists of: +* Defining input and output tensors using `cutensornetCreateTensorDescriptor`. Fixed extent truncation can be directly specified by modifying the corresponding extent in the output tensor descriptor. +* Setting up the SVD truncation options using the `cutensornetTensorSVDConfigSetAttribute` function of the `svdConfig` object created by `cutensornetCreateTensorSVDConfig`. +* Optionally, calling `cutensornetCreateTensorSVDInfo` and `cutensornetTensorSVDInfoGetAttribute` to store and retrieve runtime SVD truncation information. +* Querying the required workspace for the computation using `cutensornetWorkspaceComputeSVDSizes`. +* Performing the computation of tensor SVD using `cutensornetTensorSVD`. +* Freeing the cuTensorNet resources. + +### 6. Gate Split (`approxTN/gate_split_example.cu`) + +This sample demonstrates how to use cuTensorNet to perform a single gate split operation. + +This sample consists of: +* Defining input and output tensors using `cutensornetCreateTensorDescriptor`. Fixed extent truncation can be directly specified by modifying the corresponding extent in the output tensor descriptor. +* Setting up the SVD truncation options using the `cutensornetTensorSVDConfigSetAttribute` function of the `svdConfig` object created by `cutensornetCreateTensorSVDConfig`. +* Optionally, calling `cutensornetCreateTensorSVDInfo` and `cutensornetTensorSVDInfoGetAttribute` to store and retrieve runtime SVD truncation information. +* Querying the required workspace for the computation using `cutensornetWorkspaceComputeGateSplitSizes`. The gate split algorithm is specified in `cutensornetGateSplitAlgo_t`. +* Performing the computation of tensor SVD using `cutensornetTensorGateSplit`. +* Freeing the cuTensorNet resources. + +### 7. MPS (`approxTN/mps_example.cu`) + +This sample demonstrates how to integrate cuTensorNet into matrix product states (MPS) simulator. + +This sample is based on an ``MPSHelper`` that can systematically manage the MPS metadata and cuTensorNet library objects. +Following functionalities are encapsulated in this class: +* Dynamically updating the `cutensornetTensorDescriptor_t` for all MPS tensors by calling `cutensornetCreateTensorDescriptor` and `cutensornetDestroyTensorDescriptor`. +* Querying the maximal data size needed for each MPS tensor. +* Setting up the SVD truncation options using the `cutensornetTensorSVDConfigSetAttribute` function of the `svdConfig` object created by `cutensornetCreateTensorSVDConfig`. +* Querying the required workspace size for all gate split operations by calling `cutensornetWorkspaceComputeGateSplitSizes` on the largest problem. +* Optionally, calling `cutensornetCreateTensorSVDInfo` and `cutensornetTensorSVDInfoGetAttribute` to store and retrieve runtime SVD truncation information. +* Performing gate split operations for all gates using `cutensornetTensorGateSplit`. +* Freeing the cuTensorNet resources. +* Finding an optimal contraction path with `cutensornetContractionOptimize` in parallel, + and using global reduction (`MPI_MINLOC`) to find the best path and the owning process's identity. + Note that the contraction optimizer on each process sets a different random seed, so each process + typically computes a different optimal path for sufficiently large tensor networks. +* Broadcasting the winner's `optimizerInfo` object by serializing it using the `cutensornetContractionOptimizerInfoGetPackedSize` + and `cutensornetContractionOptimizerInfoPackData` APIs, and deserializing it into an existing `optimizerInfo` + object using the `cutensornetUpdateContractionOptimizerInfoFromPackedData` API function. +* Computing the subset of slice IDs (in a relatively load-balanced fashion) for which each process is responsible, + contracting them, and performing a global reduction (sum) to get the final result on the root process. diff --git a/samples/cutensornet/approxTN/gate_split_example.cu b/samples/cutensornet/approxTN/gate_split_example.cu new file mode 100644 index 0000000..6658007 --- /dev/null +++ b/samples/cutensornet/approxTN/gate_split_example.cu @@ -0,0 +1,380 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +// Sphinx: #1 +#include +#include + +#include +#include +#include + +#include +#include + +#define HANDLE_ERROR(x) \ +{ const auto err = x; \ +if( err != CUTENSORNET_STATUS_SUCCESS ) \ +{ printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); return err; } \ +}; + +#define HANDLE_CUDA_ERROR(x) \ +{ const auto err = x; \ + if( err != cudaSuccess ) \ + { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); return err; } \ +}; + +struct GPUTimer +{ + GPUTimer(cudaStream_t stream): stream_(stream) + { + cudaEventCreate(&start_); + cudaEventCreate(&stop_); + } + + ~GPUTimer() + { + cudaEventDestroy(start_); + cudaEventDestroy(stop_); + } + + void start() + { + cudaEventRecord(start_, stream_); + } + + float seconds() + { + cudaEventRecord(stop_, stream_); + cudaEventSynchronize(stop_); + float time; + cudaEventElapsedTime(&time, start_, stop_); + return time * 1e-3; + } + + private: + cudaEvent_t start_, stop_; + cudaStream_t stream_; +}; + +int main() +{ + const size_t cuTensornetVersion = cutensornetGetVersion(); + printf("cuTensorNet-vers:%ld\n",cuTensornetVersion); + + cudaDeviceProp prop; + int deviceId{-1}; + HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) ); + HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) ); + + printf("===== device info ======\n"); + printf("GPU-name:%s\n", prop.name); + printf("GPU-clock:%d\n", prop.clockRate); + printf("GPU-memoryClock:%d\n", prop.memoryClockRate); + printf("GPU-nSM:%d\n", prop.multiProcessorCount); + printf("GPU-major:%d\n", prop.major); + printf("GPU-minor:%d\n", prop.minor); + printf("========================\n"); + + // Sphinx: #2 + /************************************************************************************ + * Gate Split: A_{i,j,k,l} B_{k,o,p,q} G_{m,n,l,o}-> A'_{i,j,x,m} S_{x} B'_{x,n,p,q} + *************************************************************************************/ + typedef float floatType; + cudaDataType_t typeData = CUDA_R_32F; + cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F; + + // Create vector of modes + std::vector modesAIn{'i','j','k','l'}; + std::vector modesBIn{'k','o','p','q'}; + std::vector modesGIn{'m','n','l','o'}; // input, G is the gate operator + + std::vector modesAOut{'i','j','x','m'}; + std::vector modesBOut{'x','n','p','q'}; // SVD output + + // Extents + std::unordered_map extent; + extent['i'] = 16; + extent['j'] = 16; + extent['k'] = 16; + extent['l'] = 2; + extent['m'] = 2; + extent['n'] = 2; + extent['o'] = 2; + extent['p'] = 16; + extent['q'] = 16; + + const int64_t maxExtent = 16; //truncate to a maximal extent of 16 + extent['x'] = maxExtent; + + // Create a vector of extents for each tensor + std::vector extentAIn; + for (auto mode : modesAIn) + extentAIn.push_back(extent[mode]); + std::vector extentBIn; + for (auto mode : modesBIn) + extentBIn.push_back(extent[mode]); + std::vector extentGIn; + for (auto mode : modesGIn) + extentGIn.push_back(extent[mode]); + std::vector extentAOut; + for (auto mode : modesAOut) + extentAOut.push_back(extent[mode]); + std::vector extentBOut; + for (auto mode : modesBOut) + extentBOut.push_back(extent[mode]); + + // Sphinx: #3 + /*********************************** + * Allocating data on host and device + ************************************/ + + size_t elementsAIn = 1; + for (auto mode : modesAIn) + elementsAIn *= extent[mode]; + size_t elementsBIn = 1; + for (auto mode : modesBIn) + elementsBIn *= extent[mode]; + size_t elementsGIn = 1; + for (auto mode : modesGIn) + elementsGIn *= extent[mode]; + size_t elementsAOut = 1; + for (auto mode : modesAOut) + elementsAOut *= extent[mode]; + size_t elementsBOut = 1; + for (auto mode : modesBOut) + elementsBOut *= extent[mode]; + + size_t sizeAIn = sizeof(floatType) * elementsAIn; + size_t sizeBIn = sizeof(floatType) * elementsBIn; + size_t sizeGIn = sizeof(floatType) * elementsGIn; + size_t sizeAOut = sizeof(floatType) * elementsAOut; + size_t sizeBOut = sizeof(floatType) * elementsBOut; + size_t sizeS = sizeof(floatType) * extent['x']; + + printf("Total memory: %.2f GiB\n", (sizeAIn + sizeBIn + sizeGIn + sizeAOut + sizeBOut + sizeS)/1024./1024./1024); + + void* D_AIn; + void* D_BIn; + void* D_GIn; + void* D_AOut; + void* D_BOut; + void* D_S; + + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_AIn, sizeAIn) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_BIn, sizeBIn) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_GIn, sizeGIn) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_AOut, sizeAOut) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_BOut, sizeBOut) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_S, sizeS) ); + + floatType *AIn = (floatType*) malloc(sizeAIn); + floatType *BIn = (floatType*) malloc(sizeBIn); + floatType *GIn = (floatType*) malloc(sizeGIn); + + if (AIn == NULL || BIn == NULL || GIn == NULL) + { + printf("Error: Host allocation of tensor data.\n"); + return -1; + } + + /********************** + * Initialize input data + ***********************/ + for (uint64_t i = 0; i < elementsAIn; i++) + AIn[i] = ((floatType) rand())/RAND_MAX; + for (uint64_t i = 0; i < elementsBIn; i++) + BIn[i] = ((floatType) rand())/RAND_MAX; + for (uint64_t i = 0; i < elementsGIn; i++) + GIn[i] = ((floatType) rand())/RAND_MAX; + + HANDLE_CUDA_ERROR( cudaMemcpy(D_AIn, AIn, sizeAIn, cudaMemcpyHostToDevice) ); + HANDLE_CUDA_ERROR( cudaMemcpy(D_BIn, BIn, sizeBIn, cudaMemcpyHostToDevice) ); + HANDLE_CUDA_ERROR( cudaMemcpy(D_GIn, GIn, sizeGIn, cudaMemcpyHostToDevice) ); + + printf("Allocate memory for data, and initialize data.\n"); + + // Sphinx: #4 + /****************** + * cuTensorNet + *******************/ + + cudaStream_t stream; + HANDLE_CUDA_ERROR( cudaStreamCreate(&stream) ); + + cutensornetHandle_t handle; + HANDLE_ERROR( cutensornetCreate(&handle) ); + + /************************** + * Create tensor descriptors + ***************************/ + + cutensornetTensorDescriptor_t descTensorAIn; + cutensornetTensorDescriptor_t descTensorBIn; + cutensornetTensorDescriptor_t descTensorGIn; + cutensornetTensorDescriptor_t descTensorAOut; + cutensornetTensorDescriptor_t descTensorBOut; + + const int32_t numModesAIn = modesAIn.size(); + const int32_t numModesBIn = modesBIn.size(); + const int32_t numModesGIn = modesGIn.size(); + const int32_t numModesAOut = modesAOut.size(); + const int32_t numModesBOut = modesBOut.size(); + + const int64_t* strides = NULL; // assuming fortran layout for all tensors + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesAIn, extentAIn.data(), strides, modesAIn.data(), typeData, &descTensorAIn) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesBIn, extentBIn.data(), strides, modesBIn.data(), typeData, &descTensorBIn) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesGIn, extentGIn.data(), strides, modesGIn.data(), typeData, &descTensorGIn) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesAOut, extentAOut.data(), strides, modesAOut.data(), typeData, &descTensorAOut) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesBOut, extentBOut.data(), strides, modesBOut.data(), typeData, &descTensorBOut) ); + + printf("Initialize the cuTensorNet library and create tensor descriptors.\n"); + + // Sphinx: #5 + /************************************************** + * Setup gate split truncation options and algorithm + ***************************************************/ + + cutensornetTensorSVDConfig_t svdConfig; + HANDLE_ERROR( cutensornetCreateTensorSVDConfig(handle, &svdConfig) ); + double absCutoff = 1e-2; + HANDLE_ERROR( cutensornetTensorSVDConfigSetAttribute(handle, + svdConfig, + CUTENSORNET_TENSOR_SVD_CONFIG_ABS_CUTOFF, + &absCutoff, + sizeof(absCutoff)) ); + double relCutoff = 1e-2; + HANDLE_ERROR( cutensornetTensorSVDConfigSetAttribute(handle, + svdConfig, + CUTENSORNET_TENSOR_SVD_CONFIG_REL_CUTOFF, + &relCutoff, + sizeof(relCutoff)) ); + + cutensornetGateSplitAlgo_t gateAlgo = CUTENSORNET_GATE_SPLIT_ALGO_REDUCED; + /******************************************************** + * Create SVDInfo to record runtime SVD truncation details + *********************************************************/ + + cutensornetTensorSVDInfo_t svdInfo; + HANDLE_ERROR( cutensornetCreateTensorSVDInfo(handle, &svdInfo)) ; + + // Sphinx: #6 + /************************************** + * Query and allocate required workspace + ***************************************/ + + cutensornetWorkspaceDescriptor_t workDesc; + HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) ); + + HANDLE_ERROR( cutensornetWorkspaceComputeGateSplitSizes(handle, + descTensorAIn, descTensorBIn, descTensorGIn, + descTensorAOut, descTensorBOut, + gateAlgo, + svdConfig, typeCompute, + workDesc) ); + uint64_t requiredWorkspaceSize = 0; + HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, + workDesc, + CUTENSORNET_WORKSIZE_PREF_MIN, + CUTENSORNET_MEMSPACE_DEVICE, + &requiredWorkspaceSize) ); + void *work = nullptr; + HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) ); + + HANDLE_ERROR( cutensornetWorkspaceSet(handle, + workDesc, + CUTENSORNET_MEMSPACE_DEVICE, + work, + requiredWorkspaceSize) ); + + printf("Allocate workspace.\n"); + + // Sphinx: #7 + /********************** + * Execution + **********************/ + + GPUTimer timer{stream}; + double minTimeCUTENSOR = 1e100; + const int numRuns = 3; // to get stable perf results + for (int i=0; i < numRuns; ++i) + { + // restore output + cudaMemsetAsync(D_AOut, 0, sizeAOut, stream); + cudaMemsetAsync(D_S, 0, sizeS, stream); + cudaMemsetAsync(D_BOut, 0, sizeBOut, stream); + + // With value-based truncation, `cutensornetGateSplit` can potentially update the shared extent in descTensorA/BOut. + // We here restore descTensorA/BOut to the original problem. + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorAOut) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorBOut) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesAOut, extentAOut.data(), strides, modesAOut.data(), typeData, &descTensorAOut) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesBOut, extentBOut.data(), strides, modesBOut.data(), typeData, &descTensorBOut) ); + + cudaDeviceSynchronize(); + timer.start(); + HANDLE_ERROR( cutensornetGateSplit(handle, + descTensorAIn, D_AIn, + descTensorBIn, D_BIn, + descTensorGIn, D_GIn, + descTensorAOut, D_AOut, + D_S, + descTensorBOut, D_BOut, + gateAlgo, + svdConfig, typeCompute, svdInfo, + workDesc, stream) ); + // Synchronize and measure timing + auto time = timer.seconds(); + minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time; + } + + printf("Performing Gate Split\n"); + + // Sphinx: #8 + /************************************* + * Query runtime truncation information + **************************************/ + + double discardedWeight{0}; + int64_t reducedExtent{0}; + cudaDeviceSynchronize(); // device synchronization. + HANDLE_ERROR( cutensornetTensorSVDInfoGetAttribute( handle, svdInfo, CUTENSORNET_TENSOR_SVD_INFO_DISCARDED_WEIGHT, &discardedWeight, sizeof(discardedWeight)) ); + HANDLE_ERROR( cutensornetTensorSVDInfoGetAttribute( handle, svdInfo, CUTENSORNET_TENSOR_SVD_INFO_REDUCED_EXTENT, &reducedExtent, sizeof(reducedExtent)) ); + + printf("elapsed time: %.2f ms\n", minTimeCUTENSOR * 1000.f); + printf("reduced extent found at runtime: %lu\n", reducedExtent); + printf("discarded weight: %.6f\n", discardedWeight); + + // Sphinx: #9 + /*************** + * Free resources + ****************/ + + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorAIn) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorBIn) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorGIn) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorAOut) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorBOut) ); + HANDLE_ERROR( cutensornetDestroyTensorSVDConfig(svdConfig) ); + HANDLE_ERROR( cutensornetDestroyTensorSVDInfo(svdInfo) ); + HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) ); + HANDLE_ERROR( cutensornetDestroy(handle) ); + + if (AIn) free(AIn); + if (BIn) free(BIn); + if (GIn) free(GIn); + if (D_AIn) cudaFree(D_AIn); + if (D_BIn) cudaFree(D_BIn); + if (D_GIn) cudaFree(D_GIn); + if (D_AOut) cudaFree(D_AOut); + if (D_BOut) cudaFree(D_BOut); + if (D_S) cudaFree(D_S); + if (work) cudaFree(work); + + printf("Free resource and exit.\n"); + + return 0; +} diff --git a/samples/cutensornet/approxTN/mps_example.cu b/samples/cutensornet/approxTN/mps_example.cu new file mode 100644 index 0000000..02c1e74 --- /dev/null +++ b/samples/cutensornet/approxTN/mps_example.cu @@ -0,0 +1,690 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include + +/**************************************************************** + * Basic Matrix Product State (MPS) Algorithm + * + * Input: + * 1. A-J are MPS tensors + * 2. XXXXX are rank-4 gate tensors: + * + * A---B---C---D---E---F---G---H---I---J MPS tensors + * | | | | | | | | | | + * XXXXX XXXXX XXXXX XXXXX XXXXX gate cycle 0 + * | | | | | | | | | | + * | XXXXX XXXXX XXXXX XXXXX | gate cycle 1 + * | | | | | | | | | | + * XXXXX XXXXX XXXXX XXXXX XXXXX gate cycle 2 + * | | | | | | | | | | + * | XXXXX XXXXX XXXXX XXXXX | gate cycle 3 + * | | | | | | | | | | + * XXXXX XXXXX XXXXX XXXXX XXXXX gate cycle 4 + * | | | | | | | | | | + * | XXXXX XXXXX XXXXX XXXXX | gate cycle 5 + * | | | | | | | | | | + * XXXXX XXXXX XXXXX XXXXX XXXXX gate cycle 6 + * | | | | | | | | | | + * | XXXXX XXXXX XXXXX XXXXX | gate cycle 7 + * | | | | | | | | | | + * + * + * Output: + * 1. maximal virtual extent of the bonds (===) is `maxVirtualExtent` (set by user). + * + * A===B===C===D===E===F===G===H===I===J MPS tensors + * | | | | | | | | | | + * + * + * Algorithm: + * Iterative over the gate cycles, within each cycle, perform gate split operation below for all relevant tensors + * ---A---B---- + * | | GateSplit ---A===B--- + * XXXXX -------> | | + * | | +******************************************************************/ + +// Sphinx: #1 +#define HANDLE_ERROR(x) \ +{ const auto err = x; \ +if( err != CUTENSORNET_STATUS_SUCCESS ) \ +{ std::cout << "Error: " << cutensornetGetErrorString(err) << " in line " << __LINE__ << std::endl; return err;} \ +}; + +#define HANDLE_CUDA_ERROR(x) \ +{ const auto err = x; \ + if( err != cudaSuccess ) \ + { std::cout << "Error: " << cudaGetErrorString(err) << " in line " << __LINE__ << std::endl; return err; } \ +}; + +// Sphinx: #2 +class MPSHelper +{ + public: + /** + * \brief Construct an MPSHelper object for gate splitting algorithm. + * i j k + * -------A-------B------- i j k + * p| |q -------> -------A`-------B`------- + * GGGGGGGGG r| |s + * r| |s + * \param[in] numSites The number of sites in the MPS + * \param[in] physExtent The extent for the physical mode where the gate tensors are acted on. + * \param[in] maxVirtualExtent The maximal extent allowed for the virtual mode shared between adjacent MPS tensors. + * \param[in] initialVirtualExtents A vector of size \p numSites-1 where the ith element denotes the extent of the shared mode for site i and site i+1 in the beginning of the simulation. + * \param[in] typeData The data type for all tensors and gates + * \param[in] typeCompute The compute type for all gate splitting process + */ + MPSHelper(int32_t numSites, + int64_t physExtent, + int64_t maxVirtualExtent, + const std::vector& initialVirtualExtents, + cudaDataType_t typeData, + cutensornetComputeType_t typeCompute); + + /** + * \brief Initialize the MPS metadata and cutensornet library. + */ + cutensornetStatus_t initialize(); + + /** + * \brief Compute the maximal number of elements for each site. + */ + std::vector getMaxTensorElements() const; + + /** + * \brief Update the SVD truncation setting. + * \param[in] absCutoff The cutoff value for absolute singular value truncation. + * \param[in] relCutoff The cutoff value for relative singular value truncation. + * \param[in] renorm The option for renormalization of the truncated singular values. + * \param[in] partition The option for partitioning of the singular values. + */ + cutensornetStatus_t setSVDConfig(double absCutoff, + double relCutoff, + cutensornetTensorSVDNormalization_t renorm, + cutensornetTensorSVDPartition_t partition); + + /** + * \brief Update the algorithm to use for the gating process. + * \param[in] gateAlgo The gate algorithm to use for MPS simulation. + */ + void setGateAlgorithm(cutensornetGateSplitAlgo_t gateAlgo) {gateAlgo_ = gateAlgo;} + + /** + * \brief Compute the maximal workspace needed for MPS gating algorithm. + * \param[out] workspaceSize The required workspace size on the device. + */ + cutensornetStatus_t computeMaxWorkspaceSizes(uint64_t* workspaceSize); + + /** + * \brief Compute the maximal workspace needed for MPS gating algorithm. + * \param[in] work Pointer to the allocated workspace. + * \param[in] workspaceSize The required workspace size on the device. + */ + cutensornetStatus_t setWorkspace(void* work, uint64_t workspaceSize); + + /** + * \brief In-place execution of the apply gate algorithm on \p siteA and \p siteB. + * \param[in] siteA The first site where the gate is applied to. + * \param[in] siteB The second site where the gate is applied to. Must be adjacent to \p siteA. + * \param[in,out] dataInA The data for the MPS tensor at \p siteA. The input will be overwritten with output mps tensor data. + * \param[in,out] dataInB The data for the MPS tensor at \p siteB. The input will be overwritten with output mps tensor data. + * \param[in] dataInG The input data for the gate tensor. + * \param[in] verbose Whether to print out the runtime information regarding truncation. + * \param[in] stream The CUDA stream on which the computation is performed. + */ + cutensornetStatus_t applyGate(uint32_t siteA, + uint32_t siteB, + void* dataInA, + void* dataInB, + const void* dataInG, + bool verbose, + cudaStream_t stream); + + /** + * \brief Free all the tensor descriptors in mpsHelper. + */ + ~MPSHelper() + { + if (inited_) + { + for (auto& descTensor: descTensors_) + { + cutensornetDestroyTensorDescriptor(descTensor); + } + cutensornetDestroy(handle_); + cutensornetDestroyWorkspaceDescriptor(workDesc_); + } + if (svdConfig_ != nullptr) + { + cutensornetDestroyTensorSVDConfig(svdConfig_); + } + if (svdInfo_ != nullptr) + { + cutensornetDestroyTensorSVDInfo(svdInfo_); + } + } + + private: + int32_t numSites_; ///< Number of sites in the MPS + int64_t physExtent_; ///< Extent for the physical index + int64_t maxVirtualExtent_{0}; ///< The maximal extent allowed for the virtual dimension + cudaDataType_t typeData_; + cutensornetComputeType_t typeCompute_; + + bool inited_{false}; + std::vector physModes_; ///< A vector of length \p numSites_ storing the physical mode of each site. + std::vector virtualModes_; ///< A vector of length \p numSites_+1; For site i, virtualModes_[i] and virtualModes_[i+1] represents the left and right virtual mode. + std::vector extentsPerSite_; ///< A vector of length \p numSites_+1; For site i, extentsPerSite_[i] and extentsPerSite_[i+1] represents the left and right virtual extent. + + cutensornetHandle_t handle_; + std::vector descTensors_; /// A vector of length \p numSites_ storing the cutensornetTensorDescriptor_t for each site + cutensornetWorkspaceDescriptor_t workDesc_{nullptr}; + cutensornetTensorSVDConfig_t svdConfig_{nullptr}; + cutensornetTensorSVDInfo_t svdInfo_{nullptr}; + cutensornetGateSplitAlgo_t gateAlgo_{CUTENSORNET_GATE_SPLIT_ALGO_DIRECT}; + int32_t nextMode_{0}; /// The next mode label to use for labelling site tensors and gates. +}; + +// Sphinx: #3 +MPSHelper::MPSHelper(int32_t numSites, + int64_t physExtent, + int64_t maxVirtualExtent, + const std::vector& initialVirtualExtents, + cudaDataType_t typeData, + cutensornetComputeType_t typeCompute) + : numSites_(numSites), + physExtent_(physExtent), + typeData_(typeData), + typeCompute_(typeCompute) +{ + // initialize vectors to store the modes and extents for physical and virtual bond + for (int32_t i=0; i MPSHelper::getMaxTensorElements() const +{ + // compute the maximal tensor sizes for all sites during MPS simulation + std::vector maxTensorElements(numSites_); + int64_t maxLeftExtent = 1; + for (int32_t i=0; i= numSites_) + { + std::cout<< "Site index can not exceed maximal number of sites" << std::endl; + return CUTENSORNET_STATUS_INVALID_VALUE; + } + + auto descTensorInA = descTensors_[siteA]; + auto descTensorInB = descTensors_[siteB]; + + cutensornetTensorDescriptor_t descTensorInG; + + /********************************* + * Create output tensor descriptors + **********************************/ + int32_t physModeInA = physModes_[siteA]; + int32_t physModeInB = physModes_[siteB]; + int32_t physModeOutA = nextMode_++; + int32_t physModeOutB = nextMode_++; + const int32_t modesG[]{physModeInA, physModeInB, physModeOutA, physModeOutB}; + const int64_t extentG[]{physExtent_, physExtent_, physExtent_, physExtent_}; + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle_, + /*numModes=*/4, + extentG, + /*strides=*/nullptr, // fortran layout + modesG, + typeData_, + &descTensorInG) ); + + int64_t leftExtentA = extentsPerSite_[siteA]; + int64_t extentABIn = extentsPerSite_[siteA+1]; + int64_t rightExtentB = extentsPerSite_[siteA+2]; + // Compute the expected shared extent of output tensor A and B. + int64_t combinedExtentLeft = std::min(leftExtentA, extentABIn*physExtent_) * physExtent_; + int64_t combinedExtentRight = std::min(rightExtentB, extentABIn*physExtent_) * physExtent_; + int64_t extentABOut = std::min({combinedExtentLeft, combinedExtentRight, maxVirtualExtent_}); + + cutensornetTensorDescriptor_t descTensorOutA; + cutensornetTensorDescriptor_t descTensorOutB; + const int32_t modesOutA[]{virtualModes_[siteA], physModeOutA, virtualModes_[siteA+1]}; + const int32_t modesOutB[]{virtualModes_[siteB], physModeOutB, virtualModes_[siteB+1]}; + const int64_t extentOutA[]{leftExtentA, physExtent_, extentABOut}; + const int64_t extentOutB[]{extentABOut, physExtent_, rightExtentB}; + + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle_, + /*numModes=*/3, + extentOutA, + /*strides=*/nullptr, // fortran layout + modesOutA, + typeData_, + &descTensorOutA) ); + + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle_, + /*numModes=*/3, + extentOutB, + /*strides=*/nullptr, // fortran layout + modesOutB, + typeData_, + &descTensorOutB) ); + + /********** + * Execution + ***********/ + HANDLE_ERROR( cutensornetGateSplit(handle_, + descTensorInA, dataInA, + descTensorInB, dataInB, + descTensorInG, dataInG, + descTensorOutA, dataInA, // overwrite in place + /*s=*/nullptr, // we partition s equally onto A and B, therefore s is not needed + descTensorOutB, dataInB, // overwrite in place + gateAlgo_, svdConfig_, typeCompute_, + svdInfo_, workDesc_, stream) ); + + /************************** + * Query runtime information + ***************************/ + if (verbose) + { + int64_t fullExtent; + int64_t reducedExtent; + double discardedWeight; + HANDLE_ERROR( cutensornetTensorSVDInfoGetAttribute( handle_, svdInfo_, CUTENSORNET_TENSOR_SVD_INFO_FULL_EXTENT, &fullExtent, sizeof(fullExtent)) ); + HANDLE_ERROR( cutensornetTensorSVDInfoGetAttribute( handle_, svdInfo_, CUTENSORNET_TENSOR_SVD_INFO_REDUCED_EXTENT, &reducedExtent, sizeof(reducedExtent)) ); + HANDLE_ERROR( cutensornetTensorSVDInfoGetAttribute( handle_, svdInfo_, CUTENSORNET_TENSOR_SVD_INFO_DISCARDED_WEIGHT, &discardedWeight, sizeof(discardedWeight)) ); + std::cout << "virtual bond truncated from " << fullExtent << " to " << reducedExtent << " with a discarded weight " << discardedWeight << std::endl; + } + + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorInA) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorInB) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorInG) ); + + // update pointer to the output tensor descriptor and the output shared extent + physModes_[siteA] = physModeOutA; + physModes_[siteB] = physModeOutB; + descTensors_[siteA] = descTensorOutA; + descTensors_[siteB] = descTensorOutB; + + int32_t numModes = 3; + std::vector extentAOut(numModes); + HANDLE_ERROR( cutensornetGetTensorDetails(handle_, descTensorOutA, &numModes, nullptr, nullptr, extentAOut.data(), nullptr) ); + // update the shared extent of output A and B which can potentially get reduced if absCutoff and relCutoff is non-zero. + extentsPerSite_[siteA+1] = extentAOut[2]; // mode label order is always (left_virtual, physical, right_virtual) + return CUTENSORNET_STATUS_SUCCESS; +} + +// Sphinx: #10 +int main() +{ + const size_t cuTensornetVersion = cutensornetGetVersion(); + printf("cuTensorNet-vers:%ld\n",cuTensornetVersion); + + cudaDeviceProp prop; + int deviceId{-1}; + HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) ); + HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) ); + + printf("===== device info ======\n"); + printf("GPU-name:%s\n", prop.name); + printf("GPU-clock:%d\n", prop.clockRate); + printf("GPU-memoryClock:%d\n", prop.memoryClockRate); + printf("GPU-nSM:%d\n", prop.multiProcessorCount); + printf("GPU-major:%d\n", prop.major); + printf("GPU-minor:%d\n", prop.minor); + printf("========================\n"); + + // Sphinx: #11 + /*********************************** + * Step 1: basic MPS setup + ************************************/ + + // setup the simulation setting for the MPS + typedef std::complex complexType; + cudaDataType_t typeData = CUDA_C_64F; + cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_64F; + int32_t numSites = 16; + int64_t physExtent = 2; + int64_t maxVirtualExtent = 12; + const std::vector initialVirtualExtents(numSites-1, 1); // starting MPS with shared extent of 1; + + // initialize an MPSHelper to dynamically update tensor metadats + MPSHelper mpsHelper(numSites, physExtent, maxVirtualExtent, initialVirtualExtents, typeData, typeCompute); + HANDLE_ERROR( mpsHelper.initialize() ); + + // Sphinx: #12 + /*********************************** + * Step 2: data allocation + ************************************/ + + // query largest tensor sizes for the MPS + const std::vector maxElementsPerSite = mpsHelper.getMaxTensorElements(); + std::vector tensors_h; + std::vector tensors_d; + for (int32_t i=0; i + *(complexType*)(data_h) = complexType(1,0); + void* data_d; + HANDLE_CUDA_ERROR( cudaMalloc(&data_d, maxSize) ); + // data transfer from host to device + HANDLE_CUDA_ERROR( cudaMemcpy(data_d, data_h, maxSize, cudaMemcpyHostToDevice) ); + tensors_h.push_back(data_h); + tensors_d.push_back(data_d); + } + + // initialize 4 random gate tensors on host and copy them to device + const int32_t numRandomGates = 4; + const int64_t numGateElements = physExtent * physExtent * physExtent * physExtent; // shape (2, 2, 2, 2) + size_t gateSize = sizeof(complexType) * numGateElements; + complexType* gates_h[numRandomGates]; + void* gates_d[numRandomGates]; + + for (int i=0; i +#include + +#include +#include +#include + +#include +#include + +#define HANDLE_ERROR(x) \ +{ const auto err = x; \ +if( err != CUTENSORNET_STATUS_SUCCESS ) \ +{ printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); return err; } \ +}; + +#define HANDLE_CUDA_ERROR(x) \ +{ const auto err = x; \ + if( err != cudaSuccess ) \ + { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); return err; } \ +}; + +struct GPUTimer +{ + GPUTimer(cudaStream_t stream): stream_(stream) + { + cudaEventCreate(&start_); + cudaEventCreate(&stop_); + } + + ~GPUTimer() + { + cudaEventDestroy(start_); + cudaEventDestroy(stop_); + } + + void start() + { + cudaEventRecord(start_, stream_); + } + + float seconds() + { + cudaEventRecord(stop_, stream_); + cudaEventSynchronize(stop_); + float time; + cudaEventElapsedTime(&time, start_, stop_); + return time * 1e-3; + } + + private: + cudaEvent_t start_, stop_; + cudaStream_t stream_; +}; + +int64_t computeCombinedExtent(const std::unordered_map &extentMap, + const std::vector &modes) +{ + int64_t combinedExtent{1}; + for (auto mode: modes) + { + auto it = extentMap.find(mode); + if (it != extentMap.end()) + combinedExtent *= it->second; + } + return combinedExtent; +} + +int main() +{ + const size_t cuTensornetVersion = cutensornetGetVersion(); + printf("cuTensorNet-vers:%ld\n",cuTensornetVersion); + + cudaDeviceProp prop; + int deviceId{-1}; + HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) ); + HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) ); + + printf("===== device info ======\n"); + printf("GPU-name:%s\n", prop.name); + printf("GPU-clock:%d\n", prop.clockRate); + printf("GPU-memoryClock:%d\n", prop.memoryClockRate); + printf("GPU-nSM:%d\n", prop.multiProcessorCount); + printf("GPU-major:%d\n", prop.major); + printf("GPU-minor:%d\n", prop.minor); + printf("========================\n"); + + // Sphinx: #2 + /********************************************** + * Tensor QR: T_{i,j,m,n} -> Q_{i,x,m} R_{n,x,j} + ***********************************************/ + + typedef float floatType; + cudaDataType_t typeData = CUDA_R_32F; + + // Create vector of modes + int32_t sharedMode = 'x'; + + std::vector modesT{'i','j','m','n'}; // input + std::vector modesQ{'i', sharedMode,'m'}; + std::vector modesR{'n', sharedMode,'j'}; // QR output + + // Extents + std::unordered_map extentMap; + extentMap['i'] = 16; + extentMap['j'] = 16; + extentMap['m'] = 16; + extentMap['n'] = 16; + + int64_t rowExtent = computeCombinedExtent(extentMap, modesQ); + int64_t colExtent = computeCombinedExtent(extentMap, modesR); + + // cuTensorNet tensor QR operates in reduced mode expecting k = min(m, n) + extentMap[sharedMode] = rowExtent <= colExtent? rowExtent: colExtent; + + // Create a vector of extents for each tensor + std::vector extentT; + for (auto mode : modesT) + extentT.push_back(extentMap[mode]); + std::vector extentQ; + for (auto mode : modesQ) + extentQ.push_back(extentMap[mode]); + std::vector extentR; + for (auto mode : modesR) + extentR.push_back(extentMap[mode]); + + // Sphinx: #3 + /*********************************** + * Allocating data on host and device + ************************************/ + + size_t elementsT = 1; + for (auto mode : modesT) + elementsT *= extentMap[mode]; + size_t elementsQ = 1; + for (auto mode : modesQ) + elementsQ *= extentMap[mode]; + size_t elementsR = 1; + for (auto mode : modesR) + elementsR *= extentMap[mode]; + + size_t sizeT = sizeof(floatType) * elementsT; + size_t sizeQ = sizeof(floatType) * elementsQ; + size_t sizeR = sizeof(floatType) * elementsR; + + printf("Total memory: %.2f GiB\n", (sizeT + sizeQ + sizeR)/1024./1024./1024); + + floatType *T = (floatType*) malloc(sizeT); + floatType *Q = (floatType*) malloc(sizeQ); + floatType *R = (floatType*) malloc(sizeR); + + if (T == NULL || Q==NULL || R==NULL ) + { + printf("Error: Host allocation of input T or output Q/R.\n"); + return -1; + } + + void* D_T; + void* D_Q; + void* D_R; + + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_T, sizeT) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_Q, sizeQ) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_R, sizeR) ); + + /**************** + * Initialize data + *****************/ + + for (uint64_t i = 0; i < elementsT; i++) + T[i] = ((floatType) rand())/RAND_MAX; + + HANDLE_CUDA_ERROR( cudaMemcpy(D_T, T, sizeT, cudaMemcpyHostToDevice) ); + printf("Allocate memory for data, and initialize data.\n"); + + // Sphinx: #4 + /****************** + * cuTensorNet + *******************/ + + cudaStream_t stream; + HANDLE_CUDA_ERROR( cudaStreamCreate(&stream) ); + + cutensornetHandle_t handle; + HANDLE_ERROR( cutensornetCreate(&handle) ); + + /*************************** + * Create tensor descriptors + ****************************/ + + cutensornetTensorDescriptor_t descTensorIn; + cutensornetTensorDescriptor_t descTensorQ; + cutensornetTensorDescriptor_t descTensorR; + + const int32_t numModesIn = modesT.size(); + const int32_t numModesQ = modesQ.size(); + const int32_t numModesR = modesR.size(); + + const int64_t* strides = NULL; // assuming fortran layout for all tensors + + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesIn, extentT.data(), strides, modesT.data(), typeData, &descTensorIn) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesQ, extentQ.data(), strides, modesQ.data(), typeData, &descTensorQ) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesR, extentR.data(), strides, modesR.data(), typeData, &descTensorR) ); + + printf("Initialize the cuTensorNet library and create all tensor descriptors.\n"); + + // Sphinx: #5 + /******************************************** + * Query and allocate required workspace sizes + *********************************************/ + + cutensornetWorkspaceDescriptor_t workDesc; + HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) ); + HANDLE_ERROR( cutensornetWorkspaceComputeQRSizes(handle, descTensorIn, descTensorQ, descTensorR, workDesc) ); + uint64_t hostWorkspaceSize, deviceWorkspaceSize; + + // for tensor QR, it does not matter which cutensornetWorksizePref_t we pick + HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, workDesc, CUTENSORNET_WORKSIZE_PREF_RECOMMENDED, CUTENSORNET_MEMSPACE_DEVICE, &deviceWorkspaceSize) ); + HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, workDesc, CUTENSORNET_WORKSIZE_PREF_RECOMMENDED, CUTENSORNET_MEMSPACE_HOST, &hostWorkspaceSize) ); + + void *devWork = nullptr, *hostWork = nullptr; + if (deviceWorkspaceSize > 0) { + HANDLE_CUDA_ERROR( cudaMalloc(&devWork, deviceWorkspaceSize) ); + } + if (hostWorkspaceSize > 0) { + hostWork = malloc(hostWorkspaceSize); + } + HANDLE_ERROR( cutensornetWorkspaceSet(handle, workDesc, CUTENSORNET_MEMSPACE_DEVICE, devWork, deviceWorkspaceSize) ); + HANDLE_ERROR( cutensornetWorkspaceSet(handle, workDesc, CUTENSORNET_MEMSPACE_HOST, hostWork, hostWorkspaceSize) ); + + // Sphinx: #6 + /********** + * Execution + ***********/ + + GPUTimer timer{stream}; + double minTimeCUTENSOR = 1e100; + const int numRuns = 3; // to get stable perf results + for (int i=0; i < numRuns; ++i) + { + // restore output + cudaMemsetAsync(D_Q, 0, sizeQ, stream); + cudaMemsetAsync(D_R, 0, sizeR, stream); + cudaDeviceSynchronize(); + + timer.start(); + HANDLE_ERROR( cutensornetTensorQR(handle, + descTensorIn, D_T, + descTensorQ, D_Q, + descTensorR, D_R, + workDesc, + stream) ); + // Synchronize and measure timing + auto time = timer.seconds(); + minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time; + } + + printf("Performing QR\n"); + + HANDLE_CUDA_ERROR( cudaMemcpyAsync(Q, D_Q, sizeQ, cudaMemcpyDeviceToHost) ); + HANDLE_CUDA_ERROR( cudaMemcpyAsync(R, D_R, sizeR, cudaMemcpyDeviceToHost) ); + + cudaDeviceSynchronize(); // device synchronization. + printf("%.2f ms\n", minTimeCUTENSOR * 1000.f); + + // Sphinx: #7 + /*************** + * Free resources + ****************/ + + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorIn) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorQ) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorR) ); + HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) ); + HANDLE_ERROR( cutensornetDestroy(handle) ); + + if (T) free(T); + if (Q) free(Q); + if (R) free(R); + if (D_T) cudaFree(D_T); + if (D_Q) cudaFree(D_Q); + if (D_R) cudaFree(D_R); + if (devWork) cudaFree(devWork); + if (hostWork) free(hostWork); + + printf("Free resource and exit.\n"); + + return 0; +} diff --git a/samples/cutensornet/approxTN/tensor_svd_example.cu b/samples/cutensornet/approxTN/tensor_svd_example.cu new file mode 100644 index 0000000..e8fbf84 --- /dev/null +++ b/samples/cutensornet/approxTN/tensor_svd_example.cu @@ -0,0 +1,355 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +// Sphinx: #1 +#include +#include + +#include +#include +#include + +#include +#include + +#define HANDLE_ERROR(x) \ +{ const auto err = x; \ +if( err != CUTENSORNET_STATUS_SUCCESS ) \ +{ printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); return err; } \ +}; + +#define HANDLE_CUDA_ERROR(x) \ +{ const auto err = x; \ + if( err != cudaSuccess ) \ + { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); return err; } \ +}; + +struct GPUTimer +{ + GPUTimer(cudaStream_t stream): stream_(stream) + { + cudaEventCreate(&start_); + cudaEventCreate(&stop_); + } + + ~GPUTimer() + { + cudaEventDestroy(start_); + cudaEventDestroy(stop_); + } + + void start() + { + cudaEventRecord(start_, stream_); + } + + float seconds() + { + cudaEventRecord(stop_, stream_); + cudaEventSynchronize(stop_); + float time; + cudaEventElapsedTime(&time, start_, stop_); + return time * 1e-3; + } + + private: + cudaEvent_t start_, stop_; + cudaStream_t stream_; +}; + +int64_t computeCombinedExtent(const std::unordered_map &extentMap, + const std::vector &modes) +{ + int64_t combinedExtent{1}; + for (auto mode: modes) + { + auto it = extentMap.find(mode); + if (it != extentMap.end()) + combinedExtent *= it->second; + } + return combinedExtent; +} + +int main() +{ + const size_t cuTensornetVersion = cutensornetGetVersion(); + printf("cuTensorNet-vers:%ld\n",cuTensornetVersion); + + cudaDeviceProp prop; + int deviceId{-1}; + HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) ); + HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) ); + + printf("===== device info ======\n"); + printf("GPU-name:%s\n", prop.name); + printf("GPU-clock:%d\n", prop.clockRate); + printf("GPU-memoryClock:%d\n", prop.memoryClockRate); + printf("GPU-nSM:%d\n", prop.multiProcessorCount); + printf("GPU-major:%d\n", prop.major); + printf("GPU-minor:%d\n", prop.minor); + printf("========================\n"); + + // Sphinx: #2 + /****************************************************** + * Tensor SVD: T_{i,j,m,n} -> U_{i,x,m} S_{x} V_{n,x,j} + *******************************************************/ + + typedef float floatType; + cudaDataType_t typeData = CUDA_R_32F; + + // Create vector of modes + int32_t sharedMode = 'x'; + + std::vector modesT{'i','j','m','n'}; // input + std::vector modesU{'i', sharedMode,'m'}; + std::vector modesV{'n', sharedMode,'j'}; // SVD output + + // Extents + std::unordered_map extentMap; + extentMap['i'] = 16; + extentMap['j'] = 16; + extentMap['m'] = 16; + extentMap['n'] = 16; + + int64_t rowExtent = computeCombinedExtent(extentMap, modesU); + int64_t colExtent = computeCombinedExtent(extentMap, modesV); + // cuTensorNet tensor SVD operates in reduced mode expecting k <= min(m, n) + int64_t fullSharedExtent = rowExtent <= colExtent? rowExtent: colExtent; + const int64_t maxExtent = fullSharedExtent / 2; //fix extent truncation with half of the singular values trimmed out + extentMap[sharedMode] = maxExtent; + + // Create a vector of extents for each tensor + std::vector extentT; + for (auto mode : modesT) + extentT.push_back(extentMap[mode]); + std::vector extentU; + for (auto mode : modesU) + extentU.push_back(extentMap[mode]); + std::vector extentV; + for (auto mode : modesV) + extentV.push_back(extentMap[mode]); + + // Sphinx: #3 + /*********************************** + * Allocating data on host and device + ************************************/ + + size_t elementsT = 1; + for (auto mode : modesT) + elementsT *= extentMap[mode]; + size_t elementsU = 1; + for (auto mode : modesU) + elementsU *= extentMap[mode]; + size_t elementsV = 1; + for (auto mode : modesV) + elementsV *= extentMap[mode]; + + size_t sizeT = sizeof(floatType) * elementsT; + size_t sizeU = sizeof(floatType) * elementsU; + size_t sizeS = sizeof(floatType) * extentMap[sharedMode]; + size_t sizeV = sizeof(floatType) * elementsV; + + printf("Total memory: %.2f GiB\n", (sizeT + sizeU + sizeS + sizeV)/1024./1024./1024); + + floatType *T = (floatType*) malloc(sizeT); + floatType *U = (floatType*) malloc(sizeU); + floatType *S = (floatType*) malloc(sizeS); + floatType *V = (floatType*) malloc(sizeV); + + if (T == NULL || U==NULL || S==NULL || V==NULL) + { + printf("Error: Host allocation of input T or output U/S/V.\n"); + return -1; + } + + void* D_T; + void* D_U; + void* D_S; + void* D_V; + + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_T, sizeT) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_U, sizeU) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_S, sizeS) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_V, sizeV) ); + + /**************** + * Initialize data + *****************/ + + for (uint64_t i = 0; i < elementsT; i++) + T[i] = ((floatType) rand())/RAND_MAX; + + HANDLE_CUDA_ERROR( cudaMemcpy(D_T, T, sizeT, cudaMemcpyHostToDevice) ); + printf("Allocate memory for data, and initialize data.\n"); + + // Sphinx: #4 + /****************** + * cuTensorNet + *******************/ + + cudaStream_t stream; + HANDLE_CUDA_ERROR( cudaStreamCreate(&stream) ); + + cutensornetHandle_t handle; + HANDLE_ERROR( cutensornetCreate(&handle) ); + + /************************** + * Create tensor descriptors + ***************************/ + + cutensornetTensorDescriptor_t descTensorIn; + cutensornetTensorDescriptor_t descTensorU; + cutensornetTensorDescriptor_t descTensorV; + + const int32_t numModesIn = modesT.size(); + const int32_t numModesU = modesU.size(); + const int32_t numModesV = modesV.size(); + + const int64_t* strides = NULL; // assuming fortran layout for all tensors + + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesIn, extentT.data(), strides, modesT.data(), typeData, &descTensorIn) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesU, extentU.data(), strides, modesU.data(), typeData, &descTensorU) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesV, extentV.data(), strides, modesV.data(), typeData, &descTensorV) ); + + printf("Initialize the cuTensorNet library and create all tensor descriptors.\n"); + + // Sphinx: #5 + /******************************** + * Setup SVD truncation parameters + *********************************/ + + cutensornetTensorSVDConfig_t svdConfig; + HANDLE_ERROR( cutensornetCreateTensorSVDConfig(handle, &svdConfig) ); + double absCutoff = 1e-2; + HANDLE_ERROR( cutensornetTensorSVDConfigSetAttribute(handle, + svdConfig, + CUTENSORNET_TENSOR_SVD_CONFIG_ABS_CUTOFF, + &absCutoff, + sizeof(absCutoff)) ); + double relCutoff = 4e-2; + HANDLE_ERROR( cutensornetTensorSVDConfigSetAttribute(handle, + svdConfig, + CUTENSORNET_TENSOR_SVD_CONFIG_REL_CUTOFF, + &relCutoff, + sizeof(relCutoff)) ); + + /******************************************************** + * Create SVDInfo to record runtime SVD truncation details + *********************************************************/ + + cutensornetTensorSVDInfo_t svdInfo; + HANDLE_ERROR( cutensornetCreateTensorSVDInfo(handle, &svdInfo)) ; + + // Sphinx: #6 + /************************************************************** + * Query the required workspace sizes and allocate memory + **************************************************************/ + + cutensornetWorkspaceDescriptor_t workDesc; + HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) ); + HANDLE_ERROR( cutensornetWorkspaceComputeSVDSizes(handle, descTensorIn, descTensorU, descTensorV, svdConfig, workDesc) ); + uint64_t hostWorkspaceSize, deviceWorkspaceSize; + // for tensor SVD, it does not matter which cutensornetWorksizePref_t we pick + HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, workDesc, CUTENSORNET_WORKSIZE_PREF_RECOMMENDED, CUTENSORNET_MEMSPACE_DEVICE, &deviceWorkspaceSize) ); + HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, workDesc, CUTENSORNET_WORKSIZE_PREF_RECOMMENDED, CUTENSORNET_MEMSPACE_HOST, &hostWorkspaceSize) ); + + void *devWork = nullptr, *hostWork = nullptr; + if (deviceWorkspaceSize > 0) { + HANDLE_CUDA_ERROR( cudaMalloc(&devWork, deviceWorkspaceSize) ); + } + if (hostWorkspaceSize > 0) { + hostWork = malloc(hostWorkspaceSize); + } + HANDLE_ERROR( cutensornetWorkspaceSet(handle, workDesc, CUTENSORNET_MEMSPACE_DEVICE, devWork, deviceWorkspaceSize) ); + HANDLE_ERROR( cutensornetWorkspaceSet(handle, workDesc, CUTENSORNET_MEMSPACE_HOST, hostWork, hostWorkspaceSize) ); + + // Sphinx: #7 + /********** + * Execution + ***********/ + + GPUTimer timer{stream}; + double minTimeCUTENSOR = 1e100; + const int numRuns = 3; // to get stable perf results + for (int i=0; i < numRuns; ++i) + { + // restore output + cudaMemsetAsync(D_U, 0, sizeU, stream); + cudaMemsetAsync(D_S, 0, sizeS, stream); + cudaMemsetAsync(D_V, 0, sizeV, stream); + cudaDeviceSynchronize(); + + // With value-based truncation, `cutensornetTensorSVD` can potentially update the shared extent in descTensorU/V. + // We here restore descTensorU/V to the original problem. + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorU) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorV) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesU, extentU.data(), strides, modesU.data(), typeData, &descTensorU) ); + HANDLE_ERROR( cutensornetCreateTensorDescriptor(handle, numModesV, extentV.data(), strides, modesV.data(), typeData, &descTensorV) ); + + timer.start(); + HANDLE_ERROR( cutensornetTensorSVD(handle, + descTensorIn, D_T, + descTensorU, D_U, + D_S, + descTensorV, D_V, + svdConfig, + svdInfo, + workDesc, + stream) ); + // Synchronize and measure timing + auto time = timer.seconds(); + minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time; + } + + printf("Performing SVD\n"); + + HANDLE_CUDA_ERROR( cudaMemcpyAsync(U, D_U, sizeU, cudaMemcpyDeviceToHost) ); + HANDLE_CUDA_ERROR( cudaMemcpyAsync(S, D_S, sizeS, cudaMemcpyDeviceToHost) ); + HANDLE_CUDA_ERROR( cudaMemcpyAsync(V, D_V, sizeV, cudaMemcpyDeviceToHost) ); + + // Sphinx: #8 + /************************************* + * Query runtime truncation information + **************************************/ + + double discardedWeight{0}; + int64_t reducedExtent{0}; + cudaDeviceSynchronize(); // device synchronization. + HANDLE_ERROR( cutensornetTensorSVDInfoGetAttribute( handle, svdInfo, CUTENSORNET_TENSOR_SVD_INFO_DISCARDED_WEIGHT, &discardedWeight, sizeof(discardedWeight)) ); + HANDLE_ERROR( cutensornetTensorSVDInfoGetAttribute( handle, svdInfo, CUTENSORNET_TENSOR_SVD_INFO_REDUCED_EXTENT, &reducedExtent, sizeof(reducedExtent)) ); + + printf("elapsed time: %.2f ms\n", minTimeCUTENSOR * 1000.f); + printf("reduced extent found at runtime: %lu\n", reducedExtent); + printf("discarded weight: %.2f\n", discardedWeight); + + // Sphinx: #9 + /*************** + * Free resources + ****************/ + + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorIn) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorU) ); + HANDLE_ERROR( cutensornetDestroyTensorDescriptor(descTensorV) ); + HANDLE_ERROR( cutensornetDestroyTensorSVDConfig(svdConfig) ); + HANDLE_ERROR( cutensornetDestroyTensorSVDInfo(svdInfo) ); + HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) ); + HANDLE_ERROR( cutensornetDestroy(handle) ); + + if (T) free(T); + if (U) free(U); + if (S) free(S); + if (V) free(V); + if (D_T) cudaFree(D_T); + if (D_U) cudaFree(D_U); + if (D_S) cudaFree(D_S); + if (D_V) cudaFree(D_V); + if (devWork) cudaFree(devWork); + if (hostWork) free(hostWork); + + printf("Free resource and exit.\n"); + + return 0; +} diff --git a/samples/cutensornet/tensornet_example.cu b/samples/cutensornet/tensornet_example.cu index b6efaab..c14e144 100644 --- a/samples/cutensornet/tensornet_example.cu +++ b/samples/cutensornet/tensornet_example.cu @@ -1,10 +1,11 @@ -/* +/* * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. - * + * * SPDX-License-Identifier: BSD-3-Clause - */ + */ // Sphinx: #1 + #include #include @@ -14,20 +15,25 @@ #include #include -#include + #define HANDLE_ERROR(x) \ { const auto err = x; \ -if( err != CUTENSORNET_STATUS_SUCCESS ) \ -{ printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); return err; } \ + if( err != CUTENSORNET_STATUS_SUCCESS ) \ + { printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); \ + fflush(stdout); \ + } \ }; #define HANDLE_CUDA_ERROR(x) \ -{ const auto err = x; \ - if( err != cudaSuccess ) \ - { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); return err; } \ +{ const auto err = x; \ + if( err != cudaSuccess ) \ + { printf("CUDA Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); \ + fflush(stdout); \ + } \ }; + struct GPUTimer { GPUTimer(cudaStream_t stream): stream_(stream) @@ -62,53 +68,72 @@ struct GPUTimer }; -int main() +int main(int argc, char **argv) { - const size_t cuTensornetVersion = cutensornetGetVersion(); - printf("cuTensorNet-vers:%ld\n",cuTensornetVersion); + static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture!"); + bool verbose = true; + + // Check cuTensorNet version + const size_t cuTensornetVersion = cutensornetGetVersion(); + if(verbose) + printf("cuTensorNet version: %ld\n", cuTensornetVersion); + + // Set GPU device + int numDevices {0}; + HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) ); + const int deviceId = 0; + HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) ); cudaDeviceProp prop; - int deviceId{-1}; - HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) ); HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) ); - printf("===== device info ======\n"); - printf("GPU-name:%s\n", prop.name); - printf("GPU-clock:%d\n", prop.clockRate); - printf("GPU-memoryClock:%d\n", prop.memoryClockRate); - printf("GPU-nSM:%d\n", prop.multiProcessorCount); - printf("GPU-major:%d\n", prop.major); - printf("GPU-minor:%d\n", prop.minor); - printf("========================\n"); + if(verbose) { + printf("===== device info ======\n"); + printf("GPU-name:%s\n", prop.name); + printf("GPU-clock:%d\n", prop.clockRate); + printf("GPU-memoryClock:%d\n", prop.memoryClockRate); + printf("GPU-nSM:%d\n", prop.multiProcessorCount); + printf("GPU-major:%d\n", prop.major); + printf("GPU-minor:%d\n", prop.minor); + printf("========================\n"); + } typedef float floatType; cudaDataType_t typeData = CUDA_R_32F; cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F; - printf("Include headers and define data types\n"); + if(verbose) + printf("Included headers and defined data types\n"); // Sphinx: #2 /********************** - * Computing: D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y} + * Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m} **********************/ - constexpr int32_t numInputs = 3; + constexpr int32_t numInputs = 4; - // Create vector of modes - std::vector modesA{'m','h','k','n'}; - std::vector modesB{'u','k','h'}; - std::vector modesC{'x','u','y'}; - std::vector modesD{'m','x','n','y'}; + // Create vectors of tensor modes + std::vector modesA{'a','b','c','d','e','f'}; + std::vector modesB{'b','g','h','e','i','j'}; + std::vector modesC{'m','a','g','f','i','k'}; + std::vector modesD{'l','c','h','d','j','m'}; + std::vector modesR{'k','l'}; - // Extents + // Set mode extents std::unordered_map extent; - extent['m'] = 96; - extent['n'] = 96; - extent['u'] = 96; - extent['h'] = 64; - extent['k'] = 64; - extent['x'] = 64; - extent['y'] = 64; + extent['a'] = 16; + extent['b'] = 16; + extent['c'] = 16; + extent['d'] = 16; + extent['e'] = 16; + extent['f'] = 16; + extent['g'] = 16; + extent['h'] = 16; + extent['i'] = 16; + extent['j'] = 16; + extent['k'] = 16; + extent['l'] = 16; + extent['m'] = 16; // Create a vector of extents for each tensor std::vector extentA; @@ -123,8 +148,12 @@ int main() std::vector extentD; for (auto mode : modesD) extentD.push_back(extent[mode]); + std::vector extentR; + for (auto mode : modesR) + extentR.push_back(extent[mode]); - printf("Define network, modes, and extents\n"); + if(verbose) + printf("Defined tensor network, modes, and extents\n"); // Sphinx: #3 /********************** @@ -143,28 +172,36 @@ int main() size_t elementsD = 1; for (auto mode : modesD) elementsD *= extent[mode]; + size_t elementsR = 1; + for (auto mode : modesR) + elementsR *= extent[mode]; size_t sizeA = sizeof(floatType) * elementsA; size_t sizeB = sizeof(floatType) * elementsB; size_t sizeC = sizeof(floatType) * elementsC; size_t sizeD = sizeof(floatType) * elementsD; - printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC + sizeD)/1024./1024./1024); + size_t sizeR = sizeof(floatType) * elementsR; + if(verbose) + printf("Total GPU memory used for tensor storage: %.2f GiB\n", + (sizeA + sizeB + sizeC + sizeD + sizeR) / 1024. /1024. / 1024); void* rawDataIn_d[numInputs]; - void* D_d; + void* R_d; HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) ); HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) ); HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) ); - HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_d, sizeD)); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[3], sizeD) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &R_d, sizeR)); floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA); floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB); floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC); floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD); + floatType *R = (floatType*) malloc(sizeof(floatType) * elementsR); - if (A == NULL || B == NULL || C == NULL || D == NULL) + if (A == NULL || B == NULL || C == NULL || D == NULL || R == NULL) { - printf("Error: Host allocation of A or C.\n"); + printf("Error: Host memory allocation failed!\n"); return -1; } @@ -172,19 +209,23 @@ int main() * Initialize data *******************/ + memset(R, 0, sizeof(floatType) * elementsR); for (uint64_t i = 0; i < elementsA; i++) - A[i] = ((floatType) rand())/RAND_MAX; + A[i] = ((floatType) rand()) / RAND_MAX; for (uint64_t i = 0; i < elementsB; i++) - B[i] = ((floatType) rand())/RAND_MAX; + B[i] = ((floatType) rand()) / RAND_MAX; for (uint64_t i = 0; i < elementsC; i++) - C[i] = ((floatType) rand())/RAND_MAX; - memset(D, 0, sizeof(floatType) * elementsD); + C[i] = ((floatType) rand()) / RAND_MAX; + for (uint64_t i = 0; i < elementsD; i++) + D[i] = ((floatType) rand()) / RAND_MAX; HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) ); HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) ); HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) ); + HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) ); - printf("Allocate memory for data, and initialize data.\n"); + if(verbose) + printf("Allocated GPU memory for data, and initialize data\n"); // Sphinx: #4 /************************* @@ -201,44 +242,27 @@ int main() const int32_t nmodeB = modesB.size(); const int32_t nmodeC = modesC.size(); const int32_t nmodeD = modesD.size(); + const int32_t nmodeR = modesR.size(); /******************************* * Create Network Descriptor *******************************/ - const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data()}; - int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC}; - const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data()}; - const int64_t* stridesIn[] = {NULL, NULL, NULL}; // strides are optional; if no stride is provided, then cuTensorNet assumes a generalized column-major data layout - - // Notice that pointers are allocated via cudaMalloc are aligned to 256 byte - // boundaries by default; however here we're checking the pointer alignment explicitly - // to demonstrate how one would check the alginment for arbitrary pointers. - - auto getMaximalPointerAlignment = [](const void* ptr) { - const uint64_t ptrAddr = reinterpret_cast(ptr); - uint32_t alignment = 1; - while(ptrAddr % alignment == 0 && - alignment < 256) // at the latest we terminate once the alignment reached 256 bytes (we could be going, but any alignment larger or equal to 256 is equally fine) - { - alignment *= 2; - } - return alignment; - }; - const uint32_t alignmentsIn[] = {getMaximalPointerAlignment(rawDataIn_d[0]), - getMaximalPointerAlignment(rawDataIn_d[1]), - getMaximalPointerAlignment(rawDataIn_d[2])}; - const uint32_t alignmentOut = getMaximalPointerAlignment(D_d); - - // setup tensor network + const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data(), modesD.data()}; + int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC, nmodeD}; + const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data(), extentD.data()}; + const int64_t* stridesIn[] = {NULL, NULL, NULL, NULL}; // strides are optional; if no stride is provided, cuTensorNet assumes a generalized column-major data layout + + // Set up tensor network cutensornetNetworkDescriptor_t descNet; HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle, - numInputs, numModesIn, extentsIn, stridesIn, modesIn, alignmentsIn, - nmodeD, extentD.data(), /*stridesOut = */NULL, modesD.data(), alignmentOut, - typeData, typeCompute, - &descNet) ); + numInputs, numModesIn, extentsIn, stridesIn, modesIn, NULL, + nmodeR, extentR.data(), /*stridesOut = */NULL, modesR.data(), + typeData, typeCompute, + &descNet) ); - printf("Initialize the cuTensorNet library and create a network descriptor.\n"); + if(verbose) + printf("Initialized the cuTensorNet library and created a tensor network descriptor\n"); // Sphinx: #5 /******************************* @@ -247,7 +271,9 @@ int main() size_t freeMem, totalMem; HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) ); - uint64_t workspaceLimit = totalMem * 0.9; + uint64_t workspaceLimit = (uint64_t)((double)freeMem * 0.9); + if(verbose) + printf("Workspace limit = %lu\n", workspaceLimit); /******************************* * Find "optimal" contraction order and slicing @@ -256,16 +282,15 @@ int main() cutensornetContractionOptimizerConfig_t optimizerConfig; HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) ); - // Set the value of the partitioner imbalance factor, if desired - int32_t imbalance_factor = 30; - HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute( - handle, - optimizerConfig, - CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_IMBALANCE_FACTOR, - &imbalance_factor, - sizeof(imbalance_factor)) ); - + // Set the desired number of hyper-samples (defaults to 0) + int32_t num_hypersamples = 8; + HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle, + optimizerConfig, + CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES, + &num_hypersamples, + sizeof(num_hypersamples)) ); + // Create contraction optimizer info and find an optimized contraction path cutensornetContractionOptimizerInfo_t optimizerInfo; HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) ); @@ -275,17 +300,18 @@ int main() workspaceLimit, optimizerInfo) ); + // Query the number of slices the tensor network execution will be split into int64_t numSlices = 0; HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute( - handle, - optimizerInfo, - CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES, - &numSlices, - sizeof(numSlices)) ); - + handle, + optimizerInfo, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES, + &numSlices, + sizeof(numSlices)) ); assert(numSlices > 0); - printf("Find an optimized contraction path with cuTensorNet optimizer.\n"); + if(verbose) + printf("Found an optimized contraction path using cuTensorNet optimizer\n"); // Sphinx: #6 /******************************* @@ -296,49 +322,48 @@ int main() HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) ); uint64_t requiredWorkspaceSize = 0; - HANDLE_ERROR( cutensornetWorkspaceComputeSizes(handle, - descNet, - optimizerInfo, - workDesc) ); + HANDLE_ERROR( cutensornetWorkspaceComputeContractionSizes(handle, + descNet, + optimizerInfo, + workDesc) ); HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, - workDesc, - CUTENSORNET_WORKSIZE_PREF_MIN, - CUTENSORNET_MEMSPACE_DEVICE, - &requiredWorkspaceSize) ); + workDesc, + CUTENSORNET_WORKSIZE_PREF_MIN, + CUTENSORNET_MEMSPACE_DEVICE, + &requiredWorkspaceSize) ); - void *work = nullptr; + void* work = nullptr; HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) ); HANDLE_ERROR( cutensornetWorkspaceSet(handle, - workDesc, - CUTENSORNET_MEMSPACE_DEVICE, - work, - requiredWorkspaceSize) ); + workDesc, + CUTENSORNET_MEMSPACE_DEVICE, + work, + requiredWorkspaceSize) ); - printf("Allocate workspace.\n"); + if(verbose) + printf("Allocated and set up the GPU workspace\n"); // Sphinx: #7 /******************************* - * Initialize all pair-wise contraction plans (for cuTENSOR). + * Initialize the pairwise contraction plan (for cuTENSOR). *******************************/ cutensornetContractionPlan_t plan; - HANDLE_ERROR( cutensornetCreateContractionPlan(handle, - descNet, - optimizerInfo, - workDesc, - &plan) ); - + descNet, + optimizerInfo, + workDesc, + &plan) ); /******************************* * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel - * for each pairwise contraction. + * for each pairwise tensor contraction. *******************************/ cutensornetContractionAutotunePreference_t autotunePref; HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle, - &autotunePref) ); + &autotunePref) ); const int numAutotuningIterations = 5; // may be 0 HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute( @@ -348,94 +373,112 @@ int main() &numAutotuningIterations, sizeof(numAutotuningIterations)) ); - // modify the plan again to find the best pair-wise contractions + // Modify the plan again to find the best pair-wise contractions HANDLE_ERROR( cutensornetContractionAutotune(handle, - plan, - rawDataIn_d, - D_d, - workDesc, - autotunePref, - stream) ); + plan, + rawDataIn_d, + R_d, + workDesc, + autotunePref, + stream) ); HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) ); - printf("Create a contraction plan for cuTensorNet and optionally auto-tune it.\n"); + if(verbose) + printf("Created a contraction plan for cuTensorNet and optionally auto-tuned it\n"); // Sphinx: #8 /********************** - * Run + * Execute the tensor network contraction **********************/ - cutensornetSliceGroup_t sliceGroup{}; - // Create a cutensornetSliceGroup_t object from a range of slice IDs. + // Create a cutensornetSliceGroup_t object from a range of slice IDs + cutensornetSliceGroup_t sliceGroup{}; HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) ); - GPUTimer timer{stream}; + GPUTimer timer {stream}; double minTimeCUTENSOR = 1e100; - const int numRuns = 3; // to get stable perf results - for (int i=0; i < numRuns; ++i) + const int numRuns = 3; // number of repeats to get stable performance results + for (int i = 0; i < numRuns; ++i) { - cudaMemcpy(D_d, D, sizeD, cudaMemcpyHostToDevice); // restore output - cudaDeviceSynchronize(); + HANDLE_CUDA_ERROR( cudaMemcpy(R_d, R, sizeR, cudaMemcpyHostToDevice) ); // restore the output tensor on GPU + HANDLE_CUDA_ERROR( cudaDeviceSynchronize() ); /* - * Contract over all slices. - * - * A user may choose to parallelize over the slices across multiple devices. + * Contract all slices of the tensor network */ timer.start(); - int32_t accumulateOutput = 0; + int32_t accumulateOutput = 0; // output tensor data will be overwritten HANDLE_ERROR( cutensornetContractSlices(handle, - plan, - rawDataIn_d, - D_d, - accumulateOutput, - workDesc, - sliceGroup, // Alternatively, NULL can also be used to contract over all the slices instead of specifying a sliceGroup object. - stream) ); - - // Synchronize and measure timing + plan, + rawDataIn_d, + R_d, + accumulateOutput, + workDesc, + sliceGroup, // slternatively, NULL can also be used to contract over all slices instead of specifying a sliceGroup object + stream) ); + + // Synchronize and measure best timing auto time = timer.seconds(); - minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time; + minTimeCUTENSOR = (time > minTimeCUTENSOR) ? minTimeCUTENSOR : time; } - printf("Contract the network, each slice uses the same contraction plan.\n"); + if(verbose) + printf("Contracted the tensor network, each slice used the same contraction plan\n"); + // Print the 1-norm of the output tensor (verification) + HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) ); + HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host + double norm1 = 0.0; + for (int64_t i = 0; i < elementsR; ++i) { + norm1 += std::abs(R[i]); + } + if(verbose) + printf("Computed the 1-norm of the output tensor: %e\n", norm1); /*************************/ - double flops{0.}; + // Query the total Flop count for the tensor network contraction + double flops {0.0}; HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute( - handle, - optimizerInfo, - CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT, - &flops, - sizeof(flops)) ); - - printf("numSlices: %ld\n", numSlices); - printf("%.2f ms / slice\n", minTimeCUTENSOR * 1000.f / numSlices); - printf("%.2f GFLOPS/s\n", flops/1e9/minTimeCUTENSOR ); + handle, + optimizerInfo, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT, + &flops, + sizeof(flops)) ); + + if(verbose) { + printf("Number of tensor network slices = %ld\n", numSlices); + printf("Tensor network contraction time (ms) = %.3f\n", minTimeCUTENSOR * 1000.f); + } + // Free cuTensorNet resources HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) ); - HANDLE_ERROR( cutensornetDestroy(handle) ); - HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) ); HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) ); - HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) ); - HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) ); HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) ); + HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) ); + HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) ); + HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) ); + HANDLE_ERROR( cutensornetDestroy(handle) ); - if (A) free(A); - if (B) free(B); - if (C) free(C); + // Free Host memory resources + if (R) free(R); if (D) free(D); + if (C) free(C); + if (B) free(B); + if (A) free(A); + + // Free GPU memory resources + if (work) cudaFree(work); + if (R_d) cudaFree(R_d); if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]); if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]); if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]); - if (D_d) cudaFree(D_d); - if (work) cudaFree(work); + if (rawDataIn_d[3]) cudaFree(rawDataIn_d[3]); - printf("Free resource and exit.\n"); + if(verbose) + printf("Freed resources and exited\n"); return 0; } diff --git a/samples/cutensornet/tensornet_example_mpi.cu b/samples/cutensornet/tensornet_example_mpi.cu index 2446e60..1629901 100644 --- a/samples/cutensornet/tensornet_example_mpi.cu +++ b/samples/cutensornet/tensornet_example_mpi.cu @@ -6,10 +6,6 @@ // Sphinx: #1 -// Sphinx: MPI #1 [begin] -#include -// Sphinx: MPI #1 [end] - #include #include @@ -20,10 +16,16 @@ #include #include +// Sphinx: MPI #1 [begin] + +#include + +// Sphinx: MPI #1 [end] + #define HANDLE_ERROR(x) \ { const auto err = x; \ if( err != CUTENSORNET_STATUS_SUCCESS ) \ - { printf("[Process %d] Error: %s in line %d\n", rank, cutensornetGetErrorString(err), __LINE__); \ + { printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); \ fflush(stdout); \ MPI_Abort(MPI_COMM_WORLD, err); \ } \ @@ -32,7 +34,7 @@ #define HANDLE_CUDA_ERROR(x) \ { const auto err = x; \ if( err != cudaSuccess ) \ - { printf("[Process %d] CUDA Error: %s in line %d\n", rank, cudaGetErrorString(err), __LINE__); \ + { printf("CUDA Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); \ fflush(stdout); \ MPI_Abort(MPI_COMM_WORLD, err) ; \ } \ @@ -45,7 +47,7 @@ if( err != MPI_SUCCESS ) \ { char error[MPI_MAX_ERROR_STRING]; int len; \ MPI_Error_string(err, error, &len); \ - printf("[Process %d] MPI Error: %s in line %d\n", rank, error, __LINE__); \ + printf("MPI Error: %s in line %d\n", error, __LINE__); \ fflush(stdout); \ MPI_Abort(MPI_COMM_WORLD, err); \ } \ @@ -53,7 +55,6 @@ // Sphinx: MPI #2 [end] - struct GPUTimer { GPUTimer(cudaStream_t stream): stream_(stream) @@ -88,57 +89,49 @@ struct GPUTimer }; -int main(int argc, char *argv[]) +int main(int argc, char **argv) { - static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture."); + static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture!"); // Sphinx: MPI #3 [begin] - // Initialize MPI. - int errorCode = MPI_Init(&argc, &argv); - if (errorCode != MPI_SUCCESS) - { - printf("Error initializing MPI.\n"); - MPI_Abort(MPI_COMM_WORLD, errorCode); - } - - const int root{0}; - int rank{}; + // Initialize MPI + HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) ); + int rank {-1}; HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) ); - - int numProcs{}; + int numProcs {0}; HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) ); // Sphinx: MPI #3 [end] - if (rank == root) + bool verbose = (rank == 0) ? true : false; + if (verbose) { - printf("*** Printing is done only from the root process to prevent jumbled messages ***\n"); - printf("The number of processes is %d.\n", numProcs); + printf("*** Printing is done only from the root MPI process to prevent jumbled messages ***\n"); + printf("The number of MPI processes is %d\n", numProcs); } + if(verbose) + printf("Initialized MPI service\n"); - // Get cuTensornet version and device properties. + // Check cuTensorNet version const size_t cuTensornetVersion = cutensornetGetVersion(); - if (rank == root) - printf("cuTensorNet-vers:%ld\n", cuTensornetVersion); - - int numDevices; - HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) ); - - cudaDeviceProp prop; + if(verbose) + printf("cuTensorNet version: %ld\n", cuTensornetVersion); // Sphinx: MPI #4 [begin] - // Set deviceId based on ranks and nodes. - int deviceId = rank % numDevices; // We assume that the processes are mapped to nodes in contiguous chunks. + // Set GPU device based on ranks and nodes + int numDevices {0}; + HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) ); + const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) ); + cudaDeviceProp prop; HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) ); // Sphinx: MPI #4 [end] - if (rank == root) - { - printf("===== root process device info ======\n"); + if(verbose) { + printf("===== device info ======\n"); printf("GPU-name:%s\n", prop.name); printf("GPU-clock:%d\n", prop.clockRate); printf("GPU-memoryClock:%d\n", prop.memoryClockRate); @@ -152,33 +145,39 @@ int main(int argc, char *argv[]) MPI_Datatype floatTypeMPI = MPI_FLOAT; cudaDataType_t typeData = CUDA_R_32F; cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F; - auto Absolute = fabsf; - if (rank == root) - printf("Include headers and define data types\n"); + if(verbose) + printf("Included headers and defined data types\n"); // Sphinx: #2 /********************** - * Computing: D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y} + * Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m} **********************/ - constexpr int32_t numInputs = 3; + constexpr int32_t numInputs = 4; - // Create vector of modes - std::vector modesA{'m','h','k','n'}; - std::vector modesB{'u','k','h'}; - std::vector modesC{'x','u','y'}; - std::vector modesD{'m','x','n','y'}; + // Create vectors of tensor modes + std::vector modesA{'a','b','c','d','e','f'}; + std::vector modesB{'b','g','h','e','i','j'}; + std::vector modesC{'m','a','g','f','i','k'}; + std::vector modesD{'l','c','h','d','j','m'}; + std::vector modesR{'k','l'}; - // Extents + // Set mode extents std::unordered_map extent; - extent['m'] = 96; - extent['n'] = 96; - extent['u'] = 96; - extent['h'] = 64; - extent['k'] = 64; - extent['x'] = 64; - extent['y'] = 64; + extent['a'] = 16; + extent['b'] = 16; + extent['c'] = 16; + extent['d'] = 16; + extent['e'] = 16; + extent['f'] = 16; + extent['g'] = 16; + extent['h'] = 16; + extent['i'] = 16; + extent['j'] = 16; + extent['k'] = 16; + extent['l'] = 16; + extent['m'] = 16; // Create a vector of extents for each tensor std::vector extentA; @@ -193,9 +192,12 @@ int main(int argc, char *argv[]) std::vector extentD; for (auto mode : modesD) extentD.push_back(extent[mode]); + std::vector extentR; + for (auto mode : modesR) + extentR.push_back(extent[mode]); - if (rank == root) - printf("Define network, modes, and extents\n"); + if(verbose) + printf("Defined tensor network, modes, and extents\n"); // Sphinx: #3 /********************** @@ -214,31 +216,37 @@ int main(int argc, char *argv[]) size_t elementsD = 1; for (auto mode : modesD) elementsD *= extent[mode]; + size_t elementsR = 1; + for (auto mode : modesR) + elementsR *= extent[mode]; size_t sizeA = sizeof(floatType) * elementsA; size_t sizeB = sizeof(floatType) * elementsB; size_t sizeC = sizeof(floatType) * elementsC; size_t sizeD = sizeof(floatType) * elementsD; - if (rank == root) - printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC + sizeD)/1024./1024./1024); + size_t sizeR = sizeof(floatType) * elementsR; + if(verbose) + printf("Total GPU memory used for tensor storage: %.2f GiB\n", + (sizeA + sizeB + sizeC + sizeD + sizeR) / 1024. /1024. / 1024); void* rawDataIn_d[numInputs]; - void* D_d; + void* R_d; HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) ); HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) ); HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) ); - HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_d, sizeD)); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[3], sizeD) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &R_d, sizeR)); floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA); floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB); floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC); floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD); + floatType *R = (floatType*) malloc(sizeof(floatType) * elementsR); - if (A == NULL || B == NULL || C == NULL || D == NULL) + if (A == NULL || B == NULL || C == NULL || D == NULL || R == NULL) { - printf("Process %d: Error: Host allocation of A, B, C, or D.\n", rank); - MPI_Abort(MPI_COMM_WORLD, -1); - + printf("Error: Host memory allocation failed!\n"); + return -1; } // Sphinx: MPI #5 [begin] @@ -247,31 +255,35 @@ int main(int argc, char *argv[]) * Initialize data *******************/ - // Rank root creates the tensor data. - if (rank == root) + memset(R, 0, sizeof(floatType) * elementsR); + if(rank == 0) { for (uint64_t i = 0; i < elementsA; i++) - A[i] = ((floatType) rand())/RAND_MAX; + A[i] = ((floatType) rand()) / RAND_MAX; for (uint64_t i = 0; i < elementsB; i++) - B[i] = ((floatType) rand())/RAND_MAX; + B[i] = ((floatType) rand()) / RAND_MAX; for (uint64_t i = 0; i < elementsC; i++) - C[i] = ((floatType) rand())/RAND_MAX; + C[i] = ((floatType) rand()) / RAND_MAX; + for (uint64_t i = 0; i < elementsD; i++) + D[i] = ((floatType) rand()) / RAND_MAX; } - // Broadcast data to all ranks. - HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, root, MPI_COMM_WORLD) ); - HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, root, MPI_COMM_WORLD) ); - HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, root, MPI_COMM_WORLD) ); + // Broadcast input data to all ranks + HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) ); + HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) ); + HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) ); + HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) ); - // Copy data onto the device on all ranks. + // Copy data to GPU HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) ); HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) ); HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) ); + HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) ); - // Sphinx: MPI #5 [end] + if(verbose) + printf("Allocated GPU memory for data, and initialize data\n"); - if (rank == root) - printf("Allocate memory for data, calculate workspace limit, and initialize data.\n"); + // Sphinx: MPI #5 [end] // Sphinx: #4 /************************* @@ -280,6 +292,7 @@ int main(int argc, char *argv[]) cudaStream_t stream; cudaStreamCreate(&stream); + cutensornetHandle_t handle; HANDLE_ERROR( cutensornetCreate(&handle) ); @@ -287,45 +300,27 @@ int main(int argc, char *argv[]) const int32_t nmodeB = modesB.size(); const int32_t nmodeC = modesC.size(); const int32_t nmodeD = modesD.size(); + const int32_t nmodeR = modesR.size(); /******************************* * Create Network Descriptor *******************************/ - const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data()}; - int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC}; - const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data()}; - const int64_t* stridesIn[] = {NULL, NULL, NULL}; // strides are optional; if no stride is provided, then cuTensorNet assumes a generalized column-major data layout - - // Notice that pointers are allocated via cudaMalloc are aligned to 256 byte - // boundaries by default; however here we're checking the pointer alignment explicitly - // to demonstrate how one would check the alginment for arbitrary pointers. - - auto getMaximalPointerAlignment = [](const void* ptr) { - const uint64_t ptrAddr = reinterpret_cast(ptr); - uint32_t alignment = 1; - while(ptrAddr % alignment == 0 && - alignment < 256) // at the latest we terminate once the alignment reached 256 bytes (we could be going, but any alignment larger or equal to 256 is equally fine) - { - alignment *= 2; - } - return alignment; - }; - const uint32_t alignmentsIn[] = {getMaximalPointerAlignment(rawDataIn_d[0]), - getMaximalPointerAlignment(rawDataIn_d[1]), - getMaximalPointerAlignment(rawDataIn_d[2])}; - const uint32_t alignmentOut = getMaximalPointerAlignment(D_d); - - // setup tensor network + const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data(), modesD.data()}; + int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC, nmodeD}; + const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data(), extentD.data()}; + const int64_t* stridesIn[] = {NULL, NULL, NULL, NULL}; // strides are optional; if no stride is provided, cuTensorNet assumes a generalized column-major data layout + + // Set up tensor network cutensornetNetworkDescriptor_t descNet; HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle, - numInputs, numModesIn, extentsIn, stridesIn, modesIn, alignmentsIn, - nmodeD, extentD.data(), /*stridesOut = */NULL, modesD.data(), alignmentOut, - typeData, typeCompute, - &descNet) ); + numInputs, numModesIn, extentsIn, stridesIn, modesIn, NULL, + nmodeR, extentR.data(), /*stridesOut = */NULL, modesR.data(), + typeData, typeCompute, + &descNet) ); - if (rank == root) - printf("Initialize the cuTensorNet library and create a network descriptor.\n"); + if(verbose) + printf("Initialized the cuTensorNet library and created a tensor network descriptor\n"); // Sphinx: #5 /******************************* @@ -333,113 +328,117 @@ int main(int argc, char *argv[]) *******************************/ size_t freeMem, totalMem; - HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem ) ); - HANDLE_MPI_ERROR( MPI_Allreduce(MPI_IN_PLACE, &totalMem, 1, MPI_INT64_T, MPI_MIN, MPI_COMM_WORLD) ); - uint64_t workspaceLimit = totalMem * 0.9; + HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) ); + uint64_t workspaceLimit = (uint64_t)((double)freeMem * 0.9); + // Make sure all MPI processes will assume the minimal workspace size among all + HANDLE_MPI_ERROR( MPI_Allreduce(MPI_IN_PLACE, &workspaceLimit, 1, MPI_INT64_T, MPI_MIN, MPI_COMM_WORLD) ); + if(verbose) + printf("Workspace limit = %lu\n", workspaceLimit); /******************************* - * Find "optimal" contraction order and slicing + * Find "optimal" contraction order and slicing (in parallel) *******************************/ cutensornetContractionOptimizerConfig_t optimizerConfig; HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) ); + // Set the desired number of hyper-samples (defaults to 0) + int32_t num_hypersamples = 8; + HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle, + optimizerConfig, + CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES, + &num_hypersamples, + sizeof(num_hypersamples)) ); + + // Create contraction optimizer info cutensornetContractionOptimizerInfo_t optimizerInfo; - HANDLE_ERROR(cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) ); + HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) ); // Sphinx: MPI #6 [begin] // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny - // example with 3 operands, all processes will compute the same globally optimal path. This is not the case for large - // tensor networks. For large networks, hyperoptimization is also beneficial and can be enabled by setting the - // optimizer config attribute CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES. - - // Force slicing. - int32_t min_slices = numProcs; - HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute( - handle, - optimizerConfig, - CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES, - &min_slices, - sizeof(min_slices)) ); - + // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large + // tensor networks. For large networks, hyper-optimization does become beneficial. + + // Enforce tensor network slicing (for parallelization) + const int32_t min_slices = numProcs; + HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle, + optimizerConfig, + CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES, + &min_slices, + sizeof(min_slices)) ); + + // Find an optimized tensor network contraction path on each MPI process HANDLE_ERROR( cutensornetContractionOptimize(handle, - descNet, - optimizerConfig, - workspaceLimit, - optimizerInfo) ); + descNet, + optimizerConfig, + workspaceLimit, + optimizerInfo) ); + // Query the obtained Flop count double flops{-1.}; - HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute( - handle, - optimizerInfo, - CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT, - &flops, - sizeof(flops)) ); + HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(handle, + optimizerInfo, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT, + &flops, + sizeof(flops)) ); - // Choose the path with the lowest cost. + // Choose the contraction path with the lowest Flop cost struct { - double value; - int rank; + double value; + int rank; } in{flops, rank}, out; - HANDLE_MPI_ERROR( MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD) ); - - int sender = out.rank; + const int sender = out.rank; flops = out.value; - if (rank == root) - { - printf("Process %d has the path with the lowest FLOP count %lf.\n", sender, flops); - } - size_t bufSize; + if (verbose) + printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops); - // Get buffer size for optimizerInfo and broadcast it. + // Get the buffer size for optimizerInfo and broadcast it + size_t bufSize {0}; if (rank == sender) { HANDLE_ERROR( cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize) ); } - HANDLE_MPI_ERROR( MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD) ); - // Allocate buffer. + // Allocate a buffer std::vector buffer(bufSize); - // Pack optimizerInfo on sender and broadcast it. + // Pack optimizerInfo on sender and broadcast it if (rank == sender) { HANDLE_ERROR( cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize) ); } - HANDLE_MPI_ERROR( MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD) ); - // Unpack optimizerInfo from buffer. + // Unpack optimizerInfo from the buffer if (rank != sender) { HANDLE_ERROR( cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo) ); } + // Query the number of slices the tensor network execution will be split into int64_t numSlices = 0; HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute( - handle, - optimizerInfo, - CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES, - &numSlices, - sizeof(numSlices)) ); - + handle, + optimizerInfo, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES, + &numSlices, + sizeof(numSlices)) ); assert(numSlices > 0); - // Calculate each process's share of the slices. - + // Calculate each process's share of the slices int64_t procChunk = numSlices / numProcs; int extra = numSlices % numProcs; int procSliceBegin = rank * procChunk + std::min(rank, extra); - int procSliceEnd = rank == numProcs - 1 ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra); + int procSliceEnd = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra); // Sphinx: MPI #6 [end] - if (rank == root) - printf("Find an optimized contraction path with cuTensorNet optimizer.\n"); + if(verbose) + printf("Found an optimized contraction path using cuTensorNet optimizer\n"); // Sphinx: #6 /******************************* @@ -450,49 +449,48 @@ int main(int argc, char *argv[]) HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) ); uint64_t requiredWorkspaceSize = 0; - HANDLE_ERROR( cutensornetWorkspaceComputeSizes(handle, - descNet, - optimizerInfo, - workDesc) ); + HANDLE_ERROR( cutensornetWorkspaceComputeContractionSizes(handle, + descNet, + optimizerInfo, + workDesc) ); HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, - workDesc, - CUTENSORNET_WORKSIZE_PREF_MIN, - CUTENSORNET_MEMSPACE_DEVICE, - &requiredWorkspaceSize) ); + workDesc, + CUTENSORNET_WORKSIZE_PREF_MIN, + CUTENSORNET_MEMSPACE_DEVICE, + &requiredWorkspaceSize) ); - void *work = nullptr; + void* work = nullptr; HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) ); HANDLE_ERROR( cutensornetWorkspaceSet(handle, - workDesc, - CUTENSORNET_MEMSPACE_DEVICE, - work, - requiredWorkspaceSize) ); + workDesc, + CUTENSORNET_MEMSPACE_DEVICE, + work, + requiredWorkspaceSize) ); - if (rank == root) - printf("Allocate workspace.\n"); + if(verbose) + printf("Allocated and set up the GPU workspace\n"); // Sphinx: #7 /******************************* - * Initialize all pair-wise contraction plans (for cuTENSOR) + * Initialize the pairwise contraction plan (for cuTENSOR). *******************************/ cutensornetContractionPlan_t plan; - HANDLE_ERROR( cutensornetCreateContractionPlan(handle, - descNet, - optimizerInfo, - workDesc, - &plan) ); + descNet, + optimizerInfo, + workDesc, + &plan) ); /******************************* * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel + * for each pairwise tensor contraction. *******************************/ - cutensornetContractionAutotunePreference_t autotunePref; HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle, - &autotunePref) ); + &autotunePref) ); const int numAutotuningIterations = 5; // may be 0 HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute( @@ -502,37 +500,37 @@ int main(int argc, char *argv[]) &numAutotuningIterations, sizeof(numAutotuningIterations)) ); - // modify the plan again to find the best pair-wise contractions + // Modify the plan again to find the best pair-wise contractions HANDLE_ERROR( cutensornetContractionAutotune(handle, - plan, - rawDataIn_d, - D_d, - workDesc, - autotunePref, - stream) ); + plan, + rawDataIn_d, + R_d, + workDesc, + autotunePref, + stream) ); HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) ); - if (rank == root) - printf("Create a contraction plan for cuTensorNet and optionally auto-tune it.\n"); + if(verbose) + printf("Created a contraction plan for cuTensorNet and optionally auto-tuned it\n"); // Sphinx: #8 /********************** - * Run + * Execute the tensor network contraction (in parallel) **********************/ // Sphinx: MPI #7 [begin] + // Create a cutensornetSliceGroup_t object from a range of slice IDs cutensornetSliceGroup_t sliceGroup{}; - // Create a cutensornetSliceGroup_t object from a range of slice IDs. HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup) ); // Sphinx: MPI #7 [end] GPUTimer timer{stream}; double minTimeCUTENSOR = 1e100; - const int numRuns = 3; // to get stable perf results - for (int i=0; i < numRuns; ++i) + const int numRuns = 3; // to get stable performance results + for (int i = 0; i < numRuns; ++i) { cudaDeviceSynchronize(); @@ -541,7 +539,7 @@ int main(int argc, char *argv[]) */ timer.start(); - // Don't accumulate into output since we use a one-process-per-gpu model. + // Don't accumulate into output since we use a one-process-per-gpu model int32_t accumulateOutput = 0; // Sphinx: MPI #8 [begin] @@ -549,7 +547,7 @@ int main(int argc, char *argv[]) HANDLE_ERROR( cutensornetContractSlices(handle, plan, rawDataIn_d, - D_d, + R_d, accumulateOutput, workDesc, sliceGroup, @@ -557,109 +555,80 @@ int main(int argc, char *argv[]) // Sphinx: MPI #8 [end] - // Synchronize and measure timing - auto time = timer.seconds(); - minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time; - } + // Sphinx: MPI #9 [begin] - if (rank == root) - printf("Contract the network, all slices within the same rank use the same contraction plan.\n"); + // Perform Allreduce operation on the output tensor + HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) ); + HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host + HANDLE_MPI_ERROR( MPI_Allreduce(MPI_IN_PLACE, R, elementsR, floatTypeMPI, MPI_SUM, MPI_COMM_WORLD) ); - /*************************/ + // Sphinx: MPI #9 [end] - if (rank == root) - { - printf("numSlices: %ld\n", numSlices); - int64_t numSlicesProc = procSliceEnd - procSliceBegin; - printf("numSlices on root process: %ld\n", numSlicesProc); - if (numSlicesProc > 0) - printf("%.2f ms / slice\n", minTimeCUTENSOR * 1000.f / numSlicesProc); + // Measure timing + auto time = timer.seconds(); + minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time; } - HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) ); + if (verbose) + printf("Contracted the tensor network, all slices within the same rank used the same contraction plan.\n"); - HANDLE_CUDA_ERROR( cudaMemcpy(D, D_d, sizeD, cudaMemcpyDeviceToHost) ); - - // Sphinx: MPI #9 [begin] - - // Reduce on root process. - if (rank == root) - { - HANDLE_MPI_ERROR( MPI_Reduce(MPI_IN_PLACE, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) ); - } - else - { - HANDLE_MPI_ERROR( MPI_Reduce(D, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) ); + // Print the 1-norm of the output tensor (verification) + double norm1 = 0.0; + for (int64_t i = 0; i < elementsR; ++i) { + norm1 += std::abs(R[i]); } + if(verbose) + printf("Computed the 1-norm of the output tensor: %e\n", norm1); - // Sphinx: MPI #9 [end] - - // Compute the reference result. - if (rank == root) - { - floatType *Reference = (floatType*) malloc(sizeof(floatType) * elementsD); - if (Reference == NULL) - { - printf("Error: Host allocation of Reference.\n"); - MPI_Abort(MPI_COMM_WORLD, -1); - } - - void *Reference_d; - HANDLE_CUDA_ERROR( cudaMalloc((void**) &Reference_d, sizeD) ); + /*************************/ - int32_t accumulateOutput = 0; - HANDLE_ERROR( cutensornetContractSlices(handle, - plan, - rawDataIn_d, - Reference_d, - accumulateOutput, - workDesc, - NULL, // Contract over all the slices. - stream) ); - cudaDeviceSynchronize(); - HANDLE_CUDA_ERROR( cudaMemcpy(Reference, Reference_d, sizeD, cudaMemcpyDeviceToHost) ); - - // Calculate the error. - floatType max{}, maxError{}; - for (int i=0; i < elementsD; ++i) - { - floatType error = Absolute(D[i] - Reference[i]); - if (error > maxError) - maxError = error; - if (Absolute(Reference[i]) > max) - max = Absolute(Reference[i]); - } - printf("The inf norm of the reference result is %f, the maximum absolute error is %f, and the maximum relative error is %e.\n", max, maxError, maxError/max); - - free(Reference); - cudaFree(Reference_d); + // Query the total Flop count for the tensor network contraction + flops = 0.0; + HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute( + handle, + optimizerInfo, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT, + &flops, + sizeof(flops)) ); + + if(verbose) { + printf("Number of tensor network slices = %ld\n", numSlices); + printf("Tensor network contraction time (ms) = %.3f\n", minTimeCUTENSOR * 1000.f); } - HANDLE_ERROR( cutensornetDestroy(handle) ); - HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) ); + // Free cuTensorNet resources + HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) ); HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) ); - HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) ); - HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) ); HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) ); + HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) ); + HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) ); + HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) ); + HANDLE_ERROR( cutensornetDestroy(handle) ); - if (A) free(A); - if (B) free(B); - if (C) free(C); + // Free Host memory resources + if (R) free(R); if (D) free(D); + if (C) free(C); + if (B) free(B); + if (A) free(A); + + // Free GPU memory resources + if (work) cudaFree(work); + if (R_d) cudaFree(R_d); if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]); if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]); if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]); - if (D_d) cudaFree(D_d); - if (work) cudaFree(work); - - if (rank == root) - printf("Free resources and exit.\n"); + if (rawDataIn_d[3]) cudaFree(rawDataIn_d[3]); // Sphinx: MPI #10 [begin] + // Shut down MPI service HANDLE_MPI_ERROR( MPI_Finalize() ); // Sphinx: MPI #10 [end] + if(verbose) + printf("Freed resources and exited\n"); + return 0; } diff --git a/samples/cutensornet/tensornet_example_mpi_auto.cu b/samples/cutensornet/tensornet_example_mpi_auto.cu new file mode 100644 index 0000000..b5da86a --- /dev/null +++ b/samples/cutensornet/tensornet_example_mpi_auto.cu @@ -0,0 +1,564 @@ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +// Sphinx: #1 + +#include +#include + +#include +#include +#include + +#include +#include + +// Sphinx: MPI #1 [begin] + +#include + +// Sphinx: MPI #1 [end] + +#define HANDLE_ERROR(x) \ +{ const auto err = x; \ + if( err != CUTENSORNET_STATUS_SUCCESS ) \ + { printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); \ + fflush(stdout); \ + MPI_Abort(MPI_COMM_WORLD, err); \ + } \ +}; + +#define HANDLE_CUDA_ERROR(x) \ +{ const auto err = x; \ + if( err != cudaSuccess ) \ + { printf("CUDA Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); \ + fflush(stdout); \ + MPI_Abort(MPI_COMM_WORLD, err) ; \ + } \ +}; + +// Sphinx: MPI #2 [begin] + +#define HANDLE_MPI_ERROR(x) \ +{ const auto err = x; \ + if( err != MPI_SUCCESS ) \ + { char error[MPI_MAX_ERROR_STRING]; int len; \ + MPI_Error_string(err, error, &len); \ + printf("MPI Error: %s in line %d\n", error, __LINE__); \ + fflush(stdout); \ + MPI_Abort(MPI_COMM_WORLD, err); \ + } \ +}; + +// Sphinx: MPI #2 [end] + +struct GPUTimer +{ + GPUTimer(cudaStream_t stream): stream_(stream) + { + cudaEventCreate(&start_); + cudaEventCreate(&stop_); + } + + ~GPUTimer() + { + cudaEventDestroy(start_); + cudaEventDestroy(stop_); + } + + void start() + { + cudaEventRecord(start_, stream_); + } + + float seconds() + { + cudaEventRecord(stop_, stream_); + cudaEventSynchronize(stop_); + float time; + cudaEventElapsedTime(&time, start_, stop_); + return time * 1e-3; + } + + private: + cudaEvent_t start_, stop_; + cudaStream_t stream_; +}; + + +int main(int argc, char **argv) +{ + static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture!"); + + // Sphinx: MPI #3 [begin] + + // Initialize MPI + HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) ); + int rank {-1}; + HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) ); + int numProcs {0}; + HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) ); + + // Sphinx: MPI #3 [end] + + bool verbose = (rank == 0) ? true : false; + if (verbose) + { + printf("*** Printing is done only from the root MPI process to prevent jumbled messages ***\n"); + printf("The number of MPI processes is %d\n", numProcs); + } + if(verbose) + printf("Initialized MPI service\n"); + + // Check cuTensorNet version + const size_t cuTensornetVersion = cutensornetGetVersion(); + if(verbose) + printf("cuTensorNet version: %ld\n", cuTensornetVersion); + + // Sphinx: MPI #4 [begin] + + // Set GPU device based on ranks and nodes + int numDevices {0}; + HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) ); + const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks + HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) ); + cudaDeviceProp prop; + HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) ); + + // Sphinx: MPI #4 [end] + + if(verbose) { + printf("===== device info ======\n"); + printf("GPU-name:%s\n", prop.name); + printf("GPU-clock:%d\n", prop.clockRate); + printf("GPU-memoryClock:%d\n", prop.memoryClockRate); + printf("GPU-nSM:%d\n", prop.multiProcessorCount); + printf("GPU-major:%d\n", prop.major); + printf("GPU-minor:%d\n", prop.minor); + printf("========================\n"); + } + + typedef float floatType; + MPI_Datatype floatTypeMPI = MPI_FLOAT; + cudaDataType_t typeData = CUDA_R_32F; + cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F; + + if(verbose) + printf("Included headers and defined data types\n"); + + // Sphinx: #2 + /********************** + * Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m} + **********************/ + + constexpr int32_t numInputs = 4; + + // Create vectors of tensor modes + std::vector modesA{'a','b','c','d','e','f'}; + std::vector modesB{'b','g','h','e','i','j'}; + std::vector modesC{'m','a','g','f','i','k'}; + std::vector modesD{'l','c','h','d','j','m'}; + std::vector modesR{'k','l'}; + + // Set mode extents + std::unordered_map extent; + extent['a'] = 16; + extent['b'] = 16; + extent['c'] = 16; + extent['d'] = 16; + extent['e'] = 16; + extent['f'] = 16; + extent['g'] = 16; + extent['h'] = 16; + extent['i'] = 16; + extent['j'] = 16; + extent['k'] = 16; + extent['l'] = 16; + extent['m'] = 16; + + // Create a vector of extents for each tensor + std::vector extentA; + for (auto mode : modesA) + extentA.push_back(extent[mode]); + std::vector extentB; + for (auto mode : modesB) + extentB.push_back(extent[mode]); + std::vector extentC; + for (auto mode : modesC) + extentC.push_back(extent[mode]); + std::vector extentD; + for (auto mode : modesD) + extentD.push_back(extent[mode]); + std::vector extentR; + for (auto mode : modesR) + extentR.push_back(extent[mode]); + + if(verbose) + printf("Defined tensor network, modes, and extents\n"); + + // Sphinx: #3 + /********************** + * Allocating data + **********************/ + + size_t elementsA = 1; + for (auto mode : modesA) + elementsA *= extent[mode]; + size_t elementsB = 1; + for (auto mode : modesB) + elementsB *= extent[mode]; + size_t elementsC = 1; + for (auto mode : modesC) + elementsC *= extent[mode]; + size_t elementsD = 1; + for (auto mode : modesD) + elementsD *= extent[mode]; + size_t elementsR = 1; + for (auto mode : modesR) + elementsR *= extent[mode]; + + size_t sizeA = sizeof(floatType) * elementsA; + size_t sizeB = sizeof(floatType) * elementsB; + size_t sizeC = sizeof(floatType) * elementsC; + size_t sizeD = sizeof(floatType) * elementsD; + size_t sizeR = sizeof(floatType) * elementsR; + if(verbose) + printf("Total GPU memory used for tensor storage: %.2f GiB\n", + (sizeA + sizeB + sizeC + sizeD + sizeR) / 1024. /1024. / 1024); + + void* rawDataIn_d[numInputs]; + void* R_d; + HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[3], sizeD) ); + HANDLE_CUDA_ERROR( cudaMalloc((void**) &R_d, sizeR)); + + floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA); + floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB); + floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC); + floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD); + floatType *R = (floatType*) malloc(sizeof(floatType) * elementsR); + + if (A == NULL || B == NULL || C == NULL || D == NULL || R == NULL) + { + printf("Error: Host memory allocation failed!\n"); + return -1; + } + + // Sphinx: MPI #5 [begin] + + /******************* + * Initialize data + *******************/ + + memset(R, 0, sizeof(floatType) * elementsR); + if(rank == 0) + { + for (uint64_t i = 0; i < elementsA; i++) + A[i] = ((floatType) rand()) / RAND_MAX; + for (uint64_t i = 0; i < elementsB; i++) + B[i] = ((floatType) rand()) / RAND_MAX; + for (uint64_t i = 0; i < elementsC; i++) + C[i] = ((floatType) rand()) / RAND_MAX; + for (uint64_t i = 0; i < elementsD; i++) + D[i] = ((floatType) rand()) / RAND_MAX; + } + + // Broadcast input data to all ranks + HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) ); + HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) ); + HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) ); + HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) ); + + // Copy data to GPU + HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) ); + HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) ); + HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) ); + HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) ); + + if(verbose) + printf("Allocated GPU memory for data, and initialize data\n"); + + // Sphinx: MPI #5 [end] + + // Sphinx: #4 + /************************* + * cuTensorNet + *************************/ + + cudaStream_t stream; + cudaStreamCreate(&stream); + + cutensornetHandle_t handle; + HANDLE_ERROR( cutensornetCreate(&handle) ); + + const int32_t nmodeA = modesA.size(); + const int32_t nmodeB = modesB.size(); + const int32_t nmodeC = modesC.size(); + const int32_t nmodeD = modesD.size(); + const int32_t nmodeR = modesR.size(); + + /******************************* + * Create Network Descriptor + *******************************/ + + const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data(), modesD.data()}; + int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC, nmodeD}; + const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data(), extentD.data()}; + const int64_t* stridesIn[] = {NULL, NULL, NULL, NULL}; // strides are optional; if no stride is provided, cuTensorNet assumes a generalized column-major data layout + + // Set up tensor network + cutensornetNetworkDescriptor_t descNet; + HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle, + numInputs, numModesIn, extentsIn, stridesIn, modesIn, NULL, + nmodeR, extentR.data(), /*stridesOut = */NULL, modesR.data(), + typeData, typeCompute, + &descNet) ); + + if(verbose) + printf("Initialized the cuTensorNet library and created a tensor network descriptor\n"); + + // Sphinx: #5 + /******************************* + * Choose workspace limit based on available resources. + *******************************/ + + size_t freeMem, totalMem; + HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) ); + uint64_t workspaceLimit = (uint64_t)((double)freeMem * 0.9); + if(verbose) + printf("Workspace limit = %lu\n", workspaceLimit); + + // Sphinx: MPI #6 [begin] + + /******************************* + * Activate distributed (parallel) execution prior to + * calling contraction path finder and contraction executor + *******************************/ + // HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, NULL, 0) ); // resets back to serial execution + MPI_Comm cutnComm; + HANDLE_MPI_ERROR( MPI_Comm_dup(MPI_COMM_WORLD, &cutnComm) ); // duplicate MPI communicator + HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, &cutnComm, sizeof(cutnComm)) ); + if(verbose) + printf("Reset distributed MPI configuration\n"); + + // Sphinx: MPI #6 [end] + + /******************************* + * Find "optimal" contraction order and slicing (in parallel) + *******************************/ + + cutensornetContractionOptimizerConfig_t optimizerConfig; + HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) ); + + // Set the desired number of hyper-samples (defaults to 0) + int32_t num_hypersamples = 8; + HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle, + optimizerConfig, + CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES, + &num_hypersamples, + sizeof(num_hypersamples)) ); + + // Create contraction optimizer info and find an optimized contraction path + cutensornetContractionOptimizerInfo_t optimizerInfo; + HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) ); + + HANDLE_ERROR( cutensornetContractionOptimize(handle, + descNet, + optimizerConfig, + workspaceLimit, + optimizerInfo) ); + + // Query the number of slices the tensor network execution will be split into + int64_t numSlices = 0; + HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute( + handle, + optimizerInfo, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES, + &numSlices, + sizeof(numSlices)) ); + assert(numSlices > 0); + + if(verbose) + printf("Found an optimized contraction path using cuTensorNet optimizer\n"); + + // Sphinx: #6 + /******************************* + * Create workspace descriptor, allocate workspace, and set it. + *******************************/ + + cutensornetWorkspaceDescriptor_t workDesc; + HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) ); + + uint64_t requiredWorkspaceSize = 0; + HANDLE_ERROR( cutensornetWorkspaceComputeContractionSizes(handle, + descNet, + optimizerInfo, + workDesc) ); + + HANDLE_ERROR( cutensornetWorkspaceGetSize(handle, + workDesc, + CUTENSORNET_WORKSIZE_PREF_MIN, + CUTENSORNET_MEMSPACE_DEVICE, + &requiredWorkspaceSize) ); + + void* work = nullptr; + HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) ); + + HANDLE_ERROR( cutensornetWorkspaceSet(handle, + workDesc, + CUTENSORNET_MEMSPACE_DEVICE, + work, + requiredWorkspaceSize) ); + + if(verbose) + printf("Allocated and set up the GPU workspace\n"); + + // Sphinx: #7 + /******************************* + * Initialize the pairwise contraction plan (for cuTENSOR). + *******************************/ + + cutensornetContractionPlan_t plan; + HANDLE_ERROR( cutensornetCreateContractionPlan(handle, + descNet, + optimizerInfo, + workDesc, + &plan) ); + + /******************************* + * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel + * for each pairwise tensor contraction. + *******************************/ + cutensornetContractionAutotunePreference_t autotunePref; + HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle, + &autotunePref) ); + + const int numAutotuningIterations = 5; // may be 0 + HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute( + handle, + autotunePref, + CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS, + &numAutotuningIterations, + sizeof(numAutotuningIterations)) ); + + // Modify the plan again to find the best pair-wise contractions + HANDLE_ERROR( cutensornetContractionAutotune(handle, + plan, + rawDataIn_d, + R_d, + workDesc, + autotunePref, + stream) ); + + HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) ); + + if(verbose) + printf("Created a contraction plan for cuTensorNet and optionally auto-tuned it\n"); + + // Sphinx: #8 + /********************** + * Execute the tensor network contraction (in parallel) + **********************/ + + // Create a cutensornetSliceGroup_t object from a range of slice IDs + cutensornetSliceGroup_t sliceGroup{}; + HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) ); + + GPUTimer timer {stream}; + double minTimeCUTENSOR = 1e100; + const int numRuns = 3; // number of repeats to get stable performance results + for (int i = 0; i < numRuns; ++i) + { + HANDLE_CUDA_ERROR( cudaMemcpy(R_d, R, sizeR, cudaMemcpyHostToDevice) ); // restore the output tensor on GPU + HANDLE_CUDA_ERROR( cudaDeviceSynchronize() ); + + /* + * Contract all slices of the tensor network (in parallel) + */ + timer.start(); + + int32_t accumulateOutput = 0; // output tensor data will be overwritten + HANDLE_ERROR( cutensornetContractSlices(handle, + plan, + rawDataIn_d, + R_d, + accumulateOutput, + workDesc, + sliceGroup, // slternatively, NULL can also be used to contract over all slices instead of specifying a sliceGroup object + stream) ); + + // Synchronize and measure best timing + auto time = timer.seconds(); + minTimeCUTENSOR = (time > minTimeCUTENSOR) ? minTimeCUTENSOR : time; + } + + if(verbose) + printf("Contracted the tensor network, each slice used the same contraction plan\n"); + + // Print the 1-norm of the output tensor (verification) + HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) ); + HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host + double norm1 = 0.0; + for (int64_t i = 0; i < elementsR; ++i) { + norm1 += std::abs(R[i]); + } + if(verbose) + printf("Computed the 1-norm of the output tensor: %e\n", norm1); + + /*************************/ + + // Query the total Flop count for the tensor network contraction + double flops {0.0}; + HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute( + handle, + optimizerInfo, + CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT, + &flops, + sizeof(flops)) ); + + if(verbose) { + printf("Number of tensor network slices = %ld\n", numSlices); + printf("Tensor network contraction time (ms) = %.3f\n", minTimeCUTENSOR * 1000.f); + } + + // Free cuTensorNet resources + HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) ); + HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) ); + HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) ); + HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) ); + HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) ); + HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) ); + HANDLE_ERROR( cutensornetDestroy(handle) ); + + // Free Host memory resources + if (R) free(R); + if (D) free(D); + if (C) free(C); + if (B) free(B); + if (A) free(A); + + // Free GPU memory resources + if (work) cudaFree(work); + if (R_d) cudaFree(R_d); + if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]); + if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]); + if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]); + if (rawDataIn_d[3]) cudaFree(rawDataIn_d[3]); + + // Sphinx: MPI #7 [begin] + + // Shut down MPI service + HANDLE_MPI_ERROR( MPI_Finalize() ); + + // Sphinx: MPI #7 [end] + + if(verbose) + printf("Freed resources and exited\n"); + + return 0; +}