Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0031] Preliminary Draft of the HLSL changes for CoopVec #432

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
318 changes: 276 additions & 42 deletions proposals/0031-hlsl-vector-matrix-operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,96 +5,330 @@
## Instructions

- Proposal: [0031](0031-hlsl-vector-matrix-operations.md)
- Author(s): [Damyan Pepper][damyanp], [Chris Bieneman][llvm-beanz],
[Anupama Chandrasekhar][anupamachandra]
- Author(s): [Damyan Pepper][damyanp], [Chris Bieneman][llvm-beanz],
[Anupama Chandrasekhar][anupamachandra]
- Sponsor: [Damyan Pepper][damyanp]
- Status: **Under Consideration**
- Planned Version: Shader Model 6.9

[damyanp]: https://github.com/damyanp
[llvm-beanz]: https://github.com/llvm-beanz
[damyanp]: https://github.com/damyanp[llvm-beanz]: https://github.com/llvm-beanz
[anupamachandra]: https://github.com/anupamachandra

## Introduction

This proposes a set of HLSL APIs that enable the use of the hardware-accelerated
vector/matrix operations described in [0029].
vector/matrix operations described in [0029].

[0029]: 0029-cooperative-vector.md

## Motivation

See [0029] for general background around the need for these new operations.

An HLSL API needs to be defined to expose these new operations in a way that:
* work well with existing HLSL APIs
* is expected to work well with future HLSL APIs in the same problem space
* can be implemented reasonably in DXC and cleanly in clang

This design builds on the "long vectors" feature described in [0026].
Modern GPUs have dedicated silicon to accelerate matrix operations, but HLSL
doesn't provide a mechanism to easily utilize these units. Evaluation of
matrix-vector operations (multiply, muladd, accumulation) in HLSL was
previously scalarized at the DXIL level making it hard to employ these
specialized units. This proposal builds on the "Long vectors" feature described
in [0026], providing a mechanism to express matrix-vector ops in HLSL that can
be lowered to the DXIL ops described [0029], these primitives provide the right
level of abstraction for hardware acceleration.

[0026]: 0026-hlsl-long-vector-type.md

## Proposed solution

We introduce a the `dx.linalg` namespace that exposes functions for new
matrix-vector operations:

* **Matrix-Vector Multiply:** Multiply a matrix in memory and a vector
parameter.
* **Matrix-Vector Multiply-Add:** Multiply a matrix in memory and a vector
parameter and add a vector from memory.
* **Vector-Vector Outer Product and Accumulate:** Compute the outerproduct of
two vectors and accumulate the result matrix atomically-elementwise in
memory.
* **Reduce and Accumulate:** Accumulate elements of a vector
atomically-elementwise to corresponding elements in memory.


## Detailed Design

### `dx.linalg.MatrixRef`

`MatrixRef` is a wrapper class that represents a Matrix stored in a
(RW)ByteAddressBuffer that also contains its type, dimension, layout, start
offset and stride.

#### Syntax

```c++
namespace dx {
namespace linalg {

template <TypeInterpretation Interpretation, uint M, uint K,
MatrixLayout Layout>
class MatrixRef {
RWByteAddressBuffer Buffer;
uint Stride;
uint StartOffset;
}

} // namespace linalg
} // namespace dx
```

> Note we need to support RWByteAddressBuffer and ByteAddressBuffer if we want
to use MatrixRef for both Mul and OuterProductAccumulate. How do we do this?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious for @tex3d's feedback here, but I wonder if we should have MatrixRef and RWMatrixRef as separate types to clarify usage and align with existing HLSL conventions?

The other approaches I can think of aren't particularly clean and would complicate our future desire to support separate compilation.


#### Arguments

##### Template parameters

* **Interpretation**: This describes the type of the value in the buffer. See
[Type Interpretation] section details.

* **M x K**: Matrix Dimension

* **Layout**: Specifies the layout of the matrix. See [Matrix Layouts] section
for details.

##### Member Variables

The matrix is loaded from a raw buffer **Buffer**, starting at **StartOffset**.
For row-major and column-major layouts, **Stride** specifies the number of
bytes to go from one row/column to the next. For optimal layouts, **matrix
stride** is ignored.

The base address of **Buffer** and the **StartOffset** must be 64 byte aligned.

The **Stride** must 16 byte aligned.

`dx::linalg::VectorRef`

`VectorRef` is a wrapper class that represents a vector stored in a
ByteAddressBuffer specfying its type and StartOffset.

>TODO: Needs a length/size parameter?

```c++
namespace dx {
namespace linalg {

template <TypeInterpretation Interpretation> class VectorRef {
RWByteAddressBuffer Buffer;
uint StartOffset;
}

} // namespace linalg
} // namespace dx

```

#### Arguments

##### Template Parameters

* **Interpretation**: This describes the type of the value in the buffer. See
[Type Interpretation] section details.

##### Member Variables

The vector is loaded from a raw buffer **Buffer**, starting at **Start
Offset**.

The base address of **Buffer** and the **StartOffset** must be 64 byte aligned.

`dx.linalg.InterpretedVector`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that this is actually necessary. For input types that are converted to a destination type prior to math, specifying the conversion target as an "interpretation" is confusing - and also seems unnecessary given the lack of precision requirements. For input types that are bit-packed, HLSL has an explicit scalar type to use instead, e.g. uint8_t4_packed, which could then be composed as a native vector type without this kind of wrapper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


`InterpretedVector` is a wrapper class that represents a native vector
`vector<T, N>` but with an interpretation type that determines the actual type
that vector will be interpreted as.

```c++
namespace dx {
namespace linalg {

template <typename T, uint N, TypeInterpretation Interpretation>
class InterpretedVector {
vector<T, N> vec;
}

} // namespace linalg
} // namespace dx
```

#### Arguments

##### Template Parameters

* **T**: The vector **vec** 's declared type.
* **N**: The vector **vec** 's declared length.
* **Interpretation**: Allows functions operating on these vectors to interpret
the vector as a type different from its declared type. Based on the value
the type conversion maybe arithmetic or bitcast. See [Type Interpretation
section] for more details. This interpreted type also determines the actual
number of elements in the vector which might differ from **N** for packed
types.

##### Member Variables

A native vector **vec** of type **T** and size **N**.

### Functions

`dx::linalg::Mul` and `dx::linalg::MulAdd`

The `dx::linalg::Mul` function multiplies matrix and a input vector. The matrix
is loaded from memory while the vector is stored in a variable.

The `dx::linalg::MulAdd` operation behaves as `dx::linalg::Mul`, but also adds
an bias vector (loaded from memory) to the result.

#### Syntax

```c++
namespace dx {
namespace linalg {

template <TypeInterpretation matrixInterpretation, uint M, uint K,
MatrixLayout layout, typename InputType, uint InputNumcomp,
TypeInterpretation InputInterpretation, typename ResultType,
bool MatrixNeedsTranspose>
vector<ResultType, M>
Mul(MatrixRef<matrixInterpretation, M, N, layout> WeightMatrix,
InterpretedVector<InputType, InputNumComp, InputInterpretation>
InputVector);

} // namespace linalg
} // namespace dx

```

```c++
namespace dx {
namespace linalg {

template <TypeInterpretation matrixInterpretation, uint M, uint K,
MatrixLayout layout, TypeInterpretation biasVectorInterpretation,
typename InputType, uint InputNumcomp,
TypeInterpretation InputInterpretation, typename ResultType>
vector<ResultType, M> MulAdd(
MatrixRef<matrixInterpretation, M, N, layout> WeightMatrix,
InterpretedVector<InputType, InputNumComp, InputInterpretation> InputVector,
VectorRef<BiasInterpretation> BiasVector);

} // namespace linalg
} // namespace dx
```

#### Arguments

* **WeightMatrix**: is the Matrix multiplicand loaded from a raw buffer.

* **InputVector**: is the vector multiplicand.

* **BiasVector**: add the result of the matrix-vector multiply to a vector
loaded from a raw buffer.


`dx::linalg::OuterProductAccumulate`

#### Syntax

```c++
namespace dx {
namespace linalg {

template <typename T, uint M, uint N, MatrixLayout layout,
TypeInterpretation interpretation>
void OuterProductAccumulate(vector<T, M> inputVector1,
vector<T, N> inputVector2,
MatrixRef<interpretation, M, N, layout> AccMatrix);
} // namespace linalg
} // namespace dx

```

#### Arguments

`dx::linalg::VectorAccumulate`

#### Syntax

```c++
namespace dx {
namespace linalg {

template <typename T, uint N>
void VectorAccumulate(vector<T, N> inputVector, RWByteAddressBuffer Buffer,
uint StartOffset);

} // namespace linalg
} // namespace dx
```

#### Arguments

### Type Interpretation

> To be filled

### Matrix Layout

> To be filled

First strawman:

> To be fixed

First strawman:

```c++
```c++

ByteAddressBuffer inputMatrix0;
ByteAddressBuffer inputMatrix1;
ByteAddressBuffer biasVector0;
ByteAddressBuffer biasVector1;

void ps_main(args) // args: texture, normal, position
{
PreProcessing(args);
// Neural Network computes the output vector
// using the same input args and trained data
// in the form of matrices and bias vectors.
void ps_main(args) // args: texture, normal, position{ PreProcessing(args);
// Neural Network computes the output vector using the same input args and
// trained data in the form of matrices and bias vectors.

// The input vector is computed from the shader input
vector<uint32_t, M> inputVector = SomeFunction(args);

// Below the physical calculations are replaced by NN evaluation
// the Matrix and Bias are trained offline and loaded to memory.
// Below the physical calculations are replaced by NN evaluation the Matrix
// and Bias are trained offline and loaded to memory.

// layer0 = inputVector*inputMatrix + biasVector0
// The matrix and bias are loaded from memory at offsets : moffset0 and boffset0
// layer0 = inputVector*inputMatrix + biasVector0 The matrix and bias are
// loaded from memory at offsets : moffset0 and boffset0

dx::linalg::MatrixRef inMat0 = {inputMatrix0, moffset0};
dx::linalg::VectorRef biasV0 = {biasVector0, boffset0};
vector<uint32_t, K> layer0 = dx::linalg::MulAdd(inputVector, inMat0, biasV0);
layer0 = max(layer0,0); // Apply activation function
dx::linalg::VectorRef biasV0 = {biasVector0, boffset0}; vector<uint32_t, K>
layer0 = dx::linalg::MulAdd(inputVector, inMat0, biasV0); layer0 = max
(layer0,0); // Apply activation function

// layer0 = inputVector*inputMatrix0 + biasVector0
// The matrix and bias are loaded from memory at offsets : moffset1 and boffset1
// layer0 = inputVector*inputMatrix0 + biasVector0 The matrix and bias are
// loaded from memory at offsets : moffset1 and boffset1

dx::linalg::MatrixRef inMat1 = {inputMatrix1, moffset1};
dx::linalg::VectorRef biasV1 = {biasVector1, boffset1};
vector<uint32_t, K> layer1 = dx::linalg::MulAdd(layer0, inMat1, biasV1);
layer1 = max(layer1,0); // Apply activation function
dx::linalg::VectorRef biasV1 = {biasVector1, boffset1}; vector<uint32_t, K>
layer1 = dx::linalg::MulAdd(layer0, inMat1, biasV1); layer1 = max
(layer1,0); // Apply activation function

// output = layer1*inputMatrix1 + biasVector1
vector<uint32_t, N> output = dx::linalg::MulAdd(layer1, inMat1, biasV1);

output = exp(output);

color.r = output[0] * args.lightcolor;
color.g = output[1] * args.lightcolor;
color.b = output[2] * args.lightcolor;
}
```

## Detailed design

TBD
color.r = output[0] * args.lightcolor; color.g = output
[1] * args.lightcolor; color.b = output[2] * args.lightcolor; } ```

## Alternatives considered (Optional)

TBD

## Acknowledgments (Optional)

TBD
We would like to thank Jeff Bolz for his contribution to this spec.

<!-- {% endraw %} -->