ia32-64/x86/tdpbssd.tdpbsud.tdpbusd.tdpbuud.html

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:svg="http://www.w3.org/2000/svg" xmlns:x86="http://www.felixcloutier.com/x86"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link rel="stylesheet" type="text/css" href="style.css"></link><title>TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD
		— Dot Product of Signed/Unsigned Bytes with DwordAccumulation</title></head><body><header><nav><ul><li><a href='index.html'>Index</a></li><li>December 2023</li></ul></nav></header><h1>TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD
		— Dot Product of Signed/Unsigned Bytes with DwordAccumulation</h1>


<table>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th></tr>
<tr>
<td>VEX.128.F2.0F38.W0 5E 11:rrr:bbb TDPBSSD tmm1, tmm2, tmm3</td>
<td>A</td>
<td>V/N.E.</td>
<td>AMX-INT8</td>
<td>Matrix multiply signed byte elements from tmm2 by signed byte elements from tmm3 and accumulate the dword elements in tmm1.</td></tr>
<tr>
<td>VEX.128.F3.0F38.W0 5E 11:rrr:bbb TDPBSUD tmm1, tmm2, tmm3</td>
<td>A</td>
<td>V/N.E.</td>
<td>AMX-INT8</td>
<td>Matrix multiply signed byte elements from tmm2 by unsigned byte elements from tmm3 and accumulate the dword elements in tmm1.</td></tr>
<tr>
<td>VEX.128.66.0F38.W0 5E 11:rrr:bbb TDPBUSD tmm1, tmm2, tmm3</td>
<td>A</td>
<td>V/N.E.</td>
<td>AMX-INT8</td>
<td>Matrix multiply unsigned byte elements from tmm2 by signed byte elements from tmm3 and accumulate the dword elements in tmm1.</td></tr>
<tr>
<td>VEX.128.NP.0F38.W0 5E 11:rrr:bbb TDPBUUD tmm1, tmm2, tmm3</td>
<td>A</td>
<td>V/N.E.</td>
<td>AMX-INT8</td>
<td>Matrix multiply unsigned byte elements from tmm2 by unsigned byte elements from tmm3 and accumulate the dword elements in tmm1.</td></tr></table>
<h2 id="instruction-operand-encoding">Instruction Operand Encoding<a class="anchor" href="#instruction-operand-encoding">
			¶
		</a></h2>
<table>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th></tr>
<tr>
<td>A</td>
<td>N/A</td>
<td>ModRM:reg (r, w)</td>
<td>ModRM:r/m (r)</td>
<td>VEX.vvvv (r)</td>
<td>N/A</td></tr></table>
<h2 id="description">Description<a class="anchor" href="#description">
			¶
		</a></h2>
<p>For each possible combination of (row of tmm2, column of tmm3), the instruction performs a set of SIMD dot-products on all corresponding four byte elements, one from tmm2 and one from tmm3, adds the results of those dot-products, and then accumulates the result into the corresponding row and column of tmm1. Each dword in input tiles tmm2 and tmm3 is interpreted as four byte elements. These may be signed or unsigned. Each letter in the two-letter pattern SU, US, SS, UU indicates the signed/unsigned nature of the values in tmm2 and tmm3, respectively.</p>
<p>Any attempt to execute the TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD instructions inside an Intel TSX transaction will result in a transaction abort.</p>
<h2 id="operation">Operation<a class="anchor" href="#operation">
			¶
		</a></h2>
<pre>define DPBD(c,x,y):// arguments are dwords
    if *x operand is signed*:
        extend_src1 := SIGN_EXTEND
    else:
        extend_src1 := ZERO_EXTEND
    if *y operand is signed*:
        extend_src2 := SIGN_EXTEND
    else:
        extend_src2 := ZERO_EXTEND
    p0dword := extend_src1(x.byte[0]) * extend_src2(y.byte[0])
    p1dword := extend_src1(x.byte[1]) * extend_src2(y.byte[1])
    p2dword := extend_src1(x.byte[2]) * extend_src2(y.byte[2])
    p3dword := extend_src1(x.byte[3]) * extend_src2(y.byte[3])
    c := c + p0dword + p1dword + p2dword + p3dword
</pre>
<h3 id="tdpbssd--tdpbsud--tdpbusd--tdpbuud-tsrcdest--tsrc1--tsrc2--register-only-version-">TDPBSSD, TDPBSUD, TDPBUSD, TDPBUUD tsrcdest, tsrc1, tsrc2 (Register Only Version)<a class="anchor" href="#tdpbssd--tdpbsud--tdpbusd--tdpbuud-tsrcdest--tsrc1--tsrc2--register-only-version-">
			¶
		</a></h3>
<pre>// C = m x n (tsrcdest), A = m x k (tsrc1), B = k x n (tsrc2)
tsrc1_elements_per_row := tsrc1.colsb / 4
tsrc2_elements_per_row := tsrc2.colsb / 4
tsrcdest_elements_per_row := tsrcdest.colsb / 4
for m in 0 ... tsrcdest.rows-1:
    tmp := tsrcdest.row[m]
    for k in 0 ... tsrc1_elements_per_row-1:
        for n in 0 ... tsrcdest_elements_per_row-1:
            DPBD( tmp.dword[n], tsrc1.row[m].dword[k], tsrc2.row[k].dword[n] )
    write_row_and_zero(tsrcdest, m, tmp, tsrcdest.colsb)
zero_upper_rows(tsrcdest, tsrcdest.rows)
zero_tilecfg_start()
</pre>
<h2 id="intel-c-c++-compiler-intrinsic-equivalent">Intel C/C++ Compiler Intrinsic Equivalent<a class="anchor" href="#intel-c-c++-compiler-intrinsic-equivalent">
			¶
		</a></h2>
<pre>TDPBSSD void _tile_dpbssd(__tile dst, __tile src1, __tile src2);
</pre>
<pre>TDPBSUD void _tile_dpbsud(__tile dst, __tile src1, __tile src2);
</pre>
<pre>TDPBUSD void _tile_dpbusd(__tile dst, __tile src1, __tile src2);
</pre>
<pre>TDPBUUD void _tile_dpbuud(__tile dst, __tile src1, __tile src2);
</pre>
<h2 id="flags-affected">Flags Affected<a class="anchor" href="#flags-affected">
			¶
		</a></h2>
<p>None.</p>
<h2 class="exceptions" id="exceptions">Exceptions<a class="anchor" href="#exceptions">
			¶
		</a></h2>
<p>AMX-E4; see Section 2.10, “Intel® AMX Instruction Exception Classes,” for details.</p><footer><p>
		This UNOFFICIAL, mechanically-separated, non-verified reference is provided for convenience, but it may be
		inc<span style="opacity: 0.2">omp</span>lete or b<sub>r</sub>oke<sub>n</sub> in various obvious or non-obvious
		ways. Refer to <a href="https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4">Intel® 64 and IA-32 Architectures Software Developer’s Manual</a> for anything serious.
	</p></footer></body></html>