Skip to content

Add array built-in functions to the spec #7054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 61 additions & 1 deletion doc/plutus-core-spec/cardano/builtins6.tex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should probably be a paragraph at the start of the section saying that we've added a new type array(t) and saying what its denotation is: see Tables 4.7 and 4.8 on page 29. The question is what should ⟦array(t)⟧ be? Making it ⟦t⟧*, the set of all tuples of elements of the denotation of t is perfectly sensible even though that's the same as the denotation of list(t), since the only observable difference is in the operations we provide and their complexity. We could introduce some extra notation to distinguish tuples representing arrays from those representing lists, but I don't know if that's really necessary. Maybe an explanatory sentence when introducing the array type operator would be just as good.

Copy link
Contributor Author

@Unisay Unisay Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a Batch 6 sub-sub-section for the array type operator and its concrete syntax.
image

Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,39 @@
\subsection{Batch 6}
\label{sec:default-builtins-6}

\subsubsection{Built-in type operators}
\label{sec:built-in-type-operatos-6}
The sixth batch adds type operators defined in Table~\ref{table:built-in-type-operators-6}.

\begin{table}[H]
\centering
\begin{tabular}{|l|p{14mm}|l|l|}
\hline
Operator $\op$ & $\left|\op\right|$ & Denotation & Concrete Syntax\\
\hline
\texttt{array}
& 1
& $\denote{\arrayOf{t}} = \denote{t}^*$
& See below\\
\hline
\end{tabular}
\caption{Type operators, Batch 6}
\label{table:built-in-type-operators-6}
\end{table}

\paragraph{Concrete syntax for arrays.}
An array of type $\texttt{array}(t)$ is written as a syntactic list
\texttt{[$c_1, \ldots, c_n$]} where each $c_i$ lies in $\bitc_t$.
Some valid constant expressions are thus:
\begin{verbatim}
(con (array bool) [True, False, True])
(con (array integer) [])
\end{verbatim}

\subsubsection{Built-in functions}
\label{sec:built-in-functions-6}
The sixth batch of builtin operations is defined in Table~\ref{table:built-in-functions-6}.
The sixth batch of built-in types is defined in Table~\ref{table:built-in-types-6}.
Operations are defined in Table~\ref{table:built-in-functions-6}.

\setlength{\LTleft}{-12mm} % Shift the table left a bit to centre it on the page
\renewcommand*{\arraystretch}{1.25} %% Stretch the space between the rows by a factor of 25%
Expand Down Expand Up @@ -63,6 +93,36 @@ \subsubsection{Built-in functions}
[x_{k+1}, \ldots, x_n] & \text{if $1 \leq k \leq n-1$} \\ \relax
[] &\text{if $k \geq n$}\\
\end{array}\right.$} & & \\
\TT{lengthOfArray}
& $[\forall a_\#, \arrayOf{a_\#}] \to \ty{integer}$
& $[] \mapsto 0$
\newline
$[x_1] \mapsto 1$
\newline
$[x_1,\ldots,x_n] \mapsto n$
&
& \\
\TT{listToArray}
& $[\forall a_\#, \listOf{a_\#}] \to \arrayOf{a_\#}$
& $[] \mapsto []$
\newline
$[x_1] \mapsto [x_1]$
\newline
$[x_1,\ldots,x_n] \mapsto [x_1,\ldots,x_n]$
Comment on lines +107 to +111
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a better denotation. Maybe we should use a different syntax to denotate arrays? @effectfully please suggest, what would you do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the whole table looks when rendered:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, although I think you could miss out the two [x₁] ↦ ... cases since they're covered by [x₁, ..., xₙ] with n=1 (and arguably the [] case is covered by n=0, but that might be a bit confusing).

Also,

[x₁, ..., xₙ] ↦ [x₁, ..., xₙ]

might look a bit weird,but on page 33 we've already got (x,y) ↦ (x,y) for the denotation of mkPairData, so its maybe not such a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could miss out the two [x₁] ↦ ... cases since they're covered by [x₁, ..., xₙ] with n=1 (and arguably the [] case is covered by n=0, but that might be a bit confusing).

Yeah, I wasn't sure if [x₁, ..., xₙ] ↦ [x₁, ..., xₙ] subsumes [x₁] ↦ [x₁] or even [] ↦ [] so I went for more explicit notation. Happy to simplify it if you think that there is an ambiguity.

&
& \\
\TT{indexArray}
& $[\forall a_\#, \arrayOf{a_\#}, \ty{integer}]$ \text{$\;\;\; \to \ty{a_\#}$}
& $([x_1,\ldots,x_n], k)$
\smallskip
\newline
\text{$\;\;\mapsto \left\{ \begin{array}{ll}
\errorX & \text{if $k < 0$} \\ \relax %
x_{k+1} & \text{if $0 \leq k \leq n-1$} \\ \relax
\errorX & \text{if $k > n-1$}\\
\end{array}\right.$}
& Yes
& \\
\hline
\end{longtable}

Expand Down
51 changes: 49 additions & 2 deletions doc/plutus-core-spec/flat-serialisation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ \chapter{Serialising Plutus Core Terms and Programs Using the \texttt{flat} Form


\section{Encoding and decoding}
\label{sec:encoding-and-decoding}
Firstly recall some notation from Section~\ref{sec:notation}. The set of all
finite sequences of bits is denoted by $\Bits = \{\bits{0},\bits{1}\}^*$. For
brevity we write a sequence of bits in the form $b_{n-1} \cdots b_0$ instead of
Expand Down Expand Up @@ -79,7 +80,7 @@ \section{Encoding and decoding}
\subsection{Padding}
The encoding functions mentioned above produce sequences of \textit{bits}, but
we sometimes need sequences of \textit{bytes}. To this end we introduce a
functions $\pad: \Bits \rightarrow \Bits$ which adds a sequence of $\bits{0}$s
function $\pad: \Bits \rightarrow \Bits$ which adds a sequence of $\bits{0}$s
followed by a $\bits{1}$ to a sequence $s$ to get a sequence whose length is a
multiple of 8; if $s$ is a sequence such that $\length(s)$ is already a multiple of 8
then $\pad$ still adds an extra byte of padding; $\pad$ is used both for
Expand Down Expand Up @@ -155,6 +156,44 @@ \subsection{Lists}
\Dlist_X(\bits{1} \cdot s) &= (s'', x \cdot l) \quad \text{if $D_X(s) = (s', x)$ and $\Dlist_X(s') = (s'', l).$}
\end{align*}

\subsection{Arrays}
Arrays are encoded as a sequence of blocks of up to 255 elements.
Each block is preceded by a single unsigned byte containing the count of elements in the block,
and the end of the encoding is marked by a zero-length block.

Let $X$ be the set of elements in the array for which we have defined an encoder $\E_X$ and a
decoder $\D_X$. We define an encoder $\Earray_X$ which encodes arrays of elements of $X$:

$$
\Earray_X(s, [x_1, \ldots, x_n]) =
\begin{cases}
s \cdot \E_8(0) & \text{if } n = 0 \\
\text{let } k = \min(n, 255) \text{ in } \\
\Earray_X(s \cdot \E_8(k) \cdot \E_X(x_1) \cdots \E_X(x_k), [x_{k+1}, \ldots, x_n])
& \text{if } n > 0
\end{cases}
$$

The corresponding decoder is given by:

$$
\Darray_X(s) =
\begin{cases}
(s', []) & \text{if } \D_8(s) = (s', 0) \\
\text{let } (s', k) = \D_8(s) \text{ in } \\
\text{let } (s'', x_1) = \D_X(s') \text{ and } \ldots \text{ and } (s^{(k)}, x_k) = \D_X(s^{(k-1)}) \text{ in } \\
\text{let } (s''', l) = \Darray_X(s^{(k)}) \text{ in } \\
(s''', [x_1, \ldots, x_k] \cdot l) & \text{if } 1 \leq k \leq 255
\end{cases}
$$

The decoder reads the initial byte to determine the length of the subsequent block.
If the length $k$ is greater than 0 and no more bytes are available to decode $k$
elements of type $X$, the decoder would fail as per the general behaviour of decoders
described in Section~\ref{sec:encoding-and-decoding}.
The concatenation of the decoded block $[x_1, \ldots, x_k]$ with the recursively
decoded list $l$ (from the remainder $s^{(k)}$) forms the final result.

\subsection{Natural numbers}
We encode natural numbers by splitting their binary representations into
sequences of 7-bit blocks, then emitting these as a list with the \textbf{least
Expand Down Expand Up @@ -443,6 +482,7 @@ \subsection{Built-in types}
$\ty{bls12\_381\_G1\_element}$ & $\bits{1001}$ & 9 \\
$\ty{bls12\_381\_G2\_element}$ & $\bits{1010}$ & 10 \\
$\ty{bls12\_381\_MlResult}$ & $\bits{1011}$ & 11 \\
$\ty{array}$ & $\bits{1100}$ & 12 \\
\hline
\end{tabular}
\caption{Type tags}
Expand All @@ -465,6 +505,7 @@ \subsection{Built-in types}
&\etype(\ty{unit}) &&= [3] \\
&\etype(\ty{bool}) &&= [4] \\
&\etype(\listOf{t}) &&= [7,5] \cdot \etype(t) \\
&\etype(\arrayOf{t}) &&= [7,12] \cdot \etype(t) \\
&\etype(\pairOf{t_1}{t_2}) &&= [7,7,6] \cdot \etype(t_1) \cdot \etype(t_2)\\
&\etype(\ty{data}) &&= [8].
\end{alignat*}
Expand All @@ -476,6 +517,7 @@ \subsection{Built-in types}
&\dtype(3 \cdot l) &&= (l, \ty{unit}) \\
&\dtype(4 \cdot l) &&= (l, \ty{bool}) \\
&\dtype([7,5] \cdot l) &&= (l', \listOf{t}) &&\quad \text{if $\dtype(l) = (l', t)$}\\
&\dtype([7,12] \cdot l) &&= (l', \arrayOf{t}) &&\quad \text{if $\dtype(l) = (l', t)$}\\
&\dtype([7,7,6] \cdot l) &&= (l'', \pairOf{t_1}{t_2})
&&\ \begin{cases}
\text{if} & \dtype(l) = (l', t_1)\\
Expand Down Expand Up @@ -528,7 +570,8 @@ \subsection{Constants}
& \Econstant{\ty{bool}}(s, \texttt{False}) &&= s \cdot \bits{0}\\
& \Econstant{\ty{bool}}(s, \texttt{True}) &&= s \cdot \bits{1}\\
& \Econstant{\listOf{\tn}}(s,l) &&= \Elist^{\tn}_{\mathsf{constant}}(s, l) \\
& \Econstant{\pairOf{\tn_1}{\tn_2}}(s,(c_1,c_2)) &&= \Econstant{\tn_2}(\Econstant{\tn_1}(s, c_1), c_2)\\
& \Econstant{\arrayOf{\tn}}(s,l) &&= \Earray^{\tn}_{\mathsf{constant}}(s, l) \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that flat encodes arrays in a different way from lists. I have a program here that'll take a flat file and decode it, describing what the bits of the encoding mean. I wrote that to help me understand the flat encoding so that I could write a proper specifcation of the parts that we use. I don't quite remember why, but I added the array operations a few weeks ago and had to write some code to decode arrays. After a bit of trial and error, it seemed that arrays are encoded by emitting a natural number that says how many elements it has, then emitting the encodings of the individual elements, and then for some reason emitting 8 zero bits afterwards. I didn't actually check the implementation though, so maybe you should have a look and see if that's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look but I couldn''t work out how the flat encoding for arrays is determined. I checked again and arrays are definitely encoded differently from lists.

Copy link
Contributor Author

@Unisay Unisay Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The haddock for the Vector instance says

Vectors are encoded as arrays.

but that seems to be very misleading. The Haskell Array type is more like some kind of association list and you can see that in the examples for the Array instance: it's encoded as a sequence of (index,element) pairs. That's definitely not what it's doing with UPLC arrays. It seems to be doing something much more like what you'd do with standard arrays. In the source for the flat library it has

instance Flat a => Flat (V.Vector a) where
  size   = sizeSequence
  encode = encodeSequence
  decode = decodeSequence

and in the the file that defines `encodeSequence it says

In practice, this means that the sequence is encoded as a sequence of blocks of up to 255 elements, with every block preceded by the count of the elements in the block and a final 0-length block.

This looks much more like what I observed, and my flat debugger fails when I give it an array with more than 255 elements because I assumed that there was only one block. I think (but haven't checked in detail) that it's encoding an array as a sequence of blocks, each one consisting of a size (eight bits long?) followed by that many elements, and then the 00000000 at the end is an empty block that tells you thhat you've finished. Presumably that means that you can serialise an array in many different ways depending on how you break it into blocks. I guess we''ll have work out the details and put it all into the specification. I think something similar happens in CBOR and that's described in our specification, so maybe we could steal that, or at least a modified version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, my experiments show that Strict.Vectors are encoded as sequences:

#!/usr/bin/env runhaskell

import Data.Bits (testBit)
import Data.ByteString (ByteString, unpack)
import Data.Vector.Strict (Vector)
import Data.Vector.Strict qualified as StrictVector
import Data.Word (Word8)
import Flat (DecodeException, Flat (..), flat, unflat)
import Flat.Instances.Test (tstBits)
import Flat.Instances.Vector ()

main :: IO ()
main = do
  let
    words :: Vector Word8
    words = StrictVector.fromList [1, 2, 3, 4, 5]

    bytes :: ByteString
    bytes = flat words

  print $ tstBits words
  -- (True,56,"00000101 00000001 00000010 00000011 00000100 00000101 00000000")

  putStrLn $ renderBits bytes
  -- 00000101 00000001 00000010 00000011 00000100 00000101 00000000 00000001

  let
    words' :: Either DecodeException (Vector Word8)
    words' = unflat bytes

  print words' -- Right [1,2,3,4,5]

renderBits :: ByteString -> String
renderBits bs = unwords [byteToBits w | w <- unpack bs]
 where
  byteToBits :: Word8 -> String
  byteToBits byte = [if testBit byte i then '1' else '0' | i <- [7, 6 .. 0]]

-- from module Data.Vector.Orphans () in the Plutus repo
instance (Flat a) => Flat (Vector a) where
  size = size . StrictVector.toLazy -- Strict to Lazy is O(1)
  encode = encode . StrictVector.toLazy
  decode = StrictVector.fromLazy <$> decode -- Strict from Lazy is O(1)

I interpret the encoded bits as:

00000101: size = 5 
00000001: 1
00000010: 2
00000011: 3
00000100: 4
00000101: 5
00000000: end of the vector
00000001: padding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

@ana-pantilie ana-pantilie Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Unisay Maybe the binary serialisation/deserialisation section of https://cips.cardano.org/cip/CIP-0138 helps. It describes how the flat encoding works. I may have made mistakes but I did spend quite some time on it to try to figure out what the encoding is!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if it turns out that what I wrote in the CIP is wrong, we should amend it.

Copy link
Contributor

@kwxm kwxm Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if it turns out that what I wrote in the CIP is wrong, we should amend it.

It says "The decoding of arrays is, of course, the inverse of the encoding function". I think that may not be exactly correct. For encoding it specifies that you should produce a bunch of initial 255-byte blocks possibly followed by a smaller final one. I suspect (but haven't checked) that the decoder will accept a sequence of blocks of random nonzero lengths (up to 255 bytes). There's a similar thing for bytestrings in flat (see p62 of the specification) and certain items in data: see the first bullet point on p52 and also Section B.5 on p55. See also issue #5852.

& \Econstant{\pairOf{\tn_1}{\tn_2}}(s,(c_1,c_2)) &&= \Econstant{\tn_2}(\Econstant{\tn_1}(s, c_1), c_2)\\
& \Econstant{\ty{data}}(s,d) &&= \E_{\B^*}(s, \eData(d))
\end{alignat*}

Expand All @@ -540,6 +583,7 @@ \subsection{Constants}
&\dConstant{\ty{bool}}(\bits{0} \cdot s) &&= (s, \texttt{False}) \\
&\dConstant{\ty{bool}}(\bits{1} \cdot s) &&= (s, \texttt{True}) \\
&\dConstant{\listOf{\tn}}(s) &&= \Dlist^{\tn}_{\mathsf{constant}}(s, l) \\
&\dConstant{\arrayOf{\tn}}(s) &&= \Darray^{\tn}_{\mathsf{constant}}(s, l) \\
&\dConstant{\pairOf{\tn_1}{\tn_2}}(s) &&= (s'', (c_1, c_2))
&& \begin{cases}
\text{if} & \dConstant{\tn_1}(s) = (s', c_1) \\
Expand Down Expand Up @@ -699,6 +743,9 @@ \subsection{Built-in functions}
\TT{caseList} & $\bits{1011000}$ & 88 \\
\TT{caseData} & $\bits{1011001}$ & 89 \\
\TT{dropList} & $\bits{1011010}$ & 90 \\
\TT{lengthOfArray} & $\bits{1011011}$ & 91 \\
\TT{listToArray} & $\bits{1011100}$ & 92 \\
\TT{indexArray} & $\bits{1011101}$ & 93 \\
\hline
\end{tabular}
\caption{Tags for built-in functions (Batch 6)}
Expand Down
3 changes: 3 additions & 0 deletions doc/plutus-core-spec/header.tex
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,7 @@
%% \newcommand\uniqmem[1]{#1^{\blacktriangledown}} %% Unique member of size type

\newcommand\listOf[1]{\mathtt{list}\,\mathtt{(}#1\mathtt{)}}
\newcommand\arrayOf[1]{\mathtt{array}\,\mathtt{(}#1\mathtt{)}}
\newcommand\pairOf[2]{\mathtt{pair}\,\mathtt{(}#1\mathtt{,}\,#2\mathtt{)}}


Expand Down Expand Up @@ -406,6 +407,7 @@

\newcommand\E{\mathsf{E}} % Encoders
\newcommand\Elist{\overrightarrow{\mathsf{E}}}
\newcommand\Earray{\overrightarrow{\mathsf{E}}}
\newcommand\Eblock{\mathsf{E}_{\mathsf{block}}}

\newcommand\encode[1]{\E_{\mathsf{#1}}}
Expand All @@ -422,6 +424,7 @@

\newcommand\D{\mathsf{D}} % Decoders
\newcommand\Dlist{\overrightarrow{\mathsf{D}}}
\newcommand\Darray{\overrightarrow{\mathsf{D}}}

\newcommand\decode[1]{\D_{\mathsf{#1}}}
\newcommand\decodelist[1]{\Dlist_{\mathsf{#1}}}
Expand Down