2018-02-05

ChaCha20 in PowerShell

I couldn't sleep the other night so to pass the time, naturally, I wrote a ChaCha20 implementation in PowerShell.

ChaCha20 is a variant of Salsa20 and promises better input diffusion and, in some cases, a speed improvement.

Given the Salsa20 PowerShell implementation from last week year, you'd want to add a new ChaCha20 function:

Function ChaCha20-QuarterRound {
  Param(
    [Ref] $a,
    [Ref] $b,
    [Ref] $c,
    [Ref] $d
  )
  $a.Value = Salsa20-Sum $a.Value $b.Value; $d.Value = Salsa20-Rotate ($d.Value -bxor $a.Value) 16
  $c.Value = Salsa20-Sum $c.Value $d.Value; $b.Value = Salsa20-Rotate ($b.Value -bxor $c.Value) 12
  $a.Value = Salsa20-Sum $a.Value $b.Value; $d.Value = Salsa20-Rotate ($d.Value -bxor $a.Value) 8
  $c.Value = Salsa20-Sum $c.Value $d.Value; $b.Value = Salsa20-Rotate ($b.Value -bxor $c.Value) 7
}

and then edit your Salsa20-Core logic from this:


for ($i = $Rounds; $i -gt 0; $i-=2) {
  $x4  = $x4  -bxor (Salsa20-Rotate (Salsa20-Sum $x0  $x12) 7)
  $x8  = $x8  -bxor (Salsa20-Rotate (Salsa20-Sum $x4  $x0)  9)
  $x12 = $x12 -bxor (Salsa20-Rotate (Salsa20-Sum $x8  $x4)  13)
  $x0  = $x0  -bxor (Salsa20-Rotate (Salsa20-Sum $x12 $x8)  18)
  [...snip...]
  $x12 = $x12 -bxor (Salsa20-Rotate (Salsa20-Sum $x15 $x14) 7)
  $x13 = $x13 -bxor (Salsa20-Rotate (Salsa20-Sum $x12 $x15) 9)
  $x14 = $x14 -bxor (Salsa20-Rotate (Salsa20-Sum $x13 $x12) 13)
  $x15 = $x15 -bxor (Salsa20-Rotate (Salsa20-Sum $x14 $x13) 18)
}

to this:

for ($i = $Rounds; $i -gt 0; $i-=2) {
  ChaCha20-QuarterRound ([Ref]$x0) ([Ref]$x4) ([Ref]$x8)  ([Ref]$x12)
  ChaCha20-QuarterRound ([Ref]$x1) ([Ref]$x5) ([Ref]$x9)  ([Ref]$x13)
  ChaCha20-QuarterRound ([Ref]$x2) ([Ref]$x6) ([Ref]$x10) ([Ref]$x14)
  ChaCha20-QuarterRound ([Ref]$x3) ([Ref]$x7) ([Ref]$x11) ([Ref]$x15)
  ChaCha20-QuarterRound ([Ref]$x0) ([Ref]$x5) ([Ref]$x10) ([Ref]$x15)
  ChaCha20-QuarterRound ([Ref]$x1) ([Ref]$x6) ([Ref]$x11) ([Ref]$x12)
  ChaCha20-QuarterRound ([Ref]$x2) ([Ref]$x7) ([Ref]$x8)  ([Ref]$x13)
  ChaCha20-QuarterRound ([Ref]$x3) ([Ref]$x4) ([Ref]$x9)  ([Ref]$x14)
}

You'd want do that, but you shouldn't. You should inline the work in the core directly for performance reasons. Also, the inputs are different as per page 5 of the paper and ChaCha20 flips the nonce and IV:

Salsa20:

void ECRYPT_ivsetup(ECRYPT_ctx *x,const u8 *iv)
{
  x->input[6] = U8TO32_LITTLE(iv + 0);
  x->input[7] = U8TO32_LITTLE(iv + 4);
  x->input[8] = 0;
  x->input[9] = 0;
}

ChaCha20:

void ECRYPT_ivsetup(ECRYPT_ctx *x,const u8 *iv)
{
  x->input[12] = 0;
  x->input[13] = 0;
  x->input[14] = U8TO32_LITTLE(iv + 0);
  x->input[15] = U8TO32_LITTLE(iv + 4);
}

The change from inputs 6,7,8,9 to 12,13,14,15 isn't critical if you read page 5. What's important is that the last two 32-bit values in Salsa20 (input[8] and input[9]) are the nonce (the zeroes), whereas in ChaCha20, they are the first (input[12] and input[13]).

The ChaCha20-Core function in PowerShell is similar to the Salsa20-core:

Function ChaCha20-Core {
  Param(
    [Parameter(Mandatory=$True,Position=0)]
    [Byte[]] $k,
    [Parameter(Mandatory=$True,Position=1)]
    [Byte[]] $in,
    [Parameter(Mandatory=$False,Position=2)]
    [Byte] $Rounds = 20
  )

  [UInt32] $j0  = $x0  = Salsa20-LoadLittleEndian  $c[ 0..3]
  [UInt32] $j1  = $x1  = Salsa20-LoadLittleEndian  $c[ 4..7]
  [UInt32] $j2  = $x2  = Salsa20-LoadLittleEndian  $c[ 8..11]
  [UInt32] $j3  = $x3  = Salsa20-LoadLittleEndian  $c[12..15]
  [UInt32] $j4  = $x4  = Salsa20-LoadLittleEndian  $k[ 0..3]
  [UInt32] $j5  = $x5  = Salsa20-LoadLittleEndian  $k[ 4..7]
  [UInt32] $j6  = $x6  = Salsa20-LoadLittleEndian  $k[ 8..11]
  [UInt32] $j7  = $x7  = Salsa20-LoadLittleEndian  $k[12..15]
  [UInt32] $j8  = $x8  = Salsa20-LoadLittleEndian  $k[16..19]
  [UInt32] $j9  = $x9  = Salsa20-LoadLittleEndian  $k[20..23]
  [UInt32] $j10 = $x10 = Salsa20-LoadLittleEndian  $k[24..27]
  [UInt32] $j11 = $x11 = Salsa20-LoadLittleEndian  $k[28..31]
  [UInt32] $j12 = $x12 = Salsa20-LoadLittleEndian $in[0..3]
  [UInt32] $j13 = $x13 = Salsa20-LoadLittleEndian $in[4..7]
  [UInt32] $j14 = $x14 = Salsa20-LoadLittleEndian $in[8..11]
  [UInt32] $j15 = $x15 = Salsa20-LoadLittleEndian $in[12..15]

  for ($i = $Rounds; $i -gt 0; $i-=2) {
    $x0 = Salsa20-Sum $x0 $x4;   $x12 = Salsa20-Rotate ($x12 -bxor $x0)  16
    $x8 = Salsa20-Sum $x8 $x12;   $x4 = Salsa20-Rotate ($x4  -bxor $x8)  12
    $x0 = Salsa20-Sum $x0 $x4;   $x12 = Salsa20-Rotate ($x12 -bxor $x0)   8
    $x8 = Salsa20-Sum $x8 $x12;   $x4 = Salsa20-Rotate ($x4  -bxor $x8)   7

    $x1 = Salsa20-Sum $x1 $x5;   $x13 = Salsa20-Rotate ($x13 -bxor $x1)  16
    $x9 = Salsa20-Sum $x9 $x13;   $x5 = Salsa20-Rotate ($x5  -bxor $x9)  12
    $x1 = Salsa20-Sum $x1 $x5;   $x13 = Salsa20-Rotate ($x13 -bxor $x1)   8
    $x9 = Salsa20-Sum $x9 $x13;   $x5 = Salsa20-Rotate ($x5  -bxor $x9)   7

    $x2 = Salsa20-Sum $x2 $x6;   $x14 = Salsa20-Rotate ($x14 -bxor $x2)  16
    $x10 = Salsa20-Sum $x10 $x14; $x6 = Salsa20-Rotate ($x6  -bxor $x10) 12
    $x2 = Salsa20-Sum $x2 $x6;   $x14 = Salsa20-Rotate ($x14 -bxor $x2)   8
    $x10 = Salsa20-Sum $x10 $x14; $x6 = Salsa20-Rotate ($x6  -bxor $x10)  7

    $x3  = Salsa20-Sum $x3 $x7;  $x15 = Salsa20-Rotate ($x15 -bxor $x3)  16
    $x11 = Salsa20-Sum $x11 $x15; $x7 = Salsa20-Rotate ($x7  -bxor $x11) 12
    $x3  = Salsa20-Sum $x3 $x7;  $x15 = Salsa20-Rotate ($x15 -bxor $x3)   8
    $x11 = Salsa20-Sum $x11 $x15; $x7 = Salsa20-Rotate ($x7  -bxor $x11)  7

    $x0  = Salsa20-Sum $x0 $x5;  $x15 = Salsa20-Rotate ($x15 -bxor $x0)  16
    $x10 = Salsa20-Sum $x10 $x15; $x5 = Salsa20-Rotate ($x5  -bxor $x10) 12
    $x0  = Salsa20-Sum $x0 $x5;  $x15 = Salsa20-Rotate ($x15 -bxor $x0)   8
    $x10 = Salsa20-Sum $x10 $x15; $x5 = Salsa20-Rotate ($x5  -bxor $x10)  7

    $x1  = Salsa20-Sum $x1 $x6;  $x12 = Salsa20-Rotate ($x12 -bxor $x1)  16
    $x11 = Salsa20-Sum $x11 $x12; $x6 = Salsa20-Rotate ($x6  -bxor $x11) 12
    $x1  = Salsa20-Sum $x1 $x6;  $x12 = Salsa20-Rotate ($x12 -bxor $x1)   8
    $x11 = Salsa20-Sum $x1 $x12;  $x6 = Salsa20-Rotate ($x6  -bxor $x11)  7

    $x2 = Salsa20-Sum $x2 $x7;   $x13 = Salsa20-Rotate ($x13 -bxor $x2)  16
    $x8 = Salsa20-Sum $x8 $x13;   $x7 = Salsa20-Rotate ($x7  -bxor $x8)  12
    $x2 = Salsa20-Sum $x2 $x7;   $x13 = Salsa20-Rotate ($x13 -bxor $x2)   8
    $x8 = Salsa20-Sum $x8 $x13;   $x7 = Salsa20-Rotate ($x7  -bxor $x8)   7

    $x3 = Salsa20-Sum $x3 $x4;   $x14 = Salsa20-Rotate ($x14 -bxor $x3)  16
    $x9 = Salsa20-Sum $x9 $x14;   $x4 = Salsa20-Rotate ($x4  -bxor $x9)  12
    $x3 = Salsa20-Sum $x3 $x4;   $x14 = Salsa20-Rotate ($x14 -bxor $x3)   8
    $x9 = Salsa20-Sum $x9 $x14;   $x4 = Salsa20-Rotate ($x4  -bxor $x9)   7
  }

  $x0  = Salsa20-Sum $x0 $j0
  $x1  = Salsa20-Sum $x1 $j1
  $x2  = Salsa20-Sum $x2 $j2
  $x3  = Salsa20-Sum $x3 $j3
  $x4  = Salsa20-Sum $x4 $j4
  $x5  = Salsa20-Sum $x5 $j5
  $x6  = Salsa20-Sum $x6 $j6
  $x7  = Salsa20-Sum $x7 $j7
  $x8  = Salsa20-Sum $x8 $j8
  $x9  = Salsa20-Sum $x9 $j9
  $x10 = Salsa20-Sum $x10 $j10
  $x11 = Salsa20-Sum $x11 $j11
  $x12 = Salsa20-Sum $x12 $j12
  $x13 = Salsa20-Sum $x13 $j13
  $x14 = Salsa20-Sum $x14 $j14
  $x15 = Salsa20-Sum $x15 $j15

  $out = @()
  $out += Salsa20-StoreLittleEndian $x0
  $out += Salsa20-StoreLittleEndian $x1
  $out += Salsa20-StoreLittleEndian $x2
  $out += Salsa20-StoreLittleEndian $x3
  $out += Salsa20-StoreLittleEndian $x4
  $out += Salsa20-StoreLittleEndian $x5
  $out += Salsa20-StoreLittleEndian $x6
  $out += Salsa20-StoreLittleEndian $x7
  $out += Salsa20-StoreLittleEndian $x8
  $out += Salsa20-StoreLittleEndian $x9
  $out += Salsa20-StoreLittleEndian $x10
  $out += Salsa20-StoreLittleEndian $x11
  $out += Salsa20-StoreLittleEndian $x12
  $out += Salsa20-StoreLittleEndian $x13
  $out += Salsa20-StoreLittleEndian $x14
  $out += Salsa20-StoreLittleEndian $x15

  Return $out
}

It's longer than just calling ChaCha20-QuarterRound a few times, but the ChaCha20-QuarterRound method is about 23% slower, which is why I'm writing this. An initial test of these PowerShell functions had the pass-by-reference method approximately 7% faster on one box and 5% slower on another for a 65,535-byte string. A lot of this is machine-dependent, I know, and I haven't even mentioned CPU statistics or cycle counts yet. I was originally planning on saying "Here ya go! Here's ChaCha20 in PowerSshell! Good luck, everybody!" and be on my way, but I noticed this -5 to +7% difference and wanted to figure out why. I got curious about how to measure performance between the two algorithms, and compare them against an AES-128 implemention. Performance measurement is hard, folks.

I took a swing at it, generating a few hundred strings of various lengths and measuring the total time taken to encrypt them. To minimize I/O as a bottleneck I eliminated any writing-to-disk. For simplicity, I kept everything linear and in memory. If you want to split work up into multiple cores, more power to you. (As a note, Salsa20 and its variants are great at parallelization. Since you can encrypt block n of your message without having to know anything about block n-1, you could take your message and key, divide the message by 64 bytes, and simultaneously encrypt those (message/64) blocks, say one per core, and stitch them back together again when they've all completed. But that's extra credit.)

A serious benchmark is probably — no, definitely — more work than this implementation merits, but I wrote a modest approximation. First, create a function to create strings:

Function New-String {
  Param(
    [UInt16] $Length = 16
  )
  [String] $out = ''

  for ($i = 0; $i -lt $Length; $i++) {
    $out += [Char] (Get-Random -Minimum 33 -Maximum 126)
  }
  Return $out
}

Then use that function to generate strings of given lengths and measure how long it take to encrypt them with Measure-Command:

$algos = @(' salsa', 'chacha', 'aes128')
$iter  = 100

:LENLOOP for ($len = 256; $len -lt [Math]::Pow(2,16); $len *= 2) {
  $avg  = New-Object -TypeName Double[] $algos.Length
  $min  = New-Object -TypeName Double[] $algos.Length
  $max  = New-Object -TypeName Double[] $algos.Length

  for ($i = 0; $i -lt $min.Length; $i++) { $min[$i] = [Double]::MaxValue }

  :DOLOOP for ($i = 1; $i -le $iter; $i++) {
    $s = New-String -Length $len

    $x = Measure-Command { Set-Salsa20String -String $s }
    $avg[0] += ($x.TotalSeconds - $avg[0]) / $i
    if ($min[0] -gt $x.TotalSeconds) { $min[0] = $x.TotalSeconds }
    if ($max[0] -lt $x.TotalSeconds) { $max[0] = $x.TotalSeconds }

    $x = Measure-Command { Set-ChaCha20String -String $s }
    $avg[1] += ($x.TotalSeconds - $avg[1]) / $i
    if ($min[1] -gt $x.TotalSeconds) { $min[1] = $x.TotalSeconds }
    if ($max[1] -lt $x.TotalSeconds) { $max[1] = $x.TotalSeconds }

    $x = Measure-Command { Set-AESString -String $s }
    $avg[2] += ($x.TotalSeconds - $avg[2]) / $i
    if ($min[2] -gt $x.TotalSeconds) { $min[2] = $x.TotalSeconds }
    if ($max[2] -lt $x.TotalSeconds) { $max[2] = $x.TotalSeconds }
  }
}

Kick this off with some log lines thrown in and look at the results:

$out = ''
:OUTLOOP for ($i = 0; $i -lt $algos.Length; $i++) {
  $throughput = $len / $avg[$i]
  $ratio      = $avg[$i] / $avg[0]
  
  $out += (Get-Date).ToString('yyyy-MM-dd HH:mm:ss.ffffff')
  $out += " {0}"            -f ($algos[$i])
  $out += " strlen={0}"     -f ($len)
  $out += " iter={0}"       -f ($iter)
  $out += " min={0:0.000}s" -f ($min[$i])
  $out += " max={0:0.000}s" -f ($max[$i])
  $out += " avg={0:0.000}s" -f ($avg[$i])
  $out += " {0:0.000}b/s"   -f ($throughput)
  $out += " ({0:0.000}x)"   -f ($ratio)
  $out += "`n"
}
Write-Host $out

Eventually, you start to see (approximately) how fast these PowerShell-only encryption algorithms are. They are certainly not as fast as C natively compiled to your exact architecture, but if you're administratively prohibited from using Real Encryption, you have to make due with the best you've got. I wanted to know string length, string count, minimum and maximum encryption times, the mean, and approximate throughput in bytes per second.

For each string length, I used the Salsa20 average as my baseline (so Salsa20's average speed difference is by definition always 1.000x) to determine if ChaCha20 was faster or slower, and by how much:

salsa strlen=256 iter=100 min=0.226s max=0.405s avg=0.279s 918.348b/s (1.000x)
chacha strlen=256 iter=100 min=0.232s max=0.452s avg=0.283s 904.866b/s (1.015x)
aes128 strlen=256 iter=100 min=23.673s max=30.619s avg=27.547s 9.293b/s (98.821x)

 salsa strlen=512 iter=100 min=0.467s max=0.806s avg=0.527s 971.958b/s (1.000x)
chacha strlen=512 iter=100 min=0.465s max=0.825s avg=0.529s 966.959b/s (1.005x)
aes128 strlen=512 iter=100 min=47.188s max=61.495s avg=51.975s 9.851b/s (98.666x)

OK, that gives us an idea of how the two algorithms work. ChaCha20 is about 1% slower over 100 different encryptions. Using the ChaCha20-QuarterRound version of the script, the ChaCha20 numbers are very different:

salsa strlen=256 iter=100 min=0.221s max=0.357s avg=0.236s 1084.247b/s (1.000x)
chacha strlen=256 iter=100 min=0.273s max=0.435s avg=0.290s 883.173b/s (1.228x)
aes128 strlen=256 iter=100 min=23.035s max=24.718s avg=23.626s 10.835b/s (100.065x)

Hence why I say that the shorter, sweeter, pass-by-reference version of ChaCha20 is about 23% slower. I ran the tests repeatedly across a number of different string lengths and got similar results. I'm bad at computers, so I don't really know why it's worse, but it is. Profiling says so:

salsa strlen=512 iter=100 min=0.452s max=0.704s avg=0.479s 1069.053b/s (1.00x)
chacha strlen=512 iter=100 min=0.556s max=0.842s avg=0.585s 875.259b/s (1.22x)
aes128 strlen=512 iter=100 min=46.737s max=49.244s avg=47.843s 10.702b/s (99.90x)

 salsa strlen=1024 iter=100 min=0.949s max=1.436s avg=1.001s 1023.088b/s (1.00x)
chacha strlen=1024 iter=100 min=1.162s max=1.740s avg=1.222s 837.759b/s (1.22x)
aes128 strlen=1024 iter=100 min=97.043s max=108.716s avg=98.242s 10.423b/s (98.15x)

 salsa strlen=2048 iter=100 min=1.926s max=2.791s avg=2.042s 1002.976b/s (1.00x)
chacha strlen=2048 iter=100 min=2.355s max=3.964s avg=2.482s 825.099b/s (1.22x)
aes128 strlen=2048 iter=100 min=191.266s max=217.787s avg=195.406s 10.481b/s (95.70x)

So I ran the numbers on all of the different algorithms and it looks pretty straight-forward. On my test box, Salsa20 and the non-PBR ChaCha20 version do about 1 KiB/s of data. AES-128 handles about 10 B/s. This makes the Latin dance algorithms roughly two orders of magnitude faster, which is nice.

A picture is worth a thousand words, so I graphed it out (string length is the X-axis, maximum encryption time encountered for that length, in seconds, is the Y-axis):



You have to zoom in to the tail end of the graph to even see a very narrow difference between Salsa20 and ChaCha20, with ChaCha20 just slightly, slightly faster than Salsa20. They're pretty much indistinguishable in terms of overall performance, even at tens-of-thousands-of-character-length strings.

"Where's AES in all of this?" you ask. Same numbers, same graph, but with AES added:



AES isn't even in the same ballpark.

No comments: