ChaCha20 in PowerShell
I couldn't sleep the other night so to pass the time, naturally, I wrote a ChaCha20 implementation in PowerShell.
ChaCha20 is a variant of Salsa20 and promises better input diffusion and, in some cases, a speed improvement.
Given the Salsa20 PowerShell implementation from last week year, you'd want to add a new ChaCha20 function:
Function ChaCha20-QuarterRound { Param( [Ref] $a, [Ref] $b, [Ref] $c, [Ref] $d ) $a.Value = Salsa20-Sum $a.Value $b.Value; $d.Value = Salsa20-Rotate ($d.Value -bxor $a.Value) 16 $c.Value = Salsa20-Sum $c.Value $d.Value; $b.Value = Salsa20-Rotate ($b.Value -bxor $c.Value) 12 $a.Value = Salsa20-Sum $a.Value $b.Value; $d.Value = Salsa20-Rotate ($d.Value -bxor $a.Value) 8 $c.Value = Salsa20-Sum $c.Value $d.Value; $b.Value = Salsa20-Rotate ($b.Value -bxor $c.Value) 7 }
and then edit your Salsa20-Core logic from this:
for ($i = $Rounds; $i -gt 0; $i-=2) { $x4 = $x4 -bxor (Salsa20-Rotate (Salsa20-Sum $x0 $x12) 7) $x8 = $x8 -bxor (Salsa20-Rotate (Salsa20-Sum $x4 $x0) 9) $x12 = $x12 -bxor (Salsa20-Rotate (Salsa20-Sum $x8 $x4) 13) $x0 = $x0 -bxor (Salsa20-Rotate (Salsa20-Sum $x12 $x8) 18) [...snip...] $x12 = $x12 -bxor (Salsa20-Rotate (Salsa20-Sum $x15 $x14) 7) $x13 = $x13 -bxor (Salsa20-Rotate (Salsa20-Sum $x12 $x15) 9) $x14 = $x14 -bxor (Salsa20-Rotate (Salsa20-Sum $x13 $x12) 13) $x15 = $x15 -bxor (Salsa20-Rotate (Salsa20-Sum $x14 $x13) 18) }
to this:
for ($i = $Rounds; $i -gt 0; $i-=2) { ChaCha20-QuarterRound ([Ref]$x0) ([Ref]$x4) ([Ref]$x8) ([Ref]$x12) ChaCha20-QuarterRound ([Ref]$x1) ([Ref]$x5) ([Ref]$x9) ([Ref]$x13) ChaCha20-QuarterRound ([Ref]$x2) ([Ref]$x6) ([Ref]$x10) ([Ref]$x14) ChaCha20-QuarterRound ([Ref]$x3) ([Ref]$x7) ([Ref]$x11) ([Ref]$x15) ChaCha20-QuarterRound ([Ref]$x0) ([Ref]$x5) ([Ref]$x10) ([Ref]$x15) ChaCha20-QuarterRound ([Ref]$x1) ([Ref]$x6) ([Ref]$x11) ([Ref]$x12) ChaCha20-QuarterRound ([Ref]$x2) ([Ref]$x7) ([Ref]$x8) ([Ref]$x13) ChaCha20-QuarterRound ([Ref]$x3) ([Ref]$x4) ([Ref]$x9) ([Ref]$x14) }
You'd want do that, but you shouldn't. You should inline the work in the core directly for performance reasons. Also, the inputs are different as per page 5 of the paper and ChaCha20 flips the nonce and IV:
Salsa20:
void ECRYPT_ivsetup(ECRYPT_ctx *x,const u8 *iv) { x->input[6] = U8TO32_LITTLE(iv + 0); x->input[7] = U8TO32_LITTLE(iv + 4); x->input[8] = 0; x->input[9] = 0; }
ChaCha20:
void ECRYPT_ivsetup(ECRYPT_ctx *x,const u8 *iv) { x->input[12] = 0; x->input[13] = 0; x->input[14] = U8TO32_LITTLE(iv + 0); x->input[15] = U8TO32_LITTLE(iv + 4); }
The change from inputs 6,7,8,9 to 12,13,14,15 isn't critical if you read page 5. What's important is that the last two 32-bit values in Salsa20 (input[8] and input[9]) are the nonce (the zeroes), whereas in ChaCha20, they are the first (input[12] and input[13]).
The ChaCha20-Core function in PowerShell is similar to the Salsa20-core:
Function ChaCha20-Core { Param( [Parameter(Mandatory=$True,Position=0)] [Byte[]] $k, [Parameter(Mandatory=$True,Position=1)] [Byte[]] $in, [Parameter(Mandatory=$False,Position=2)] [Byte] $Rounds = 20 ) [UInt32] $j0 = $x0 = Salsa20-LoadLittleEndian $c[ 0..3] [UInt32] $j1 = $x1 = Salsa20-LoadLittleEndian $c[ 4..7] [UInt32] $j2 = $x2 = Salsa20-LoadLittleEndian $c[ 8..11] [UInt32] $j3 = $x3 = Salsa20-LoadLittleEndian $c[12..15] [UInt32] $j4 = $x4 = Salsa20-LoadLittleEndian $k[ 0..3] [UInt32] $j5 = $x5 = Salsa20-LoadLittleEndian $k[ 4..7] [UInt32] $j6 = $x6 = Salsa20-LoadLittleEndian $k[ 8..11] [UInt32] $j7 = $x7 = Salsa20-LoadLittleEndian $k[12..15] [UInt32] $j8 = $x8 = Salsa20-LoadLittleEndian $k[16..19] [UInt32] $j9 = $x9 = Salsa20-LoadLittleEndian $k[20..23] [UInt32] $j10 = $x10 = Salsa20-LoadLittleEndian $k[24..27] [UInt32] $j11 = $x11 = Salsa20-LoadLittleEndian $k[28..31] [UInt32] $j12 = $x12 = Salsa20-LoadLittleEndian $in[0..3] [UInt32] $j13 = $x13 = Salsa20-LoadLittleEndian $in[4..7] [UInt32] $j14 = $x14 = Salsa20-LoadLittleEndian $in[8..11] [UInt32] $j15 = $x15 = Salsa20-LoadLittleEndian $in[12..15] for ($i = $Rounds; $i -gt 0; $i-=2) { $x0 = Salsa20-Sum $x0 $x4; $x12 = Salsa20-Rotate ($x12 -bxor $x0) 16 $x8 = Salsa20-Sum $x8 $x12; $x4 = Salsa20-Rotate ($x4 -bxor $x8) 12 $x0 = Salsa20-Sum $x0 $x4; $x12 = Salsa20-Rotate ($x12 -bxor $x0) 8 $x8 = Salsa20-Sum $x8 $x12; $x4 = Salsa20-Rotate ($x4 -bxor $x8) 7 $x1 = Salsa20-Sum $x1 $x5; $x13 = Salsa20-Rotate ($x13 -bxor $x1) 16 $x9 = Salsa20-Sum $x9 $x13; $x5 = Salsa20-Rotate ($x5 -bxor $x9) 12 $x1 = Salsa20-Sum $x1 $x5; $x13 = Salsa20-Rotate ($x13 -bxor $x1) 8 $x9 = Salsa20-Sum $x9 $x13; $x5 = Salsa20-Rotate ($x5 -bxor $x9) 7 $x2 = Salsa20-Sum $x2 $x6; $x14 = Salsa20-Rotate ($x14 -bxor $x2) 16 $x10 = Salsa20-Sum $x10 $x14; $x6 = Salsa20-Rotate ($x6 -bxor $x10) 12 $x2 = Salsa20-Sum $x2 $x6; $x14 = Salsa20-Rotate ($x14 -bxor $x2) 8 $x10 = Salsa20-Sum $x10 $x14; $x6 = Salsa20-Rotate ($x6 -bxor $x10) 7 $x3 = Salsa20-Sum $x3 $x7; $x15 = Salsa20-Rotate ($x15 -bxor $x3) 16 $x11 = Salsa20-Sum $x11 $x15; $x7 = Salsa20-Rotate ($x7 -bxor $x11) 12 $x3 = Salsa20-Sum $x3 $x7; $x15 = Salsa20-Rotate ($x15 -bxor $x3) 8 $x11 = Salsa20-Sum $x11 $x15; $x7 = Salsa20-Rotate ($x7 -bxor $x11) 7 $x0 = Salsa20-Sum $x0 $x5; $x15 = Salsa20-Rotate ($x15 -bxor $x0) 16 $x10 = Salsa20-Sum $x10 $x15; $x5 = Salsa20-Rotate ($x5 -bxor $x10) 12 $x0 = Salsa20-Sum $x0 $x5; $x15 = Salsa20-Rotate ($x15 -bxor $x0) 8 $x10 = Salsa20-Sum $x10 $x15; $x5 = Salsa20-Rotate ($x5 -bxor $x10) 7 $x1 = Salsa20-Sum $x1 $x6; $x12 = Salsa20-Rotate ($x12 -bxor $x1) 16 $x11 = Salsa20-Sum $x11 $x12; $x6 = Salsa20-Rotate ($x6 -bxor $x11) 12 $x1 = Salsa20-Sum $x1 $x6; $x12 = Salsa20-Rotate ($x12 -bxor $x1) 8 $x11 = Salsa20-Sum $x1 $x12; $x6 = Salsa20-Rotate ($x6 -bxor $x11) 7 $x2 = Salsa20-Sum $x2 $x7; $x13 = Salsa20-Rotate ($x13 -bxor $x2) 16 $x8 = Salsa20-Sum $x8 $x13; $x7 = Salsa20-Rotate ($x7 -bxor $x8) 12 $x2 = Salsa20-Sum $x2 $x7; $x13 = Salsa20-Rotate ($x13 -bxor $x2) 8 $x8 = Salsa20-Sum $x8 $x13; $x7 = Salsa20-Rotate ($x7 -bxor $x8) 7 $x3 = Salsa20-Sum $x3 $x4; $x14 = Salsa20-Rotate ($x14 -bxor $x3) 16 $x9 = Salsa20-Sum $x9 $x14; $x4 = Salsa20-Rotate ($x4 -bxor $x9) 12 $x3 = Salsa20-Sum $x3 $x4; $x14 = Salsa20-Rotate ($x14 -bxor $x3) 8 $x9 = Salsa20-Sum $x9 $x14; $x4 = Salsa20-Rotate ($x4 -bxor $x9) 7 } $x0 = Salsa20-Sum $x0 $j0 $x1 = Salsa20-Sum $x1 $j1 $x2 = Salsa20-Sum $x2 $j2 $x3 = Salsa20-Sum $x3 $j3 $x4 = Salsa20-Sum $x4 $j4 $x5 = Salsa20-Sum $x5 $j5 $x6 = Salsa20-Sum $x6 $j6 $x7 = Salsa20-Sum $x7 $j7 $x8 = Salsa20-Sum $x8 $j8 $x9 = Salsa20-Sum $x9 $j9 $x10 = Salsa20-Sum $x10 $j10 $x11 = Salsa20-Sum $x11 $j11 $x12 = Salsa20-Sum $x12 $j12 $x13 = Salsa20-Sum $x13 $j13 $x14 = Salsa20-Sum $x14 $j14 $x15 = Salsa20-Sum $x15 $j15 $out = @() $out += Salsa20-StoreLittleEndian $x0 $out += Salsa20-StoreLittleEndian $x1 $out += Salsa20-StoreLittleEndian $x2 $out += Salsa20-StoreLittleEndian $x3 $out += Salsa20-StoreLittleEndian $x4 $out += Salsa20-StoreLittleEndian $x5 $out += Salsa20-StoreLittleEndian $x6 $out += Salsa20-StoreLittleEndian $x7 $out += Salsa20-StoreLittleEndian $x8 $out += Salsa20-StoreLittleEndian $x9 $out += Salsa20-StoreLittleEndian $x10 $out += Salsa20-StoreLittleEndian $x11 $out += Salsa20-StoreLittleEndian $x12 $out += Salsa20-StoreLittleEndian $x13 $out += Salsa20-StoreLittleEndian $x14 $out += Salsa20-StoreLittleEndian $x15 Return $out }
It's longer than just calling ChaCha20-QuarterRound a few times, but the ChaCha20-QuarterRound method is about 23% slower, which is why I'm writing this. An initial test of these PowerShell functions had the pass-by-reference method approximately 7% faster on one box and 5% slower on another for a 65,535-byte string. A lot of this is machine-dependent, I know, and I haven't even mentioned CPU statistics or cycle counts yet. I was originally planning on saying "Here ya go! Here's ChaCha20 in PowerSshell! Good luck, everybody!" and be on my way, but I noticed this -5 to +7% difference and wanted to figure out why. I got curious about how to measure performance between the two algorithms, and compare them against an AES-128 implemention. Performance measurement is hard, folks.
I took a swing at it, generating a few hundred strings of various lengths and measuring the total time taken to encrypt them. To minimize I/O as a bottleneck I eliminated any writing-to-disk. For simplicity, I kept everything linear and in memory. If you want to split work up into multiple cores, more power to you. (As a note, Salsa20 and its variants are great at parallelization. Since you can encrypt block n of your message without having to know anything about block n-1, you could take your message and key, divide the message by 64 bytes, and simultaneously encrypt those (message/64) blocks, say one per core, and stitch them back together again when they've all completed. But that's extra credit.)
A serious benchmark is probably — no, definitely — more work than this implementation merits, but I wrote a modest approximation. First, create a function to create strings:
Function New-String { Param( [UInt16] $Length = 16 ) [String] $out = '' for ($i = 0; $i -lt $Length; $i++) { $out += [Char] (Get-Random -Minimum 33 -Maximum 126) } Return $out }
Then use that function to generate strings of given lengths and measure how long it take to encrypt them with Measure-Command:
$algos = @(' salsa', 'chacha', 'aes128') $iter = 100 :LENLOOP for ($len = 256; $len -lt [Math]::Pow(2,16); $len *= 2) { $avg = New-Object -TypeName Double[] $algos.Length $min = New-Object -TypeName Double[] $algos.Length $max = New-Object -TypeName Double[] $algos.Length for ($i = 0; $i -lt $min.Length; $i++) { $min[$i] = [Double]::MaxValue } :DOLOOP for ($i = 1; $i -le $iter; $i++) { $s = New-String -Length $len $x = Measure-Command { Set-Salsa20String -String $s } $avg[0] += ($x.TotalSeconds - $avg[0]) / $i if ($min[0] -gt $x.TotalSeconds) { $min[0] = $x.TotalSeconds } if ($max[0] -lt $x.TotalSeconds) { $max[0] = $x.TotalSeconds } $x = Measure-Command { Set-ChaCha20String -String $s } $avg[1] += ($x.TotalSeconds - $avg[1]) / $i if ($min[1] -gt $x.TotalSeconds) { $min[1] = $x.TotalSeconds } if ($max[1] -lt $x.TotalSeconds) { $max[1] = $x.TotalSeconds } $x = Measure-Command { Set-AESString -String $s } $avg[2] += ($x.TotalSeconds - $avg[2]) / $i if ($min[2] -gt $x.TotalSeconds) { $min[2] = $x.TotalSeconds } if ($max[2] -lt $x.TotalSeconds) { $max[2] = $x.TotalSeconds } } }
Kick this off with some log lines thrown in and look at the results:
$out = '' :OUTLOOP for ($i = 0; $i -lt $algos.Length; $i++) { $throughput = $len / $avg[$i] $ratio = $avg[$i] / $avg[0] $out += (Get-Date).ToString('yyyy-MM-dd HH:mm:ss.ffffff') $out += " {0}" -f ($algos[$i]) $out += " strlen={0}" -f ($len) $out += " iter={0}" -f ($iter) $out += " min={0:0.000}s" -f ($min[$i]) $out += " max={0:0.000}s" -f ($max[$i]) $out += " avg={0:0.000}s" -f ($avg[$i]) $out += " {0:0.000}b/s" -f ($throughput) $out += " ({0:0.000}x)" -f ($ratio) $out += "`n" } Write-Host $out
Eventually, you start to see (approximately) how fast these PowerShell-only encryption algorithms are. They are certainly not as fast as C natively compiled to your exact architecture, but if you're administratively prohibited from using Real Encryption, you have to make due with the best you've got. I wanted to know string length, string count, minimum and maximum encryption times, the mean, and approximate throughput in bytes per second.
For each string length, I used the Salsa20 average as my baseline (so Salsa20's average speed difference is by definition always 1.000x) to determine if ChaCha20 was faster or slower, and by how much:
salsa strlen=256 iter=100 min=0.226s max=0.405s avg=0.279s 918.348b/s (1.000x) chacha strlen=256 iter=100 min=0.232s max=0.452s avg=0.283s 904.866b/s (1.015x) aes128 strlen=256 iter=100 min=23.673s max=30.619s avg=27.547s 9.293b/s (98.821x) salsa strlen=512 iter=100 min=0.467s max=0.806s avg=0.527s 971.958b/s (1.000x) chacha strlen=512 iter=100 min=0.465s max=0.825s avg=0.529s 966.959b/s (1.005x) aes128 strlen=512 iter=100 min=47.188s max=61.495s avg=51.975s 9.851b/s (98.666x)
OK, that gives us an idea of how the two algorithms work. ChaCha20 is about 1% slower over 100 different encryptions. Using the ChaCha20-QuarterRound version of the script, the ChaCha20 numbers are very different:
salsa strlen=256 iter=100 min=0.221s max=0.357s avg=0.236s 1084.247b/s (1.000x) chacha strlen=256 iter=100 min=0.273s max=0.435s avg=0.290s 883.173b/s (1.228x) aes128 strlen=256 iter=100 min=23.035s max=24.718s avg=23.626s 10.835b/s (100.065x)
Hence why I say that the shorter, sweeter, pass-by-reference version of ChaCha20 is about 23% slower. I ran the tests repeatedly across a number of different string lengths and got similar results. I'm bad at computers, so I don't really know why it's worse, but it is. Profiling says so:
salsa strlen=512 iter=100 min=0.452s max=0.704s avg=0.479s 1069.053b/s (1.00x) chacha strlen=512 iter=100 min=0.556s max=0.842s avg=0.585s 875.259b/s (1.22x) aes128 strlen=512 iter=100 min=46.737s max=49.244s avg=47.843s 10.702b/s (99.90x) salsa strlen=1024 iter=100 min=0.949s max=1.436s avg=1.001s 1023.088b/s (1.00x) chacha strlen=1024 iter=100 min=1.162s max=1.740s avg=1.222s 837.759b/s (1.22x) aes128 strlen=1024 iter=100 min=97.043s max=108.716s avg=98.242s 10.423b/s (98.15x) salsa strlen=2048 iter=100 min=1.926s max=2.791s avg=2.042s 1002.976b/s (1.00x) chacha strlen=2048 iter=100 min=2.355s max=3.964s avg=2.482s 825.099b/s (1.22x) aes128 strlen=2048 iter=100 min=191.266s max=217.787s avg=195.406s 10.481b/s (95.70x)
So I ran the numbers on all of the different algorithms and it looks pretty straight-forward. On my test box, Salsa20 and the non-PBR ChaCha20 version do about 1 KiB/s of data. AES-128 handles about 10 B/s. This makes the Latin dance algorithms roughly two orders of magnitude faster, which is nice.
A picture is worth a thousand words, so I graphed it out (string length is the X-axis, maximum encryption time encountered for that length, in seconds, is the Y-axis):
You have to zoom in to the tail end of the graph to even see a very narrow difference between Salsa20 and ChaCha20, with ChaCha20 just slightly, slightly faster than Salsa20. They're pretty much indistinguishable in terms of overall performance, even at tens-of-thousands-of-character-length strings.
"Where's AES in all of this?" you ask. Same numbers, same graph, but with AES added:
AES isn't even in the same ballpark.
No comments:
Post a Comment