ChaCha20 in PowerShell
I couldn't sleep the other night so to pass the time, naturally, I wrote a ChaCha20 implementation in PowerShell.
ChaCha20 is a variant of Salsa20 and promises better input diffusion and, in some cases, a speed improvement.
Given the Salsa20 PowerShell implementation from last week year, you'd want to add a new ChaCha20 function:
Function ChaCha20-QuarterRound {
Param(
[Ref] $a,
[Ref] $b,
[Ref] $c,
[Ref] $d
)
$a.Value = Salsa20-Sum $a.Value $b.Value; $d.Value = Salsa20-Rotate ($d.Value -bxor $a.Value) 16
$c.Value = Salsa20-Sum $c.Value $d.Value; $b.Value = Salsa20-Rotate ($b.Value -bxor $c.Value) 12
$a.Value = Salsa20-Sum $a.Value $b.Value; $d.Value = Salsa20-Rotate ($d.Value -bxor $a.Value) 8
$c.Value = Salsa20-Sum $c.Value $d.Value; $b.Value = Salsa20-Rotate ($b.Value -bxor $c.Value) 7
}and then edit your Salsa20-Core logic from this:
for ($i = $Rounds; $i -gt 0; $i-=2) {
$x4 = $x4 -bxor (Salsa20-Rotate (Salsa20-Sum $x0 $x12) 7)
$x8 = $x8 -bxor (Salsa20-Rotate (Salsa20-Sum $x4 $x0) 9)
$x12 = $x12 -bxor (Salsa20-Rotate (Salsa20-Sum $x8 $x4) 13)
$x0 = $x0 -bxor (Salsa20-Rotate (Salsa20-Sum $x12 $x8) 18)
[...snip...]
$x12 = $x12 -bxor (Salsa20-Rotate (Salsa20-Sum $x15 $x14) 7)
$x13 = $x13 -bxor (Salsa20-Rotate (Salsa20-Sum $x12 $x15) 9)
$x14 = $x14 -bxor (Salsa20-Rotate (Salsa20-Sum $x13 $x12) 13)
$x15 = $x15 -bxor (Salsa20-Rotate (Salsa20-Sum $x14 $x13) 18)
}to this:
for ($i = $Rounds; $i -gt 0; $i-=2) {
ChaCha20-QuarterRound ([Ref]$x0) ([Ref]$x4) ([Ref]$x8) ([Ref]$x12)
ChaCha20-QuarterRound ([Ref]$x1) ([Ref]$x5) ([Ref]$x9) ([Ref]$x13)
ChaCha20-QuarterRound ([Ref]$x2) ([Ref]$x6) ([Ref]$x10) ([Ref]$x14)
ChaCha20-QuarterRound ([Ref]$x3) ([Ref]$x7) ([Ref]$x11) ([Ref]$x15)
ChaCha20-QuarterRound ([Ref]$x0) ([Ref]$x5) ([Ref]$x10) ([Ref]$x15)
ChaCha20-QuarterRound ([Ref]$x1) ([Ref]$x6) ([Ref]$x11) ([Ref]$x12)
ChaCha20-QuarterRound ([Ref]$x2) ([Ref]$x7) ([Ref]$x8) ([Ref]$x13)
ChaCha20-QuarterRound ([Ref]$x3) ([Ref]$x4) ([Ref]$x9) ([Ref]$x14)
}You'd want do that, but you shouldn't. You should inline the work in the core directly for performance reasons. Also, the inputs are different as per page 5 of the paper and ChaCha20 flips the nonce and IV:
Salsa20:
void ECRYPT_ivsetup(ECRYPT_ctx *x,const u8 *iv)
{
x->input[6] = U8TO32_LITTLE(iv + 0);
x->input[7] = U8TO32_LITTLE(iv + 4);
x->input[8] = 0;
x->input[9] = 0;
}ChaCha20:
void ECRYPT_ivsetup(ECRYPT_ctx *x,const u8 *iv)
{
x->input[12] = 0;
x->input[13] = 0;
x->input[14] = U8TO32_LITTLE(iv + 0);
x->input[15] = U8TO32_LITTLE(iv + 4);
}The change from inputs 6,7,8,9 to 12,13,14,15 isn't critical if you read page 5. What's important is that the last two 32-bit values in Salsa20 (input[8] and input[9]) are the nonce (the zeroes), whereas in ChaCha20, they are the first (input[12] and input[13]).
The ChaCha20-Core function in PowerShell is similar to the Salsa20-core:
Function ChaCha20-Core {
Param(
[Parameter(Mandatory=$True,Position=0)]
[Byte[]] $k,
[Parameter(Mandatory=$True,Position=1)]
[Byte[]] $in,
[Parameter(Mandatory=$False,Position=2)]
[Byte] $Rounds = 20
)
[UInt32] $j0 = $x0 = Salsa20-LoadLittleEndian $c[ 0..3]
[UInt32] $j1 = $x1 = Salsa20-LoadLittleEndian $c[ 4..7]
[UInt32] $j2 = $x2 = Salsa20-LoadLittleEndian $c[ 8..11]
[UInt32] $j3 = $x3 = Salsa20-LoadLittleEndian $c[12..15]
[UInt32] $j4 = $x4 = Salsa20-LoadLittleEndian $k[ 0..3]
[UInt32] $j5 = $x5 = Salsa20-LoadLittleEndian $k[ 4..7]
[UInt32] $j6 = $x6 = Salsa20-LoadLittleEndian $k[ 8..11]
[UInt32] $j7 = $x7 = Salsa20-LoadLittleEndian $k[12..15]
[UInt32] $j8 = $x8 = Salsa20-LoadLittleEndian $k[16..19]
[UInt32] $j9 = $x9 = Salsa20-LoadLittleEndian $k[20..23]
[UInt32] $j10 = $x10 = Salsa20-LoadLittleEndian $k[24..27]
[UInt32] $j11 = $x11 = Salsa20-LoadLittleEndian $k[28..31]
[UInt32] $j12 = $x12 = Salsa20-LoadLittleEndian $in[0..3]
[UInt32] $j13 = $x13 = Salsa20-LoadLittleEndian $in[4..7]
[UInt32] $j14 = $x14 = Salsa20-LoadLittleEndian $in[8..11]
[UInt32] $j15 = $x15 = Salsa20-LoadLittleEndian $in[12..15]
for ($i = $Rounds; $i -gt 0; $i-=2) {
$x0 = Salsa20-Sum $x0 $x4; $x12 = Salsa20-Rotate ($x12 -bxor $x0) 16
$x8 = Salsa20-Sum $x8 $x12; $x4 = Salsa20-Rotate ($x4 -bxor $x8) 12
$x0 = Salsa20-Sum $x0 $x4; $x12 = Salsa20-Rotate ($x12 -bxor $x0) 8
$x8 = Salsa20-Sum $x8 $x12; $x4 = Salsa20-Rotate ($x4 -bxor $x8) 7
$x1 = Salsa20-Sum $x1 $x5; $x13 = Salsa20-Rotate ($x13 -bxor $x1) 16
$x9 = Salsa20-Sum $x9 $x13; $x5 = Salsa20-Rotate ($x5 -bxor $x9) 12
$x1 = Salsa20-Sum $x1 $x5; $x13 = Salsa20-Rotate ($x13 -bxor $x1) 8
$x9 = Salsa20-Sum $x9 $x13; $x5 = Salsa20-Rotate ($x5 -bxor $x9) 7
$x2 = Salsa20-Sum $x2 $x6; $x14 = Salsa20-Rotate ($x14 -bxor $x2) 16
$x10 = Salsa20-Sum $x10 $x14; $x6 = Salsa20-Rotate ($x6 -bxor $x10) 12
$x2 = Salsa20-Sum $x2 $x6; $x14 = Salsa20-Rotate ($x14 -bxor $x2) 8
$x10 = Salsa20-Sum $x10 $x14; $x6 = Salsa20-Rotate ($x6 -bxor $x10) 7
$x3 = Salsa20-Sum $x3 $x7; $x15 = Salsa20-Rotate ($x15 -bxor $x3) 16
$x11 = Salsa20-Sum $x11 $x15; $x7 = Salsa20-Rotate ($x7 -bxor $x11) 12
$x3 = Salsa20-Sum $x3 $x7; $x15 = Salsa20-Rotate ($x15 -bxor $x3) 8
$x11 = Salsa20-Sum $x11 $x15; $x7 = Salsa20-Rotate ($x7 -bxor $x11) 7
$x0 = Salsa20-Sum $x0 $x5; $x15 = Salsa20-Rotate ($x15 -bxor $x0) 16
$x10 = Salsa20-Sum $x10 $x15; $x5 = Salsa20-Rotate ($x5 -bxor $x10) 12
$x0 = Salsa20-Sum $x0 $x5; $x15 = Salsa20-Rotate ($x15 -bxor $x0) 8
$x10 = Salsa20-Sum $x10 $x15; $x5 = Salsa20-Rotate ($x5 -bxor $x10) 7
$x1 = Salsa20-Sum $x1 $x6; $x12 = Salsa20-Rotate ($x12 -bxor $x1) 16
$x11 = Salsa20-Sum $x11 $x12; $x6 = Salsa20-Rotate ($x6 -bxor $x11) 12
$x1 = Salsa20-Sum $x1 $x6; $x12 = Salsa20-Rotate ($x12 -bxor $x1) 8
$x11 = Salsa20-Sum $x1 $x12; $x6 = Salsa20-Rotate ($x6 -bxor $x11) 7
$x2 = Salsa20-Sum $x2 $x7; $x13 = Salsa20-Rotate ($x13 -bxor $x2) 16
$x8 = Salsa20-Sum $x8 $x13; $x7 = Salsa20-Rotate ($x7 -bxor $x8) 12
$x2 = Salsa20-Sum $x2 $x7; $x13 = Salsa20-Rotate ($x13 -bxor $x2) 8
$x8 = Salsa20-Sum $x8 $x13; $x7 = Salsa20-Rotate ($x7 -bxor $x8) 7
$x3 = Salsa20-Sum $x3 $x4; $x14 = Salsa20-Rotate ($x14 -bxor $x3) 16
$x9 = Salsa20-Sum $x9 $x14; $x4 = Salsa20-Rotate ($x4 -bxor $x9) 12
$x3 = Salsa20-Sum $x3 $x4; $x14 = Salsa20-Rotate ($x14 -bxor $x3) 8
$x9 = Salsa20-Sum $x9 $x14; $x4 = Salsa20-Rotate ($x4 -bxor $x9) 7
}
$x0 = Salsa20-Sum $x0 $j0
$x1 = Salsa20-Sum $x1 $j1
$x2 = Salsa20-Sum $x2 $j2
$x3 = Salsa20-Sum $x3 $j3
$x4 = Salsa20-Sum $x4 $j4
$x5 = Salsa20-Sum $x5 $j5
$x6 = Salsa20-Sum $x6 $j6
$x7 = Salsa20-Sum $x7 $j7
$x8 = Salsa20-Sum $x8 $j8
$x9 = Salsa20-Sum $x9 $j9
$x10 = Salsa20-Sum $x10 $j10
$x11 = Salsa20-Sum $x11 $j11
$x12 = Salsa20-Sum $x12 $j12
$x13 = Salsa20-Sum $x13 $j13
$x14 = Salsa20-Sum $x14 $j14
$x15 = Salsa20-Sum $x15 $j15
$out = @()
$out += Salsa20-StoreLittleEndian $x0
$out += Salsa20-StoreLittleEndian $x1
$out += Salsa20-StoreLittleEndian $x2
$out += Salsa20-StoreLittleEndian $x3
$out += Salsa20-StoreLittleEndian $x4
$out += Salsa20-StoreLittleEndian $x5
$out += Salsa20-StoreLittleEndian $x6
$out += Salsa20-StoreLittleEndian $x7
$out += Salsa20-StoreLittleEndian $x8
$out += Salsa20-StoreLittleEndian $x9
$out += Salsa20-StoreLittleEndian $x10
$out += Salsa20-StoreLittleEndian $x11
$out += Salsa20-StoreLittleEndian $x12
$out += Salsa20-StoreLittleEndian $x13
$out += Salsa20-StoreLittleEndian $x14
$out += Salsa20-StoreLittleEndian $x15
Return $out
}It's longer than just calling ChaCha20-QuarterRound a few times, but the ChaCha20-QuarterRound method is about 23% slower, which is why I'm writing this. An initial test of these PowerShell functions had the pass-by-reference method approximately 7% faster on one box and 5% slower on another for a 65,535-byte string. A lot of this is machine-dependent, I know, and I haven't even mentioned CPU statistics or cycle counts yet. I was originally planning on saying "Here ya go! Here's ChaCha20 in PowerSshell! Good luck, everybody!" and be on my way, but I noticed this -5 to +7% difference and wanted to figure out why. I got curious about how to measure performance between the two algorithms, and compare them against an AES-128 implemention. Performance measurement is hard, folks.
I took a swing at it, generating a few hundred strings of various lengths and measuring the total time taken to encrypt them. To minimize I/O as a bottleneck I eliminated any writing-to-disk. For simplicity, I kept everything linear and in memory. If you want to split work up into multiple cores, more power to you. (As a note, Salsa20 and its variants are great at parallelization. Since you can encrypt block n of your message without having to know anything about block n-1, you could take your message and key, divide the message by 64 bytes, and simultaneously encrypt those (message/64) blocks, say one per core, and stitch them back together again when they've all completed. But that's extra credit.)
A serious benchmark is probably — no, definitely — more work than this implementation merits, but I wrote a modest approximation. First, create a function to create strings:
Function New-String {
Param(
[UInt16] $Length = 16
)
[String] $out = ''
for ($i = 0; $i -lt $Length; $i++) {
$out += [Char] (Get-Random -Minimum 33 -Maximum 126)
}
Return $out
}Then use that function to generate strings of given lengths and measure how long it take to encrypt them with Measure-Command:
$algos = @(' salsa', 'chacha', 'aes128')
$iter = 100
:LENLOOP for ($len = 256; $len -lt [Math]::Pow(2,16); $len *= 2) {
$avg = New-Object -TypeName Double[] $algos.Length
$min = New-Object -TypeName Double[] $algos.Length
$max = New-Object -TypeName Double[] $algos.Length
for ($i = 0; $i -lt $min.Length; $i++) { $min[$i] = [Double]::MaxValue }
:DOLOOP for ($i = 1; $i -le $iter; $i++) {
$s = New-String -Length $len
$x = Measure-Command { Set-Salsa20String -String $s }
$avg[0] += ($x.TotalSeconds - $avg[0]) / $i
if ($min[0] -gt $x.TotalSeconds) { $min[0] = $x.TotalSeconds }
if ($max[0] -lt $x.TotalSeconds) { $max[0] = $x.TotalSeconds }
$x = Measure-Command { Set-ChaCha20String -String $s }
$avg[1] += ($x.TotalSeconds - $avg[1]) / $i
if ($min[1] -gt $x.TotalSeconds) { $min[1] = $x.TotalSeconds }
if ($max[1] -lt $x.TotalSeconds) { $max[1] = $x.TotalSeconds }
$x = Measure-Command { Set-AESString -String $s }
$avg[2] += ($x.TotalSeconds - $avg[2]) / $i
if ($min[2] -gt $x.TotalSeconds) { $min[2] = $x.TotalSeconds }
if ($max[2] -lt $x.TotalSeconds) { $max[2] = $x.TotalSeconds }
}
}Kick this off with some log lines thrown in and look at the results:
$out = ''
:OUTLOOP for ($i = 0; $i -lt $algos.Length; $i++) {
$throughput = $len / $avg[$i]
$ratio = $avg[$i] / $avg[0]
$out += (Get-Date).ToString('yyyy-MM-dd HH:mm:ss.ffffff')
$out += " {0}" -f ($algos[$i])
$out += " strlen={0}" -f ($len)
$out += " iter={0}" -f ($iter)
$out += " min={0:0.000}s" -f ($min[$i])
$out += " max={0:0.000}s" -f ($max[$i])
$out += " avg={0:0.000}s" -f ($avg[$i])
$out += " {0:0.000}b/s" -f ($throughput)
$out += " ({0:0.000}x)" -f ($ratio)
$out += "`n"
}
Write-Host $outEventually, you start to see (approximately) how fast these PowerShell-only encryption algorithms are. They are certainly not as fast as C natively compiled to your exact architecture, but if you're administratively prohibited from using Real Encryption, you have to make due with the best you've got. I wanted to know string length, string count, minimum and maximum encryption times, the mean, and approximate throughput in bytes per second.
For each string length, I used the Salsa20 average as my baseline (so Salsa20's average speed difference is by definition always 1.000x) to determine if ChaCha20 was faster or slower, and by how much:
salsa strlen=256 iter=100 min=0.226s max=0.405s avg=0.279s 918.348b/s (1.000x) chacha strlen=256 iter=100 min=0.232s max=0.452s avg=0.283s 904.866b/s (1.015x) aes128 strlen=256 iter=100 min=23.673s max=30.619s avg=27.547s 9.293b/s (98.821x) salsa strlen=512 iter=100 min=0.467s max=0.806s avg=0.527s 971.958b/s (1.000x) chacha strlen=512 iter=100 min=0.465s max=0.825s avg=0.529s 966.959b/s (1.005x) aes128 strlen=512 iter=100 min=47.188s max=61.495s avg=51.975s 9.851b/s (98.666x)
OK, that gives us an idea of how the two algorithms work. ChaCha20 is about 1% slower over 100 different encryptions. Using the ChaCha20-QuarterRound version of the script, the ChaCha20 numbers are very different:
salsa strlen=256 iter=100 min=0.221s max=0.357s avg=0.236s 1084.247b/s (1.000x) chacha strlen=256 iter=100 min=0.273s max=0.435s avg=0.290s 883.173b/s (1.228x) aes128 strlen=256 iter=100 min=23.035s max=24.718s avg=23.626s 10.835b/s (100.065x)
Hence why I say that the shorter, sweeter, pass-by-reference version of ChaCha20 is about 23% slower. I ran the tests repeatedly across a number of different string lengths and got similar results. I'm bad at computers, so I don't really know why it's worse, but it is. Profiling says so:
salsa strlen=512 iter=100 min=0.452s max=0.704s avg=0.479s 1069.053b/s (1.00x) chacha strlen=512 iter=100 min=0.556s max=0.842s avg=0.585s 875.259b/s (1.22x) aes128 strlen=512 iter=100 min=46.737s max=49.244s avg=47.843s 10.702b/s (99.90x) salsa strlen=1024 iter=100 min=0.949s max=1.436s avg=1.001s 1023.088b/s (1.00x) chacha strlen=1024 iter=100 min=1.162s max=1.740s avg=1.222s 837.759b/s (1.22x) aes128 strlen=1024 iter=100 min=97.043s max=108.716s avg=98.242s 10.423b/s (98.15x) salsa strlen=2048 iter=100 min=1.926s max=2.791s avg=2.042s 1002.976b/s (1.00x) chacha strlen=2048 iter=100 min=2.355s max=3.964s avg=2.482s 825.099b/s (1.22x) aes128 strlen=2048 iter=100 min=191.266s max=217.787s avg=195.406s 10.481b/s (95.70x)
So I ran the numbers on all of the different algorithms and it looks pretty straight-forward. On my test box, Salsa20 and the non-PBR ChaCha20 version do about 1 KiB/s of data. AES-128 handles about 10 B/s. This makes the Latin dance algorithms roughly two orders of magnitude faster, which is nice.
A picture is worth a thousand words, so I graphed it out (string length is the X-axis, maximum encryption time encountered for that length, in seconds, is the Y-axis):
You have to zoom in to the tail end of the graph to even see a very narrow difference between Salsa20 and ChaCha20, with ChaCha20 just slightly, slightly faster than Salsa20. They're pretty much indistinguishable in terms of overall performance, even at tens-of-thousands-of-character-length strings.
"Where's AES in all of this?" you ask. Same numbers, same graph, but with AES added:
AES isn't even in the same ballpark.


No comments:
Post a Comment