You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reason will be displayed to describe this comment to others. Learn more.
This commit is kindof a quine!
Look at that commit message. Mouseover it. Click it. It's the same as the short SHA of the commit itself! Crazy.
How it went down
The initial inspiration came from the quine tweet.
How it was built
You can probably guess what the strategy was here: brute force guessing the short SHA. This project has a dependency on how GitHub works as well, since it leverages GitHub's commit SHA auto-linking. I did a quick test to figure out what the minimum number of characters required to trigger short SHA auto-linking is: it appears to be 7.
Armed with this knowledge, I wrote a Ruby script, trashed it, re-wrote it in Go, and iterated on that. As much as I wanted to play around with Ruby's new-ish async stuff, I'm much more comfortable with Go's concurrency patterns. Plus, my guess is the little bump in speed would come in handy in the long run.
The final version of the program does the following:
spawn n workers
each worker gets a unique path on the filesystem
each worker initializes a repo at their path
then, in a loop:
generate a 7-character hexadecimal string
make an empty commit with the hex string as the commit message
git rev-parse --short=7
if it's a match:
celebrate! 🎉
also, be sure to print it (and write it to a file to be safe)
then bail
if not:
git reset --hard HEAD~
occasionally, remove the repo and re-git init
How it was run
I didn't want to run the script on my own machine because I didn't want to go through the hassle of keeping it plugged in and/or figuring out how to keep the script running and/or confirming that it was actually running all the time. Plus, I figured I could could get a fancier rig from the cloud ☁️.
Up on GCP, I went through 4 instances while trying to figure out the right setup. IIRC the first few were either too slow in disk ops, or had too little CPU. Running trials with 50 to 200 workers would either blow out the CPU and have all the goroutines fighting over time, or cause the single-iteration time to be too high e.g. on the order of 10s of seconds.
I eventually settled on GCP's lowest tier compute-optimized instance, c2-standard-4, which was heavily based on the fact that it seems like you can only attach local disks to compute- or memory-optimized (or GPU) instances. By this point I was operating under the assumption that I was wasting precious time talking to the network attached disks these instances have by default, so a local SSD seemed necessary.
On this box, I was able to run 300 workers, hitting almost exactly 80% CPU usage, and iterating roughly every 750ms. With 7 hex digits, we have a 1 in 16^7 = 268,435,456 chance of guessing the short SHA. So:
git reset --hard doesn't completely get rid of your changes
This became clear after two things:
I started seeing messages about git automatically running git gc
I checked the reflog and, well, there're at least some remnants of the reset commits in there
I briefly had a version of the program which would run git gc after every reset, but that didn't seem to help upon checking the reflog, so I ended up adding an occasional nuke and re-init of the repo.
But then, how can you truly get rid of your changes without rm-rf-ing? (I'm sure there's a way, I just haven't gotten around to Googling it).
Did I really need 300 workers?
I made a lot of assumptions during this project, and this is kindof one of them. I did do a bit of tuning when starting up the program and seeing the average iteration time of each worker. It seemed like iteration time scaled more with number of workers than with (what I imagine is) disk usage.
That would be the other variable here right? With more workers comes more disk operations, and I would guess the bottleneck would then come from workers waiting on disk IO.
I ended up limiting it to 300 workers, in part because any more than about 330 workers and I'd start seeing "too many files open" errors. I guessed there was a way to increase this limit, but I didn't spend the time on it, and plus, my CPU usage was already at a comfortable 80% with 300 workers.
How much are empty commits interacting with the filesystem?
I know git works by performing lots of arcane operations on the .git directory. Empty commits seemed like the right way to go to limit disk operations, but I wonder how much time I was really saving by doing this? Versus commiting a single markdown file with the guessed SHA or something.
df2128c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is kindof a quine!
Look at that commit message. Mouseover it. Click it. It's the same as the short SHA of the commit itself! Crazy.
How it went down
The initial inspiration came from the quine tweet.
How it was built
You can probably guess what the strategy was here: brute force guessing the short SHA. This project has a dependency on how GitHub works as well, since it leverages GitHub's commit SHA auto-linking. I did a quick test to figure out what the minimum number of characters required to trigger short SHA auto-linking is: it appears to be 7.
Armed with this knowledge, I wrote a Ruby script, trashed it, re-wrote it in Go, and iterated on that. As much as I wanted to play around with Ruby's new-ish async stuff, I'm much more comfortable with Go's concurrency patterns. Plus, my guess is the little bump in speed would come in handy in the long run.
The final version of the program does the following:
n
workersgit rev-parse --short=7
git reset --hard HEAD~
git init
How it was run
I didn't want to run the script on my own machine because I didn't want to go through the hassle of keeping it plugged in and/or figuring out how to keep the script running and/or confirming that it was actually running all the time. Plus, I figured I could could get a fancier rig from the cloud ☁️.
Up on GCP, I went through 4 instances while trying to figure out the right setup. IIRC the first few were either too slow in disk ops, or had too little CPU. Running trials with 50 to 200 workers would either blow out the CPU and have all the goroutines fighting over time, or cause the single-iteration time to be too high e.g. on the order of 10s of seconds.
I eventually settled on GCP's lowest tier compute-optimized instance,
c2-standard-4
, which was heavily based on the fact that it seems like you can only attach local disks to compute- or memory-optimized (or GPU) instances. By this point I was operating under the assumption that I was wasting precious time talking to the network attached disks these instances have by default, so a local SSD seemed necessary.On this box, I was able to run 300 workers, hitting almost exactly 80% CPU usage, and iterating roughly every 750ms. With 7 hex digits, we have a 1 in
16^7 = 268,435,456
chance of guessing the short SHA. So:i.e. I should be expecting a result within ~7¾ days.
How did it actually shake out?
3.434675925925926 days! Pretty good!
Things I learned and open questions
git reset --hard
doesn't completely get rid of your changesThis became clear after two things:
git
automatically runninggit gc
reflog
and, well, there're at least some remnants of the reset commits in thereI briefly had a version of the program which would run
git gc
after everyreset
, but that didn't seem to help upon checking the reflog, so I ended up adding an occasional nuke and re-init
of the repo.But then, how can you truly get rid of your changes without
rm-rf
-ing? (I'm sure there's a way, I just haven't gotten around to Googling it).Did I really need 300 workers?
I made a lot of assumptions during this project, and this is kindof one of them. I did do a bit of tuning when starting up the program and seeing the average iteration time of each worker. It seemed like iteration time scaled more with number of workers than with (what I imagine is) disk usage.
That would be the other variable here right? With more workers comes more disk operations, and I would guess the bottleneck would then come from workers waiting on disk IO.
I ended up limiting it to 300 workers, in part because any more than about 330 workers and I'd start seeing "too many files open" errors. I guessed there was a way to increase this limit, but I didn't spend the time on it, and plus, my CPU usage was already at a comfortable 80% with 300 workers.
How much are empty commits interacting with the filesystem?
I know git works by performing lots of arcane operations on the
.git
directory. Empty commits seemed like the right way to go to limit disk operations, but I wonder how much time I was really saving by doing this? Versus commiting a single markdown file with the guessed SHA or something.What is the true meaning of df2128c?
D? F? 21? 28?? C??!?!1 What is the significance of these numbers and letters? We may never know ¯_(ツ)_/¯