[xxhash] Creating cuckooget for Ultra-Fast Local Site Saving [cuckoohash table]
Hello, I'm Munou.
I created this because there was nothing with the features I wanted for creating a local mirror of a site.
httpithub.cturackooget
Introduction
This site's mirroring feature is strongly influenced by xavier roche, the creator of HTTrack.
[xavier roche’s homework](httplog.httrack.c
While reading his blog, I saw the following article.
[Coucal, Cuckoo-hashing-based hashtable with stash area C library](httplog.httrack.c/uc
Regarding cuckoo hash tables, there is an easy-to-understand explanation in Kumagi-san's beloved slide presentation, "The World of Hash Tables You Don't Know".
I had an idea. It's the part where, in case of a collision, it "kicks out other elements like a cuckoo's habit." I thought this characteristic could be utilized when dealing with a huge list and potential duplicates.
ahewfewjjwefawefj
Hash Calculation
Since security is not required, I found and incorporated the recent xxhash.
Two types are needed for cuckoo hash tables, so they are generated with xxh32 and xxhash64.
Also, Iigau-kun pointed out that I had fixed the upper limit of the table size, so I made it possible to resize the table size with _resize.
Unique Weighting
The priority for saving destination URLs changes by sequentially specifying strings contained in the URL using the -w --weight option.
For example, this makes it possible to configure flexibly, without relying on hierarchical specifications.
Fastest?
I won't write much about it as I might face criticism, but the default number of connections is 50.
But honestly, it's fast even without going up to 50.
The reason why speed is important is that I was saving the 5ch UNIX board from around 2022 or 2023. It was approximately 30GB.
At that time, I was shocked that historical threads and posts were rapidly being pushed into old logs due to an overwhelming storm of scripts.
Therefore, for me, speed was an unavoidable priority.
What's Left...
I want to be able to create a cache file (which I failed to do once) and resume processing after an interruption, and perhaps implement an LRU cache...
I've failed various times, so I'd like to somehow incorporate these features.
Well then. Until next time.
And... I was so tired that I fell asleep while looking at my computer...