Powerful HTTrack CLI Options for Dumping Sites
Hello, it's Munou.
When dumping sites with HTTrack, since the software itself is from around the year 2000 and somehow still has constraints from that era, I've written the commands, which tend to get long, in a shell script, and I'll leave them here.
Especially since Japanese articles mostly cover the Windows GUI version and there isn't much information on CLI operations, this is for the record.
#!/bin/bash
httrack\
'https://example.com'\
'+*/*.pdf'\
--sockets=59\
--robots=0\
--user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'\
-O '/media/your/outdir/'\
--can-go-up-and-down\
--keep-alive\
--mirror\
--depth=999999999\
-%P\
--retries=999999\
--ext-depth=0\
--timeout=9999\
-T1000000\
−−max−rate=0\
--disable-security-limits
I will explain the important options.
max−rate=0
If you set the upper limit to --max-rate=999999 or similar, it seems the standard speed limit is applied and not recognized correctly, so setting it to 0 appears to remove the speed limit.
disable-security-limits
It seems this option is used in conjunction with the one above, but since I didn't know which one was essential, I included both.
Note that this option itself is not in the official documentation, so it might be a developer option.
ext-depth=0
This is for specifying the maximum depth, but if it's 0, it seems to crawl almost indefinitely.
Note that, at this time, it seems that setting ext-depth=0 and then including --depth=999999999 allows it to recognize crawling up to 999,999,999 levels, and it might not work with just ext-depth=0 alone.
Also, if it's not set to 0, an error like the following will be output.
nohup.out:PANIC! : Too many URLs : >99999 [3031]
This probably means that this option, for software that has existed since 2000, performed an unexpected operation.
So, these were the three important options.
For other options, please refer to the official documentation.
https://www.httrack.com/html/fcguide.html
That's all for now.
Until next time.