Caveats: These are all guesswork, they might be incorrect or may block more than intended, but they work for me.
NB, i have a robots.txt file specifying a crawl-rate of one request every 5 seconds, the below appear to be ignoring this. Generally i turn a blind eye to anything thats not invoking a server-generating url multiple times a second - these are people being excessive and causing undue load on relatively-modest servers.
User-agent: *
Allow: /
Crawl-delay: 5
Found this crawling our jobs pages, multiple requests per second, spoofed useragent to look like a normal browser, and some sneaky dns - all the requests come from random-alnums.setaptr.net, all of which resolve to the same 1 ip address on nslookup but aren't the ip address they're actually originating from Think i've managed to block it with the following CIDR subnets added to the firewall (these rules are almost certainly too broad, but theres only 256 IPs for each range to minimise things):
- 173.244.208.0/24
- 173.244.209.0/24
- 173.244.210.0/24
- 173.244.211.0/24
- 209.95.51.0/24
- 209.95.56.0/24
- 107.182.230.0/24
Crawling, making requests too quickly - i blocked the following 2 ip addresses:
- 67.212.239.134
- 67.212.239.133
Bytespider (Bytedance's (tiktok owner) spider - too many requests per second. Suspicion googling is they're trying to build their own LLM so hoovering up as much of the web content as possible as quickly as possible.
In htaccess file, i added the following, but bear in mind i could see half the requests weren't identifying themselves in the useragent but pretty certain given the pattern it was still them (fyi they crawl from AWS): RewriteCond %{HTTP_USER_AGENT} "(Bytespider)" [NC] RewriteRule "^.*$" - [F,L]
In htaccess file, i added the following:
RewriteCond %{HTTP_USER_AGENT} "(Fuzz Faster)" [NC] RewriteRule "^.*$" - [F,L]