Latest TweetsDifference between mod_alias and mod_rewrite perishablepress.com/difference…
Perishable Press
Tag: robots
Found 26 matching results
Page 1 of 2

Blackhole for Bad Bots – PHP Version

This post summarizes Blackhole for Bad Bots version 4.0+. For older versions, check out the original tutorial. Please read the original tutorial for download, demo, and important information about the standalone PHP version of Blackhole for Bad Bots. The following guide is meant to simplify things for users of Blackhole version 4.0 and better. Continue »

Worst IPs: 2016 Edition

A little late this year, but following tradition here is my list of the absolute worst IP addresses from 2016. All in nice numerical order for easy crunching. These IPs are associated with all sorts of malicious activity, including exploit scanning, email harvesting, brute-force login attacks, referrer spam, and everything in between. Really obnoxious stuff […] Continue »

How to Block Baidu Bot

A user of my 6G Firewall recently asked how to block the “baidu” bot from accessing their site. This post explains why Baidu is not blocked in 6G and provides a quick .htaccess technique to deny it (or anything claiming to be it) access to your site. Continue »

New Plugin: Blackhole for Bad Bots

Image Courtesy NASA/JPL-Caltech. Update: Pro version now available! Check out Blackhole Pro » Finally translated my Blackhole Spider Trap into a FREE WordPress plugin. It’s fun, fast, flexible, and works silently behind the scenes to protect your WordPress-powered site from malicious bots. Here are some of the features: Easy to set up Squeaky clean code […] Continue »

Integrating Google No Captcha reCaptcha In WordPress Forms

In this tutorial you will learn how to integrate Google’s new reCatcha model in WordPress Login, Comment, Registration and Lost Password Forms. Continue »

Humans.txt

One thing I love about Twitter is the instant feedback. For the past few weeks I’ve been seeing lots of 404 requests like this: https://perishablepress.com/humans.txt https://perishablepress.com/humans.txt https://perishablepress.com/humans.txt At first I thought it was some skript kiddie getting creative, you know as a play on the robots.txt file, which is also located in the root of […] Continue »

Multiple Sitemaps

Yes you can have multiple sitemaps for your site. Create the sitemaps you need, and then specify them in your robots.txt file. For example, here are the robots.txt directives for the two sitemaps used here at Perishable Press: Continue »

Better Robots.txt Rules for WordPress

Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt. […] Continue »

Protect Your Site with a Blackhole for Bad Bots

One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the honeypot trap, which then […] Continue »

Stop 404s for Mobile Versions of Your Site

If you’ve been keeping an eye on your 404 errors recently, you will have noticed an increase in requests for nonexistent mobile files and directories, especially over the past year or so. The scripts and bots requesting these files from your server seem to be looking for a mobile version of your site. Unfortunately, they […] Continue »

Tell Google NOT to Index Certain Parts of Your Web Pages

There are several ways to instruct Google to stay away from various pages in your site: Robots.txt directives Nofollow attributes on links Meta noindex/nofollow directives X-Robots noindex/nofollow directives ..and so on. These directives all function in different ways, but they all serve the same basic purpose: control how Google crawls the various pages on your […] Continue »

Yahoo! Slurp too Stupid to be a Robot

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not […] Continue »

Yahoo! Lies about Obeying Robots.txt Directives

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the […] Continue »

Yahoo! Once Again Caught Disobeying Robots.txt Rules

Hmmm.. Let’s see here. Google can do it. MSN/Live can do it. Even Ask can do it. So why oh why can’t Yahoo’s grubby Slurp crawler manage to adhere to robots.txt crawl directives? Just when I thought Yahoo! finally figured it out, I discover more Slurp tracks in my Blackhole trap for bad spiders: Continue »

Unexplained Crawl Behavior Involving Tagged Query Strings

I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following: https://example.com/press/page/2/?tag=spam https://example.com/press/page/3/?tag=code https://example.com/press/page/2/?tag=email https://example.com/press/page/2/?tag=xhtml […] Continue »

Taking Advantage of the X-Robots Tag

Controlling the spidering, indexing and caching of your (X)HTML-based web pages is possible with meta robots directives such as these: <meta name=”googlebot” content=”index,archive,follow,noodp”/> <meta name=”robots” content=”all,index,follow”/> <meta name=”msnbot” content=”all,index,follow”/> I use these directives here at Perishable Press and they continue to serve me well for controlling how the “big bots”1 crawl and represent my (X)HTML-based […] Continue »

« Previous Posts 12