New Bookstore! Save 20% on books with discount code: LAUNCH
Web Dev + WordPress + Security
Tag: robots
27 posts

All the little .txt files you can put in the root directory of your website

The ones I know of: ads.txt humans.txt robots.txt security.txt This site makes use of robots.txt and humans.txt. I don’t need ads.txt because 3rd-party ads aren’t currently running on the site, and security.txt seems not necessary as the site’s contact form is easy enough for anyone to find. Continue reading »

Blackhole for Bad Bots – Quick Start

Welcome to the Quick Start Guide for the standalone PHP version of Blackhole for Bad Bots. This post basically is a condensed summary of the original Blackhole tutorial. So if you are new to the concept of blocking bad bots, check out the original tutorial. Otherwise, for those that are familiar, the following guide should simplify things and help you get started with Blackhole as quickly as possible. Continue reading »

Worst IPs: 2016 Edition

A little late this year, but following tradition here is my list of the absolute worst IP addresses from 2016. All in nice numerical order for easy crunching. These IPs are associated with all sorts of malicious activity, including exploit scanning, email harvesting, brute-force login attacks, referrer spam, and everything in between. Really obnoxious stuff that degrades your site’s performance and potentially threatens security. Continue reading »

How to Block Baidu Bot

A user of my 6G Firewall recently asked how to block the “baidu” bot from accessing their site. This post explains why Baidu is not blocked in 6G and provides a quick .htaccess technique to deny it (or anything claiming to be it) access to your site. Continue reading »

WordPress Plugin: Blackhole for Bad Bots

Image Courtesy NASA/JPL-Caltech. Update: Pro version now available! Check out Blackhole Pro » Finally translated my Blackhole Spider Trap into a FREE WordPress plugin. It’s fun, fast, flexible, and works silently behind the scenes to protect your WordPress-powered site from malicious bots. Here are some of the features: Continue reading »

Integrating Google No Captcha reCaptcha In WordPress Forms

In this tutorial you will learn how to integrate Google’s new reCatcha model in WordPress Login, Comment, Registration and Lost Password Forms. Continue reading »

Humans.txt

One thing I love about Twitter is the instant feedback. For the past few weeks I’ve been seeing lots of 404 requests like this: https://perishablepress.com/humans.txt https://perishablepress.com/humans.txt https://perishablepress.com/humans.txt At first I thought it was some skript kiddie getting creative, you know as a play on the robots.txt file, which is also located in the root of many websites. So it seemed interesting enough to tweet about: Continue reading »

Multiple Sitemaps

Yes you can have multiple sitemaps for your site. Create the sitemaps you need, and then specify them in your robots.txt file. For example, here are the robots.txt directives for the two sitemaps used here at Perishable Press: Continue reading »

Better Robots.txt Rules for WordPress

Cleaning up my files during the recent redesign, I realized that several years had somehow passed since the last time I even looked at the site’s robots.txt file. I guess that’s a good thing, but with all of the changes to site structure and content, it was time again for a delightful romp through robots.txt. This post summarizes my research and gives you a near-perfect robots file, so you can copy/paste completely “as-is”, or use a template to give you […] Continue reading »

Protect Your Site with a Blackhole for Bad Bots

One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the honeypot trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied […] Continue reading »

Stop 404s for Mobile Versions of Your Site

If you’ve been keeping an eye on your 404 errors recently, you will have noticed an increase in requests for nonexistent mobile files and directories, especially over the past year or so. The scripts and bots requesting these files from your server seem to be looking for a mobile version of your site. Unfortunately, they are wasting bandwidth and resources in the process. It has become common to see the following 404 errors constantly repeated in your log files: http://domain.tld/apple-touch-icon.png […] Continue reading »

Tell Google NOT to Index Certain Parts of Your Web Pages

There are several ways to instruct Google to stay away from various pages in your site: Robots.txt directives Nofollow attributes on links Meta noindex/nofollow directives X-Robots noindex/nofollow directives ..and so on. These directives all function in different ways, but they all serve the same basic purpose: control how Google crawls the various pages on your site. For example, you can use meta noindex to instruct Google not to index your sitemap, RSS feed, or any other page you wish. This […] Continue reading »

Yahoo! Slurp too Stupid to be a Robot

I really hate bad robots. When a web crawler, spider, bot — or whatever you want to call it — behaves in a way that is contrary to expected and/or accepted protocols, we say that the bot is acting suspiciously, behaving badly, or just acting stupid in general. Unfortunately, there are thousands — if not hundreds of thousands — of nefarious bots violating our sites every minute of the day. For the most part, there are effective methods available enabling […] Continue reading »

Yahoo! Lies about Obeying Robots.txt Directives

There are two possibilities here: Yahoo!’s Slurp crawler is broken or Yahoo! lies about obeying Robots directives. Either case isn’t good. Slurp just can’t seem to keep its nose out of my private business. And, as I’ve discussed before, this happens all the time. Here are the two most recent offenses, as recorded in the log file for my blackhole spider trap: Continue reading »

Yahoo! Once Again Caught Disobeying Robots.txt Rules

Hmmm.. Let’s see here. Google can do it. MSN/Live can do it. Even Ask can do it. So why oh why can’t Yahoo’s grubby Slurp crawler manage to adhere to robots.txt crawl directives? Just when I thought Yahoo! finally figured it out, I discover more Slurp tracks in my Blackhole trap for bad spiders: Continue reading »

Unexplained Crawl Behavior Involving Tagged Query Strings

I need your help! I am losing my mind trying to solve another baffling mystery. For the past three or four months, I have been recording many 404 Errors generated from msnbot, Yahoo-Slurp, and other spider crawls. These errors result from invalid requests for URLs containing query strings such as the following: https://example.com/press/page/2/?tag=spam https://example.com/press/page/3/?tag=code https://example.com/press/page/2/?tag=email https://example.com/press/page/2/?tag=xhtml https://example.com/press/page/4/?tag=notes https://example.com/press/page/2/?tag=flash https://example.com/press/page/2/?tag=links https://example.com/press/page/3/?tag=theme https://example.com/press/page/2/?tag=press Note: For these example URLs, I replaced my domain, perishablepress.com with the generic example.com. Turns out that listing the plain-text […] Continue reading »

Welcome
Perishable Press is operated by Jeff Starr, a professional web developer and book author with two decades of experience. Here you will find posts about web development, WordPress, security, and more »
USP Pro: Unlimited front-end forms for user-submitted posts and more.
Thoughts
After 10 years working on my 2010 iMac, my upgrade finally arrived. Shiny new iMac shipped from Ireland :)
Too much caffeine weirds me out. But I love the taste of coffee. So once in a while I enjoy a small cup of decaf. Hits the spot.
Chris Coyier is a truly awesome person. One of the finest people I've ever worked with. Just #gottasayit
Excel won't open CSV file because SYLK format? Open it with text editor and add an apostrophe ' at the beginning of the file, save changes, done.
Displaying too many social media buttons and links all over the place imho makes you look desperate and frankly kinda sad.
Interesting and insightful domain security scanner. Could spend hours playing.
Seeing massive increases in spam and malicious attacks across the board. Nothing out of the ordinary, just a LOT more of it.
Newsletter
Get news, updates, deals & tips via email.
Email kept private. Easy unsubscribe anytime.