How WordPress lets the world know you’ve posted.
When you set up a new WordPress-based Web log you’re asked if you’d like the world to know about it and this magic happens through Pingomatic. Click through to Options then the Writing tab in your WordPress admin interface it’s there at the bottom.But what exactly does that mean? Who’s coming to see my new post?
Clicking the Publish button gets things rolling:
127.0.0.1 - - [24/Feb/2008:15:32:02 +0000] "GET /wp-admin/post-new.php HTTP/1.1" 200 16513 "http://blog.jules.com/wp-admin/post.php?action=edit&post=***" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en; rv:1.8.1.12) Gecko/20080206 Camino/1.5.5"
Who’s this a fraction of a second later?
67.207.146.36 - - [24/Feb/2008:15:32:02 +0000] "GET /wp-cron.php?check=*** HTTP/1.0" 200 - "-" "-"
Oh, it’s the local WordPress installation calling wp-cron. Oopsie. wp-cron is a low-rent version of Unix cron.
Fast forward about eight seconds and it looks like the first across the line is Moreover:
72.13.32.7 - - [24/Feb/2008:15:32:10 +0000] "GET /feed/ HTTP/1.0" 200 3385 "-" "Moreoverbot/5.00 (+http://www.moreover.com)"
It’s a mind bending 46 seconds later and it looks like everybody has the message. Here’s the Googlebot who does the polite thing first and asks for a robots.txt file.
66.249.70.194 - - [24/Feb/2008:15:32:48 +0000] "GET /robots.txt HTTP/1.1" 301 26 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.70.194 - - [24/Feb/2008:15:32:48 +0000] "GET /robots.txt/ HTTP/1.1" 200 50 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.70.194 - - [24/Feb/2008:15:32:48 +0000] "GET /feed/atom/ HTTP/1.1" 200 3321 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.70.194 - - [24/Feb/2008:15:32:49 +0000] "GET / HTTP/1.1" 200 4650 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Now here’s Yahoo’s Feed Seeker.
216.39.58.18 - - [24/Feb/2008:15:32:49 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "YahooFeedSeeker/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide)"
216.39.58.18 - - [24/Feb/2008:15:32:49 +0000] "GET /robots.txt/ HTTP/1.0" 200 24 "http://blog.jules.com/robots.txt" "YahooFeedSeeker/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide)"
209.131.41.49 - - [24/Feb/2008:15:32:57 +0000] "GET / HTTP/1.0" 200 14023 "-" "YahooFeedSeeker/2.0 (compatible; Mozilla 4.0; MSIE 5.5; http://publisher.yahoo.com/rssguide)"
Next is the dubiously useful Snapbot:
38.98.19.67 - - [24/Feb/2008:15:37:54 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.66 - - [24/Feb/2008:15:37:54 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.68 - - [24/Feb/2008:15:37:54 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.67 - - [24/Feb/2008:15:37:55 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.66 - - [24/Feb/2008:15:37:55 +0000] "GET /feed/atom/ HTTP/1.0" 200 10134 "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.68 - - [24/Feb/2008:15:37:58 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.67 - - [24/Feb/2008:15:37:58 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.68 - - [24/Feb/2008:15:37:59 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.66 - - [24/Feb/2008:15:38:00 +0000] "GET /robots.txt HTTP/1.0" 301 - "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
38.98.19.66 - - [24/Feb/2008:15:38:01 +0000] "GET / HTTP/1.0" 200 14023 "-" "Snapbot/1.0 (Snap Shots, +http://www.snap.com)"
What the heck is Snapbot doing? Why does it need to ask nine times for the same file and not pay attention to the URL redirect request? Sounds like Snapbot needs an update to 1.0.1.
When you ask WordPress for robots.txt you get sent along along to xmlrpc.php:
HTTP/1.1 301 Moved Permanently
Date: Sun, 24 Feb 2008 16:39:43 GMT
Server: Apache
X-Pingback: http://blog.jules.com/xmlrpc.php
Location: http://blog.jules.com/robots.txt/
Content-Type: text/html; charset=UTF-8
301 is the HTTP code to permanently redirect old URLs and do the right thing when accessing directories. URLs can change and it’s an elegant way to deal with the problem. When you get this response back from a Web server your side should parse the headers for a Location field. Notice the trailing slash on that line that WordPress provides and more on that in a second.
For example, if you ask Apache for the archives of Feb ‘08 at http://blog.jules.com/2008/02 it’ll do the right thing and add a trailing slash:
HTTP/1.1 301 Moved Permanently
Date: Sun, 24 Feb 2008 16:57:52 GMT
Server: Apache
X-Pingback: http://blog.jules.com/xmlrpc.php
Location: http://blog.jules.com/2008/02/
Content-Type: text/html; charset=UTF-8
See the Location field?
Back when all Web sites were served from a filesystem it was important to note what was a directory and what was not. The former would have a slash and latter would not. robots.txt is usually a file and this might be where Snap is coming unstuck.
Ok, blood pressure is going down and here’s Google Feedfetcher (which is different from Googlebot above):
209.85.238.7 - - [24/Feb/2008:15:38:02 +0000] "GET /feed/ HTTP/1.1" 200 3385 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)"
Talking of Googlebot, he’s back for the post itself almost six minutes later:
66.249.70.194 - - [24/Feb/2008:15:38:29 +0000] "GET /2008/02/24/the-thing-about-transatlantic-first-class/ HTTP/1.1" 200 4550 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
BlogPulse is next in line:
64.158.138.84 - - [24/Feb/2008:15:38:37 +0000] "GET /robots.txt HTTP/1.1" 301 26 "-" "BlogPulseLive (support@blogpulse.com)"
64.158.138.84 - - [24/Feb/2008:15:38:37 +0000] "GET /robots.txt/ HTTP/1.1" 200 50 "-" "BlogPulseLive (support@blogpulse.com)"
64.158.138.84 - - [24/Feb/2008:15:38:37 +0000] "GET /feed/ HTTP/1.1" 200 3385 "-" "BlogPulseLive (support@blogpulse.com)"
BlogPulse does the right thing with the 301 redirect (Oh Snap!).
Googlebot just can’t get enough of this article so here he is again a smidge over two minutes from the last request:
66.249.70.194 - - [24/Feb/2008:15:40:42 +0000] "GET /2008/02/24/the-thing-about-transatlantic-first-class/ HTTP/1.1" 200 4550 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Technorati joins the party:
208.66.64.4 - - [24/Feb/2008:15:41:19 +0000] "GET / HTTP/1.0" 200 4650 "-" "Technoratibot/0.7"
208.66.64.4 - - [24/Feb/2008:15:41:19 +0000] "GET /feed/ HTTP/1.0" 200 3385 "-" "Technoratibot/0.7"
208.66.64.4 - - [24/Feb/2008:15:41:20 +0000] "GET /feed/rss/ HTTP/1.0" 200 1235 "-" "Technoratibot/0.7"
208.66.64.4 - - [24/Feb/2008:15:41:20 +0000] "GET /feed/atom/ HTTP/1.0" 200 3321 "-" "Technoratibot/0.7"
Here’s a request from something mysteriously tagged Java/1.6.0_02:
67.202.34.117 - - [24/Feb/2008:15:42:41 +0000] "GET / HTTP/1.1" 200 14023 "-" "Java/1.6.0_02"
Some on-demand application up at Amazon’s EC2 farm is, well, not doing a whole lot. Just a simple HTTP request for the root of the site. Doesn’t touch the feed.
I get a similar request every four-ish hours from a user agent of ONNET-OPENAPI and they all stem from 211.239.119.213. No reverse mapping is assigned but the IP maps back to APNIC. International robot of mystery and all that.
Finally, Technorati is back for more:
208.66.64.4 - - [24/Feb/2008:16:00:12 +0000] "GET / HTTP/1.0" 200 4650 "-" "Technoratibot/0.7"
208.66.64.4 - - [24/Feb/2008:16:00:13 +0000] "GET /feed/ HTTP/1.0" 200 3385 "-" "Technoratibot/0.7"
208.66.64.4 - - [24/Feb/2008:16:00:13 +0000] "GET /feed/rss/ HTTP/1.0" 200 1235 "-" "Technoratibot/0.7"
208.66.64.4 - - [24/Feb/2008:16:00:13 +0000] "GET /feed/atom/ HTTP/1.0" 200 3321 "-" "Technoratibot/0.7"
That’s when the excitement dies down about half a hour after publishing. Googlebot and Yahoo spend a bit more time crawling around for the updated categories and pages and the rest of the visits are from people who have come to read what’s on offer.
Verifying the IPs that came to visit.
- 72.13.32.7 was Moreover and doesn’t reverse lookup. The netblock belongs to VeriSign Infrastructure & Operations;
- 66.249.70.194, the Googlebot, points back to crawl-66-249-70-194.googlebot.com;
- 216.39.58.18 is Yahoo’s Feed Seeker and resolves to htproxy2.ops.re4.yahoo.net;
- Snap had no reverse lookup on 38.98.19.66 but a whois search shows the netblock is registered to PSI.net;
- 209.85.238.7 doesn’t reverse map but whois shows the IP range is Google’s;
- BlogPulse came from 64.158.138.84 which is floodgate.intelliseek.com and the netblock belongs to Level 3 Communications, Inc;
- 208.66.64.4 is nat-365m.technorati.com and as you can imagine that’s Technorati; and
- The mystery Java-based admirer came from 67.202.34.117 and the lookup on that is ec2-67-202-34-117.compute-1.amazonaws.com.
Posted: February 25th, 2008 under It's technical.
Comments: 2
Comments
Comment from suesonday
Time: March 12, 2008, 6:48 pm
Hey Jules,
Would love to hear your take on Adobe Air vs. serving apps/widgets through iGoogle or Facebook.
- Sue
Comment from jules
Time: March 12, 2008, 6:53 pm
I think that’s a brilliant idea! Thanks, Sue.
Write a comment