Important issues for WWW forums

**Oren Grad** · 23-Dec-2005, 18:37

Kirk, the new website is definitely up - nice and easy to get around. The "articles" link brings up the menu of articles, but the individual article links don't work. Ditto with the resume link from the biography page.

**Kirk Gittings** · 23-Dec-2005, 18:53

Oren,

Odd I get my old site even when I clear the cache and history.

**Frank Petronio** · 23-Dec-2005, 18:56

It's 404ing the articles Kirk

**Paddy Quinn** · 23-Dec-2005, 19:02

"But he's a bastard and that is something I will comfortably say to the g@d@mn tratiors at the New York Times."

well, at least you can't acuse them of being pinko liberal left wing and anti-administration - after sitting on he story of Presidential law breaking for a whole year

**robc** · 23-Dec-2005, 19:53

' You can prevent any website that you control from being "crawled" '

Oren, this is not actually correct. What you can do is put code to effectively ask well behaved crawlers not to crawl or archive your site. You can also block IP addresses of known crawlers but what you can not do is stop any unknown (to your code) crawlers or any badly behaved crawlers which use constantly changing IP addresses(false ones) from crawling your site. There are many web archive sites which are badly behaved. Google happens to be one of the few well behaved crawlers.

The only effective way to ensure your site is not crawled is to password protect it. For most web sites this is self defeating.

**Ben Diss** · 23-Dec-2005, 20:01

Kirk- I read your exchange on Mark Justice Hinton's site but I can't see where I can buy your book. Oh wait, here it is:

http://www.amazon.com/gp/product/offer-listing/0826312780/ref=dp_olp_2//103-6467018-9327028?condition=all
http://www.amazon.com/gp/product/offer-listing/0826312772/ref=lp_g_1/103-6467018-9327028?%5Fencoding=UTF8

(snicker, snicker) ...and yes, I bought one.

-Ben

**Oren Grad** · 23-Dec-2005, 20:06

Oren, this is not actually correct. What you can do is put code to effectively ask well behaved crawlers not to crawl or archive your site.

OK. I know that Google and the Internet Archive ("Wayback Machine") are well-behaved in this sense, but I don't know as much about malicious sites. Thanks for the correction. Just for my own education, off the top of your head can you point to any specific archive sites that are ill-behaved in this way? I'm curious as to exactly what they're doing with the information. Harvesting email addresses for spam?

**Kirk Gittings** · 23-Dec-2005, 20:21

Thank you Ben!

Frank, sorry I have no idea what that means.

**Oren Grad** · 23-Dec-2005, 21:08

Kirk, "404" is just the code for the "page cannot be found" screen that you get when a link doesn't lead anywhere.

**robc** · 23-Dec-2005, 22:20

off the top of my head? No. Last time I looked, which was quite a while ago, I found stuff I didn't expect to find as I had used noarchive on some pages. I just decided it wasn't worth worrying about. The big boys seem to be quite well behaved and the rest are quite insignificant.

For more info you can look at:

http://www.robotstxt.org/wc/robots.html
http://searchenginewatch.com/

A quick look at my stats show approx 200 different robots have visited my site over the last year. What they are all doing with the info they extract I have no idea. Some may be going to search engines, others for analysis of some kind, others for archive. None for email spam because my email address doesn't exist in my web site.

Thread: Important issues for WWW forums

Thread Tools

Search Thread

Display