Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 12 of 12
  1. #1
    Regular Coder
    Join Date
    Oct 2003
    Posts
    603
    Thanks
    2
    Thanked 1 Time in 1 Post

    spider detection?

    is there a way to detect whether a viewer is a search engine spider (Googlebot, Inktomi, etc) because i want to make certain pages available for the bots to search that i dont want Guests or non-logged in premium members to search.

  2. #2
    Regular Coder
    Join Date
    Aug 2004
    Location
    The US of A
    Posts
    767
    Thanks
    1
    Thanked 0 Times in 0 Posts
    That makes no sense. You want spiders to be allowed to crawl it, but only premium members to see the actual content?

  3. #3
    Senior Coder
    Join Date
    Jun 2002
    Location
    UK
    Posts
    1,137
    Thanks
    0
    Thanked 0 Times in 0 Posts
    by doing that you enable anyone with a noggin (a simple flesh contraption in the skull aka a brain) to google the url or a simlair one and view the catch of the protected URL.

    scroots
    Spammers next time you spam me consider the implications:
    (1) that you will be persuaded by me(in a legitimate mannor)
    (2)It is worthless to you, when i have finished

  4. #4
    Regular Coder
    Join Date
    Oct 2003
    Posts
    603
    Thanks
    2
    Thanked 1 Time in 1 Post
    nvm...
    Last edited by boeing747fp; 12-23-2004 at 11:38 PM.

  5. #5
    Regular Coder
    Join Date
    Nov 2004
    Location
    Somewhere over the rainbow
    Posts
    208
    Thanks
    0
    Thanked 0 Times in 0 Posts
    so how do all those sites do it?

    there are plenty of news sites where you can google the content but require registration to view...
    Questions are what binds the universe,
    Questions lead to answers,
    Answers lead to knowledge,
    Knowledge leads to wisdom,
    Wisdom brings more questions -- Horus Kol, 2004 :)

  6. #6
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    17,025
    Thanks
    4
    Thanked 2,668 Times in 2,637 Posts
    Yeah, I'm not certain on how google really gets all of its information. Assumably its through meta tags, which would be a simple task in php than. You can just tell it what you want, and send it on its way.

    As for the google, don't search engine spiders show up differently with the user agent? I know I've seen it before, it shows as like google.bot or google.spider or something like that.
    So in regards to your actual question, as opposed to asking why, the answer should be yes to that I would assume.

  7. #7
    Regular Coder
    Join Date
    Nov 2004
    Location
    Somewhere over the rainbow
    Posts
    208
    Thanks
    0
    Thanked 0 Times in 0 Posts
    according to Google, meta tags are looked at for keywords and stuff, but the rely heavily on the actual content and structure (<h> tags and such) to "be more precise" in their results...
    Questions are what binds the universe,
    Questions lead to answers,
    Answers lead to knowledge,
    Knowledge leads to wisdom,
    Wisdom brings more questions -- Horus Kol, 2004 :)

  8. #8
    Senior Coder
    Join Date
    Aug 2003
    Location
    One step ahead of you.
    Posts
    2,815
    Thanks
    0
    Thanked 3 Times in 3 Posts
    you just check for the ip. if the ip is googles spider you let it do something or not.

  9. #9
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    17,025
    Thanks
    4
    Thanked 2,668 Times in 2,637 Posts
    Quote Originally Posted by Horus Kol
    according to Google, meta tags are looked at for keywords and stuff, but the rely heavily on the actual content and structure (<h> tags and such) to "be more precise" in their results...
    With this one then, does the spider only go through pages with meta attached to them, or is that just to help them out. Normally php could set an easy block for spiders when you don't want them to retrieve something, but if just leaving the meta out will do that would more usefull. Anyone know if this is the case or not?

  10. #10
    Senior Coder
    Join Date
    Jun 2002
    Location
    near Oswestry
    Posts
    4,508
    Thanks
    0
    Thanked 0 Times in 0 Posts
    webmasterworld.com does this trick - you search Google, and its threads come up in the search results, but when you go to the link it wants you to log in.

    It's actually incredibly simple - all they do is sniff for the Googlebot's UA string. So all you have to do is set your browser to identify as Google, and presto - you can see the whole site. Here's the UA string:
    Code:
    Googlebot/2.1 (+http://www.googlebot.com/bot.html)
    Ironically, if you Google for Googlebot user-agent string, the top answer is from ... webmasterworld!

    I don't know of any other sites that pull this trick ... but I regard it as a highly dubious practise, so I have no compunction about revealing how it works. Anyone else got some links? Be interesting to know how widely this is done.
    "Why bother with accessibility? ... Because deep down you know that the web is attractive to people who aren't exactly like you." - Joe Clark

  11. #11
    Regular Coder
    Join Date
    Oct 2003
    Posts
    603
    Thanks
    2
    Thanked 1 Time in 1 Post
    what about the others like MSN Bot, Inktomi (Yahoo), Alexa, etc... do they have specific HTTP_USER_AGENT tags?

  12. #12
    Regular Coder
    Join Date
    Aug 2002
    Location
    Oregon, United States of America
    Posts
    882
    Thanks
    1
    Thanked 9 Times in 9 Posts
    Here is the answer to all of your questions... Enjoy.

    http://www.wsabstract.com/howto/robots.shtml

    Create a file named 'robots.txt' and write this:
    Code:
    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Allow: /Page1.html
    Allow: /Page2.html
    Allow: /Page3.html
    That will stop all bots but googles from seeing any pages. Then allows google to see only the pages you put down.

    You can also put lines in your meta tags, telling it to not index your page, or to index it, but not follow the links.
    Code:
    <meta name="robots" content="index,nofollow">
    Here is a list of bot names for the robots.txt
    http://www.jafsoft.com/searchengines...ots_and_others
    Last edited by Ultragames; 12-25-2004 at 12:59 AM.
    If I'm postin here, I NEED YOUR HELP!!


 

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •