Have you ever wanted to know how Google’s web crawlers sees your web pages? Curious about what happens when Google requests a web page? If so, let’s learn about the Crawling process of Google search engine.
Since Google is a search engine for many different media type, it also has different crawlers for different purposes.
For general web search, you can use Googlebot on your web site knowing that Googlebot will honor the directives you place in robots.txt file. For example:
User-agent: Googlebot
Disallow:
Above instructs Googlebot that it can crawl your entire website (think it as saying I Disallow you NOT) . But what if you wanted to tell Google that certain parts of your website shouldn’t be crawled, then you would simply disallow the file or folder path like this:
User-agent: Googlebot
Disallow: /foldernametonotcrawl/
Disallow: /filenametonotcrawlthankyoupage.html
Name of Google’s Crawlers
Crawler | User Agent Token | Full user agent string (as seen in website log files) |
---|---|---|
Googlebot (Google Web search) | Googlebot | Mozilla/5.0 (compatible; Googlebot/2.1;) or (rarely used): Googlebot/2.1 |
Googlebot Images | Googlebot-Image | Googlebot-Image/1.0 |
Googlebot Video | Googlebot-Video | Googlebot-Video/1.0 |
Googlebot News | Googlebot-News | Googlebot-News |
Google Mobile (feature phone) | Googlebot-Mobile |
|
Google Smartphone | Googlebot |
|
Google AdSense | Mediapartners-Google | Mediapartners-Google |
Google Mobile AdSense | Mediapartners-Google | [various mobile device types] (compatible; Mediapartners-Google/2.1;) |
Google AdsBot landing page quality check | AdsBot-Google | AdsBot-Google |
Understanding the Difference Between Google Crawling Compared to Indexing Web Pages
You can use robots.txt file directives for disallowing Google access to certain parts on your website. However, if Google can somehow find those URL’s (perhaps through your internal linking structure) or through external backlinks. Then, Google will still index those URL’s even if you disallowed through using robots.txt file directives.
If this has already occurred for some of your web pages, then, first remove robots.txt file directives as its only to do with crawling and NOT indexing.
Knowing that, if you want to control Google’s ability to not index certain web pages on your site, then use this meta tag
<head>
<meta tag for Googlebot noindex would go here>
</head>
For WordPress CMS Use This Format for controlling indexing for certain pages.
<head>
<?php if (is_page('PageName') ) : ?>
<meta tag for Googlebot noindex would go here>
<?php endif; ?>
</head>
IMPORTANT: use the noindex directive only on web pages that you do NOT want Google to index. For example: if the web page that I don’t want Google to index is named samplewebpage.html then I would place the above code only on that page and not others. If you get this wrong by setting noindex to all your web pages, your entire website can be de-indexed by Google.
Here’s a Video Lesson That Explains Google Crawling Process
At the end of the day, whether your website has small number of pages or its a medium to large sized website. Using both robots.txt directives coupled with XML sitemaps and meta tags for indexation control, will allow you to have a better optimized website.Free SEO Course Category
Free SEO Course Category
Love this course.A little suggestion here.maybe it will be better to write a CSS overflow table for the Name of Crawlers heading.
As I am using my phone to browse the table above does not overflow
Meaning it’s out if your blog width for mobile device
Thank you for being generous enough to provide such a wonderful resource.
Rankya will be my favourite hangout Seo blog from now.
Thank you once again.
Thank you Jeff, you are indeed experienced as CSS overflow for table suggestion is great, much appreciated