SEO: Manage crawling, indexing with robots exclusion protocols


Indexing is the precursor to rankings in organic search. But there are pages that you do not want the search engines to index and rank. This is where the "robot exclusion protocol" comes into play.

REP can exclude and include the search engine's crawlers. Thus, it is a way to block bots or welcome them – or both. REP contains technical tools such as the robots.txt file, XML sitemaps and metadata and heading directives.

REP can exclude and include the search engine's crawlers.

However, keep in mind that crawlers' compliance with REP is voluntary. Good bots follow, like those from the major search engines.

Unfortunately, bad bots do not interfere. Examples are scrapers that collect information for publication on other websites. Your developer should block bad bots at the server level.

The protocol for the exclusion of robots was created in 1994 by Martijn Koster, founder of three early search engines, who was frustrated by the stress crawlers on his site. Year 2019 Google proposed REP as an official Internet standard.

Each REP method has features, strengths and weaknesses. You can use them individually or in combination to achieve crawl goals.

robots.txt

Walmart.com's robots.txt file

Walmart.com's robots.txt file does not allow bots to access many sites on its site.

The robot.txt file is the first page that good bots visit on a website. It is in the same place and is called the same thing ("robots.txt") on every site, as in site.com/robots.txt.

Use the robots.txt file to request bots to avoid specific sections or pages of your site. When good bots meet these requests, they usually follow.

For example, you can specify pages that bots should ignore, such as pages with shopping cart, thank you pages, and user profiles. But you can also request that bots crawl specific pages within another blocked section.

In its simplest form, a robots.txt file contains only two elements: a user agent and a directive. Most sites want to be indexed. So the most common robots.txt file contains:

User-agent: *
Disallow:

The asterisk is a wildcard that indicates "everyone", which means in this example that the directive applies to all bots. Plot Do not approve the directive states that nothing should be prohibited.

You can limit user agent to specific bots. For example, the following file would restrict Googlebot from indexing the entire site, resulting in an inability to rank in organic search.

User-agent: googlebot
Disallow: /

You can add as many rows of permissions and allow as needed. The following example robots.txt file requests that Bingbot does not crawl any pages in / user account directory except the user's login page.

User-agent: bingbot
Disallow: /user-account*
Allow: /user-account/log-in.htm

You can also use robots.txt files to request crawl delays when bots hit pages on your site too quickly and affect server performance.

Each site protocol (HTTPS, HTTP), domain (site.com, mysite.com) and subdomain (www, store, no subdomain) requires its own robots.txt file – even if the content is the same. For example, the robots.txt file on https://shop.site.com does not work for content hosted on http://www.site.com.

When changing the robots.txt file, you always test with robots.txt testing tool in Google Search Console before you tap it live. The robots.txt syntax is confusing, and mistakes can be disastrous for your organic search performance.

More information about the syntax is available Robotstxt.org.

XML Sitemaps

Apple.com's XML Sitemap contains references to the pages Apple wants bots to crawl.

Apple.com's XML Sitemap contains references to the pages Apple wants bots to crawl.

Use an XML sitemap to notify search engine crawlers of your most important pages. Once they have checked the robots.txt file, crawlers second stop is your XML sitemap. A Sitemap may have any name, but it is usually at the root of the site, such as site.com/sitemap.xml.

In addition to a version identifier and an opening and closing urlset tag, XML Sitemaps should contain both and Tags that identify each URL bots should crawl, as shown in the image above. Other tags can identify the page's last change date, change frequency and priority.

XML Sitemaps are simple. But remember three critical things.

  • Just link to canonical URLs – the ones you want to rank as opposed to duplicate content URLs.
  • Update your Sitemap files as often as you can, preferably with an automated process.
  • Keep the file size below 50MB and the URL counts below 50,000.

XML Sitemaps are easy to forget. It is common for Sitemaps to contain legacy URLs or duplicate content. Check their accuracy at least quarterly.

Many e-commerce sites have more than 50,000 URLs. In these cases, create multiple XML sitemap files and link to all of them in a Sitemap index. The index itself can link to 50,000 Sitemaps each with a maximum size of 50 MB. You can also use gzip compression to reduce the size of each sitemap and index.

XML sitemaps can also contain video files and images to optimize image search and video search.

Bots do not know what you are called your XML Sitemap. Thus, include the Sitemap URL in your robots.txt file and also upload it to Google Search Console and Bing Webmaster Tools.

For more information on XML Sitemaps and their similarities to HTML Sitemaps, see "SEO: HTML, XML Sitemaps Explained."

For more information about syntax and expectations on XML sitemap Sitemaps.org.

Metadata and heading directives

Robots.txt files and XML sitemaps usually exclude or include many pages at once. REP metadata works at the page level, in a meta tag i head of the HTML code or as part of the HTTP response that the server sends with a single page.

Lululemon's shopping cart page uses a robot meta tag to target search engine searchers so as not to index the page or send link authority through its links.

Lululemon's shopping cart page uses a robot meta tag to target search engine searchers so as not to index the page or send link authority through its links.

The most common REP attributes include:

  • Noindex. Do not index the page on which the directive is located.
  • Nofollow. Do not send link authority from the links on the page.
  • Follow. Send link permissions from the links on the page, even if the page is not indexed.

When used in a robot meta tag, the syntax looks like:

Although applied at the page level – affecting one page at a time – the metro tag can be scaled into a template, which then places the tag on each page.

The nofollow Attributes in an anchor tag stop the flow of link authority, as in:

Shopping Bag

The Metrobots tag is in a page's source code. However, its directives can apply to non-HTML file types such as PDFs by using it in the HTTP response. This method sends the robot directive as part of the server's response when the file is requested.

When used in the server's HTTP header, the command looks like this:

X-Robots-Tag: noindex, nofollow

Like metro tags, the robot directive applies to individual files. But it can apply to multiple files – for example, all PDFs or all files in a single directory – through the root of your site .htaccess or httpd.conf file on Apache or .conf file on Nginx.

For a complete list of robots' attributes and sample code sections, see Google developer site.

A crawler must access a file to detect a robot directive. Consequently, although the index-related attributes can be effective in limiting indexing, they do nothing to preserve the site's average budget.

If you have many pages with noindex directive, a robots.txt permission would do a better job of blocking the crawl to preserve your average budget. However, the search engines are slow to de-index content via a robots.txt allowing if the content is already indexed.

If you need to index the content and restrict bots from crawling it, start with a noindex attribute (to deindex) and then use a disallow in the robots.txt file to prevent the crawlers from accessing it in the future.



Source link

Leave a Comment