Форум eBanners.Ru: Реклама в интернете и раскрутка сайтов - форум по интернет-рекламе
Согласованный стандарт robots.txt от Google, Yahoo и Live Search (MSN) - Поисковая оптимизация сайта и продвижение сайтов в поисковых системах - интернет-реклама
У вас есть вопрос по рекламе в интернете и раскрутке сайтов? Задайте его здесь и вам ответят. Вы уже всё знаете? Помогите тем, кто знает пока не всё. Правила форума|
Сделать стартовой|Добавить в избранное.
Tuesday, June 03, 2008 at 9:00 AM
Written by Prashanth Koppula, Product Manager
Web publishers often ask us how they can maximize their visibility on the web. Much of this has to do with search engine optimization -- making sure a publisher's content shows up on all the search engines.
However, there are some cases in which publishers need to communicate more information to search engines -- like the fact that they don't want certain content to appear in search results. And for that they use something called the Robots Exclusion Protocol (REP), which lets publishers control how search engines access their site: whether it's controlling the visibility of their content across their site (via robots.txt) or down to a much more granular level for individual pages (via META tags).
Since it was introduced in the early '90s, REP has become the de facto standard by which web publishers specify which parts of their site they want public and which parts they want to keep private. Today, millions of publishers use REP as an easy and efficient way to communicate with search engines. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and in the way it works for any publisher, no matter how large or small.
While REP is observed by virtually all search engines, we've never come together to detail how we each interpret different tags. Over the last couple of years, we have worked with Microsoft and Yahoo! to bring forward standards such as Sitemaps and offer additional tools for webmasters. Since the original announcement, we have, and will continue to, deliver further improvements based on what we are hearing from the community.
Today, in that same spirit of making the lives of webmasters simpler, we're releasing detailed documentation about how we implement REP. This will provide a common implementation for webmasters and make it easier for any publisher to know how their REP directives will be handled by three major search providers -- making REP more intuitive and friendly to even more publishers on the web.
So, without further ado...
Common REP Directives
The following list are all the major REP features currently implemented by Google, Microsoft, and Yahoo!. With each feature, you'll see what it does and how you should communicate it.
Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.
1. Robots.txt Directives
DIRECTIVE IMPACT USE CASES
Disallow Tells a crawler not to index your site -- your site's robots.txt file still needs to be crawled to find this directive, however disallowed pages will not be crawled 'No Crawl' page from a site. This directive in the default syntax prevents specific path(s) of a site from being crawled.
Allow Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed except for a small section within it
$ Wildcard Support Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages 'No Crawl' files with specific patterns, for example, files with certain filetypes that always have a certain extension, say pdf
* Wildcard Support Tells a crawler to match a sequence of characters 'No Crawl' URLs with certain patterns, for example, disallow URLs with session ids or other extraneous parameters
Sitemaps Location Tells a crawler where it can find your Sitemaps Point to other locations where feeds exist to help crawlers find URLs on a site
2. HTML META Directives
DIRECTIVE IMPACT USE CASES
NOINDEX META Tag Tells a crawler not to index a given page Don't index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By using NOFOLLOW you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page Present no snippet for the page on Search Results
NOARCHIVE META Tag Tells a search engine not to show a "cached" link for a given page Do not make available to users a copy of the page from the Search Engine cache
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and snippet for this page
These directives are applicable for all forms of content. They can be placed in either the HTML of a page or in the HTTP header for non-HTML content, e.g., PDF, video, etc. using an X-Robots-Tag. You can read more about it here:X-Robots-Tag Post or in our series of posts about using robots and Meta Tags.
Other REP Directives
The directives listed above are used by Microsoft, Google and Yahoo!, but may not be implemented by all other search engines. In addition, the following directives are supported by Google, but are not supported by all three as are those above:
UNAVAILABLE_AFTER Meta Tag - Tells a crawler when a page should "expire", i.e., after which date it should not show up in search results.
NOIMAGEINDEX Meta Tag - Tells a crawler not to index images for a given page in search results.
NOTRANSLATE Meta Tag - Tells a crawler not to translate the content on a page into different languages for search results.
One Standard Fits All: Robots Exclusion Protocol for Yahoo!, Google and Microsoft
Over the last couple of years, we've been collaborating with Google and Microsoft to make webmasters' efforts more effective across the major search engines. By bringing standards such as Sitemaps and improvements like auto-discovery and cross-host submission, webmasters can simplify their account management across the different search engines.
The Robots Exclusion Protocol (REP) lets content publishers specify which parts of their site they want public and which parts they want to keep private from robots, whether it's controlling the visibility of their content across their site (via robots.txt) or at the level of individual pages (via META tags). REP was introduced in the early 1990's and is the de facto standard. Its strengths lie in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots and in the way it works for any publisher, no matter how large or small. We've heard that there is some confusion around the specific implementation of REP supported by each engine. Since we've never detailed the specifics of implementing the protocol, today we're releasing detailed documentation on how REP directives will be handled by the three major search providers.
Common REP Directives
The following are all the major REP features currently implemented by Google, Microsoft and Yahoo!. Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Each of us also supports Reverse DNS based authentication of our crawler, and you can use this validate the identity of any crawlers claiming a particular user-agent.
1. Robot.txt Directives
DIRECTIVE IMPACT USE CASE(s)
Disallow Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled. 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling.
Allow Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule - the longest rule - applies. This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.
$ Wildcard Support Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages. 'No Crawl' files with specific patterns, for eg., files with certain filetypes that always have a certain extension, say pdf; etc.
Sitemap Location Tells a crawler where it can find your sitemaps. Point to other locations where feeds exist to point the crawlers to the site's content.
2. HTML META Directives
These directives can either be placed in the HTML of a page or in the HTTP header for non-HTML content like PDF, video, etc. using an X-Robots-Tag. The X-Robots-Tag mechanism allows these directives to be available for all types of documents -- HTML or otherwise. If both forms of the tag, HTML META and X-Robots-Tag in the header are present, the most restrictive one applies.
DIRECTIVE IMPACT USE CASE(s)
NOINDEX META Tag Tells a crawler not to index a given page. Don't index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page. Prevent publicly writeable areas from being abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page. Present no abstract for the page on search results.
NOARCHIVE META Tag Tells a search engine not to show a "cached" link for a given page. Do not make a copy of the page available to users from the search engine cache.
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page. Do not use the ODP (Open Directory Project) title and abstract for this page in Search.
Other REP Directives
Yahoo!-specific REP directives that are not supported by Microsoft and Google include:
Crawl-Delay: Allows a site to delay the frequency with which a crawler checks for new content
NOYDIR META Tag: This is similar to the NOODP META Tag above but applies to the Yahoo! Directory, instead of the Open Directory Project
Robots-nocontent Tag: Allows you to identify the main content of your page so that our crawler targets the right pages on your site for specific search queries by marking out non content parts of your page. We won't use the sections tagged as such for indexing the page or for the abstract in the search results.
Apart from these tools in the REP, Yahoo! Site Explorer also provides further ways to tell Yahoo! to Delete URLs, or Rewrite Dynamic URLs to remove spurious parameters. You can learn more about our crawler at the Slurp Help page.
Tuesday, June 03, 2008 9:00 AM by Nathan_Buggia
Robots Exclusion Protocol: Joining Together to Provide Better Documentation
As a member of the Live Search Webmaster Team, I'm often asked by web publishers how they can control the way search engines access and display their content. The de-facto standard for managing this is the Robots Exclusion Protocol (REP) introduced back in the early 1990's. Over the years, the REP has evolved to support more than "exclusion" directives; it now supports directives controlling what content gets included, how the content is displayed, and how frequently the content is crawled. The REP offers an easy and efficient way to communicate with search engines, and is currently used by millions of publishers worldwide. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and the way it works for any publisher, no matter how large or small.
In the spirit of making the lives of webmasters simpler, Microsoft, Yahoo and Google are coming forward with detailed documentation about how we implement the Robots Exclusion Protocol (REP). This will provide a common implementation for webmasters and make it easier for any publishers to know how their REP directives will be handled by three major search providers, making REP more intuitive and friendly to even more publishers on the web.
Common REP Directives and USE Cases
The following list includes all the major REP features currently implemented by Google, Microsoft, and Yahoo. We are documenting the features and the use cases they enable for site owners. With each feature, you'll see what it does and how you should communicate it.
Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.
1.Robots.txt Directives
Directive Impact Use Cases
Disallow Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling
Allow
Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule – the longest rule – applies.
This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.
$ Wildcard Support
Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages (available by end of June)
'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc.
* Wildcard Support Tells a crawler to match a sequence of characters (available by end of June) 'No Crawl' URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc.
Sitemaps Location
Tells a crawler where it can find your sitemaps.
Point to other locations where feeds exist to point the crawlers to the site's content
2. HTML META Directives
The tags below can be present as Meta Tags in the page HTML or X-Robots Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.
Directive Impact Use Case(s)
NOINDEX META Tag Tells a crawler not to index a given page Don't index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page Present no abstract for the page on Search Results.
NOARCHIVE / NOCACHE META Tag Tells a search engine not to show a "cached" link for a given page Do not make a copy of the page available to users from the Search Engine cache.
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and abstract for this page in Search.
Other REP Directives
The directives listed above are used by Microsoft, Google and Yahoo, but may not be implemented by all other search engines. Additionally, Live Search and Yahoo support the Crawl-Delay directive, which is not supported by Google at this time.
Crawl-Delay - Allows a site to delay the frequency with which a crawler checks for new content (Supported by Live Search and Yahoo).