What is robots.txt
robots.txt, also known as the robots protocol, is a widely accepted ethical standard in the international internet community. robots.txt is a text file located in the root directory of your website, used to inform search engines which pages can be crawled and which cannot; it can block some larger files on the website, such as images, music, videos, etc., to save server bandwidth; it can block some dead links on the site, making it easier for search engines to crawl website content; it can set up sitemap links to guide spiders in crawling pages.How to create a robots.txt file
You only need to use a text editing software, such as Notepad, to create a text file named robots.txt, and then upload this file to the website root directory to complete the creation. You can also userobots generation toolto generate it online.How to write robots.txt rules
Simply creating a robots.txt file is not enough; the essence lies in writing rules suitable for your own website.robots.txt supports the following rulesUser-agent: * 这里的*代表的所有的搜索引擎种类,*是一个通配符 Disallow: /admin/ 这里定义是禁止爬寻admin目录下面的目录 Disallow: /require/ 这里定义是禁止爬寻require目录下面的目录 Disallow: /ABC/ 这里定义是禁止爬寻ABC目录下面的目录 Disallow: /cgi-bin/*.htm 禁止访问/cgi-bin/目录下的所有以".htm"为后缀的URL(包含子目录)。 Disallow: /*?* 禁止访问网站中所有包含问号 (?) 的网址 Disallow: /.jpg$ 禁止抓取网页所有的.jpg格式的图片 Disallow:/ab/adc.html 禁止爬取ab文件夹下面的adc.html文件。 Allow: /cgi-bin/ 这里定义是允许爬寻cgi-bin目录下面的目录 Allow: /tmp 这里定义是允许爬寻tmp的整个目录 Allow: .htm$ 仅允许访问以".htm"为后缀的URL。 Allow: .gif$ 允许抓取网页和gif格式图片 Sitemap: 网站地图 告诉爬虫这个页面是网站地图It is recommended to use the webmaster tool's robots generation tool to write rules, as it is simpler and clearer.robots generation tool
Naiba's Tip:Note: If Disallow: is not followed by a slash, it means allowing crawling of the entire site.
Recommended robots.txt rules for WordPress
After WordPress is installed, it will virtually create a robots.txt rule file by default (meaning you cannot see it in the website directory, but you can access it via „网址/robots.txt"). The default rules are as follows:User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.phpThis rule means that all search engines are prohibited from crawling the contents under the
wp-adminfolder, but are allowed to crawl the/wp-admin/admin-ajax.phpfile. However, for website SEO and security considerations, Naiba suggests improving the rules further.Below are the current robots.txt rules of Naibabiji.User-agent: * Disallow: /wp-admin/ Disallow: /wp-content/plugins/ Disallow: /?s=* Allow: /wp-admin/admin-ajax.php User-agent: YandexBot Disallow: / User-agent: DotBot Disallow: / User-agent: BLEXBot Disallow: / User-agent: YaK Disallow: / Sitemap: https://blog.naibabiji.com/sitemap_index.xmlThe above rules add the following two lines to the default rules:
Disallow: /wp-content/plugins/ Disallow: /?s=*Disallow crawling
/wp-content/plugins/Folders and URLs are/?s=*web pages./wp-content/plugins/is the WordPress Plugin directory, avoid being crawled to prevent privacy risks (for example, some plugins have privacy leakage bugs that could be crawled by search engines.)Disallow crawling of search result pages to prevent others from using them to boost ranking:URL is/?s=*web page, this is also a bug recently discovered by Naiba that is being exploited by SEO gray-hat projects./?s=*URL is the default search results page of a WordPress website, as shown in the figure below:
Basically, the vast majority ofWordPress Themesearch page titles are in the form of „keyword + website title“ combination. But this creates a problem: Baidu has a chance to crawl such pages. For example, Naiba has a site that unfortunately was exploited.
The following rules are to disallow specific search engine crawling rules and a sitemap address link,Several Methods for WordPress to Generate Sitemaps_Recommended Sitemap Plugins
Comments are closed
The comment function for this article is closed. If you have any questions, please feel free to contact us through other channels.