CALL US

+91 8219776763

Sitemap.xml Basics in Secure Website Development

What is sitemap.xml?

By Prempal Singh 0 Comment October 3, 2017

What is sitemap

A great XML Sitemap is a sitemap created for search engines. The XML Sitemap is all of the all the URLs on your site that you want search engines like google to examine and index. The Sitemap also provides information on when pages get up-to-date and how important they are really. Search engines do not guarantee they will totally abide by the sitemap, but search engines use XML Sitemaps for assistance in crawling the web.

Sitemaps are xml or code files that list away every single URL on your website, along with important meta data for every single URL that includes when it was last current, how relatively important it is within your website structure and how often you choose updates to it.

This document describes the XML schema for the Sitemap protocol.

The Sitemap protocol format consists of XML tags. All data values in a Sitemap must be entity-escaped. The file itself must be UTF-8 encoded.

The Sitemap must:

  • Begin with an opening <urlset> tag and end with a closing </urlset> tag.
  • Specify the namespace (protocol standard) within the <urlset> tag.
  • Include a <url> entry for each URL, as a parent XML tag.
  • Include a <loc> child entry for each <url> parent tag.

All other tags are optional. Support for these optional tags may vary among search engines. Refer to each search engine’s documentation for details.

Also, all URLs in a Sitemap must be from a single host, such as www.example.com or store.example.com.

Why Do Sitemaps Matter?

Sitemaps are a core part of an internet site and critical to locate engine optimization – xml sitemaps allows search motors to simply crawal website and index each webpage so that it shows up in search engine results. HTML sitemaps are also important, and are more geared towards human users – they help your website visitors more easily find this article they’re looking for on your website.

According to Google, there are a few specific reasons a client would benefit from a sitemap:

  • Their website is new with very few backlinks
  • Their website is very large
  • Their website content isn’t well-linked internally, making it difficult to navigate
  • Their website uses a lot of rich-media content

Sitemap Comparison of Text, HTML, ROR, RSS and XML Sitemaps

Guide about sitemaps. Comparison of text, HTML, RSS, ROR and XML sitemaps. All sitemap differences explained.

HTML Sitemaps
– help humans navigate your website

HTML sitemaps can be:

  • Viewed by all browsers including Firefox, IE and Opera.
  • Crawled by all search engines including Google, Yahoo, Bing and ASK.

Some HTML sitemap tips and tricks:

  • HTML documents can be generated by PHP, ASP etc. It is the output format that matters.
  • Limit yourself to a few hundred links per page for best website results. Makes it easier to find your important pages.
  • You can read our article about creating HTML sitemaps for more detailed information.

Code example of HTML:

<html lang="en">
<head>This is a site map</head>
<body>
<h1>header of HTML site map</h1>
<p>site map paragraph with links
</body>
</html>

XHTML Sitemaps
– HTML sitemaps as XML

is the HTML specification moved into the standard.

Sitemap file with XHTML and HTML differences highlighted:

<?xml version="1.0" encoding="UTF-8">
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>This is a site map</head>
 <body>
 <h1>header of XHTML site map</h1>
 <p>site map paragraph with links</p>
 </body>
 </html>

Text Sitemaps
– simple sitemap

Text sitemaps contain one website url per line. Many search engines including Google and Yahoo can scan text sitemaps.

Improve compability between text sitemaps and search engines:

  • For Yahoo, name the primary text sitemap file urllist.txt.
  • Save text file sitemaps as documents. Especially if you have website urls with non-English characters.
  • Each text sitemap file should contain no more than 50.000 urls.

Example of text sitemap file:

http://www.example.com/
http://www.example.com/some-directory/

Be sure to check our text sitemap tutorial, so you can easily generate URL list text files for all your websites.

RSS Feeds as Sitemaps
 - RSS 0.9, RSS 1.0 and RSS 2.0

The protocol is often used in feed files for blogs, forums etc. The RSS file format uses XML and has evolved over multiple versions and names, all fairly compatible with each other:

  • Really Simple Syndication (RSS 2.0)
  • Site Summary (RSS 1.0 and RSS 0.90)
  • Rich Site Summary (RSS 0.91)

After Google and Yahoo adopted RSS feeds as a kind of website sitemaps, more search engines have followed.

Note: There is no official standard for splitting RSS feed sitemaps into multiple files. However, if your RSS sitemap feed is too large, you may wish to, instead of just normal sitemap file split, create a RSS feed file per website category. (If using a sitemap generator tool try use include/exclude filters.)

Example of a RSS feed sitemap file:

<?xml version="1.0" encoding="UTF-8"?>
 <rss version="2.0">
 <channel>
 <title>Website title</title>
 <link>http://www.example.com</link>
 <generator>A1 Sitemap Generator</generator>
 <lastBuildDate>Tue, 13 Mar 2007 22:28:20 GMT</lastBuildDate>
 <item>
 <title>Page 1</title>
 <link>http://www.example.com/page1.html</link>
 </item>
 <item>
 <title>Page 2</title>
 <link>http://www.example.com/page2.html</link>
 </item>
 </channel>
 </rss>

ROR Sitemaps
– extends RSS sitemaps

expands on the RSS protocol with its own extensions. The standard file extension for ROR files is .ror. All search engines that understand RSS sitemap files continue to understand the RSS parts of ROR files. However, no major search engine, if any at all, currently supports the ROR sitemap extensions. If you know of any major search engine that states support for ROR sitemaps, please write me. Currently Google Webmaster Tools has no mention of ROR sitemaps support.

ROR sitemap file with the ROR namespace extensions of RSS highlighted:

<?xml version="1.0" encoding="UTF-8"?>
 <rss version="2.0" xmlns:ror="http://rorweb.com/0.1/">
 <channel>
 <title>Website title</title>
 <link>http://www.example.com</link>
 <generator>A1 Sitemap Generator</generator>
 <lastBuildDate>Tue, 13 Mar 2007 22:28:20 GMT</lastBuildDate>
 <item><title>Page 1</title>
 <link>http://www.example.com/page1.html</link>
 <ror:keywords>page1-keyword1, page1-keyword2, page1-keyword3</ror:keywords><ror:updatePeriod>day</ror:updatePeriod>
 </item>
 <item>
 <title>Page 2</title>
 <link>http://www.example.com/page2.html</link>
 <ror:keywords>page2-keyword1, page2-keyword2, page2-keyword3</ror:keywords>
 <ror:updatePeriod>day</ror:updatePeriod>
 </item>
 </channel>
 </rss>

XML Sitemaps Protocol
– also called Google Sitemaps

In 2005 Google started an unique sitemaps protocol based on XML. It was called Google Sitemaps. Google later convinced more search machines to follow and the typical was renamed to XML sitemaps protocol. Currently Google, Askjeeve, Bing, Ask, IBM and possibly more supports XML sitemaps. It is likely that more search engines like yahoo will implement support for XML sitemaps.

The protocol of XML sitemaps also defines autodiscovery, i.e. how search engines can automatically discover website xml sitemaps. The answer is linking to the XML sitemap, e.g. sitemap.xml, from robots.txt.

User-agent: *
Sitemap: http://www.example.com/sitemap.xml

Instead of just pointing to one XML sitemap file for auto discovery, you can list multiple sitemaps:

Sitemap: http://www.example.com/sitemap-1.xml
Sitemap: http://www.example.com/sitemap-2.xml

Or point to XML sitemap index file:

Sitemap: http://www.example.com/sitemap-index.xml

Information about XML sitemaps protocol:

  • Each XML sitemap file can contain max 50.000 urls and be 10 mb in size.
  • It is possible to link 1000 XML sitemaps using a sitemap index file.
  • You can read our article about page priorities in XML sitemaps.
  • XML sitemap files and sitemap index files have to be stored as UTF-8 documents.

Example of XML sitemaps file:

 

<?xml version="1.0" encoding="UTF-8"?>
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
 <loc></loc>
 <priority>1.0</priority>
 <changefreq>weekly</changefreq>
 <lastmod>2007-06-18</lastmod>
 </url>
 <url>
 <loc>blogs/</loc>
 <priority>0.8</priority>
 <changefreq>weekly</changefreq>
 <lastmod>2007-06-21</lastmod>
 </url>
 </urlset>

XML Sitemaps Protocol and Derived Formats

You can find many derived formats of the standard XML sitemaps protocol, most created by Google.

If you are interested in creating XML sitemaps or any of its derived formats, check these tutorials:

  • Standard XML sitemap
  • Image sitemap
  • Video sitemap
  • Mobile sitemap
  • News sitemap

Types of sitemap

Besides the standard XML Sitemap, there is also a Sitemap index and four more specialized sitemaps (the code search sitemap is now basically useless since Google Code Search has been deprecated this yr. ) If you want to boost traffic to videos, images, your mobile site, or news articles, use specialized Sitemaps (Sitemap extensions).

The 6 Types of Sitemaps:

  • Video
  • Images
  • Mobile
  • News
  • Sitemap Index if you have multiple sitemaps
  • Standard Sitemap

 

Build Sitemap

You could either allow generator go and do its thing or you can tweak adjustments to create the Sitemap that presents the engines just how you want your site crawled.

  • Sitemap tags
    Attribute Description <sitemapindex>requiredEncapsulates information about all of the Sitemaps in the file. <sitemap>requiredEncapsulates information about an individual Sitemap. <loc>requiredIdentifies the location of the Sitemap.This location can be a Sitemap, an Atom file, RSS file or a simple text file.

    <lastmod>optionalIdentifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be in W3C Datetime format.

    By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index i.e. a crawler may only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.

  • Sitemaps segmentation — divvy up individual Sitemaps by type and by a structure that will best help you diagnose indexation shortcomings. Give them descriptive names as well.
  • Exclude URLs that should NOT be indexed
    • Exclude URLS disallowed in robots.txt (good time to make sure you’re disallowing the right urls)
    • Exclude URLs disallowed via meta noindex tags
    • Exclude duplicate URLS
    • Exclude private pages

Uploading Sitemap:
Once you run the sitemap, you will publish it to your internet site, ideally at the main listing like so: www.example.com/sitemap.xml. Theoretically, you don’t need to stick it at the main, but you will see some limitations.

Limitations of Sitemap

You can provide a simple text file that contains one URL per line. The text file must follow these guidelines:

  • The text file must have one URL per line. The URLs cannot contain embedded new lines.
  • You must fully specify URLs, including the http.
  • Each text file can contain a maximum of 50,000 URLs and must be no larger than 50MB (52,428,800 bytes). If you site includes more than 50,000 URLs, you can separate the list into multiple text files and add each one separately.
  • The text file must use UTF-8 encoding. You can specify this when you save the file (for instance, in Notepad, this is listed in the Encoding menu of the Save As dialog box).
  • The text file should contain no information other than the list of URLs.
  • The text file should contain no header or footer information.
  • If you would like, you may compress your Sitemap text file using gzip to reduce your bandwidth requirement.
  • You can name the text file anything you wish. Please check to make sure that your URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs
  • You should upload the text file to the highest-level directory you want search engines to crawl and make sure that you don’t list URLs in the text file that are located in a higher-level directory.
error: Content is protected by Cyberops !!