If you ever made a website you might know about XML Sitemaps. They’re used to tell search engines what pages you want indexed (and sometimes other information like how often they update). There’s a neat online service xml-sitemaps.com that will do the work of generating the XML file for you.

The free version, however, limits you to 500 pages. I need a way to generate bigger sitemaps as well as automating the process so that a sitemap is updated when the content changes. This post is about first part of the challenge and a post on second part will follow.

I was just about to start writing my own crawler when I figured I better look on github first. After reviewing a few options (mostly Ruby-based), I stopped at Sitemap Generator. I already had node installed so the setup was as simple as

npm install -S sitemap-generator

which took less than a minute. With the following code I was able to start the crawling.

var SG = require('sitemap-generator');
var gen = new SG('http://mysite.com');
gen.start();

Instructions in README.md suggested registering an event on completion so that the sitemap logs to the console. I knew it’s going to be bigger than the buffer of my shell, so I didn’t do it. I did want to see the progress so I registered ‘fetch’ as follows.

gen.on('fetch', function(status, url) {
  console.log(url);
});

When crawling was done, I knew the generator had results in it but wasn’t sure how to get them out. I should have registered ‘done’ event to save it to file but didn’t do it in time and I wasn’t feeling like crawling the whole thing all over again. I looked at the code and saw _buildXML function. Creating SML Sitemap was as easy as

var fs = require('fs');
gen._buildXML(function(sitemap) {
    fs.writeFile("sitemap.xml", sitemap);
});
Share →