Today’s “Ask An SEO” question comes from Bhaumik from Mumbai, who asks:
“I have a question about automatically generated URLs. My firm had previously used different tools to generate sitemaps. But recently, we started creating them manually by selecting URLs that are necessary and blocking others in robots.txt.
We are facing an issue now with more than 50 auto-generated URLs.
For example, we have a page called “keyword keyword” URL: https://url.com/keyword-keyword/ and we have another page knowledge center URL: https://www.url.com/folder/keyword-keyword.
In coverage issues, we are seeing errors under the 5xx series which created totally new URLs something like https://test.url.com/keyword-keyword/keyword-keyword. We tried many ways but we are not getting the solution for this one.”
It’s an interesting situation you’re finding yourself in.
The good news is that 5XX errors tend to resolve on their own, so don’t worry about that one.
The cannibalization issue you’re facing is also more common than most people think.
With ecommerce stores, for example, you could have the same product (or the same collection of products) appear in multiple folders.
So, which one is the official one?
The same goes for your situation in the B2B finance space (I removed your URL above and replaced it with ”keyword keyword.”)
This is why the search engines created canonical links.
Canonical links are a way to tell search engines when a page is a duplicate of another, and which page is the official one.
Let’s pretend you sell pink bunny slippers.
These bunny slippers have their own page, they’re on sale, they appear in footwear, and also in pink.
The first URL above is the “official version” of the URL.
That means it should have a canonical link pointing to itself.
The other three pages are duplicate versions of it. So, when you set up your canonical link, it should reference the official page.
In short, you’ll want to make sure all four pages have rel=”canonical” href=”https://url.com/products/pink-bunny-slippers” as this will deduplicate them for search engines.
Next, you’ll want to make sure that you remove all duplicate versions from your sitemap.
A sitemap is supposed to feature the most important and indexable pages on your website.
You do not want to include non-official versions of a page, pages disallowed by robots.txt, and non-canonicalized URLs in your sitemaps.
Search engines do not crawl your entire website every time – and if you send them to unimportant pages, you’re wasting your ability for proper crawling and indexing.
There is another situation that can occur here.
If you have site search enabled, it can also create URLs that are duplicates.
If I type “pink bunny slippers” into your site’s search box, I’m likely going to get a URL with the same keyword phrase in the URL – and also with parameters on it.
This would further your problem, and your IT team will need to programmatically set the canonical links to the search results along with a meta robots for noindex, follow.
One other thing to look for is: If I click to the pink bunny slippers page from the search result, these parameters may stick.
If they do, take the same steps mentioned above.
Using proper canonical links and ensuring your sitemap doesn’t have non-official pages will help solve the duplicate page problem and help ensure you don’t waste a spider’s visit by having it crawl the wrong pages on your site.
I hope this helps!
Featured Image: Leremy/Shutterstock
Editor’s note: Ask an SEO is a weekly SEO advice column written by some of the industry’s top SEO experts, who have been hand-picked by Search Engine Journal. Got a question about SEO? Fill out our form. You might see your answer in the next #AskanSEO post!