How to define All Present and Archived URLs on a web site

There are plenty of good reasons you may want to find all the URLs on an internet site, but your correct target will ascertain what you’re attempting to find. By way of example, you may want to:

Establish each and every indexed URL to research troubles like cannibalization or index bloat
Collect latest and historic URLs Google has noticed, especially for internet site migrations
Obtain all 404 URLs to Get better from write-up-migration errors
In each scenario, a single tool gained’t Present you with every thing you will need. Regrettably, Google Lookup Console isn’t exhaustive, plus a “web site:example.com” look for is limited and challenging to extract info from.

In this article, I’ll wander you thru some tools to create your URL list and just before deduplicating the information using a spreadsheet or Jupyter Notebook, based on your site’s dimension.

Old sitemaps and crawl exports
If you’re searching for URLs that disappeared through the Reside site lately, there’s a chance anyone on your own crew can have saved a sitemap file or perhaps a crawl export prior to the alterations have been manufactured. In case you haven’t already, check for these data files; they could typically offer what you need. But, for those who’re reading through this, you most likely didn't get so Fortunate.

Archive.org
Archive.org
Archive.org is a useful Software for Website positioning tasks, funded by donations. In case you search for a site and select the “URLs” selection, it is possible to access up to 10,000 mentioned URLs.

Having said that, There are many limitations:

URL limit: You are able to only retrieve as many as web designer kuala lumpur ten,000 URLs, that's inadequate for greater web-sites.
High-quality: Many URLs might be malformed or reference useful resource information (e.g., photographs or scripts).
No export selection: There isn’t a constructed-in technique to export the list.
To bypass The shortage of the export button, make use of a browser scraping plugin like Dataminer.io. However, these limitations necessarily mean Archive.org may well not deliver an entire Option for larger sized sites. Also, Archive.org doesn’t indicate no matter whether Google indexed a URL—however, if Archive.org located it, there’s an excellent chance Google did, much too.

Moz Pro
Even though you may commonly use a backlink index to locate external internet sites linking for you, these instruments also discover URLs on your site in the procedure.


Ways to use it:
Export your inbound inbound links in Moz Pro to get a rapid and easy list of concentrate on URLs from the website. In the event you’re addressing an enormous Internet site, consider using the Moz API to export facts beyond what’s manageable in Excel or Google Sheets.

It’s essential to Observe that Moz Pro doesn’t validate if URLs are indexed or found out by Google. Nevertheless, since most web-sites apply precisely the same robots.txt guidelines to Moz’s bots because they do to Google’s, this process frequently works nicely like a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Search Console provides many worthwhile resources for making your list of URLs.

Links stories:


Similar to Moz Professional, the Inbound links area presents exportable lists of goal URLs. Regretably, these exports are capped at one,000 URLs Every. You may utilize filters for unique internet pages, but since filters don’t apply towards the export, you could must count on browser scraping tools—restricted to five hundred filtered URLs at a time. Not suitable.

Effectiveness → Search Results:


This export provides you with an index of internet pages acquiring search impressions. Whilst the export is limited, You should utilize Google Research Console API for much larger datasets. Additionally, there are free of charge Google Sheets plugins that simplify pulling a lot more substantial details.

Indexing → Pages report:


This area presents exports filtered by problem variety, nevertheless they're also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is an excellent resource for amassing URLs, which has a generous limit of a hundred,000 URLs.


Better yet, it is possible to apply filters to build distinct URL lists, effectively surpassing the 100k limit. For instance, if you want to export only website URLs, comply with these techniques:

Action 1: Include a phase into the report

Move 2: Click “Make a new section.”


Step three: Define the section having a narrower URL sample, for example URLs containing /site/


Note: URLs present in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.

Server log files
Server or CDN log information are Probably the last word Device at your disposal. These logs capture an exhaustive list of each URL path queried by customers, Googlebot, or other bots through the recorded period.

Criteria:

Facts measurement: Log files is often huge, lots of web sites only retain the last two weeks of information.
Complexity: Examining log data files can be demanding, but many tools can be found to simplify the process.
Combine, and good luck
When you’ve collected URLs from all of these sources, it’s time to combine them. If your website is sufficiently small, use Excel or, for greater datasets, applications like Google Sheets or Jupyter Notebook. Make certain all URLs are regularly formatted, then deduplicate the listing.

And voilà—you now have an extensive list of latest, outdated, and archived URLs. Superior luck!

Leave a Reply

Your email address will not be published. Required fields are marked *