Improve your internal links using Python string-matching
Andreas says: “Use Python’s string-matching functions to increase the relevance of your internal links on your website.”
Can you give a brief explanation of the value of using Python for SEO?
“What I love about Python is that it can scale SEO really well. A lot of SEOs will be working in spreadsheets and there are obviously restrictions or limitations in terms of what a spreadsheet can do. They are limited in the scale of the data they can handle, like the number of rows, but also in the complexity of the functions and calculations that they can perform with that data.
For example, if you're optimizing a high-traffic website with tons of pages, like Amazon, then you're going to find scalable SEO analysis in Excel or Google Sheets pretty limiting.
Instead, you can use an IPython notebook known as Jupyter, that will allow you to run Python code. If you import string-matching functions, you can take a target keyword and compare that to the title tags of your site pages to try and find the best page to send internal links to.”
Are you using this to determine whether a page or a piece of content is sufficiently optimized or just to find the most appropriate internal page to link to?
“You could also use it for measuring how optimized your content is, which is a different use case for Python. Python has many use cases for scalable and data-driven SEO.
In this case, though, we're trying to find content like blog posts where you can place internal links that will help reshape the importance of your target content for Google and other search engines.”
What content elements are you looking for?
“The great thing about doing this is that there are so many different ways to approach it. On a basic level, you could take your target keyword and the title tags of all of your content, and then simply use a string-matching function to calculate the similarity between them. Based on that similarity metric, you could use a quick rule of thumb to say that anything that's 60% or above would be considered suitable pages to place internal links on, for example.
You could do it at the body content level but that's a bit more complex because you need to ingest that content into a spreadsheet cell (or what we call a DataFrame in Python language) to do that kind of calculation. That’s possible thanks to Python.
If you don’t know what a good rule of thumb is, you can go even deeper. You can say, ‘I want to model the median’ or ‘I want to model the 95th percentile of what's considered relevant.’ You can determine your rule of thumb on a statistical basis rather than on something that you pulled out of thin air.”
Would you be able to incorporate intent into what you're looking for?
“You absolutely can. If you had the target keyword for your site content then you could create another separate column in which you've predetermined whether those two keywords share the same search intent or not.”
What data sources are required for this?
“If you wanted to do this at a basic level, you could just rely on crawling data alone. If you want to get search intent involved, then you'll need SERP data so that you can determine the similarity between your target keyword and the focus keyword of the content page you're comparing the search intent of. If you wanted to look at whether Google was crawling that page live, you would obviously use server logs.”
How do you clean URLs that you wouldn't want to link to?
“That’s a slightly separate issue, but let's get into it. One of the things that I do is model the page rank or link equity of a website using crawl data and external backlink data, so that I get both the internal and external page rank. Then, I amalgamate those two data sources together to get what I would call the ‘effective page rank’, which combines both the internal and the external.
Using that, you can transform or pivot your existing site structure away from the typical catalogue/product group structure (which might make sense from a librarian’s perspective) and move it more towards the type of content structure that the internet is more interested in.”
Should all SEOs be doing this or is it primarily for technical SEOs?
“To me, any SEO should have a holistic view, and all SEOs should understand it. If you call yourself an SEO generalist or an SEO consultant, then you should have a level of competency, if not experience or understanding, in the holistic elements of SEO.
You should be competent in your technical, your content, and your backlinks/off-page SEO. Technical SEOs should know how to do this themselves, but SEO content strategists might not need to.”
How can you use statistical distributions to model relevance and highlight under-served target content?
“If you look at the median number of internal links to a product category on an e-commerce site, for example, those will be very different from the median number of internal links to a product item.
I don’t want to create a hard-and-fast rule. I don’t want to say that any pages that have less than 10 internal links need more links, or that you should add a certain number of links to those pages. If you use statistical distributions, you're taking a smarter, more tailored approach. You're taking a segmented approach, and you're accounting for the fact that not all content is equal.
You would expect your product categories to have more internal links, so the threshold will be high. Your product items may have fewer internal links, or it might be the other way around. The point is to take a segmented approach. By using distributions, you're moving away from hard-and-fast rules.”
Is this just for internal links or can this approach be used to determine the optimum landing page for external links as well?
“You can apply it to absolutely everything. That's the whole premise of being data-driven.”
How do you measure the ROI of improved internal linking?
“You would benchmark the ROI beforehand and then it's almost like a split test. You would benchmark what it was before, then you could make the change following the model’s recommendations and see what the ROI is afterwards. However, if you're going to make this change site-wide, then you would want to do a split A/A test because you're comparing the result of the internal linking on the same URL against itself, before and after.
If you wanted to make it truly scientific, then you would conduct a split A/B test. In that case, you would only make that change on a collection of unlinked URLs, measure the revenue before and after, then compare it to the control group.”
Does providing better and more relevant internal links also enhance usability?
“In theory (and, in many cases, in a practical sense), search engine SEO and user experience are often aligned. By optimizing your content for the search engines, you should also be optimizing it for the user. If the user knows what they're getting before they click on the link, and the link is more relevant for their needs, then that should improve their experience.”
If an SEO is struggling for time, what should they stop doing right now so they can spend more time doing what you suggest in 2024?
“Stop getting better at Excel and retrain in Python.
Personally, I rarely use Excel. I use Google Sheets but only for putting together nice graphs because the ones produced by Python are a bit too sciencey for a business audience.
A more diplomatic and practical approach would be to say, ‘Limit your use of Excel and retrain in Python’. You’ll start noticing that you can invest ten minutes or one hour working out how to solve a dilemma in Python rather than Excel and, eventually, it will get to the point where you can do so much more in Python that you will drop Excel like a hot potato.
Python is also well future-proofed. That’s not to say there won't be a language in 10, 15, or 20 years that will supersede Python. However, the great thing is that, once you learn a computing language, those skills are transferable to almost any other computing language. I started out using R, which is a statistical computing language. Once I saw that more of the SEO industry was favouring Python, it was really easy for me to switch. A lot of the function names are identical.”
Andreas Voniatis is Founder at Artios, and you can find him over at Artios.io.