Can the Use of Co-Occurrence and TF-IDF Improve Rankings? Absolutely!
Co-Occurrence and TF-IDF (Term Frequency x Inverse Document Frequency) are hot topics for Advanced SEO, and with very good reason. More and More pages with little or no Page authority are showing up on Page 1. The new mantra is “Content is King”, and I can’t agree more. The days of ranking mediocre content by bombing it with links are over. The big SEO questions for 2016 and beyond are:
- How does Google determine what high quality content is?
- Can content be engineered for higher rankings?
This article is my feeble attempt to answer these 2 questions.
Before we jump in, I need to say this up front: Engineering lousy content will not work. Googles’s use of click data is their ultimate BS detector. If you get to page 1 and your visitors bounce, your rankings will eventually drop like a an anvil in a Roadrunner cartoon.
The Foundations of Co-Occurrence and TF-IDF
The 2015 SearchMetrics Ranking Factors Study found a very strong correlation between the co-occurrence of Related Terms and Proof Terms and Page 1 rankings. Google has acknowledged using TF-IDF within their algorithm. But before we get started, let’s define some of the lingo:
- Relevant Terms: In a document about baseball, relevant terms could be home run, base, pitcher, diamond, etc.
- Co-Occurrence is simply defined as a group of Web Pages having the same relevant terms and phrases. Using the example above, co-occurrence would be a group of web pages about baseball all including Home run, base, and pitcher.
- TF-IDF (Term Frequency x Inverse Document Frequency) is a fairly simple statistical calculation that measures the importance (term weight) of relevant terms within a group of documents. Sometimes TF is replaced by “WDF” (Within Document Frequency)
- Proof Terms (Proof Keywords): The most heavily weighted (relevant) terms in a body of documents as calculated using TF-IDF.
- Topic Modeling: a machine learning process that attempts to connect words and documents to determine their meaning.
- Latent Dirichlet Association (LDA): a specific form of natural language processing and a subset of topic modeling
Let’s make the following assumptions for this article:
- Google is obviously very secretive about their algorithms, and much of this article is speculative.
- We would need to have an advanced degrees in Computer Science and Mathematics to really dig in deeply on TF-IDF, LDA, etc. It is believed that Google may also be using a far more complex process known as Latent Dirichlet Association (LDA) for topic modeling and semantic relevance. We will be examining TF-IDF from a very high level.
- Google claims to use over 200 signals to score and rank documents. We will only be talking about a handful in this article, and we will only be discussing on-page signals.
- We will be exploring TF-IDF in simplistic terms. There is likely a lot more going on behind the scenes.
TF-IDF and Co-Occurrence Correlation Studies
SearchMetrics’s annual correlation studies are the gold standard in SEO research. SearchMetrics pays special attention to the occurrence of “Proof Terms” and “Relevant Terms”. In short, there is a strong correlation between a page’s inclusion of relevant terms and high rankings.
Images courtesy of SearchMetrics
Early Google Algorithms
In the early days of search, Google was looking at simple on-page and off-page elements to score and rank a web page for a given search term:
- Did the keyword(s) appear individually in the document? If so, how many times (with a spam trigger)?
- If the search term had multiple words, did the phrase appear in the document?
- Did the search term appear in the first sentence of the document?
- Did the term appear in the H1, Meta-title, and URL?
- What was the Pagerank of the document, and was the term used as anchor text?
- Other conventional SEO signals.
Google would then score these documents using their algorithms to create the SERP. This method worked incredibly well for a while until the SEO community figured out how to game the algorithm using various spam techniques. From the years 2009-2011, the quality of Google’s search results eroded as more and more low quality sites spammed their way to page 1.
Google’s original algorithm can also break down when a page mentions a search term in passing, but still scores highly. If a page like this gets a lot of backlinks, it can rank highly, but not deliver a satisfactory result for the end user.
To improve results, Google needed a better way to measure content quality and semantic relevance, which lead to adding some form of TF-IDF and/or LDA to the algorithm as well as creation of the Knowledge Graph. I would speculate that TF-IDF was rolled out either as a part of, or in conjunction with Panda, and may also be part of Hummingbird.
This is perfectly logical. If a web page is on the topic of social media, you would expect to see the words Facebook, Twitter, tweet, Reddit, etc. in the content.
What is TF-IDF?
The short explanation is that TF-IDF finds common related terms for a group of web pages, and rates them by importance. Think about “Word Clouds” for groups of documents.
Math Alert! There’s some math that’s going to be explained here. If it you want to cut to the chase you can skip to the next section.
TF-IDF has been around a long time, and Google has been using it for years to determine document relevance.
TF-IDF is used to determine the weight and importance of a related term in a body (corpus) of documents. In this case, the corpus are the documents in Google’s index related to a particular primary search term.
TF (Term Frequency) is calculated the same way as keyword density: number of term occurrences/total number of words in a document. If a term appears 5 times in a 100 word document, TF=5/100=.05.
IDF (Inverse Document Frequency) is calculated by taking the natural log of: (the number of documents in the corpus)/(the number of documents containing the term). If there are 100,000 documents in the corpus, and 1000 contain the term, IDF=log(100,000/1000)=2
Using the examples above, TF-IDF=.05 x 2=.1=Term Weight.
TF-IDF effectively de-empahasizes stop words. For example, using the term “the” which would appear in theoretically every document, IDF=log(100,000/100,000)=0
The SEO Magic of TF-IDF
We can use the same formulas to analyze a group of Web pages in aggregate. For example, we could run TF-IDF for a given search term against all of the SERP results on Page 1!
In short, TF-IDF identifies the most important terms that are common in a group of documents. A high TF-IDF score indicates a heavily weighted term. A low TF-IDF indicates an irrelevant term.
Conclusion: Reverse-engineering the SERP results with TF-IDF can indicate to SEOs what Google deems to be very important relevant terms for a keyword search. Running a TF-IDF analysis on the top ranking pages for a keyword should provide a list of related keywords. The related keywords can be incorporated into a web page to increase its quality and relevance scores.
Google’s Use of Click Data to Measure SERP Quality
Google employees have never gone on record admitting it, but a number of tests have revealed that Google almost definitely tracks user clicks to determine the quality of their search results. Many SEOs speculate that Google rewards “winners” in the SERPs. A winner for an individual query is the web page where the end-user stops hitting the back button to return to the results page. Winners get promoted in the SERPs, and losers get demoted.
This is Google’s method for keeping the spam off page 1.
Google’s use of TF-IDF with Click Data
In order for TF-IDF to work for Google, there needs to be an initial baseline set of high quality web pages in order to generate related keywords. I would speculate that once Google establishes high quality SERPs (measured by click data) for a search term, Google uses those documents as a baseline for determining the related keywords.
My Experience Using TF-IDF
I’ve used some form of Co-Occurence and TF-IDF for over 5 years on several thousand Web pages. While I have not kept track of the results to provide a case study, I can state categorically that when I have rewritten content and used TF-IDF, the rankings increase significantly at least 90% of the time. I will, however, offer the following caveats:
- There is definitely a very strong correlation between the quantity of relevant terms on the page and higher rankings. Don’t be fooled into thinking that only including the top 10 relevant terms will move the needle. For competitive keywords, I try to go at least to the top 50.
- The location of the relevant terms on the page is also important. I try to include the top relevant terms in the first paragraph as well as H2s.
- In most cases, our replacement content had a higher word count than the original. This will have some positive impact on rankings.
- Google seems to have a bias towards incumbents.
- I’ve generally been targeting keywords where my sites’ Domain Authority (Moz) is within striking distance of the other sites on page 1.
- Content freshness may also play a part in the ranking increase.
Here are my unscientific observations:
- Not including Proof Terms is the death sentence for rankings. When discussing a topic, Google expects certain related terms to be in the document.
- For competitive keyword terms where the existing page ranks from 25 to 50, I’ve seen on average, rankings increases of 10-40 positions. The higher the starting position of a page, the less it will jump.
- For competitive keyword terms where the existing page ranks from 10-25, I generally see a position jump from 5-15 places.
- For competitive keywords where the existing page is already in the top ten, I generally see a rankings improvement of 1-3 positions, but sometimes, the rank does not improve at all.
- For medium to low competition keywords, TF-IDF will almost always get the content to page 1.
- If the incumbent content on page 1 is well written, comprehensive, and strong from a TF-IDF perspective, they will be difficult to crack without using off-page SEO.
- I have ranked many pages where my sites’ PA/DA is far lower than the others on Page 1.
- Adding relevant terms to an existing page without changing the content has less impact.
- About 10% of the time, rankings stay static. My general observation is that the pages that rank higher have a heavy dose of related terms.
My Conclusion: Google has gotten really good at rewarding quality content and punishing poor content. For highly competitive keywords, TF-IDF is the price of admission to the top 10 of the SERPs. Without resorting to blackhat methods, it is nearly impossible to rank without heavy usage of relevant terms.
TF-IDF and Co-Occurrence Tools
There are 2 tools that I use for optimizing content:
OnPage.org: Outside of enterprise class SEO tools, this is the only tool I’ve found that does TF-IDF analysis. It’s pricey at over $100 per month, but it’s a must-have. OnPage.org crawls the top 15 ranked websites for a given keyword term, and generates TF-IDF scores for many related terms. There’s also a step where you can enter your content, and the tool will make suggestions as to what relevant terms to add and how frequently. This tool does have several shortcomings, however.
- The user is locked in to only going 15 deep in the SERPs, which, for me is sometimes not deep enough.
- I don’t believe that the tool returns enough relevant terms to take down a highly competitive keyword term.
- It only returns relevant terms consisting of 1 or 2 words.
Shameless Affiliate Plug: Get OnPage.org here.
Ultimate Keyword Hunter (UKH): This is a free download, and is the poor man’s version of Onpage.org. It generally returns very similar results to OnPage.org, but requires more work to filter and rank the importance of relevant terms.
- UKH runs through the same crawling process as OnPage.org, but does not calculate TF-IDF. Figuring out the most important terms is more work.
- It does return aggregate keyword densities as well as the term frequency and how many sites use the term. Pretty handy.
- UKH will go as deep as you want into the SERPs, and gives the opportunity to exclude sites like Youtube videos from the calculations.
- UKH also has a content evaluation tool.
- UKH also provides results for multi-word relevant terms
How to Optimize Content Using TF-IDF and Co-Occurrence
I prefer to use both tools. I use UKH to expand the results I get from OnPage.org.
- Create a list of closely related target keyword phrases that you want to rank for. Ex: Women’s Clothing, Ladies Apparel, Women’s Apparel, etc. The keywords need to be virtual synonyms.
- Run reports in one or both tools for each of your target keywords. Make sure to filter out sites like Youtube, as well as your own page.
- Start with multi-word terms and work downward. If a relevant 3 word relevant term is “red cotton dress”, you can eliminate “red”, “cotton”, “dress”, “red cotton” and “cotton dress” from the related terms list.
- You can generally eliminate words that don’t fit into the theme. If you are analyzing eCommerce pages, you’ll see words and phrases like “add to cart”.
- Combine your lists and rank them by importance.
- For highly competitive keywords, I often try to incorporate 50-100 relevant terms.
- Try and include the top 10 relevant terms in the first 2 paragraphs and H2s.
- When the content is finished, put it back into one of the tools for evaluation and minor tweaks
- Give the keyword list to your writer with the following instructions:
- Write for end-users, not Google. Text should be very natural and readable.
- Try and incorporate as many of the top relevant terms into your first 2 paragraphs as possible.
- No Keyword Stuffing!
- Try and incorporate as many of your top terms as possible into subheadings.
My writers generally don’t complain about these instructions since I am giving them relevant terms to include. Try writing an article about SEO without using words like Google, rankings, keywords, pages, links, etc.!
Good Luck! I’d love to hear about your results!