7 Steps to find Google Analytics keywords


    Are you still searching for Google Analytics keywords?

    Back in 2011 it was just a small problem for SEOs.

    But now 99%+ of Google Analytics keywords are not visible.

    This used to to be a treasure trove of useful information about visitor behavior and intent.

    However, gradually since 2011, when Google released the Safe Search Update, almost all organic searches were hidden as keywords not provided.

    In July 2013, all search traffic moved to HTTPS.

    And the ratio of Google Analytics keywords not provided surged from about 41% to over 80%.

    Now it is over 99%.

    Is this an attempt by Google to persuade more users to pay for their services?

    It could be.

    But in 7 steps you can find all of your not provided keywords.

    • Step 1: Pull together nine different data sources
    • Step 2: Perform Google Analytics data analysis
    • Step 3: Analyze organic traffic for keyword fluctuation in Google SERP
    • Step 4: Train the keyword classifications
    • Step 5: Match sessions with keywords using an algorithm
    • Step 6: Start computation to unlock keywords not provided
    • Step 7: Session vs. string-based matching

    You can see in the below image that not provided organic search accounts for almost all of the below organic traffic.

    not provided

    Let’s take a look at Google Analytics keywords and how you can see them again in GA.

    But first, let’s start with how not provided became an issue in the first place.

    A brief history of Google Analytics keywords

    To get a complete picture of why this matters so much, we need to jump into our time machine and take a trip back to 2011.

    Google Analytics had just launched, and website owners were bestowed with access to a vast well of information.

    In the organic search report, you could see which keywords brought visitors to your site.

    And how user behavior changed with different keywords.

    This made it easy to see which keywords were the most valuable.

    Let’s take an example.

    Say an ecommerce retailer observed that people who arrived on his site after a long tail search like ‘Adidas high top trainers’ had a high conversion rate.

    He could optimize that page to closer match user intent.

    But in 2011, that all changed when Google started encrypting search data.

    According to Google, this move was made to make search more secure.

    As co-founders Larry Page and Sergey Brin disclosed, the increased privacy measures were taken partly due to the introduction of personalized search results.

    Nowadays, most of us don’t even consider the fact that our search results take personal information into account.

    Like when you search for restaurants in New York and expect Google to show you restaurants close to your current location, even if you didn’t search for ‘Italian restaurants in New York’ for instance.

    But this wasn’t always the case.

    As Google started incorporating personal information into search results, they had to protect users’ personal information.

    They achieved this by encrypting keywords from logged-in Google users.

    So, how does that process work?

    Before this change, a normal Google search would redirect users to an HTTP version of the domain they clicked on.

    This might seem like a small change, but it had a huge impact on our ability to access keywords.

    That’s because redirecting users to a secure version of their desired domain encrypts their search queries.

    Google Analytics keywords locked by privacy update

    For privacy concerns, this change was fantastic.

    And I’m glad for it.

    The new encryption process added an extra layer of security to Google search.

    However, it became a huge obstacle for SEO and the online marketing world.

    When this change came into effect, SEO professionals suddenly lost a large chunk of information.

    The quest to unlock not provided keywords has been the dragon in the cave for SEO.

    There is good news though, there is a way to reveal all organic keywords including all session metrics.

    Keyword Hero can retrieve between 85-95% of all keywords in seven steps.

    7 Steps to get not provided keywords back

    Once he gets the green light, Keyword Hero pulls together nine different data sources.

    Including Google Analytics keywords and Search Console data via Google certified API access.

    Next massive parallel, cloud-based artificial intelligence / machine learning algorithms statistically match search phrases to sessions and cluster them.

    Then Keyword hero uploads this data back to a new Google Analytics property.

    This allows you to analyze your new data set in a familiar setting without interfering with your original data.

    Step 2: Perform Google Analytics data analysis

    Once a user signs up, the Hero analyzes Google Analytics account history, looking for:

    • Patterns e.g. site structure, Urls that pull traffic; clusters among organic traffic; device categories; time / date; locations
    • Type and content e.g. number of direct sessions attributed to organic; news or ecomm; one-pager or many product sites; semantical topics

    We then generate a large set of possible keywords per Url using:

    • 3rd party data sources e.g. rank monitoring services, browser extensions data, Bing search API
    • 1st party user data including Google Search Console and the remaining keywords in GA

    The result is one big bucket with all possible keywords per Url.

    Google Analytics keywords

    None of this would be particularly useful unless we clustered and classified these meta keywords for you.

    Using external APIs we look for abnormal behavior in a keyword’s history.

    Were there changes in traffic in the last 12 months (per country / device)?

    Is this a new keyword? Did spikes occur? Is there a reference article on Wikipedia if so how does it behave?

    After the Hero creates a keyword set and checks for potential spikes and abnormalities, he tries to understand the keyword not provided.

    Is this a brand keyword or a non-brand keyword?

    Is it a person / brand name? (important for n-gram analysis).

    Is the keyword transactional / navigational or

    keywords not provided

    The result is one big data frame where certain parameters are attributed to the keywords.

    Step 3: Analyze organic traffic for keyword fluctuation in the Google Serp

    Let’s suppose we have a site that ranked on positions 13 – 15 in calendar week X for the term “shoes”.

    At this point we have the following metrics for this landing page:

    keyword bounce rate

    In calendar week X + 1 the site now ranks position 5 for shoes and the metrics changed to:
    keyword conversions

    In this simple example we assume ceteris paribus, that nothing else has changed.

    Almost all new sessions can be attributed to “shoes”.

    Why is this keyword analysis important?

    With just a few computations we gain some valuable insights.

    1. “Shoes” in position 5 brings in 1000 more daily sessions compared to position 13 – 15
    2. The bounce rate of the keyword is identical to other keywords that rank for this site
    3. The conversion rate of this keyword seems to be significantly lower than the conversion rate of the others

    It has brought only 5 more conversions with 1000 sessions, compared to the 10 conversions with the 500 sessions we had before.

    When analyzing this, we take a couple of factors into consideration, most importantly:

    • Seasonal fluctuations
    • Overall site performance during this time frame
    • Keyword performance → is the spike due to a trend?

    Luckily the search engine results page changes quite a lot, otherwise, this would not be possible.

    Step 4: Train the keyword classifications

    Now please take a sip of coffee.

    Because this is going to get a bit complicated.

    At this point, we need to start matching keywords and sessions.

    Using the previously generated data, the Hero tries to calculate the probability of a certain keyword matching a cluster of sessions.

    Sessions that were captured through extensions and those where the keyword is still visibly in Google Analytics (= “hard data”) can be matched with 100% certainty, so these will be left out.

    After the first probability is calculated, the data will again be compared against the “hard data”.

    This happens on the cluster level to adjust the classifications.

    One more sip of coffee, please.

    After the first iteration, the Hero constructs strings from the GA data.

    These strings contain between 5 and 25 dimensions.

    It’s interesting how few sessions are left if you query a big string.

    You can check this for yourself using the Core Reporting API, where you get up to 7 dimensions in one string.

    Now, the keywords have been clustered sufficiently and you get only a handful of sessions back from one string.

    This leaves very few potential keywords per cluster of session (the cluster has gotten smaller, too).

    Still with me?

    An unfair visual representation

                          Keyword A

    keyword cluster

    Keyword A might now be matched with one or more of those sessions.

    This is a session cluster, containing four sessions.

    This results from the Hero requesting all sessions with a certain string.

    Step 5: Match sessions with keywords using an algorithm

    keyword algorithm

    You know the Hero is top of his class at this sort of thing, so you can trust his superpowers.

    At this stage, he sets about creating an adjusted, unique algorithm for your account, that tries to match sessions with keywords.

    He then applies this algorithm to the entire history of the GA account.

    And then matches the results with the “hard data”.

    Which are the remaining visible keywords and the keywords captured by browser extensions.

    After this last step, the algorithm is ready to go for each page.

    In most cases, the historic data is not enough.

    Because some data sources are not available retrospectively, which is why the Hero works far better after a month.

    Step 6: Start computation to unlock keywords not provided

    After all that learning, the Hero can finally start.

    The computational process is essentially the same as the training process:

    1. Compute possible keywords
    2. Check results against “hard data”
    3. Adjust classifications based on results

    Now we have the final output to discover keywords not provided.

    The matching does not happen through CIDs.

    Looking at that mirrored account, you’ll see that some dimensions are missing.

    This is because the matching happens based on dimension strings and not sessions.

    Are you still reading?

    Good, because this is where it heats up.

    Step 7: Session vs. string-based matching

    If the data was session-based, you’d get everything back like this:


    keyword CID

    But what it looks like instead is this:


    Keyword GA CID


    But why only 83% certainty?

    You may have seen this number on our website.

    This is just the average threshold we use because it’s easier to communicate.

    In reality, the certainty threshold for a keyword to match with a session varies between 80% and 85%.

    The certainty level is based on the assumption that 95% of all possible keywords not provided have been found in the first bucket (see the big red keyword bucket above.)

    And there you have it.

    The world’s first keyword superhero has solved the problem of not provided organic search.

    We are certain about the validity of the mathematical concepts involved here.

    But we have to remain high-level, as the Hero needs to mask his secret powers.

    However, if you have any further questions from a technical perspective, we would be happy to chat with you.

    Test it for three months for free (cancel at any time).

    Daniel Schmeh
    For eight years Daniel Schmeh has been involved in online marketing, in the first years mainly with SEO, later more and more with web analytics and marketing automation.