SGE and Google: To Block or Not to Block

Google is joining the generative AI movement with the Search Generative Experience (SGE), the interface from Google Search Labs that is actively testing placing generative AI responses to queries directly on the search results page. This potential change to the layout of the Search Engine Results Page (SERP) has spurred lots of questions from marketers and SEOs alike. (Read our Introduction to SGE blog here for more.)

One major question is whether an organization should block their site in some way to prevent their site content from showing in the SGE answer. 

Answer: To put it bluntly, there is no easy way to do this without harming your site. To remove your content from SGE, you need to block Googlebot itself (not just Google-Extended), which would result in your site no longer ranking and losing all of your organic traffic from Google.

If there is content that you don’t want in SGE, and you don’t want it to rank either, we recommend blocking those individual pages. If you have several pages you wish to block, create a subfolder containing all of them; this folder can easily be blocked in robots.txt without impacting the entire site. 

To clarify, let’s look at where Google gets the information it uses to generate SGE answers.

How does Google’s SGE (Search Generative Experience) get information?

SGE doesn’t use a special user agent to crawl and fetch data; it relies on Googlebot. When you consider that Google already has all of this information thanks to Googlebot’s continual crawling of sites, it makes sense. 

Additionally, relying on Googlebot limits the control websites can place over how SGE gains information. If users could block Google from using their information, SGE would potentially be quickly hamstrung and couldn’t possibly work as well as Google would like. By relying on Googlebot, Google can acquire as much information as possible without fear of missing out, as blocking Googlebot entirely would mean your site would lose organic traffic and rankings.

This ultimately means that SGE can surface any information about your site that Googlebot can normally access. The only way to hide information is to block Googlebot from accessing those pages via a robots.txt rule.

What if we have proprietary information on our site that we don't want available to the general public?

Proprietary information would not typically be available on a company’s public website. SGE is powered by the information that Googlebot crawls. The only information it factors in is what you put on your website and allow to be crawled (plus anything others write about your brand, but we’re not addressing that here). 

Again, if you have content on specific pages you don’t want to be seen by Google, you can block Googlebot from accessing those specific pages through robots.txt rules that prevent crawling. 

Be very careful with this, though. Double-check to make sure those pages aren’t indexed—if they are, and they are ranking, you need to put a noindex tag on them first and wait for Google to remove them. Once that’s done, you’re safe to put in the robots.txt block. Remember, this will only apply to those specific pages, and you don’t want to do this to your whole site or your important pages.

What about Googlebot-Extended?

Googlebot-Extended is used for some of Google’s AI Products, such as Gemini (formerly known as Bard), Vertex, and other AI products. However, Googlebot-Extended is not used for SGE, and blocking it will not prevent SGE from using your content. (Read more from Search Engine Land and SEO Roundtable on this topic.)

What if we have gated information on our site and depend on paid subscriptions for our business model?

Gated information that requires a login and does not use paywall schema is unavailable to Googlebot anyway since Googlebot cannot log in.

This also applies to any paid subscription requiring a login to access the content. 

There are two ways to handle how SGE utilizes your paywalled content. Currently, if your page has structured paywall data, Google will know it is paid content and still surface an overview, along with links to the piece. 

If you do not want it to appear in SGE in Google Labs, you need to update your pages’ HTML to include a data-nosnippet tag (read more from Google here). This tag will prevent SGE from using the information in a generated answer. Keep in mind that there’s no need to remove the structured paywall data—we would advise that you keep it.

‘SGE While Browsing’ works the opposite way

If you have implemented the paywalled content schema, ‘SGE While Browsing’ will not use your paywalled content. If you haven’t implemented this schema yet, it’s worth considering, as it allows Google access to paywalled content for ranking purposes. This helps users discover your content if it’s relevant to their query without giving away your content for free. 

How can I test SGE for myself?

You can sign up to test SGE by:

  1. Logging in to a personal Gmail account (a Gmail account not tied to Google Workspace). 
  2. Living in one of the countries where SGE is available. If you’re in Europe, you can get around this by using a VPN. Here are the instructions for getting access to Google SGE.
Scroll to Top