Google Patent : Website Representation Vectors Used to Classify a Website's Expertise, Authority, and Trust.
Bill Slawski from Go Fish Digital discovered a new patent relating to how Google uses Representation Vectors to classify websites. Google filed the Patent in August 2018, and is called “Website Representation Vector to Generate Search Results and Classify Website.”
As the title indicates, this Patent may be related to the classification of websites that require Expertise, Authority, and Trust (EAT). Topics that typically need EAT are Your Money or Your Life (YMYL) topics, such as health and finance.
We know that EAT is important as Google discusses it at length in the Google Quality Raters Guidelines.
It is perhaps, useful to run through the salient points of the guidelines first, so you can see why this patent matters and how it relates in laymen’s terms to Google Search.
What the Quality Raters Guidelines tells us about YMYL and EAT
Google defines YMYL pages and topics on page 10 of the guidelines, as those that could “potentially impact a person’s future happiness, health, financial stability, or safety.”
Examples of YMYL given in the guide are quite wide-ranging, which can be seen in the screenshot below:
The guide continues on page 22 to talk about EAT and how it is integral to high-quality pages. This is what the guide says:
High quality pages and websites need enough expertise to be authoritative and trustworthy on their topic. Remember that there are “expert” websites of all types, even gossip websites, fashion websites, humor websites, forum and Q&A pages, etc. In fact, some types of information are found almost exclusively on forums and discussions, where a community of experts can provide valuable perspectives on specific topics.
The guide says that the highest quality pages should have the following:
- Very high level of Expertise, Authoritativeness, and Trustworthiness (E-A-T).
- A very satisfying amount of high or highest quality MC.
- Very positive website reputation for a website that is responsible for the MC on the page. Very positive reputation of the creator of the MC, if different from that of the website
It is worth reading the entirety of the guide, as it goes into detail with examples of what Google expects from high-quality pages.
The vital point here is that Google is assessing whether your website is written by someone with the necessary expertise. Is your site or content authoritative and trustworthy?
The question of how they do this largely remains a mystery, although Google’s Gary Illyes has previously said it is largely about links and mentions on authoritative sites:
I asked Gary about E-A-T. He said it's largely based on links and mentions on authoritative sites. i.e. if the Washington post mentions you, that's good.— Marie Haynes (@Marie_Haynes) February 21, 2018
He recommended reading the sections in the QRG on E-A-T as it outlines things well.@methode #Pubcon
Will the Patent shed any further light on this? I’ll cover this next.
What does the Patent say about the classification of websites?
The Patent describes a process where Google could use Website Representation Vectors to classify websites in particular knowledge domains.
Let’s take a look at the wording in the Patent summary (I’ve added some line breaks for readability):
In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of, for each website of a plurality of websites determined to be in a particular knowledge domain, wherein the particular knowledge domain is one of a plurality of knowledge domains that are each different from the other knowledge domains:
receiving representations of the website and a quality score representing a quality measure of the website relative to other websites;
classifying as first websites each of the plurality of websites having a quality score below a first threshold, at least one of the plurality of websites having a quality score below the first threshold;
classifying as second websites each of the plurality of websites having a quality score above a second threshold that is greater than the first threshold, at least one of the plurality of websites having a quality score greater than the first threshold;
generating a first composite-representation of the websites classified as the first websites; generating a second composite-representation of the websites classified as the second websites; receiving a representation of another website;
determining a first measure of difference between the first composite-representation and the representation; determining a second measure of difference between the second composite-representation and the representation; and based on the first measure of difference and the second measure of difference, classifying the other website as one the first websites, the second websites, or as third websites that are not classified as either the first websites or second websites
Okay, so let’s break this down:
- Google classifies websites into a particular knowledge domain. There are many knowledge domains, each containing similar sites.
- Each knowledge domain is designated a quality score threshold that indicates the quality-level of sites within it.
- This process allows Google to assign “representations” of quality scores to websites, relative to other sites in that knowledge domain.
Are you still with me?
Essentially, Google will assign each website covering a topic to a specific group, containing sites of similar quality.
What criteria is used to Classify Websites?
The Patent goes further into the methods used to generate the representations of quality starting at paragraph .
Here are the salient points (with paragraph labels so you can read the detail directly in the Patent should you wish):
Website Content. The website content may include the text from the website, the images on the site, other website content, e.g., links, or a combination of two or more of these.
Classification System. The website classification system uses the website content to generate a representation for the website. This can use a mapping system that maps content to a vector space that identifies a representation for the website. The classification system may use a neural network.
Labels may be based on anything. [0032, 0033] During training, the classification system may use labels for the websites to determine classifications for each of the websites. The labels may be:
- Alphanumeric, numerical, or alphabetical characters, symbols, or a combination of two or more of these
- A type of entity that had the corresponding website published, e.g., a non-profit or a for-profit business.
- May indicate an industry described on the corresponding website, e.g., artificial intelligence or education.
- The labels may indicate the type of person who authored the corresponding website, e.g., a doctor, a medical student, or a layperson.
Labels may be the assigned representation of quality scores. 
The Scores may be specific for a particular knowledge domain.  The website classification system can determine multiple queries for a particular knowledge domain.
- Examples of knowledge domains include artificial intelligence, education, astronomy, and health.
- A single website might have different scores for different knowledge domains
The Patent continues with more detailed information about more complex situations. For example, it covers mappings for subdomains or portions of the content included in a website.
I don’t think it will serve much purpose to go into too much detail here, but if you want to know more, I suggest reading the Patent from paragraph 0030.
How does Google use the website classifications to display the search results?
For any given search query, Google may select one of the classifications upon determining what knowledge domain applies to the query.
In other words, depending on the query, and its context, Google may prioritize websites with higher knowledge domain classifications.
For example, the search engine may select the second websites for searching because those websites generally have data responsive to queries for the knowledge domain while the first websites generally include data that is less responsive to queries for the knowledge domain.
You can read more about this aspect in the Patent at paragraph .
The Raters Guidelines classifies content by YMYL topics along with the requisite levels of EAT. The guidelines say that high-quality content requires a high level of EAT for specific topics.
The Patent classifies content by Topic (Knowledge domain) and the requisite levels of quality (Knowledge domain classifications). The knowledge domain classifications may represent the expertise level of the content.
The level of similarity between the two concepts is hard to ignore, especially when the Patent uses YMYL topics as examples, such as Health and Finance.
If you have anything to add, please let me know in the comments.