Topic Extraction
Since Lytics collects and stores every event without any aggregation, automatic topic extraction becomes a possibility. For every URL seen, Lytics fetches the web page at that URL, analyzes the content, the metadata, and even the images. The analysis boils the web page down to a set of topics. Where manual topic tagging may result in four or five topics for an article or product, Lytics topic extraction often results in 10 or more.
The Significance of Topic Extraction
Having a set of topics in addition to the volume of content for each topic greatly increases the potential for personalized content-marketing. In the absence of topics, the primitives for content-marketing become URLs and keywords. These are both flawed in different ways.
Marketing using URLs alone means working directly with the large corpus of content. At best, when content is well organized, partial URLs can be used to represent higher-level abstract ideas. Unfortunately, the reality of content is that carefully architecting URLs isn't enough to manage it all. Marketing using keywords is an improvement to URLs but misses meaningful higher-level connections that humans think in.
Topics are the answer to the content-marketing primitive. Especially when bolstered by a networked taxonomy. See how topics are useful in audiences as well as content recommendations.
How Topics are Different than Keywords
Lytics makes sure to distinguish between Topics (what Lytics uses) and keywords (what platforms like Google AdWords uses). Here is a passage of text to use as an example:
The Seahawks blew a chance to make Super Bowl history with another improbable comeback because of an inexplicable decision to pass instead of handing the ball to Marshawn Lynch.
In this passage, the keywords have been marked in bold. They are extracted verbatim from the text. This is a tunnel-vision approach and easily gamed by clever copywriters.
Topic extraction, however, identifies inferences that keywords miss. In this same passage, topic extraction would pick up on topics that weren't in the text. Such as:
- The NFL
- American Football
- Sports
- Sports Organizations
Since topics are extracted using a more sophisticated Natural Language Processing approach, they are effective content-marketing primitives.
Topic Extraction Logistics
When Lytics fetches and analyzes new content, it does so with a bot, creatively called lyticsbot
. When lyticsbot
scrapes your content, you can identify it with some HTTP headers that will be present on every request, namely:
User-Agent
:lyticsbot
Lytics-Id
:<YOUR_ACCOUNT_ID>
This will allow you to identify requests from Lytics to scrape that content to enhance your topic graph. In the event that your content is behind a paywall or other authentication, you can use these headers to permit the lyticsbot
access to your content. If, for some reason, these means of authentication don't work for your content, we also support basic authentication, or you can send custom HTTP headers — your Lytics account representative can help you with either of these options. See below for more information on crawling.
Providing Custom Topics
Lytics will automatically extract topics from the main content at a URL, but sometimes domain specific topics are also desired to track. In this case, Lytics supports a special meta tag for annotating custom topics.
Provide a comma-separated list of topics in a lytics:topics
meta element in your HTML source.
Here is an example from a Lytics blog post:
<html>
<head>
<title>Omeda and Lytics Team Up To Offer All-In-One Audience Engagement Platform</title>
<!-- ... -->
<meta name="lytics:topics" content="Customer Data Platform, Lytics News"/>
<!-- ... -->
</head>
</html>
Additionally, your Lytics account can be configured to also scrape other meta tags to feed into your topic graph by setting the account's content_customprops
setting to the names of the meta tags you'd also like to include.
For example, if you wanted your Lytics topic graph to include topics from your article:tag
meta tags, you could update your account settings with the following API request.
curl -XPUT "https://api.lytics.io/api/account/$ACCOUNTID" \
-H 'Content-type: application/json' \
-H "Authorization: $LIOKEY" \
-d '{
"settings" : {
"content_customprops": ["article:tag"]
}
}'
Now, after adding the article:tag
topic, any values from article:tag
meta tags will also appear in the topic graph — which means they'll be eligible for content affinities, targeting and personalization, and inform content recommendations.
Note: Lytics will track these custom topics in addition to the automatically extracted topics. Do not specify generic topics, there is no need.
Crawling
For some websites it is desirable to allow lyticsbot
to crawl everything as fast as possible. However, some web administrators would like more flexibility and control over how fast and where the bot attempts to pull content from. The bot will follow a set of directives that would be located at the root of the website, for instance https://www.lytics.com/robots.txt
. Below you can see three common robots.txt
configurations.
NOTE: You must specify the lyticsbot
user agent. A wild card will not work in this case.
The first example disallows lyticsbot
from attempting to crawl any links that reside in the /admin
directory.
User-agent: lyticsbot
Disallow: /admin
The second, shows a "crawl delay" being used to set the amount of time in between crawl attempts.
User-agent: lyticsbot
Crawl-delay: 10
Finally, this example sets the delay to 10 seconds which would effectively allow the bot to only crawl 8,640 pages a day. These two settings can be combined together as well if needed.
User-agent: lyticsbot
Disallow: /admin
Disallow: /private
Crawl-delay: 10
Natural Language Processing
The following Natural Language Processing (NLP) services are available in Lytics for content enrichment. Each link takes you to the Language support page for that service, if applicable.
The Setting column denotes the account setting change needed to enable the service, which must be enabled by Lytics Support.
Service | Setting | Notes |
---|---|---|
Google NLP | google_nlp | The default enricher turned on for all new accounts. |
Google NLP (entity) | google_nlp_entity | If used, this would force in only things like "Barack Obama" and "Frank Sinatra" as topics, instead of general topics like "Politics" and "Music". |
Google Vision | google | Analyze images to predict topics. |
Diffbot | diffbot for topics diffbot_meta for meta data | It predicts both content topics and content type. It was set as the default in most accounts created prior 2020. It has more loose associations between topics and content than Google NLP. By turning this on you’ll bring in more topics, but they may not feel intuitive. |
TextRazor | textrazor | Predicts topics, is very verbose and may also bring in topics that do not feel intuitive. |