Understanding Lytics / Product Documentation / Content Affinity Engine

Topic Extraction

Since Lytics collects and stores every event without any aggregation, automatic topic extraction becomes a possibility. For every URL seen, Lytics fetches the web page at that URL, analyzes the content, the metadata, and even the images1. The analysis boils the web page down to a set of topics. Where manual topic tagging may result in four or five topics for an article or product, Lytics topic extraction often results in 10 or more.

A list of articles and their corresponding topics

The Significance of Topic Extraction

Having a set of topics in addition to the volume of content for each topic greatly increases the potential for personalized content-marketing.

In the absence of topics, the primitives for content-marketing become URLs and keywords. These are both flawed in different ways.

Marketing using URLs alone means working directly with the large corpus of content. At best, when content is well organized, partial URLs can be used to represent higher-level abstract ideas. Unfortunately, the reality of content is it's a web and a careful URL architecting isn't enough to manage it all.

Marketing using keywords is an improvement to URLs but misses meaningful higher-level connections that humans think in.

Topics are the answer to the content-marketing primitive. Especially when bolstered by a networked taxonomy.

See how topics are useful in audiences as well content recommendation.

How Topics are Different than Keywords

Lytics makes sure to distinguish between Topics (what Lytics uses) and keywords (what platforms like Google AdWords uses). Here is a passage of text to use as an example:

The Seahawks blew a chance to make Super Bowl history with another improbable comeback because of an inexplicable decision to pass instead of handing the ball to Marshawn Lynch.

In this passage, the keywords have been marked in bold. They are extracted verbatim from the text. This is a tunnel-vision approach and easily gamed by clever copywriters.

Topic extraction, however, identifies inferences that keywords miss. In this same passage, topic extraction would pick up on topics that weren't in the text. Such as:

  1. The NFL
  2. American Football
  3. Sports
  4. Sports Organizations

Since topics are extracted using a more sophisticated Natural Language Processing approach, they are effective content-marketing primitives.

Topic Extraction Logistics

When Lytics fetches and analyzes new content, it does so with a bot, creatively called lyticsbot. When lyticsbot scrapes your content, you can identify it with some HTTP headers that will be present on every request, namely

  • User-Agent: lyticsbot
  • Lytics-Id: <YOUR_ACCOUNT_ID>

This will allow you to identify requests from Lytics to scrape that content to enhance your topic graph. In the event that your content is behind a paywall, or other authentication, you can use these headers to permit the lyticsbot access to your content. If, for some reason, these means of authentication don't work for your content, we also support basic authentication, or you can send custom HTTP headers your Lytics account representative can help you with either of these options. See below for more information on crawling.

Providing Custom Topics

Lytics will automatically extract topics from the main content at a URL, but sometimes domain specific topics are also desired to track. In this case, Lytics supports a special metatag for annotating custom topics.

Provide a comma-separated list of topics in a lytics:topics meta element in your HTML source.

Here is an example from a Lytics blog post:

<html>
  <head>
    <title>Omeda and Lytics Team Up To Offer All-In-One Audience Engagement Platform</title>
    <!-- ... -->
    <meta name="lytics:topics" content="Customer Data Platform, Lytics News"/>
    <!-- ... -->
  </head>
</html>

Additionally, your Lytics account can be configured to also scrape other meta tags to feed into your topic graph by setting the account's content_customprops setting to the names of the meta tags you'd also like to include.

For example, if you wanted your Lytics topic graph to include topics from your article:tag meta tags, you could update your account settings with the following API request.

curl -XPUT "https://api.lytics.io/api/account/$ACCOUNTID" \
   -H 'Content-type: application/json' \
   -H 'Authorization: $LIOKEY' \
   -d '{
    "settings" : {
        "content_customprops": ["article:tag"]
    }
   }'

Now, after adding the article:tag topic, any values from article:topic meta tags will also appear in the topic graph which means they'll be eligible for content affinities, targeting and personalization, and inform content recommendations.

Note: Lytics will track these custom topics in addition to the automatically extracted topics. Do not specify generic topics, there is no need.

Crawling

For some websites it is desirable to allow lyticsbot to crawl everything as fast as possible. However, some web administrators would like more flexiblity and control over how fast and where the bot attempts to pull content from. The bot will follow a set of directives that would be located at the root of the website, for instance https://www.lytics.com/robots.txt. Below you can see three common robots.txt configurations, note that you must specify the lyticsbot user agent a wild card will not work in this case.

The first example disallows lyticsbot from attempting to crawl any links that reside in the /admin directory.

User-agent: lyticsbot
Disallow: /admin

The second, shows a "crawl delay" being used to set the ammount of time in between crawl attempts.

User-agent: lyticsbot
Crawl-delay: 10

Finally, this example sets the delay to 10 seconds which would effectively allow the bot to only crawl 8,640 pages a day. These two settings can be combined together as well if needed.

User-agent: lyticsbot
Disallow: /admin
Disallow: /private
Crawl-delay: 10