Multi-purpose Recomender Platform using Perceiver IO

Multi-purpose Recomender Platform using Perceiver IO
Page content


Web services usually require many different types of recommender systems using large amount of user log and content data. It is no different for Stockmark! For example, news article, category and keyword recommenders are some of the recommendation services that run in our news distribution and analytics platforms Anews and Astrategy. There is also demand for different kind of recommendation tools, such as user recommendation.

On the other hand, it is challenging to design different models for each recommender task. We explain the architecture of an efficient recommender platform for various tasks in this blog post. Different recommendation tasks may use the same historical user behavior data, such as clicked articles for exact same target users but with different target item set. In such cases, having a unified architecture help sharing or transferring models between recommenders.

Together with historical behavior data, using content data improves quality of recommendations. It is particularly important to use textual content data for news recommender services, because the articles to be recommended are always new which are lacking enough user interactions. Our architecture is based on Perceiver IO, hence we can efficiently input different types of large content data, in order to achieve highly accurate recommenders.

In this blog post, we will focus on implementing news article recommender with our platform. For more information and use cases, please refer to our paper, which is published as an IEEE ICDM 2022 Workshop proceeding. We also open-sourced the platform code with sample implementations.

Perceiver IO

Our platform is based on Perceiver IO, so let us briefly explain about Perceiver IO model. Perceiver IO is a transformer-based model which maps input feature matrix of arbitrary size to arbitrary outputs. Therefore, it can be trained for almost any type of machine learning tasks, without need of task specific engineering. It is reported to achieve strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning.

The transformer is a revolutionary architecture in AI, but it scales poorly with the increase of input features, due to the self-attention executed for all pairs of inputs. Perceiver IO reduces the cost of expensive self attention by using lower dimensional latent matrices, instead of high dimensional inputs. The computational complexity of Perceiver IO is linear in the input and output size.

See below figure from the Perceiver IO paper for the architecture of it.

Figure 1: Perceiver IO model architecture.

Our Approach

There are two main components to be designed while building a recommender model with our architecture which uses a central Perceiver IO model: input features and an appropriate loss function. Figure 2 depicts how to design a recommender as a learning-to-rank problem by contrastive learning.

Figure 2: Learning-to-rank-based recommendation with Perceiver IO.

The goal of article recommender is to recommend the most interesting news articles to users, more precisely, the articles most likely to be clicked by users. In order to calculate relations between users and articles, we encode user and article features to the same vector space by feeding those features to the same Perceiver IO model. We use triplet loss to learn the similarities and dissimilarities by selecting positive and negative samples from the articles that are impressed by users: a positive sample for a user is an article clicked by the user and a negative sample is an article impressed by the user but not clicked. It is also possible to use other loss functions, such as binary classification of dot product between user and item encodings, but we found triplet loss to perform better in our experiments.

It is possible to use different way of modeling recommendation tasks other than learning to rank. For instance, if the number of items to be recommended are small, it is possible to design recommendation problem as a multi-class classification problem. See Bert4Rec for such an architecture. One may use our platform to implement this kind of recommenders, e.g., news category recommender where there are small number of fixed categories.

Designing Input Features

Figure 3: Sample input feature design for a) articles, b) users.

The important part of designing recommendations with our model is designing input features. Input features for candidate articles and users need to be provided to the model. Figure 3.a describes sample 2D input feature embeddings for an article Ai. Let us consider each article has title, body, category and media name as article features. Title and body contain array of words; therefore, they can be represented as sequences of word embeddings, which can be possibly initialized by pretrained word embedding models. In Figure 3.a, wti represents ith title word, wbi represent ith body word, c represents category and m represents media name. Embeddings for category and media names are appended to the sequence of title and body word embeddings. We call them token embeddings altogether. Besides token embeddings, positional embeddings, such as word position and feature type, can be appended using Fourier feature encoding. Figure 3.b describes user feature embeddings. A user might have clicked a list of articles (Aci), marked a list of articles (Ami) and might have user-specific features (ui), such as age, gender, working industry, … 2D feature matrices can be built for article-type of features, Aci and Ami, as explained above, optionally with user-specific positional embeddings appended. The inputs become very high dimensional, easily becoming matrices of tens of thousands times hundreds. Traditional transformer networks, such as BERT, cannot handle large inputs, but Perceiver IO is able to handle.


We use MIND dataset large version in our experiments, which is a popular dataset for news recommendation. We use title, body, category and subcategory entities for training. We find it improves accuracy to append word position embedding and feature type embeddings to the token embeddings. We take first 30 words from titles and 128 words from the article body. Word embeddings are initialized with GloVe.

In the table below, we compare metrics on validation set with NRMS algorithm. NRMS is not the latest study on news recommendation, but it is one of the most popular algorithms and it is based on additive attention, which is similar with our approach. MRR and NDCG yields similar results, but Perceiver IO achieves AUC score of 0.726 on validation set.

Table 1: News recommender experimental result.


We introduced a general purpose, scalable recommender framework based on Perceiver IO. To build recommendations it is only needed to prepare input features and outputs.

One advantage of using unified framework is that it is possible to reuse or transfer models between different tasks. For instance, in our internal evaluation which we do not report experiments here, we observed that the model trained for article recommendation can be used for keyword recommendation as is, with some post-processing on the results.

Users and candidates are encoded in the same vector space with the learning to rank approach. These encodings learned from the recommender model can be used for different tasks which require retrieval of similar items (articles in case of news recommendation) or users.