Recommendations by Concise User Profiles from Review Text: Experimental Results

:::info
This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Ghazaleh H. Torbati, Max Planck Institute for Informatics Saarbrucken, Germany & ghazaleh@mpi-inf.mpg.de;

(2) Andrew Yates, University of Amsterdam Amsterdam, Netherlands & a.c.yates@uva.nl;

(3) Anna Tigunova, Max Planck Institute for Informatics Saarbrucken, Germany & tigunova@mpi-inf.mpg.de;

(4) Gerhard Weikum, Max Planck Institute for Informatics Saarbrucken, Germany & weikum@mpi-inf.mpg.de.

:::

Table of Links

Abstract and Introduction
Related Work
Methodology
Experimental Design
Experimental Results
Conclusion
Ethics Statement and References

IV. EXPERIMENTAL DESIGN

A. Rationale

As a difficult and less explored application area for recommender systems, we investigate the case of book recommendations in online communities. These come with a long-tailed distribution of user activities, highly diverse user interests, and demanding textual cues from user reviews and book descriptions.

Unlike in many prior works’ experiments, often on movies, restaurants or mainstream products, the data in our experiments is much sparser regarding user-item interactions. We design the evaluation as a stress-test experiment, with focus on text-rich but otherwise data-poor users: users who liked relatively few items but wrote informative, yet noisy reviews.

We further enforce the difficulty of predictions when items belong to groups with high relatedness within a group, by constraining disjointedness of authors per user in training and evaluation set. Thus, we rule out the near-trivial case of predicting that a user likes a certain book given that another book by the same author has been used for training.

B. Datasets

We use two book datasets, both from the UCSD recommender systems repository [40]:

• GR [41]: a Goodreads sample with item-user interactions for 1.5M books and 280K users, incl. titles, genre tags, item descriptions, ratings and textual reviews.

• AM [42]: an Amazon crawl, filtered for the books domain with 2.3M books and 3.1M users, incl. category tags, ratings and reviews.

While the GR data hardly appears in the literature, the AMbooks data has been used for experiments in prior works (e.g., [43]). However, the data has been pre-processed in numerous variants, with differences in data size, definition of positive and negative points, inclusion of features (mostly disregarding reviews), and elimination of long-tail users and items. The most widely used derivative is the 10-core variant where all users with less than 10 items and all items with less than 10 users are eliminated. This pre-processing clearly focuses on interaction-based predictions, whereas our intention is to study the underexplored case of sparse interactions with informative user reviews.

We view all book-user interactions with a rating of 4 or higher as positive, and disregard the lower ratings as they are rare anyway. We further pre-process the dataset by removing all the users with less than 3 books, as we cannot split their interactions into the three sets (train, validation, test).

Table I gives statistics for the pre-processed GR and AM data. Our data pre-processing is designed to evaluate textbased recsys performance with low interaction density and text-rich users. We select 1K users from each of the two datasets, based on descending order of average review length per book, such that each user has reviewed at least 3 books. In Table I, rows GR-1K-rich and AM-1K-rich show the characteristics of these data slices. Both GR-1K-rich and AM1K-rich are extremely sparse in terms of users that share the same items; so the emphasis is on leveraging text.

To investigate the influence of interaction density, we construct two further slices from AM and GR, both covering 10K users in two variants: sparse (but not as sparse as the 1Krich slices) and dense. These are used only in Section V-C for sensitivity studies, and thus explained there.

C. Baselines

We compare our approach to several state-of-the-art baselines, which cover different methods for recommendation, ranging from traditional collaborative filtering approaches to text-centric neural models. We compare the following methods:

• CF: collaborative filtering operating on the user-item interaction matrix by pre-computing per user and peritem vectors via matrix factorization [36] (with 200 latent dimensions).

• RecWalk [15] makes predictions by random walks over the interaction graph, incorporating item-item similarities to encourage the exploration of long-tail items. We pre-compute the similarities by factorizing the full-data matrix (i.e., over ca. 10M and 25M interactions in GR and AM, resp.).

• DeepCoNN [18] is a salient representative of using convolutional neural networks (CNN) over text inputs.

• LLM-Rec: following [31], we use ChatGPT to rank the test items, given the user’s reading history. The history is given by the sequence of titles of the 50 most recent books of the user, prefixed by the prompt “I’ve read the following books in the past in order:”. This prompt is completed by a list of titles of the test-time candidate items, asking the LLM to rank them.

• P5-text [29]: prompting the T5 language model [44], to provide a recommended item for a user, given their ids. Following [29], we train P5 using the prompts for direct recommendation (e.g. “Shall we recommend item item id to user id”) to generate a “yes” or “no” answer. Pilot experiments show that the original method does not work well on sparse data. Therefore, we extend P5 to leverage review texts and item descriptions. Instead of ids, the prompts include item descriptions and selected sentences from reviews, with the highest idf scores (i.e., one of our own techniques).

• BENEFICT [34] uses BERT to create representations for each user review, which are averaged and concatenated to the item vectors. Predictions are made by a feed-forward network on top. Following the original paper, each review is truncated to its first 256 tokens.

• BENEFICT-text: our own variant of BENEFICT where the averaging over all reviews of a user is replaced by our idf-based selection of most informative sentences, with the total length limited to 128 tokens (for comparability to the CUP methods).

• BERT5-text: using vanilla BERT (out of the box, without any fine-tuning) for encoding user text and item descriptions, followed by a feed-forward network. The text selection is 5 chunks of 128 tokens of the highestidf sentences, with max-pooling for aggregation.

All methods were run on NVIDIA Quadro RTX 8000 GPU with 48 GB memory, and we implemented the models with PyTorch.

D. Performance Metrics

At test time, we present the trained system with each user’s withheld positive items (20% of the user’s books, with authors disjoint from those of the user’s training items), along with negative items, sampled from all non-positive items, such that the ratio of positive to negative test points is 1:100. The learned system scores and ranks these data points. We evaluate all methods in two different modes of operation:

• Standard prediction: sampling the 100 negative test points uniformly at random from all unlabeled data, and ranking the 100+1 test instances by the methods under test.

• Search-based: given the positive test item, searching for the top-100 approximate matches to the item’s description, using the BM25 scoring model; then ranking the 100 + 1 candidates by our methods.

Following the literature, our evaluation metrics are NDCG@5 (Normalized Discounted Cumulative Gain) with binary 0-or-1 gain and P@1 (precision at rank 1). We compute these by micro-averaging over all test items of all users. We also experimented with macro-averaging over users; as the results were not significantly different, we report only microaverage numbers in the paper.

NDCG@5 reflects the observations that users care only about a short list of top-N recommendations; P@1 is suitable for recommendations on mobile devices (with limited UI). We also measured other metrics, like NDCG@k for higher k, MRR and AUC. None of these provides any additional insight, so they are not reported here.

E. Configurations

The CUP framework supports a variety of methods by specific configurations. All variants use an input budget of 128 tokens, to construct a stress-test and to emphasize that computational and environmental footprint is a major concern as well. In the experiments, we focus on the following textcentric options (see Subsection III-D):

• CUPidf : sentences from reviews selected by idf scores.

• CUPsbert: sentences from reviews using Sentence-BERT similarity against an item description.

• CUP1gram: single words selected by tf-idf scores.

• CUP3gram: word-level 3-grams selected by tf-idf scores.

• CUPkeywords: a user profile consisting of keywords that are generated by a fine-tuned T5 model [3]

• CUPGP T : a concise set of keyphrases generated by ChatGPT from all reviews of a user.

When comparing against prior baselines, we employ CUPidf as a default configuration. For comparison, we also configure more restricted variants that use only category/genre tags for user text and title, category/genre, and optionally description for item text. These are denoted as CUPbasic and CUPexpanded, respectively.

To obtain refined insights into the performance for specific kinds of users and items, we split the 1000 users and their items into the following groups, reporting NDCG@5 for each group separately. Note that this refinement drills down on the test outputs; the training is completely unaffected.

• Items are split into unseen (u) and seen (s) items. The former consist of all test-time items that have not been seen at training time. The latter are those items that appear as positive samples at test-time and are also among the positive training items (for a different user)

• Users are split into three groups based on the #books-peruser distribution (in train/dev/test points):

• Sporadic (s) users are the lowest 50% with the least numbers of books. For GR-1K-rich, this threshold is 13 books per user; for AM-1K-rich it is 5 (with means 6 and 3, resp.).

• Regular (r) users are those between the 50th percentile and 90th percentile, which is between 13 and 71 books per user for GR-1K-rich, and between 5 and 20 for AM-1K-rich (with means 31 and 9, resp.).

• Bibliophilic (b) users are the highest 10%: above 75 books per user for GR-1K-rich and above 20 for AM1K-rich (with means 156 and 43, resp.).

We abbreviate these six groups as u-s, u-r, u-b, s-s, s-r and s-b.

[3] https://huggingface.co/ml6team/keyphrase-generation-t5-small-inspec