Communities Recommendations

Communities on 𝕏 were created to give people a dedicated place to connect, share, and get closer to the discussions they care about most.

How does 𝕏 find communities to follow 

Communities are started and managed by people on 𝕏 — admins and moderators who enforce Community rules and keep conversations informative, relevant, and fun. People who accept invites to join a Community become members. Posts in Communities can be seen by anyone on 𝕏, but only others within the Community itself can engage and participate in the discussion

How does 𝕏 decide which communities to recommend to you

How we recommend communities is an important part of our user’s experience on 𝕏, whether that is directly suggesting communities to users that may be interested in them, suggesting that members of a community consider adding other accounts to that community, or indirectly surfacing communities related to a particular Topic, List, or Account. You may see suggestions based on criterion such as:

  • We recommend communities based on the number of followers in the community;

  • We recommend communities based on the simcluster-similarity between communities and users. The simcluster-similarity is the cosine-similarity between the simcluster-embeddings of the user and the community.

How you can influence the accounts you see

𝕏 has also designed tools that help you control all content that you see across the platform and to protect you from content you consider harmful. (Learn more, here.) 

More information

For a more detailed view of how our Account Recommendation system works, please see:

  • An overview from our engineering team below;

  • Our Communities help center article, here

  • Our research paper on SimClusters, here; or

  • Our source code, which is available, here.

System Overview

SimClusters is a general-purpose representation layer based on overlapping communities into which users as well as heterogeneous content can be captured as sparse, interpretable vectors to support a multitude of recommendation tasks.

We build our user and post SimClusters embeddings based on the inferred communities, and the representations power our personalized post recommendation via our online serving service SimClusters ANN.

For more details, please read our paper that was published in KDD'2020 Applied Data Science Track, here

Follow relationships as a bipartite graph

Follow relationships on 𝕏 are perhaps most naturally thought of as directed graph, where each node is a user and each edge represents a Follow. Edges are directed in that User 1 can follow User 2, User 2 can follow User 1 or both User 1 and User 2 can follow each other.

This directed graph can be also viewed as a bipartite graph, where nodes are grouped into two sets, Producers and Consumers. In this bipartite graph, Producers are the users who are Followed and Consumers are the Followees. Below is a toy example of a follow graph for four users:

 

Figure 1 - Left panel: A directed follow graph; Right panel: A bipartite graph representation of the directed graph

Community Detection - Known For

The bipartite follow graph can be used to identify groups of Producers who have similar followers, or who are "Known For" a topic. Specifically, the bipartite follow graph can also be represented as an m x n matrix (A), where consumers are presented as u and producers are represented as v.

Producer-producer similarity is computed as the cosine similarity between users who follow each producer. The resulting cosine similarity values can be used to construct a producer-producer similarity graph, where the nodes are producers and edges are weighted by the corresponding cosine similarity value. Noise removal is performed, such that edges with weights below a specified threshold are deleted from the graph.

After noise removal has been completed, Metropolis-Hastings sampling-based community detection is then run on the Producer-Producer similarity graph to identify a community affiliation for each producer. This algorithm takes in a parameter k for the number of communities to be detected.

Figure 2 - Left panel: Matrix representation of the follow graph depicted in Figure 1; Middle panel: Producer-Producer similarity is estimated by calculating the cosine similarity between the users who follow each producer; Right panel: Cosine similarity scores are used to create the Producer-Producer similarity graph. A clustering algorithm is run on the graph to identify groups of Producers with similar followers.

Community affiliation scores are then used to construct an n x k "Known For" matrix (V). This matrix is maximally sparse, and each Producer is affiliated with at most one community. In production, the Known For dataset covers the top 20M producers and k ~= 145000. In other words, we discover around 145k communities based on 𝕏’s user follow graph.

Figure 3 - The clustering algorithm returns community affiliation scores for each producer. These scores are represented in matrix V.

In the example above, Producer 1 is "Known For" community 2, Producer 2 is "Known For" community 1, and so forth.

Consumer Embeddings - User InterestedIn

An Interested In matrix (U) can be computed by multiplying the matrix representation of the follow graph (A) by the Known For matrix (V):

In this toy example, consumer 1 is interested in community 1 only, whereas consumer 3 is interested in all three communities. There is also a noise removal step applied to the Interested In matrix.

We use the InterestedIn embeddings to capture consumer's long-term interest. The InterestedIn embeddings is one of our major source for consumer-based post recommendations.

Producer Embeddings

When computing the Known For matrix, each producer can only be Known For a single community. Although this maximally sparse matrix is useful from a computational perspective, we know that our users post about many different topics and may be "Known" in many different communities. Producer embeddings ( Ṽ ) are used to capture this richer structure of the graph.

To calculate producer embeddings, the cosine similarity is calculated between each Producer’s follow graph and the Interested In vector for each community.

Producer embeddings are used for producer-based posts recommendations. For example, we can recommend similar posts based on an account you just followed.

Entity Embeddings

SimClusters can also be used to generate embeddings for different kind of contents, such as

  • Posts (used for post recommendations)

  • Topics (used for TopicFollow)

Post embeddings

When a post is created, its post embedding is initialized as an empty vector. Post embeddings are updated each time the post is favorited. Specifically, the InterestedIn vector of each user who Fav-ed the post is added to the post vector. Since post embeddings are updated each time a post is favorited, they change over time.

Post embeddings are critical for our post recommendation tasks. We can calculate post similarity and recommend similar posts to users based on their post engagement history.

We have an online Heron job that updates the post embeddings in realtime, check out here for more.

Topic embeddings

Topic embeddings (R) are determined by taking the cosine similarity between consumers who are interested in a community and the number of aggregated favorites each consumer has taken on a post that has a topic annotation (with some time decay).

Project Directory Overview

The whole SimClusters project can be understood as 2 main components

  • SimClusters Offline Jobs (Scalding / GCP)

  • SimClusters Real-time Streaming Jobs

SimClusters Offline Jobs

Jobs

Code

Description

KnownFor

simclusters_v2/scalding/update_known_for/UpdateKnownFor20M145K2020.scala

The job outputs the KnownFor dataset which stores the relationships between clusterId and producerUserId. KnownFor dataset covers the top 20M followed producers. We use this KnownFor dataset (or so-called clusters) to build all other entity embeddings.

InterestedIn Embeddings

simclusters_v2/scalding/InterestedInFromKnownFor.scala

This code implements the job for computing users' interestedIn embedding from the KnownFor dataset. We use this dataset for consumer-based post recommendations.

Producer Embeddings

simclusters_v2/scalding/embedding/ProducerEmbeddingsFromInterestedIn.scala

The code implements the job for computer producer embeddings, which represents the content user produces. We use this dataset for producer-based post recommendations.

Semantic Core Entity Embeddings

simclusters_v2/scalding/embedding/EntityToSimClustersEmbeddingsJob.scala

The job computes the semantic core entity embeddings. It outputs datasets that stores the "SemanticCore entityId -> List(clusterId)" and "clusterId -> List(SemanticCore entityId))" relationships.

Topic Embeddings

simclusters_v2/scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scala

Jobs to generate Fav-based Topic-Follow-Graph (TFG) topic embeddings A topic's fav-based TFG embedding is the sum of its followers' fav-based InterestedIn. We use this embedding for topic related recommendations.

 

SimClusters GCP Jobs

We have a GCP pipeline where we build our SimClusters ANN index via BigQuery. This allows us to do fast iterations and build new embeddings more efficiently compared to Scalding.

All SimClusters related GCP jobs are under src/scala/com/twitter/simclusters_v2/scio/bq_generation.

Jobs

Code

Description

PushOpenBased SimClusters ANN Index

EngagementEventBasedClusterToTweetIndexGenerationJob.scala

The job builds a clusterId -> TopTweet index based on user-open engagement history. This SANN source is used for candidate generation for Notifications.

VideoViewBased SimClusters Index

EngagementEventBasedClusterToTweetIndexGenerationJob.scala

The job builds a clusterId -> TopTweet index based on the user's video view history. This SANN source is used for video recommendation on Home.

 

SimClusters Real-Time Streaming Posts Jobs

Jobs

Code

Description

Post Embedding Job

simclusters_v2/summingbird/storm/TweetJob.scala

Generate the post embedding and index of posts for the SimClusters

Persistent Post Embedding Job

simclusters_v2/summingbird/storm/PersistentTweetJob.scala

Persistent the post embeddings from MemCache into Manhattan.