Project Inspiration: Showing the Levels of Consensus in Text

How context-sensitive embeddings can improve our information diet

Timo Kats
4 min readAug 21, 2023
Visual created by the author

In our current information age, it’s common to see multiple versions of the same story. This can manifest itself in different ways. For example, news sources can report on the same topic and draw different conclusions, Wikipedia articles can constantly get changed, and large sets of comments can get too overwhelming to read.

As an information consumer, this can be difficult to navigate. As a result, different solutions aimed at addressing this problem have been developed over the years. For example, in news/media there are aggregator sites that try to display political bias (like ground-news). On websites like Wikipedia, there are “administrators” that clean up content. And on reddit, there are “moderators” that filter comments.

These methods — although having different degrees of success — share a very important trait, which is that they often rely on manual work. As a result, the possibilities are somewhat limited. And, especially in the case of “human moderators”, often seen as potentially biased.

In effect, these methods predominantly “manage” large sets of information, instead of “leveraging” it. As a result, an important property — especially for information consumers — remains unused. Namely, the fact that different news sources/comment sections/etc often have consensuses and disagreements. This is an important omission, because the level of consensus is an integral part of the information that people consume. In fact, it’s a very democratic form of validating online information.

Innovation and potential

Given these setbacks, there’s a real use-case for automation. There are two potential advantages that such an approach can have. Firstly, it scales better (i.e. it can process more information). Second, it can do more than simple deletion. In fact, given the recent innovations in “context-sensitive embeddings”, we can actually find the consensuses/disagreements and display those to the user.

Obviously, automation can also be biased. Thus, the transparency of a method developed for this purpose is paramount. Therefore, this article aims to explain one potential method developed for this purpose and share its source code.

Finding meaningful consensus

A hurdle often encountered in natural language processing (NLP) related use-cases are the contextual language properties that computers typically don’t recognize. For example, the usage homonyms (the same word having a different meaning in a different context) and synonyms (different words having the same meaning) can make it hard to find out whether people are actually agreeing with each other or not.

Thankfully, recent innovations in NLP have made it possible to find a form of semantic meaning behind text. For instance, context-sensitive embeddings (that are typically based on transformers/neural-networks) have been found to perform well at semantic-similarity tasks, even with these context-related hurdles.

An example of such a method is S-BERT, which is what this method will use to create the embeddings with. Next, these (context-sensitive) embeddings are used to find the level of consensus using cosine-similarity. This is a similarity metric that bases itself on the angle of the two vectors (i.e. S-BERT embeddings). As a result, the difference in text-length is largely nullified when computing the level of similarity. Hence it typically outperforms other vector similarity methods (like Jaccard or the Euclidean distance).

Image from LearnDataSci

This method allows this method to capture the level of (dis)agreement between different texts. Moreover, due to the nature of context-sensitive embeddings, it can capture this level of (dis)agreement based on the semantics.

Displaying consensus to the user

However, the goal of creating software should always be user-value. And, simply showing a log-file with the cosine-similarity scores of thousands of news-articles/comments/posts is not very user friendly. Therefore, the next challenge is to format this method in a human-readable manner.

For this, our demo (see final paragraph) uses a textual-heatmap. Here, green text shows a strong consensus whereas red text shows strong disagreement. As a result, a user can find the level of consensus (which is a form of validating information) without any additional user effort.

Example of showing different levels of (dis)agreement in text through a heatmap

In conclusion

There are many potential innovations and use-cases for this method. In fact, this article is only meant to show a “proof-of-concept” of something that could be beneficial for this information age. Especially in an era where validating online information is leaning more towards the usage of “blue checkmarks” and identity-linked certificates (like C2PA), this method can offer an alternative that does preserve the online equality and anonymity of users.

Live demo and source code

If you’re interested in trying this new method, then you can visit the “proof-of-concept” through this link. Here, you can add basic articles and comment on them to see the text-heatmap develop. Next, the video embedded below shows a quick demo of this web-app (in less than 4 minutes).

Note, if there are any questions/comments feel free to leave them here or on the video below. Also, if there are any developers interested in picking up an AI related side-project, you’re more than welcome to have a look at the source code on my GitHub. Thank you for reading this article.

--

--