here for a related post in Japanese.
In Stockmark, we collect tens of thousands of news articles from thousands of different media sources every day. Articles with similar content may be crawled from different news media. It is vital to detect and group similar articles in order to provide better user experience by improving our news analytics services, such as Anews recommendations, Astrategy business activity comparisons, etc.
It is relatively easier to collect articles with the exact same content, but we also need to consider articles telling the same story with slightly different content. Vast majority of the similar articles are in the form of the same news stories written by different authors, rather than duplicating the same text.
We utilize search engine to find text similarities and graph algorithms for grouping similar articles. Let us move on to the details.
We first build an article similarity graph before grouping articles, where each node represents an article and there is an edge between two nodes if they are decided to be similar. Depending on the metric used to define similarity, the graph can be directed or undirected. In our case, it will be a directed graph.
Article similarities can be retrieved using vector distance calculations if articles are embedded to fixed dimensional vectors, using approaches similar to doc2vec or S-BERT. In this project, we take a different approach and utilize the classical vector-space model through text-search engine implementation for similar document retrieval. We find this approach is more accurate and flexible than the document embedding approach.
In Stockmark, news articles are indexed in Elasticsearch. Elasticsearch is a full-text search engine built over open-source Apache Lucene. You can see details of Lucene’s similarity scoring here.
For each indexed article, More Like This (MLT) query is executed to retrieve their similar articles. A threshold is set for similarity decisions. Lucene’s scores vary depending on the length of the articles, so instead of setting the threshold over the absolute scores, we scale scores between 0 and 1. The first match of the MLT query is the same article or an article with exact same content, so its score is set to 1 and the rest is scaled accordingly.
We use title and body fields of the article documents for similarity computations. We do some filtering in order to decrease the amount of the noise, such as ignoring news from the same media and ignoring news released more than 5 days apart - it is rare that an article of some news is reproduced after 5 days.
Once similar articles are extracted, we can build the similarity graph. If article B is in the similarity list of article A, then we draw an edge from A to B. Note that MLT scoring is not symmetric; so B being similar to A does not mean A is similar to B. In the end, the resulting graph will be directed.
Next, we need to find components on the similarity graph to retrieve similar articles. We utilize two different algorithms in this step. First, we find strongly connected components, then apply a community detection algorithm and take their intersecting components as final components.
A directed graph is called strongly connected if there is at least one path between each pair of vertices. Strongly connected components of a graph are maximal strongly connected subgraphs of that graph. For our sample dataset, you can see strongly connected components of the similarity graph colored in the Figure 1 below. There are a few very large components and many small components. The largest component accounts for more than 4% of the articles, mainly because it contains articles for pre-formatted market reports.
In Figure 2, we have a strongly connected component and 3 communities intersecting with that component. Graph community detection algorithms group densely connected vertices in a graph, which for our case help us remove noisy edges that make strongly connected components grow larger than they need to be. We used the SCD algorithm for community detection. SCD is a greedy algorithm where it clusters highly connected nodes first, then makes refinements on initial clusters. For details, you can refer to the paper and an implementation.
It is theoretically possible to ignore strongly connected component detection and apply only community detection. Although community detection is very powerful, we observed that using only the community detection algorithm introduces false matching of articles, mainly because the SCD algorithm treats the directed graph as undirected. Using both strongly connected component and community detection algorithms yields better results than using only one of them.
It is computationally infeasible to rebuild the similarity graph for all articles after each crawling batch. Instead, we employ an incremental approach. We only update article nodes that are reachable from the newly added articles. We ignore articles added more than 5 days ago during edge updates.
In this post, we explained how to utilize full-text search engine to detect near-duplicate news articles. In Figure 3, You can see a sample search screen from our tool Astrategy, which lists a news article and other articles that are detected as similar.
In our database, we detected similar articles for 40% of the articles and the average size of similar article groups is 3.
This was collaborative work between our R&D and engineering teams. As a member of R&D team, I mainly explained the algorithm part, however there was also work to set up and scale Elasticsearch servers and queries.