💻 Improving News Recommendations With Cloudflare Workers & Knowledge Graphs
Presented by: William Lyon
Originally aired on October 3, 2022 @ 1:00 PM - 1:30 PM EDT
Cloudflare's Full Stack Week Developer Speaker Series
Surfacing relevant content to users can be a particular challenge for news sites. In this talk we’ll explore how to build a location-aware news recommendation endpoint using Cloudflare Workers and the Neo4j graph database.
Visit the Full Stack Week Hub for every exciting announcement and CFTV episode — and check back all week for more!
English
Full Stack Week
Transcript (Beta)
Hi, everyone. Welcome to Cloudflare's Full Stack Week. In this session, we're going to be talking about improving news recommendations using knowledge graphs and Cloudflare Workers.
The slides are available at dev.neo4j.com slash news dash graph.
So, my name is Will. I work for a company called Neo4j. Neo4j is a graph database, which we'll talk about in a little bit.
I work on the developer relations team.
So, my job is basically to work with users and customers building applications with Neo4j.
And oftentimes, that involves looking at integrations with new technologies and figuring out how Neo4j fits in to that ecosystem.
And so, one of those that we've been seeing more recently is called Cloudflare Workers.
So, we'll be talking about using Neo4j with Cloudflare Workers.
The best way to get ahold of me online is probably on Twitter.
I have my handle link there, lyonwj, or my website, lyonwj.com, where I have a blog and a newsletter as well.
So, if you're interested in following up, ping me.
One of those two places is probably the best. I also co-host a podcast called Graph Stuff FM, where we dig into a lot of graph database technology, graph technology, talk to folks that are building interesting things.
So, if you like podcasts and that sounds interesting, check out graphstuff.fm as well.
So, if you're not familiar with Neo4j, okay, I said it's a graph database.
So, what that means is you can think of Neo4j as similar to other databases, where the data model instead of tables or documents, the data model is a graph.
I guess that's the quickest way, I guess, maybe to think about Neo4j.
And when we say graph, we're talking about nodes. Those are the entities and relationships that connect them.
Now, of course, there are lots of important implications of that data model being a graph that Neo4j allows us to model, query, and store our data as a graph.
Things like performance characteristics that are different with graph databases and other databases.
So, graph databases are optimized for traversing the graph, which you can think of as similar to like a join in a relational database.
So, a lot of people ask, well, what's a good use case for a graph database?
Why would I use a graph database?
And one answer to that is if you're writing a lot of joins in a relational database and you either have problems with performance or problems expressing some of those queries, graph database might be an interesting thing to look at.
There are also all kinds of interesting graph analytics, data visualization use cases as well.
The other important aspect of graph databases that are different from other databases is the query language.
So, instead of SQL, we use a query language called Cypher, which we'll see some examples of Cypher today.
There's an example here in kind of the upper right there that shows how we work with graph patterns in Neo4j and Cypher.
So, Cypher is very declarative and we draw these sort of ASCII art-like graph patterns where we sort of are drawing nodes and relationships to express the pattern that we want to work with in the query.
So, Neo4j as a database kind of sits at the core of a lot of infrastructure.
And I mentioned lots of different use cases from analytics to building transactional applications.
So, there's lots of different tooling and use cases around those to think about.
But what we want to focus on today is talking about personalization and recommendations.
And specifically in the context of news applications.
So, we see recommendations and personalization all over the web, all over mobile apps.
This is a very common thing. So, in e-commerce, we may see products that we might be interested in, suggested purchases, this kind of thing.
But in the news world, I think recommendations and personalization are especially relevant because there's so much content over just this diverse range of topics.
So, finding content that's relevant for the user becomes very difficult, I think, simply because of the vast number of pieces of content and topics that a given user may be interested in.
And it's also, I think, important to focus on personalization for news applications because of notifications.
So, when there's some breaking news, it may or may not be relevant for certain audiences.
So, news orgs, they want to be able to push out that notification to say, hey, here's some breaking event that happens maybe in the world of climate change.
And we know that you're interested in climate change.
This is relevant. But it's a fine line to walk, right?
Because you can't push out lots of notifications that are not relevant for the user.
They'll delete your app, right? But at the same time, we want to be able to drive engagement to bring users to our site to surface interesting content so that they read.
So, there are lots of challenges there, making sure that personalization and recommendations are fast, performant, relevant for our users.
Now, there's lots of different ways to approach generating personalized recommendations.
We're going to focus on using knowledge graphs for personalization.
But what is a knowledge graph?
You may have heard that term before. So, let's back up a little bit and start with, well, what is a graph?
So, a graph is fundamentally a data structure that's composed of nodes.
These are the entities, the objects in our data, and relationships that connect nodes.
Graph databases like Neo4j use a property graph data model.
So, the addition here from just the conceptual graph data structure is the addition of properties.
So, arbitrary key value pair properties that we can store on either nodes or relationships.
So, these are the attributes of our data.
And then we also have labels for nodes. Labels are a way to group nodes, basically assigning something like the type of the node.
You can think of nodes as kind of similar to maybe tables from the relational world where I'm grouping rows.
Think of labels as a way to group nodes. And then relationships have a direction.
So, relationships are going from one node to another, and they have a single type.
So, then a knowledge graph is basically an implementation of a property graph that is, I like to say, putting things in context.
Well, what does that actually mean?
Google, when they introduced the Google Knowledge Graph API, put out a blog post talking about things, not strings, so that when you're searching Google, or you're searching the Google Knowledge Graph, and you're maybe searching for, let's say, United States, you're not just looking for that search term.
You're actually recognizing that United States refers to a country, and we know sort of the maybe states that compose that country.
Maybe we know the political system of that country, and so on.
We have information about that thing. We know what kind of entity it is, and we have context about it.
So, we're talking about news. So, here's an example of a news knowledge graph.
So, here we have an article. This is from the New York Times about Biden signing the infrastructure bill.
So, we know that's an article.
It has a URL that refers to it, and then we know the topics of the article.
So, this is about the United States economy. It's about the United States politics and government, and then we have other articles that are connected to these same topics.
So, here's one about Cold War in the U.S. and China, maybe.
We know that's about China. We know where China is. We have the latitude and longitude of that.
We know it's a geo region. We know it's a country. This article is also mentioning Xi Jinping, who we know is the president of China, and so on.
So, taking all this information, telling us what are the important parts of it, how are those important parts connected, and then allowing us to query that to find insights, I think, is the other important aspect of a knowledge graph.
That example, that came from the New York Times API. So, I've built a demo dataset for us using the New York Times API, pulling some data into Neo4j.
So, this is the basic model that we have.
We have articles that are modeled as nodes. Articles have topics.
They have an author. They have photos. They're about a geo region, one or more, perhaps, if we're talking about a certain area, a certain country, a certain region.
And then, we also extract out any organizations. So, is this mentioning a specific company?
Is this an article about Boeing factory shutting down, something like that?
Is this an article about a person, as well? And all the data for this is on GitHub.
You can see the import scripts, and then, the code for a GraphQL API, and then, also, the workers API that we'll talk about.
So, this is a screenshot from Neo4j browser, which we'll take a look at in a minute.
Neo4j browser is kind of like a query workbench for Neo4j. So, as I'm developing, it's kind of the main way that I'm querying the database and visualizing the results.
And here's a Cypher query. So, we're searching for a specific article by URL.
This is an article about Ukraine and its role in cryptocurrency. And we're traversing out in Cypher to find the topics of the article.
So, you can see here how we're drawing the sort of ASCII art representation of this pattern.
Nodes are within parentheses.
And here we have colon article inside parentheses. So, this is saying, find an article node, a node with the label article.
And then, the A before the colon, that's like a variable that I'm binding to that part of the pattern.
So, I can refer to A later on. Here, I'm filtering where A.URL is the specific URL for the article.
So, that's how I'm finding the article I'm interested in.
And then, I'm drawing these sort of outgoing arrows. This has topic arrow is a relationship that says, find the article, then traverse out to the topics, traverse out to the geo regions that it's about, and return that graph.
Now, I can add more complex patterns in Cypher to find insights in my knowledge graph.
So, here I've added on line 3, another traversal, another pattern that says, well, once you've found all the topics of these other articles, show me articles that have the same topic.
That might be a relevant recommendation if I'm reading this article about crypto in Ukraine.
And you can see what we end up with is we end up with other articles about Bitcoin and virtual currencies.
There's some corruption aspect in this article, apparently.
So, I end up with articles about that and with more articles about Ukraine and so on.
So, you can see how we can traverse this knowledge graph to find, in this case, relevant articles, in this case, if we're interested in this article about crypto and Ukraine.
Now, talking about recommendations with graphs in general, I like to think of two different approaches.
One is collaborative filtering, and the other is content-based filtering.
With collaborative filtering, we're using either the ratings, the preferences of some users, or some action of users that the users have taken with an item.
And we're using that to find recommendations.
So, in this case, we have this user, Misty Williams, who has rated a bunch of movies.
And then we have another user, Guy Davis, who has also rated the same movies that Misty has.
And Guy has also rated some other movies.
So, those movies that Guy has rated might be relevant recommendations for Misty.
Compare that to content-based filtering, where we are using a knowledge graph, so information about the items, to find recommendations.
Here we have two movies, Casino and Goodfellas, and we're seeing where the actors, the director, and the genres overlap.
So, we can do that traversal to find relevant recommendations.
If you're interested in Casino, you may also be interested in this movie, Goodfellas.
So, at a high level, what we're trying to do is find similar users in the network.
There are lots of different ways to do that, with things like similarity algorithms, graph clustering algorithms, where we're trying to find sort of my peer group in the graph, essentially.
And then we look at, okay, here are users that are similar to me in the graph.
What actions have those users taken?
That I haven't taken yet. Those might be relevant recommendations.
In the context of movies, it's what movies did those users that are most similar to me watch that I haven't seen?
In the world of news, it's what news articles have those users viewed or shared that I haven't, that I might be interested in.
There's two examples here, using similarity metrics and movies data to find similar users, using Cypher and using the graph data science library for Neo4j, which allows us to run graph algorithms.
One is cosine similarity. So, with cosine similarity, we're looking at ratings, in this case, ratings of movies, and calculating some similarity metric from one user to all other users in the graph, based on how similar their tastes in movies are.
And then the other example here is using a different similarity metric called Pearson.
Pearson is nice because it takes into account that all users have sort of a different baseline for their ratings.
So, I may frequently give four or five-star ratings, but someone else may consistently give two -star ratings.
So, their sort of three is the equivalent of my four and a half rating, this kind of thing.
Anyway, just wanted to include those examples so you can see how we can accomplish this using Cypher and graph traversals.
There's an inherent problem, though, with collaborative filtering approaches, and this is known as the cold start problem.
So, what this means is that if I don't have any information about a user, if I haven't viewed their actions in my application, if I don't know what news articles they've read, well, then I can't sort of infer what their interests are, and I don't really have anything that I can personalize for them.
So, in this case, this is a screenshot from one of the news apps that I use.
I wasn't signed in, so when I went to the feed, it said, hey, we don't have anything for you because we don't know anything about you.
So, this is an inherent problem with collaborative filtering.
Content-based recommendations don't have this same problem because we're not depending on information about the user.
Rather, we're relying on this knowledge graph, this information that we have about our items to generate recommendations.
So, here's a screenshot. This comes from the New York Times app. This was an article, I think, from yesterday talking about a railroad in Bulgaria.
And as you scroll into this article, there are recommendations for other articles you may be interested in based on this article that you're reading.
And if we go to our news knowledge graph, we can find this article in the graph and see how we might generate those type of content-based recommendations.
So, here we find the article, we traverse out to the geo regions, in this case, Bulgaria.
So, this red note is Bulgaria.
What are other articles about Bulgaria? Those might be relevant.
And we also combine this with topics. So, you may be interested in other articles about travel or trains and train stations.
So, we've talked about personalization and recommendations using graphs and Neo4j.
Let's talk about how we actually build and deploy these in the real world.
And in this case, how we can use Cloudflare Workers with Neo4j to accomplish that.
So, Cloudflare Workers, if you're not familiar with them, I like to think of Cloudflare Workers as kind of like the next evolution of serverless.
So, serverless kind of started with functions as a service.
Workers has really extended that idea and addressed a lot of the problems that we had with functions as a service, basically bringing it to the edge, to this global CDN, addressing what was a really common problem with functions as a service, which was this cold start problem where I had to wait for my sort of function to spin up.
We don't really have that with Cloudflare Workers.
And because a lot of the features of Workers make them really, really good for personalization and recommendation features.
So, for example, every worker is location aware.
So, I know not only what region my user is making the request from, but with Cloudflare's localization service, I have their latitude and longitude, what city they're in, and so on.
There's also a lot of really important performance implications with Cloudflare Workers that are really relevant for personalization.
So, what I want to build is something kind of like this, if you excuse my simple architecture diagram there.
But basically, we have some news app, and I want to serve a feed of articles for the user that they may be interested in.
And I'm going to do that with Cloudflare Workers.
So, my app is going to hit a worker endpoint. That worker is then going to query Neo4j, which is also running somewhere in the cloud, traverse this knowledge graph, and return a feed of relevant news articles.
So, to do this, I got started with Wrangler.
Wrangler is the CLI for Cloudflare Workers.
So, we install that. And then I started from a template. So, there's a worker template that shows how to use a router.
So, you can add different routes in the same worker, which is really nice.
So, I started with that template. And then I also use Wrangler to set some secrets, some connection credentials for Neo4j, the URI for Neo4j, and my authorization for the database.
Let's jump over to VS Code.
Actually, first, let's take a look at Neo4j. So, this is Neo4j browser.
So, I have... Let's zoom in a little bit. So, I have Neo4j instance running in the cloud at this endpoint.
Here's that Bulgaria query, so we can see what this looks like.
So, here's my Cypher query. I run that, it traverses the graph, and I can then visually see the results of that query.
But I can also explore the graph.
So, I can just kind of double click here to sort of traverse out and explore the graph visually, which can be useful during development.
Okay. So, here's the code for my worker.
First of all, we are importing this iddy router, so this allows us to add routes in the worker.
And then I have a function that I wrote just to handle some database connection things.
One thing when you're working with databases databases and database connections with Cloudflare Workers, if you can connect over HTTP, that is vastly going to simplify the way that you interact with databases, since it's much easier to work with HTTP requests coming out from a worker than other methods like TCP or WebSockets of connecting to a database.
So, Neo4j has a HTTP API, so we can send Cypher queries over HTTP to the database.
That's just a function to handle that. And here is the code for our index route.
So, when I first sort of hit that endpoint, I want to show the user a personalized feed of articles.
And in this case, we're going to use the location of the user.
So, what I want to do is get the location of the user, the latitude and longitude, and I want to find any geo regions in my knowledge graph that I have that are most closest to the user.
In workers, in the request object that's passed to my worker, I have this CF object, which has a lot of things on it, but one of the things that it has is latitude and longitude for my user.
Then I have a Cypher statement here.
This is basically just searching for the geo node. So, some geo region that is mentioned in some of these articles that are closest to the user.
And then I'm traversing out to find articles that are about that geo region.
And then I also grab some information, like what people are mentioned, what topics is the article about, and so on.
And then I pass the latitude and longitude that I grabbed here from the request object to the Cypher query, send it to Neo4j, and then I just pull out the results and send that back as JSON.
So, let's do a Wrangler dev to start this locally. And this will run on localhost 8787.
And here I can see some results. So, I live in Montana. So, it's giving me some articles about Montana.
This one, of course, the first one is about anti-vax conspiracy theories.
We'll skip that one. CDC food waste. Let's skip that one.
Here's one. This looks good. So, Yellowstone National Park is in Montana. Here's an article about grizzly bears in Yellowstone.
Great. So, in my app, I like this article.
I'm going to read it. But then when I'm done reading the article, I may want to see similar articles that I may be interested in.
So, we added another route in our worker.
So, slash recommended and then slash the ID of the article. And now this Cypher query is going to find that article in our knowledge graph in Neo4j.
We're then going to find what topics, what georegions, what people are mentioned in that article, and then find other articles that are about the same topics, the same georegion, and basically score them by the number of overlapping nodes that they have using Jaccard similarity.
So, we looked at those similarity metrics earlier, cosine, Pearson similarity.
Jaccard similarity is another type of similarity algorithm.
So, finding the articles most similar to the one I'm interested in and then returning that.
So, in this case, we're going to grab the article ID.
And we go to slash recommended slash 11490. And we can see, great, here's other articles about Yellowstone, other articles about other national parks.
Cool. So, looks like that works. I'm running that locally with Wrangler dev.
Let's do a Wrangler publish. And this is now going to publish my worker, in this case, to news.graphstuff .workers.dev.
So, this is live now. So, if you're anywhere in the world, you should be able just to open this in a browser, news.graphstuff.workers.dev.
And you should see initially a feed of news articles from the New York Times that are basically about georegions that are close to you.
So, basically localized news recommendations in Cloudflare Workers powered by Neo4j.
Cool.
So, I think that is all of the code that we want to look at. Let's jump back to our slides here.
So, just some screenshots in the slides talking about the demo that we built.
So, we'll skip that. Like I said, all of this is on GitHub in this news graph repo.
So, if you'd like to check out the code or import this data yourself and work with it, all you need is a New York Times API key.
I'll mention just a couple of resources that you may be interested in.
The Neo4j Sandbox is a great resource for trying out Neo4j in the cloud, but also with interesting datasets.
There's one called Recommendations that dives into a lot of the different ways of generating personalizations and recommendations with Neo4j.
So, that's a good place to start. Arrows.app is a diagramming tool, but as you're yourself in terms of graphs and the property graph model.
Now, because we're talking about news, I also want to mention that Neo4j has a data journalism accelerator program that I help manage.
So, if you are a data journalist and you have an interesting dataset that you think Neo4j might be able to help you use in your investigation, definitely reach out for that as well.
Great. So, thanks for joining today and we'll see you next time.
Cheers.