Explore the Power of Neo4j: Building a Recommendation System Powered by Graph Data Science

5.15 Technologies

Feb 1710 min read

Introduction

Neo4j is a graphical NoSQL database that excels at visualizing and analyzing complex data relationships. Instead of organizing data in rows and columns like traditional databases, Neo4j represents information as interconnected nodes with properties and relationships, forming a graph-like structure. This approach makes it particularly effective for identifying patterns and extracting insights.

Graph data aligns closely with an unstructured approach to data science, especially within the field of Graph Data Science (GDS). In a graph, information can be inferred from existing relationships in multiple ways. By analyzing how nodes interact, deeper connections and meaningful insights can be uncovered—even from unstructured data. GDS consists of graph-based algorithms designed to extract hidden insights beyond direct relationships. Using a sample database provided by Neo4j, this blog will demonstrate how graph data science techniques can be used to build a movie recommendation system.

By leveraging vector space and embeddings, this recommendation system goes beyond simple relationship-based queries to reveal deeper, more complex connections between nodes. While explicitly defined relationships can generate obvious recommendations, this approach can be limiting, missing less apparent but still highly relevant suggestions. In contrast, embeddings capture semantic similarities and latent patterns, allowing recommendations to be made based on contextual relevance while still utilizing direct relationships. This results in more flexible and dynamic recommendations.

The sample database contains movies, reviewer ratings, and their connections to directors, actors, and genres. By analyzing these relationships and user ratings, the system can infer movies a user is likely to enjoy, essentially forming a movie recommendation engine.

This example takes inspiration from Neo4j’s original blog post on a similar recommendation system. If you're new to Neo4j, consider starting with a simpler example that uses direct relationships to generate recommendations. Once you grasp the fundamentals of graph data operations, revisiting this discussion will provide a deeper understanding of advanced concepts and a richer exploration of Neo4j’s capabilities.

Neo4j Recommendation Tutorial

All example code will be available in a Git repository.

GitHub Repository

Environment Setup

Architectural Overview Neo4j Desktop Database — Figure 1: Architectural Overview

Create a graph projection to focus embedding generation on relevant connections.
Store embedding properties in a python list of dictionaries.
Apply stored embeddings back to the original dataset for analysis.
After analysis, export recommendation data.
Display data in a readable data frame.

Requirements

Neo4J Desktop version 1.5.9 installed and running the Neo4j Graph Example Movies Recommendation database with Neo4j’s graph data science libraries version 2.5.5 installed.
IDE of your choice. This example was created in a Jupyter notebook, but an IDE like PyCharm or VSCode will do just fine.
Imported libraries: random, pandas, neo4j.

Configuring the local database

If you have not done so already, you'll want to download Neo4j Desktop.

Neo4j Download

Locate the Projects tab on the left and click on the “New” button.
From the dropdown, select: “import sample project.”
Next, locate the sample project with the title: Neo4j Graph Example Movies Recommendation, and click install. A popup should appear asking which version to install; however, we will be loading a dump manually so hit the close button.
Under the file area, there will be a data directory, click on the dropdown to see multiple data dumps. Hover over the dump named “recommendations-50.dump” and click on the 3 dots next to the open button.
Select “Create new DBMS from dump.”

Example of Neo4j Desktop Display — Figure 2: Example Neo4j desktop display when instantiating recommendations database

Input a password and keep track of the password, as it will be needed later.
From here, click on the name of the database that was just created to open a sidebar on the right.
Click on “Plugins” and then click the dropdown next to “Graph Data Science Library” and click install.
Finally, to run the database, simply click: “Start.”

Graph data science library installation — Figure 3: Graph Data Science Library Installation

Setting up your IDE

During this step, the Jupyter notebook environment will be set up and the required libraries will be imported. The libraries used in this example will be “random” for random number generation, “pandas” for data visualization, and neo4j for interaction with neo4j desktop.

This example was created in Jupyter notebook for ease of use, but as mentioned before any IDE will work.

To install Jupyter, open the terminal and run this command:

$pip install jupyter

After installation has finished, starting a Jupyter notebook is as easy as running this command:

$jupyter notebook

Importing required dependencies

Below are the required dependencies and the initialization of Neo4j driver variables. If you're not using Jupyter, run the following commands to install the necessary libraries before importing them:

$pip install pandas

$pip install neo4j

Import Libraries

import random
import pandas as pd
from neo4j import GraphDatabase

Define Data Access Variables

NEO4J_URI='bolt://localhost:7687'
NEO4J_USER='neo4j'
#Input the database password you created earlier
NEO4J_PASS='neo4jPASS'

Now, let's dive into the coding!

Step 1: Projecting a Graph

In this first step, a graph will be projected for Neo4j to analyze. A graph projection is where only a portion of the entire database is taken to perform actions on. This reduces computation time by only focusing on data pertinent to our analysis. The snippet below is the query that will project a subset of the graph:

query = """
       CALL gds.graph.project(
          'movieGraph',
          {
            Movie: {
            },
            Actor: {
            },
            Genre: {
            },
            Director: {
            }
          },{
            IN_GENRE: {
                orientation: 'UNDIRECTED'
            },
            ACTED_IN: {
                orientation: 'UNDIRECTED'
            },
            DIRECTED: {
                orientation: 'UNDIRECTED'
            }
          }
        )
       """
with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS)) as driver:
    query_result, _, _ = driver.execute_query(query)
    driver.close()

The query is defined as a call to Neo4j’s graph data science library and projects a graph named “movieGraph”. After naming the projected graph, the labels for the nodes that Neo4j will need to project must be defined. In this example, the Movie, Actor, Genre, and Director nodes are collected. After defining the node labels, the relationship types that Neo4j will project need to be defined, these being, “IN_GENRE”,” ACTED_IN”, and “DIRECTED”. Essentially what was projected in the graph was all the movies, who acted in them, who directed them, and the different movie genres.

Step 2: Estimating computational requirements

Even though a smaller portion of the graph was projected, thousands of nodes and relationships are available for analysis. With the number of nodes and relationships still being large, the memory resources Neo4j will require to analyze the data needs to be taken into consideration. The command below will return the required memory:

query = """
CALL gds.fastRP.stream.estimate('movieGraph', {embeddingDimension:128})
YIELD requiredMemory
RETURN requiredMemory
"""

with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS)) as driver:
    query_result, _, _ = driver.execute_query(query)
    driver.close()

memory = [dict(record) for record in query_result]

print(f"Memory required to create embeddings: {memory[0]['requiredMemory']}")
Memory required to create embeddings: 43 MiB

Here, a query is written to estimate how much memory we need to run the algorithm. First, an algorithm is chosen. “FastRP” is the algorithm that will be used in this example. Then the graph that will be analyzed is defined. Finally, the level of detail is defined. The number of embedding dimensions is representative of the amount of detail and for this example it will be set to 128, but feel free to lower or raise that number as your machine needs.

What are embedding dimensions?

Embedding dimensions refers to the number of dimensions used to represent and encode data in a lower-dimensional space through mathematical transformations. In various machine learning and data processing tasks, embedding dimensions determine the level of detail and complexity in capturing relationships between different elements or features within the data.

A higher amount of embedding dimensions means more detailed data and a clearer picture, at the cost of increasing the memory needed for computation. A lower number of embedding dimensions means the data will be more compressed and the computation will be less memory-intensive, but the resulting representation may lack nuanced details and precision.

Step 3: Creating and storing embeddings

In this next step, the movies will be analyzed, creating an “embedding” property for each movie in the projected graph, pulling the property out of the graph projection and setting the property for each movie in the main graph.

As mentioned before, the “FastRP” (Fast Rapid Projection) algorithm will be utilized. This algorithm is used for its efficiency. This sub-graph contains thousands of nodes and “FastRP” is an efficient and scalable algorithm that can produce accurate embedding dimensions at a relatively low computational cost. Other options to consider are “Node2Vec“ which utilizes random walks and ”GraphSage“ which utilizes random neighborhood sampling. Essentially, the FastRP algorithm assigns each node a random vector (a series of numbers) and based on that node’s connections with other nodes, it will make each node’s numbers more similar or different.

A good example of “FastRP” in action would be if two movies are connected to the same genre, actor, or director node, their vectors will be made to be more similar. However, if two movies don’t share any genres, actors, or directors, the algorithm adjusts the vectors in such a way that they move further apart in the vector space. This doesn't necessarily mean the numbers diverge in the sense of becoming infinitely large or approaching some theoretical limit, but rather the algorithm optimizes these vectors to be more dissimilar based on their lack of connections or relationships within the network. More info on “FastRP” here.

random_seed = random.randint(0, 1000000)
query = f"""
        CALL gds.fastRP.stream('movieGraph', {{
          relationshipTypes: ['DIRECTED', 'ACTED_IN', 'IN_GENRE'],
          embeddingDimension: 128,
          randomSeed: {random_seed}
        }})
        YIELD nodeId, embedding
        MATCH (m:Movie)
        WHERE id(m) = nodeId
        RETURN nodeId, embedding
        """

with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS)) as driver:
    query_result, _, _ = driver.execute_query(query)
    driver.close()

embeddings = [dict(record) for record in query_result]

query = """
        UNWIND $embeddings AS row
        MATCH (n)
        where ID(n) = row.nodeId
        SET n.embedding = row.embedding
        return count(n) as analyzed_movies
        """

with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS)) as driver:
    query_result, _, _ = driver.execute_query(query, embeddings=embeddings)
    driver.close()

analyzed_movies = [dict(record) for record in query_result]
print(analyzed_movies)
[{'analyzed_movies': 9125}]

In the code above, a random seed is defined for the initial random vectors (this number could also be kept the same for reproducibility purposes). A query is then defined as the building of the embedding property for each movie. We return each movie’s embedding property and ID for matching purposes and convert it into a Python dictionary. A separate query is then defined to unwind the dictionary of IDs and embeddings. The query then matches each embedding to its respective movie and sets the property in the main database, for future use. Finally, the print statement is added to verify the number of movies analyzed.

Step 4: Creating user recommendations.

In this last step, a random user will be selected from the database and based on the movies they rated highly, other similar movies will be recommended to them.

query = """
            MATCH (u:User)
            RETURN u.name as Name
            ORDER BY RAND()
            LIMIT 1
            """

with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS)) as driver:
    query_result, _, _ = driver.execute_query(query)
    driver.close()

results = [dict(record) for record in query_result]

name = results[0]["Name"]

query = f"""
MATCH (user:User {{name: '{name}'}})-[watched:RATED]->(rated_movie:Movie)
    WHERE watched.rating >= 4

WITH user, rated_movie, collect(DISTINCT rated_movie.embedding) as embeddings

WITH rated_movie, 
    REDUCE(s = [], e IN embeddings |
       [i IN RANGE(0, SIZE(e)-1) | COALESCE(s[i], 0.0) + e[i]]) AS sum,
    SIZE(embeddings) AS count

WITH rated_movie, [x IN sum | x / count] AS averageEmbedding
            
WITH averageEmbedding, rated_movie

MATCH (other_movie:Movie)
WHERE other_movie <> rated_movie

WITH other_movie.title as recommendation, 
    other_movie.embedding as otherEmbedding, averageEmbedding

WITH recommendation, 
    gds.similarity.cosine(averageEmbedding, otherEmbedding) AS similarity

ORDER BY similarity DESC
LIMIT 10
RETURN recommendation, similarity
            """

with GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASS)) as driver:
    query_result, _, _ = driver.execute_query(query)
    driver.close()

movie_predictions = [dict(record) for record in query_result]

movie_df = pd.DataFrame(movie_predictions)
print(name, movie_df)

The first query defined above matches all User nodes, returns all the different user’s names, puts them in a random order and grabs one off the top. This random name is picked and used in the next query.

In the second query, the same user found in the first is matched and used to grab all the movies that they rated 4 out of 5 stars or better. Each of the user’s highly rated movie’s embeddings are then added to a list named “embeddings.” The goal after collecting the embeddings is to represent all the user’s highly rated movies at once by taking the average of all their embedding properties to create an embedding property that accurately represents the user’s movie preferences.

From there, all other movies the user has not watched are matched and their embedding properties are compared one by one to the user’s ideal movie embedding property, ordered by the similarity score, and converted into a data frame for visualization. The similarity score is generated using Neo4j’s cosine similarity algorithm.

How Does Neo4j’s Cosine Similarity Algorithm Work?

Cosine similarity is an algorithm used to measure how similar two nodes are based on their embeddings in a graph. It calculates the cosine of the angle between two vectors (representing node embeddings), producing a similarity score between -1 and 1:

A score of 1 implies that the two vectors (node embeddings) are identical.
A score of -1 indicates complete dissimilarity between the vectors.
A score close to 0 signifies that the vectors are orthogonal or have no similarity.

In the context of a recommendation system, a higher similarity score means a movie is more closely aligned with a user's preferences.

Altogether, the result should look similar to the output below.

Alexis Steele MD                    recommendation  similarity
0                  Grumpy Old Men    0.858187
1                   Jurassic Park    0.808978
2              Odd Couple II, The    0.804159
3             Fortune Cookie, The    0.797590
4                 Front Page, The    0.786504
5  Lost World: Jurassic Park, The    0.786379
6            Bourne Identity, The    0.778431
7                            I.Q.    0.777085
8                      Out to Sea    0.767656
9                           Chaos    0.766575

In summary

A movie recommendation system was built leveraging Neo4j and graph data science. Unlike traditional methods that rely solely on direct relationships to generate clear and obvious recommendations, the approach used here utilized graph embeddings and vector space to uncover more complex relationships that lie beneath the direct relationships to provide more accurate, dynamic, and personalized recommendations. While this example focused on movies, these same techniques can be adapted to any graph-based dataset.

This was accomplished by:

Projecting a subset of the database to for analysis
Analyzing each node’s position in relation to other nodes to produce an embedding property for each movie utilizing “FastRP.”
Taking those embedding properties out of the projected graph and setting them in the actual database.
Using those embedding properties to calculate an embedding property that represents the user’s ideal movie.
Finally, comparing the user’s ideal movie embedding property to the embeddings of other movies the user has not seen to produce a similarity score.

Further Exploration

What else can we do with this approach or the data? Several options to consider include:

Community Detection: Leverage Neo4j’s community detection algorithms to identify strongly connected clusters of nodes within the dataset. Learn more about how these clusters reveal meaningful relationships.
Centrality Analysis: Use Neo4j’s centrality algorithms to determine the most influential and well-connected nodes. For example, do major directors exhibit higher centrality because they are linked to more movies and actors? Learn more about centrality here.
Data Augmentation: Enhance the dataset using unsupervised machine learning or graph data science techniques to improve inference capabilities.
- Incorporate additional reviewer data to uncover patterns across different generations or demographics.
Data Analysis: Identify trends and patterns—do the suggested movies share common characteristics? If not, what factors influence the recommendations?
- Can data collection and analysis methods be improved? Are there any biases or limitations in how the data was originally generated?
Challenges with Embeddings: Calculating the ideal embeddings for recommendations can be tricky. Averaging a user’s highly rated movie embeddings works well when their preferences are consistent. However, for users with diverse tastes, averaging embeddings may reduce accuracy.

How Can This Data Be Used?

Link Prediction: A “MAY_LIKE” relationship could be created between users and movies with a similarity score above a certain threshold, predicting relationships that didn’t previously exist based on data.
Machine Learning Feedback Loop: These recommendations could be fed to users for feedback, which could then be integrated into a machine learning pipeline to train a link prediction model. More on that here.

Need Help Exploring Your Data?

At 5.15, we specialize in data science and machine learning solutions, helping organizations extract valuable insights from their data. Whether you’re exploring graph databases, predictive analytics, or AI-driven recommendations, our team can assist you in designing scalable, intelligent solutions.

If you're looking to leverage your data more effectively, contact us to see how we can help!

Get Started with 5.15 Technologies