This blog is intended to provide you with a way to visualize your organization and gather knowledge about the dynamics of your organization. If you'd like to "follow along," you can access the code for this blog post in our GitLab repository below.
We will discuss the steps necessary to:
Connect
Collect
Process
Load
Visualize
Extract Knowledge
Before we discuss the process and code required to achieve the objective, it is important to understand the infrastructure and use case.
Use Case
Our use case is simple but leverages cool technology to achieve the goal. We want to collect data from our Active Directory Domain and build an “Org Chart” in the form of a graph.
Infrastructure
The image and supporting dialog below provide insight into the core components of the infrastructure and processes associated with this project.
Active Directory (“AD”) is our data source for collecting information about the organization. This will allow us to collect the information we need to build the graph and extract knowledge.
A Python function will be used to collect data from AD via LDAP
We will use Python to process the data into the formats that we need for the loading process. The data processing addresses formatting and any data augmentation required.
CQL or Cypher Query Language will be used to load data into the Neo4J database. The graph(s) used in this blog will be sourced from this database.
Neo4j Desktop will be used to store any data collected. We will also use this database for knowledge extraction. Knowledge extraction is the digital form of asking the data a question and expecting an answer.
After we load the data into Neo4j, we will visualize the data in the Neo4j Browser. This will give us an effortless way to verify that the Nodes, Relationships, and Properties are set properly.
In the Jupyter Notebook we will build a hierarchical visualization of the organizational chart.
Let us discuss each step below. If you came here to hijack code and want to skip all the dialog, go to this section here.
Connect
Since we are collecting data from Active Directory/LDAP we will need to build a connection and authenticate to the domain.
Collect
After we have connected and successfully authenticated, we will need to return and collect useful data that aligns with our use case. For this use case we will collect data about all “User” objects in the domain. We will limit this data to the following AD/LDAP attributes:
cn: Common Name or user id
sn: Surname or last name
givenName: First name of the user
mail: Email address of the user
displayName: The display name, typically in the format “Last Name, First Name”
userPrincipalName: This is the fully qualified username, typically the same as email in the format of username@domain.local or firstname.lastname@domain.com
mobile: The user’s mobile phone number
telephoneNumber: The user’s desk phone number.
manager: The user’s manager
directReports: Users who report to this user.
description: A brief description of the user.
If you want to learn more about other AD attributes, this link will provide you with some background information:
Process
After we collect the data, it will need to be processed to remove data we do not want or to clean up the formatting of the data we do want. For example, removing service accounts, which are not typically actual users. For this example, we will use the manager property of the LDAP object and create a new field. The new field will contain the “manager_name.” The process for obtaining the manager’s name will leverage lambda functions on a “Pandas DataFrame.”
Load
Loading the data is the process of mapping data collected with either Node Labels, Relationships, or the properties associated with the Nodes, Relationships.
Visualize
Visualizing the data is how we rapidly validate the database schema and prepare for developing a simple, consumable user interface.
Prerequisites
To follow along with this project, you will need the following installed:
Active Directory Domain Controller and read only user account.
Python Interpreter and the following Python Libraries:
Pandas
LDAP3
Jupyter
Python IDE (PyCharm/VS Code)
Neo4j Desktop or Server
Neo4j Browser
Y_Files Extension for Jupyter Notebooks.
Web Browser
Code Section and Screen Shots
The images below are from the Jupyter Notebook. The steps and code required to build your visual organization chart (“OrgChart”) are listed below:
Python Imports
As with any Python project we start with importing libraries. For this project we will use LDAP3 and Pandas. This will be the basis of our data collector.
Variables
First, we define the Python variables required to support the connection and authentication to Active Directory (LDAP). We start with a read-only user and the required password. Next, we define the server for which we will connect to LDAP. Then we define the LDAP attributes that we want to return with our query. The “searchParameters” will be defined to determine what we are searching for in LDAP coupled with the “basedn” and “ad_attributes” variables.
Data Collector
The script below is used to connect to the domain and collect the data we want to save in a “.CSV” file. The function itself is simple. We create a connection, use the “server” variable defined earlier and request a “paged” search. "Paged" means that any values returned in excess of 999 is returned on another page. The results are "returned" as a Python dictionary. The function below adds each dictionary to a list, creating a list of dictionaries. The list of dictionaries is then converted to Pandas DataFrame. Since some values in the dictionary are valuable to building our graph, we will need to explode those values from a single cell to multiple cells.
With the function defined, we will create a variable to hold the results of the function. The results are then stored in a .CSV file for processing. This image below shows some sparse, but representative data from the results. Your data will look more like the values used in your domain.
As you can see from the image below, we added a print function. This is to give you an idea of how many objects were returned from the LDAP query.
Working with Sample Data
To help support the development of your code, based on this blog, I have provided sample data. You can use this data to evaluate your data loading functions into Neo4j. To work with the sample data, you will need to read the .CSV file into a Pandas DataFrame. I highly recommend filling any “NaN” or missing values. Use the Pandas “fillna” method to achieve this objective.
Data Processing
The data in my development domain is staged, so we will perform basic processing. For this example, we will use python to find the “Managers” display name by cross-referencing the “Manager” value which is a username. To achieve this objective, we will create a Python Lookup dictionary from a Pandas Series. The key value will be “CN” or username and the “DisplayName” will be the value. We use the dictionary to lookup the corresponding display name for each username in the DataFrame. The lookup dictionary is created using the Pandas Series method and includes the “DataFrame.Column” name pairs for the key and value pairs. This will result in a Python dictionary that we can use to perform lookup operations for mapping a manager’s username to an actual Person’s name.
Once the lookup dictionary is created, we can create a simple Python Function to compare the value of “x”, which equates to the value of cell in the Pandas DataFrame. The function takes the value of “x”, or in our case the username and returns the “DisplayName” value upon match.
With the function completed, we will need to apply the function to the entire DataFrame using a “Lambda” function. This will create a new column called “manager_name”. The function will get the value of “x” from the cell values of each “manager” value in the DataFrame.
After our data processing is complete, we will build a simple “Data Loader.”
File Copy
The data load script will be in the format of “CQL” for Cypher Query Language. This script can be copied directly into your Neo4j browser or run from a Python script. I will provide screen shots below for the operation using the Neo4J browser. With your database selected, click “Open/Open Folder/Import”, and paste the “fake_users_from_ldap.csv” file.
Data Loader
Once the data is copied to the “Import” directory, you are ready to load the data. The commands in the Jupyter notebook below show how to perform this operation using Python and the “py2neo” library. First, we load the library and instantiate an object of class “Graph”. The basic arguments are listed in the example below. You need to provide the “bolt” url and authentication, assuming you have not disabled authentication. The next step is to create constraints in the database to ensure that certain values are unique. This will allow you to create and merge data without creating duplicate nodes. For this dataset we need constraints for the following: “id”, “PersonCN”, “PersonMail”, and “CompanyName”.
You can verify that the constraints are added via the Neo4j Browser using this command: “show constraints”. The image below represents the expected, abbreviated output.
With the data copied to the import directory and the constraints created, we are ready to load the “.CSV” file into the graph database. First, we create our CQL query. The image below represents this task. It is also in the Jupyter Notebook if you want to copy/paste. In the query we are completing the following:
Load the “CSV” file and create a reference to each row as the object “row.” You can call this whatever you want, so long as you use that object name as your point of reference for each column in each row. With the “CSV” loaded and object instantiated we then create a “Person” node label for each row. This will only occur for rows in the file that do not have the “CN” or “MANAGER” columns populated. Then we iterate over each row, create the node, and set properties on each node based on the values in the referenced column. After the query is created, we run the CQL query with the “run” method.
Load Validation
Validating that the data is loaded is simple. Using the same “run” method we can run a basic CQL query that returns all nodes and properties. The image below depicts what you should expect to see with this command.
You can also validate in the Neo4j Browser running the same command as above. When you select one of the returned nodes, you should see the properties as depicted below.
Create “Relationships” based on Node Properties
With our data loaded and the nodes created can now add additional data and create relationships. The first thing we need to do is create “Company” Node. We will connect all employees to this node.
With the “Company” node created, we can create relationships for employee to company and employee to manager. The “MANAGES” relationship is created using the code below. We are creating a “Cartesian” lookup with the “MATCH” command.
With the syntax below we ask to create a “MANAGES” directed relationship between the “p” and “m” nodes where the “DisplayName” property is not null and the “PersonManager_Name” is not null.
The relationship will be created based on the previously described condition “AND” where to “PersonDisplayName” for the “p” node is equal to the “PersonManager_Name” of the “m” node. The “graph.run()” method is used to execute the query and merge against the Neo4j database.
Next, we will create the relationships between employees and the company. This is like the syntax for the “MANAGES” relationship. The primary difference is based on the “PersonCompany” property of each node.
At this point we should now have a graph that looks like the image below:
Visualization Steps
Once our data is loaded and our relationships are created, we can quickly visualize the data from the Neo4j Browser directly in the Jupyter Notebook. The code in the image below provides the steps. The simple approach is:
Import Libraries
Instantiate a Class of the GraphDatabase
Establish a session
Instantiate a Class of the GraphWidget
Set the “Layout” Property via included method.
Show the Graph returned from Neo4j using the YFiles Widget.
The image below shows what the hierarchical layout should look like with the data we added in the steps above.
Summary
While this is not a production ready system, we hope that this helps you start your journey into visualizing graph data with Neo4j. At 5.15 Technologies LLC., we build all kinds of solutions, including “Knowledge Graph Applications”.
For assistance with your integration, development, or design needs for your next knowledge graph-based solution, feel free to reach out to us through our website or social media channels. We're here to help!