Converting PayStub to SQL with Azure Form Recognizer
published: 02-07-2023
- Step 1: Initial Functionality
- Step 2: Convert To Solution and an Azure Function
- Step 3: Setup Azure environment and deploy Azure Function
Why
Like many people I set a series of goals in the beginning of the year. For 2023, one goal was to more accuratly measure aspects of my life, in this case my finances. I have tried using numerous financial tracking apps and I have never been pleased, if anyone has a recommendation please let me know. I decided to take a different approach and try to build my own solution; to start, I am putting the input (income from job) into a sql database so I can being to analyze it. The code outlined here is the first step in that process, you can find the complete codebase on my GitHub.
Begin by provisioning Azure resources
The scripts below prevision the following Azure resources:
NOTE: I choose to use Azure Key Vault to store my Azure Form Recognizer key and endpoint, but you can use any key management method you choose.
Create Azure Form Recognizer with Azure Key Vault
I began with provisioning the Azure Form Recognizer, the code for which is below.
Param(
[Parameter(Mandatory = $true)]
[String]
$ResourceGroupName,
[Parameter(Mandatory = $true)]
[String]
$FormRecognizerName,
[Parameter(Mandatory = $true)]
[String]
$KeyVaultName,
[Parameter(Mandatory = $true)]
[String]
$Location
)
# ------------------------------------------------------------------------------
# Variables
# ------------------------------------------------------------------------------
$RESOURCE_GROUP_NAME = $ResourceGroupName
$FORM_RECOGNIZER_ACCOUNT = $FormRecognizerName
$KEY_VAULT_NAME = $KeyVaultName
# ------------------------------------------------------------------------------
# Provision Resource Group
# ------------------------------------------------------------------------------
az group create `
--name $RESOURCE_GROUP_NAME `
--location $LOCATION
# ------------------------------------------------------------------------------
# Provision Azure Key Vault
# ------------------------------------------------------------------------------
az keyvault create `
--name $KEY_VAULT_NAME `
--resource-group $RESOURCE_GROUP_NAME `
--location $LOCATION
# ------------------------------------------------------------------------------
# Provision Azure Form Recognizer
# ------------------------------------------------------------------------------
$FORM_RECOGNIZER_ACCOUNT_ENDPOINT = az cognitiveservices account create `
--kind "FormRecognizer" `
--name $FORM_RECOGNIZER_ACCOUNT `
--resource-group $RESOURCE_GROUP_NAME `
--location $LOCATION `
--sku "S0" `
--assign-identity `
--yes `
--query "properties.endpoint" `
--output tsv
$FORM_RECOGNIZER_ACCOUNT_KEY = az cognitiveservices account keys list `
--name $FORM_RECOGNIZER_ACCOUNT `
--resource-group $RESOURCE_GROUP_NAME `
--query "key1" `
--output tsv
# ------------------------------------------------------------------------------
# Store Azure Form Recognizer Keys in Vault
# ------------------------------------------------------------------------------
az keyvault secret set `
--vault-name $KEY_VAULT_NAME `
--name "FormRecognizerEndpoint" `
--value $FORM_RECOGNIZER_ACCOUNT_ENDPOINT
az keyvault secret set `
--vault-name $KEY_VAULT_NAME `
--name "FormRecognizerKey" `
--value $FORM_RECOGNIZER_ACCOUNT_KEY
Create Azure SQL Server and Database
Next step is to provision the SQL Database, the script for which is below. If you choose to copy this script and run it in a pipeline, GitHub Actions as an example, you will need to update the external admin on the SQL Server. Similarly the Client IP Range should not be added when run in a pipeline, but rather when you are running the script locally. Feel free to modify the script to suit your needs.
Param(
[Parameter(Mandatory = $true)]
[String]
$ResourceGroupName,
[Parameter(Mandatory = $true)]
[String]
$SqlServerName,
[Parameter(Mandatory = $true)]
[String]
$SqlDatabaseName,
[Parameter(Mandatory = $true)]
[String]
$ClientIPStart,
[Parameter(Mandatory = $true)]
[String]
$ClientIPEnd,
[Parameter(Mandatory = $true)]
[String]
$Location
)
# ------------------------------------------------------------------------------
# Variables
# ------------------------------------------------------------------------------
$RESOURCE_GROUP_NAME = $ResourceGroupName
$LOCATION = $Location
$SQL_SERVER_NAME = $SqlServerName
$SQL_DATABASE_NAME = $SqlDatabaseName
$START_IP = $ClientIPStart
$END_IP = $ClientIPEnd
# ------------------------------------------------------------------------------
# Provision Resource Group
# ------------------------------------------------------------------------------
az group create `
--name $RESOURCE_GROUP_NAME `
--location $LOCATION
# ------------------------------------------------------------------------------
# Provision Server (for current signed-in user)
# ------------------------------------------------------------------------------
$SQL_ADMIN_NAME = az ad signed-in-user show `
--query displayName `
--output tsv
$SQL_ADMIN_USER_OBJECT_ID = az ad signed-in-user show `
--query id `
--output tsv
az sql server create `
--name $SQL_SERVER_NAME `
--resource-group $RESOURCE_GROUP_NAME `
--location $LOCATION `
--enable-ad-only-auth `
--external-admin-principal-type User `
--external-admin-name $SQL_ADMIN_NAME `
--external-admin-sid $SQL_ADMIN_USER_OBJECT_ID
# ------------------------------------------------------------------------------
# Configure a server-based firewall rule
# ------------------------------------------------------------------------------
az sql server firewall-rule create `
--resource-group $RESOURCE_GROUP_NAME `
--server $SQL_SERVER_NAME `
--name AllowMyIp `
--start-ip-address $START_IP `
--end-ip-address $END_IP
# ------------------------------------------------------------------------------
# Create a database
# ------------------------------------------------------------------------------
az sql db create `
--resource-group $RESOURCE_GROUP_NAME `
--server $SQL_SERVER_NAME `
--name $SQL_DATABASE_NAME `
--edition GeneralPurpose `
--compute-model Serverless `
--family Gen5 `
--capacity 2
Now that resources are provisioned we can begin to build the application.
Building the application.
I chose to build a console app because it will be easy to integrate into my home automation, but in the future I would like to move it to an Azure Function.
lets review the goal again: to pass in a PDF (because thats what I have) and have the application extract the data and insert it into the SQL database.
Setup the project
I created a new console app using the .NET CLI, but you can use Visual Studio or Visual Studio Code to create a new console app.
dotnet new console --name "<Name of your project>"
NOTE: I suggest also adding a .gitignore for dotnet at the repository root as this command will not create one for you.
Add NuGet packages
dotnet add package Azure.Identity
dotnet add package Azure.AI.FormRecognizer
dotnet add package Azure.Security.KeyVault.Secrets
dotnet add package Microsoft.Data.SqlClient
dotnet add package AzFormRecognizer.Table.ToSQL
Before we begin to write code lets talk through the packages.
-
Azure.Identity: Used to authenticate to Azure services, in the code below I use the DefaultAzureCredential which require you to be logged into Azure through several methods like the Azure CLI.
-
Azure.AI.FormRecognizer: Used to interact with the Form Recognizer service.
-
Azure.Security.KeyVault.Secrets: Used to interact with the Key Vault service.
-
Microsoft.Data.SqlClient: Used to interact with the Azure SQL Server instance.
-
AzFormRecognizer.Table.ToSQL: This is a custom package I created to help with the conversion of the Form Recognizer output to SQL. The logic is currently minimal but I plan to extend the functionality with time. Feel free to explore the package and contribute on GitHub.
The Code
I break out the code into 5 core sections to make it easier to follow. To keep things simple, I will simply be placing these sections one after another in the Program.cs file. Feel free to break these sections into functions or different files as you see fit, afterall in its current form unit tests would be painful.
You can remove the Console.WriteLine("Hello, World!");
line as it is not needed.
Section 1: Ensure the PDF file path is passed in as an argument
Add the the following using statement to the top of the file.
using AzFormRecognizer.Table;
This will allow us to use the DocumentDetails
class in this section of code.
if (args.Length == 0) {
throw new Exception("Please pass in the PDF file path.");
}
string filePath = args[0];
if (!File.Exists(filePath)) {
throw new Exception("File does not exist.");
}
var details = new DocumentDetails() // This is used later as keys for database tables
{
Title = Path.GetFileName(filePath),
Id = Guid.NewGuid().ToString()
};
var bytes = await File.ReadAllBytesAsync(filePath);
var memoryStream = new MemoryStream(bytes);
The above code will ensure that a file path is passed in as an argument and that the file exists. It will then read the file into a MemoryStream
which will be used later to send the PDF to the Form Recognizer service.
Section 2: Pull Form Recognizer keys from Azure Key Vault
Add the the following using statement to the top of the file.
using Azure.Identity;
using Azure.Security.KeyVault.Secrets;
This will allow us to use the DefaultAzureCredential
and SecretClient
classes in this section of code.
string? keyVaultName = Environment.GetEnvironmentVariable("KEY_VAULT_NAME");
if (keyVaultName == null || keyVaultName == "") {
throw new Exception("Please set the KEY_VAULT_NAME environment variable.");
}
var keyVaultUri = new Uri($"https://{keyVaultName}.vault.azure.net");
var secretClient = new SecretClient(keyVaultUri, new DefaultAzureCredential());
var formRecognizerEndpoint = await secretClient.GetSecretAsync("FormRecognizerEndpoint");
var formRecognizerKey = await secretClient.GetSecretAsync("FormRecognizerKey");
The above code will pull the Form Recognizer endpoint and key from Azure Key Vault. The DefaultAzureCredential
will use the Azure CLI to authenticate to Azure. If you are not logged into Azure through the CLI you will need to use a different method to authenticate.
Section 3: STEP 3: Send PDF to Azure Form Recognizer
Add the the following using statement to the top of the file.
using Azure;
using Azure.AI.FormRecognizer;
This will allow us to use the AzureKeyCredential
, DocumentAnalysisClient
, AnalyzeResult
and AnalyzeDocumentOperation
classes in this section of code.
string endpoint = formRecognizerEndpoint.Value.Value;
string apiKey = formRecognizerKey.Value.Value;
string modelName = "prebuilt-layout";
AzureKeyCredential credential = new AzureKeyCredential(apiKey);
var documentClient = new DocumentAnalysisClient(new Uri(endpoint), credential);
AnalyzeDocumentOperation operation = await documentClient.AnalyzeDocumentAsync(WaitUntil.Completed, modelName, memoryStream);
AnalyzeResult result = operation.Value;
The above code will send the PDF to the Form Recognizer service and wait for the results to be returned. The modelName
variable is the name of the model that will be used to analyze the PDF. In this case we are using the prebuilt-layout
model which will return various structural elements of the PDF including extracted tables. You can read more about it in the documentation.
Section 4: Parse the Form Recognizer results into a list of sql commands
Add the the following using statement to the top of the file.
using AzFormRecognizer.Table.ToSQL;
This will allow you to use the IEnumerable<DocumentTable>
extension ToSQL
which will output the SQL commands.
void AddMissingInfoAndRelations(List<Table> tables, DocumentDetails details)
{
if (tables.Count == 0)
{
throw new Exception("No tables found in the document.");
}
if (tables[0].Title == null && tables[0].Headers != null)
{
tables[0].Title = "Payslip";
var payslipTablePrimaryKey = new ColumnHeader()
{
Name= "DocumentId",
TableKey = new TableKey() { Type = TableKeyType.Primary },
DataType = ColumnDataTypes.VARCHAR
};
var primaryKeyColumnIndex = tables[0].Headers.Last().Key + 1;
tables[0].Headers.Add(primaryKeyColumnIndex, payslipTablePrimaryKey);
}
var payslipTableForignKey = new ColumnHeader()
{
Name= "DocumentId",
TableKey = new TableKey() { Type = TableKeyType.Foreign, Reference = "Payslip(DocumentId)" },
DataType = ColumnDataTypes.VARCHAR
};
if (tables[1].Title == null)
{
tables[1].Title = "Summary";
}
if (tables[8].Title == null)
{
tables[8].Title = "Allowances";
}
foreach (var table in tables)
{
if (table.Headers.All(header => header.Value.TableKey == null))
{
var primaryKey = new ColumnHeader() { Name= "Id", TableKey = new TableKey() { Type = TableKeyType.Primary }, DataType = ColumnDataTypes.INT };
table.Headers.Add(table.Headers.Last().Key + 1, primaryKey);
table.Headers.Add(table.Headers.Last().Key + 1, payslipTableForignKey);
}
foreach (var row in table.Rows)
{
row.Add("DocumentId", details.Id);
}
}
}
var listOfSQLCommands = result.Tables.ToSQL(details, AddMissingInfoAndRelations);
The above code will parse the Form Recognizer results into a list of SQL commands. The AddMissingInfoAndRelations
method will add the missing table names and primary keys to the tables. While Form Recognizer will return the table headers it will not return the table name or the relation between the tables, which is why the ToSQL
method accepts a delegate to add this information. I have this configured to add in missing segments of my own workday generated playstub in PDF format, but you can modify this to suit the needs of the PDF you are using.
Section 5: Execute the SQL commands
Add the the following using statement to the top of the file.
using Microsoft.Data.SqlClient;
This will allow you to use the SqlConnection
and SqlCommand
classes in this section of code.
string? serverName = Environment.GetEnvironmentVariable("SQL_SERVER_NAME");
if (serverName == null || serverName == "") {
Console.WriteLine("Please set the SQL_SERVER_NAME environment variable.");
return;
}
string? databaseName = Environment.GetEnvironmentVariable("SQL_DATABASE_NAME");
if (databaseName == null || databaseName == "") {
Console.WriteLine("Please set the SQL_DATABASE_NAME environment variable.");
return;
}
string ConnectionString = $"Server={serverName}.database.windows.net; Authentication=Active Directory Default; Encrypt=True; Database={databaseName}";
using (SqlConnection connection = new SqlConnection(ConnectionString)) {
await connection.OpenAsync();
// First create all tables and relations
// This will cause an error if run after tables exist, this is currently a limitation of the AzFormRecognizer.Table.ToSQL library
// which will be fixed in a future release, in the meantime, you can comment out the following line and run the program again for new PDFs
var createTableCmds = listOfSQLCommands.Where(cmd => cmd.Contains("CREATE TABLE")).ToList();
foreach (var sqlTableStr in createTableCmds) {
using (var command = new SqlCommand(sqlTableStr, connection)) {
int rowsAffected = command.ExecuteNonQuery();
Console.WriteLine(rowsAffected + " = rows affected.");
}
}
// Insert values into tables
var inserValuesCmds = listOfSQLCommands.Where(cmd => !cmd.Contains("CREATE TABLE")).ToList();
foreach (var sqlTableStr in inserValuesCmds) {
using (var command = new SqlCommand(sqlTableStr, connection)) {
int rowsAffected = command.ExecuteNonQuery();
Console.WriteLine(rowsAffected + " = rows affected.");
}
}
await connection.CloseAsync();
}
The above code will create the tables and insert the values into the tables. The ConnectionString
variable is the connection string to the Azure SQL Database. You can find the connection string in the Azure Portal under the SQL Database resource. The listOfSQLCommands
variable is the list of SQL commands that were generated in the previous section.
Section 6: Run the program
To run the program, you can either run it from Visual Studio or from the command line. To run it from the command line, navigate to the directory where the project is located and run the following command.
dotnet build -o output -c release
This will build the project and output the executable to the output
directory. To run the executable, run the following command with your own environment variables, project name and pdf.
$Env:SQL_SERVER_NAME = <Enter your SQL Server Name>
$Env:SQL_DATABASE_NAME = <Enter your SQL Database Name>
$Env:KEY_VAULT_NAME = <Enter your Key Vault Name>
./<Name of project>/output/<Name of project> ./Path to PDF.pdf
Limitations
As noted in the comments of the Section 5 code block the SQL commands will fail to create the tables if they already exist. This is a limitation of the AzFormRecognizer.Table.ToSQL
library. This will be fixed in a future release. In the meantime, you can comment out the following line and run the program again for new PDFs.
Conclusion
The above functionality is a work in progress as I develop my own personal financial analysis tool. I hope this article has been helpful in showing you how to use Form Recognizer to extract data from PDFs and how to use the data to create SQL commands. If you have any questions or comments, please feel free to reach out.