Project Horus

HackMIT 2019 · Boston, MA · September 14-15, 2019

Project Horus is an application that reads paper and digital receipts with OCR to better inform customers with their purchase decision making. Using a combination of machine learning algorithms, Project Horus scans, parses, augments, and analyzes purchases from receipts. Currently, we run analytics on grocery receipt to inform customers on the healthiness and unhealthiness of their food selections based on statistics from a nutritional facts database. In addition to serving as a personal finance tool, we are donating a percentage of the total value of healthy purchases in a month to a hunger-related charity. In the future, we plan to expand our technology to analyze receipts from other industries, including clothing, technology, wellness, education, and business applications.


Project Walkthrough

The FlowChart below summarizes the main algorithms and languages used in developing Project Horus.

Flow Chart
Sample Input
Receipt 1

Sample Receipt

Receipt 2

Sample Receipt
Step 1: Optical Character Recognition
Optical Character Recognition allows our algorithm to read text directly from images, accounting for blur, rotation, smudges, wrinkles, and more. We experimented with two software platforms to perform OCR: Amazon Web Services (AWS) Textract and InstaBase OCR. Both had similar accuracy, and a combination of both algorithms was used. Our input for this step was the image of the receipt, and OCR outputted a JSON file.


Step 2: Text Parsing + Augmentation
Using Python 3.0 libraries (NumPy, Pandas, PyEnchant), we were able to parse out pairs of items and prices from the receipt JSON files, as well as the final expense from each transaction. As receipts have limited space, many words are truncated and abbreviated, so we focused on augmenting our data with an English dictionary spellcheck tool. Taking in the JSON file, this step created a 2D NumPy array of (item, price) pairs.


Step 3: Text Classification
Using (more) Python 3.0 libraries (NumPy, Pandas, DiffLab), we were able to cluster our parsed words into buckets defined by an online nutritional database. Each word is now associated with a plethora of information regarding calories, sugar, carbohydrates, and much, much more.

Step 4: Analytics on Nutrition Dataset

Using the nutirition dataset and the parsed text segmented into these clusters, we calculate for each item in the receipt, the 'healthiness' of the item, measured using a combination of calories, sugar, fat, saturates, and sodium.

Step 5: Receipt Health Score

Using a weighted average of the individual item 'healthiness' scores, multiplied by the price of each item, we find the Receipt Health Score. A segment of all advertisement profits proportional to the healthiness of the receipt is donated to a food bank or non-profit. In this way, we encourage our customers to eat healthy and contribute to their communities.

Sample Output for Fast Food Meal
Flow Chart

Business Overview

Project Horus aims to use machine learning and natural language processing solutions to bring value to millions of consumers who struggle daily with monitoring their finances and making meaningful insight of their spending pattern. We target all audiences who struggle with managing their day-to-day finances. In addition, to promote community and address pressing societal concerns, we pledge to donate a portion of our profits to charity. To motivate our users to improve their own spending habits, users can pledge to support a non-profit, and "donate" their ad revenue for good spending habits and healthy lifestyle.

People

Project Horus was developed by a team of three undergraduates at the 2019 HackMIT event. Abhijit Gupta is a freshman at Yale University, Adit Gupta is a freshman at Yale, and Mason Mitchell is a freshman at WPI.