Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minions document OCR Parser with Gemma3 #26

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AlienKevin
Copy link

@AlienKevin AlienKevin commented Mar 19, 2025

This PR implements an OCR parser for Minions with Gemma3. Currently, granite3.2-vision is used for OCR. Inspired by strong vision capability of the recent Gemma3 model and a streamlit demo showcasing the use of Gemma3 for OCR, I added a similar OCR parser to Minions. I copied app.py into minions_ocr.py and updated the extract_text_from_image function to use Gemma3 with a prompt instructing it to parse the image into structured markdown to be included in the Minions context.

I tested the new OCR parser locally on my macBook by selecting gemma3:4b as the local model and using the Minion protocol to summarize all items purchased on a restaurant receipt image. Gemma3 successfully summarized all the food items in a restaurant receipt.

Here's the receipt I uploaded:
receipt

And here's the output trace:

{
  "task": "summarize all items purchased",
  "context": "\n--------------------\n### Content from receipt.jpg:\n\nThe image depicts a receipt from \"The Tack Room,\" located at 145 Lincoln Road, Lincoln, MA 01773. The receipt is dated April 8, 2024, and the time of the transaction was 7:13 PM.\n\nHere's a detailed breakdown of the items ordered along with their respective prices:\n\n1. **BBQ Potato Chips** - $7.00\n2. **Diet Coke** - $3.00\n3. **Trillium Fort Point** - $10.00\n4. **Fried Chicken Sandwich** - $34.00\n5. **Famous Duck Grilled Cheese** - $25.00\n6. **Mac & Cheese** - $17.00\n7. **Burger of the moment** - $18.00\n\nThe subtotal for these items is calculated as follows:\n- BBQ Potato Chips: $7.00\n- Diet Coke: $3.00\n- Trillium Fort Point: $10.00\n- Fried Chicken Sandwich: $34.00\n- Famous Duck Grilled Cheese: $25.00\n- Mac & Cheese: $17.00\n- Burger of the moment: $18.00\n\nAdding these amounts together gives a subtotal of $114.00.\n\nNext, there is an administrative fee of $3.42 and a tax amount of $7.11, which brings the total to $124.53.\n\nThe receipt also includes a note at the bottom that reads: \"Add us on Instagram: Tackroomlincoln.\" This suggests that The Tack Room encourages customers to follow them on Instagram for updates or promotions.\n\n### Analysis and Description:\n\nThis receipt is from a dining establishment named \"The Tack Room,\" located in Lincoln, MA. The customer ordered several items including various types of sandwiches, chips, and cheese dishes. The total cost for these items amounts to $124.53, which includes the subtotal before taxes and fees.\n\nThe receipt also indicates that there is an administrative fee of $3.42 and a tax amount of $7.11, resulting in a final total of $124.53. This suggests that the customer paid for their meal including both the food items and additional charges such as fees and taxes.\n\nThe note at the bottom encourages customers to follow The Tack Room on Instagram, which is a common practice among businesses to engage with customers digitally and promote their brand through social media platforms.\n\n### Conclusion:\n\nThis receipt provides a clear record of the items ordered by a customer from \"The Tack Room\" along with the total cost. The inclusion of an administrative fee and tax amount indicates that this is a standard practice for businesses to calculate and display their final charges to customers. The note at the bottom promotes social media engagement, which is a common marketing strategy in today's digital age.",
  "conversation": [
    {
      "user": "remote",
      "prompt": "We need to perform the following task.\n\n### Task\nsummarize all items purchased\n\n### Instructions\nYou will not have direct access to the context, but can chat with a small language model which has read the entire thing.\n\nFeel free to think step-by-step, but eventually you must provide an output in the format below:\n\n```json\n{\n    \"message\": \"<your message to the small language model. If you are asking model to do a task, make sure it is a single task!>\"\n}\n```\n",
      "output": "\n\n\n{\n    \"message\": \"Please list all the items that were purchased.\"\n}"
    },
    {
      "user": "local",
      "prompt": "Please list all the items that were purchased.",
      "output": "Here are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00"
    },
    {
      "user": "remote",
      "prompt": "\nHere is the response from the small language model:\n\n### Response\nHere are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00\n\n\n### Instructions\nAnalyze the response and think-step-by-step to determine if you have enough information to answer the question.\n\nThink about:\n1. What information we have gathered\n2. Whether it is sufficient to answer the question\n3. If not sufficient, what specific information is missing\n4. If sufficient, how we would calculate or derive the answer\n\n",
      "output": "To determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```"
    },
    {
      "user": "remote",
      "prompt": "Here is the response after step-by-step thinking.\n\n### Response\nTo determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```\n\n### Instructions\nIf you have enough information or if the task is complete, write a final answer to fullfills the task. \n\n```json\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"<your answer>\"\n}\n```\n\nOtherwise, if the task is not complete, request the small language model to do additional work, by outputting the following:\n\n```json\n{\n    \"decision\": \"request_additional_info\",\n    \"message\": \"<your message to the small language model>\"\n}\n```\n\n",
      "output": "\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}"
    }
  ],
  "generated_final_answer": "The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00)."
}

For reference, here's the current behavior with Granite, which also correctly completed the task:

{
  "task": "summarize all items purchased",
  "context": "\n--------------------\n### Content from receipt.jpg:\n\nThe image depicts a receipt from \"The Tack Room,\" located at 145 Lincoln Road, Lincoln, MA 01773. The receipt is dated April 8, 2024, and the time of the transaction was 7:13 PM.\n\nHere's a detailed breakdown of the items ordered along with their respective prices:\n\n1. **BBQ Potato Chips** - $7.00\n2. **Diet Coke** - $3.00\n3. **Trillium Fort Point** - $10.00\n4. **Fried Chicken Sandwich** - $34.00\n5. **Famous Duck Grilled Cheese** - $25.00\n6. **Mac & Cheese** - $17.00\n7. **Burger of the moment** - $18.00\n\nThe subtotal for these items is calculated as follows:\n- BBQ Potato Chips: $7.00\n- Diet Coke: $3.00\n- Trillium Fort Point: $10.00\n- Fried Chicken Sandwich: $34.00\n- Famous Duck Grilled Cheese: $25.00\n- Mac & Cheese: $17.00\n- Burger of the moment: $18.00\n\nAdding these amounts together gives a subtotal of $114.00.\n\nNext, there is an administrative fee of $3.42 and a tax amount of $7.11, which brings the total to $124.53.\n\nThe receipt also includes a note at the bottom that reads: \"Add us on Instagram: Tackroomlincoln.\" This suggests that The Tack Room encourages customers to follow them on Instagram for updates or promotions.\n\n### Analysis and Description:\n\nThis receipt is from a dining establishment named \"The Tack Room,\" located in Lincoln, MA. The customer ordered several items including various types of sandwiches, chips, and cheese dishes. The total cost for these items amounts to $124.53, which includes the subtotal before taxes and fees.\n\nThe receipt also indicates that there is an administrative fee of $3.42 and a tax amount of $7.11, resulting in a final total of $124.53. This suggests that the customer paid for their meal including both the food items and additional charges such as fees and taxes.\n\nThe note at the bottom encourages customers to follow The Tack Room on Instagram, which is a common practice among businesses to engage with customers digitally and promote their brand through social media platforms.\n\n### Conclusion:\n\nThis receipt provides a clear record of the items ordered by a customer from \"The Tack Room\" along with the total cost. The inclusion of an administrative fee and tax amount indicates that this is a standard practice for businesses to calculate and display their final charges to customers. The note at the bottom promotes social media engagement, which is a common marketing strategy in today's digital age.",
  "conversation": [
    {
      "user": "remote",
      "prompt": "We need to perform the following task.\n\n### Task\nsummarize all items purchased\n\n### Instructions\nYou will not have direct access to the context, but can chat with a small language model which has read the entire thing.\n\nFeel free to think step-by-step, but eventually you must provide an output in the format below:\n\n```json\n{\n    \"message\": \"<your message to the small language model. If you are asking model to do a task, make sure it is a single task!>\"\n}\n```\n",
      "output": "\n\n\n{\n    \"message\": \"Please list all the items that were purchased.\"\n}"
    },
    {
      "user": "local",
      "prompt": "Please list all the items that were purchased.",
      "output": "Here are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00"
    },
    {
      "user": "remote",
      "prompt": "\nHere is the response from the small language model:\n\n### Response\nHere are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00\n\n\n### Instructions\nAnalyze the response and think-step-by-step to determine if you have enough information to answer the question.\n\nThink about:\n1. What information we have gathered\n2. Whether it is sufficient to answer the question\n3. If not sufficient, what specific information is missing\n4. If sufficient, how we would calculate or derive the answer\n\n",
      "output": "To determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```"
    },
    {
      "user": "remote",
      "prompt": "Here is the response after step-by-step thinking.\n\n### Response\nTo determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```\n\n### Instructions\nIf you have enough information or if the task is complete, write a final answer to fullfills the task. \n\n```json\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"<your answer>\"\n}\n```\n\nOtherwise, if the task is not complete, request the small language model to do additional work, by outputting the following:\n\n```json\n{\n    \"decision\": \"request_additional_info\",\n    \"message\": \"<your message to the small language model>\"\n}\n```\n\n",
      "output": "\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}"
    }
  ],
  "generated_final_answer": "The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00)."
}

Copy link
Collaborator

@danbider danbider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you incorporate a more minimal change in the existing app.py instead of having a new app file?

thanks for the work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants