Minions document OCR Parser with Gemma3 #26

AlienKevin · 2025-03-19T06:25:46Z

This PR implements an OCR parser for Minions with Gemma3. Currently, granite3.2-vision is used for OCR. Inspired by strong vision capability of the recent Gemma3 model and a streamlit demo showcasing the use of Gemma3 for OCR, I added a similar OCR parser to Minions. I copied app.py into minions_ocr.py and updated the extract_text_from_image function to use Gemma3 with a prompt instructing it to parse the image into structured markdown to be included in the Minions context.

I tested the new OCR parser locally on my macBook by selecting gemma3:4b as the local model and using the Minion protocol to summarize all items purchased on a restaurant receipt image. Gemma3 successfully summarized all the food items in a restaurant receipt.

Here's the receipt I uploaded:

And here's the output trace:

{
  "task": "summarize all items purchased",
  "context": "\n--------------------\n### Content from receipt.jpg:\n\nThe image depicts a receipt from \"The Tack Room,\" located at 145 Lincoln Road, Lincoln, MA 01773. The receipt is dated April 8, 2024, and the time of the transaction was 7:13 PM.\n\nHere's a detailed breakdown of the items ordered along with their respective prices:\n\n1. **BBQ Potato Chips** - $7.00\n2. **Diet Coke** - $3.00\n3. **Trillium Fort Point** - $10.00\n4. **Fried Chicken Sandwich** - $34.00\n5. **Famous Duck Grilled Cheese** - $25.00\n6. **Mac & Cheese** - $17.00\n7. **Burger of the moment** - $18.00\n\nThe subtotal for these items is calculated as follows:\n- BBQ Potato Chips: $7.00\n- Diet Coke: $3.00\n- Trillium Fort Point: $10.00\n- Fried Chicken Sandwich: $34.00\n- Famous Duck Grilled Cheese: $25.00\n- Mac & Cheese: $17.00\n- Burger of the moment: $18.00\n\nAdding these amounts together gives a subtotal of $114.00.\n\nNext, there is an administrative fee of $3.42 and a tax amount of $7.11, which brings the total to $124.53.\n\nThe receipt also includes a note at the bottom that reads: \"Add us on Instagram: Tackroomlincoln.\" This suggests that The Tack Room encourages customers to follow them on Instagram for updates or promotions.\n\n### Analysis and Description:\n\nThis receipt is from a dining establishment named \"The Tack Room,\" located in Lincoln, MA. The customer ordered several items including various types of sandwiches, chips, and cheese dishes. The total cost for these items amounts to $124.53, which includes the subtotal before taxes and fees.\n\nThe receipt also indicates that there is an administrative fee of $3.42 and a tax amount of $7.11, resulting in a final total of $124.53. This suggests that the customer paid for their meal including both the food items and additional charges such as fees and taxes.\n\nThe note at the bottom encourages customers to follow The Tack Room on Instagram, which is a common practice among businesses to engage with customers digitally and promote their brand through social media platforms.\n\n### Conclusion:\n\nThis receipt provides a clear record of the items ordered by a customer from \"The Tack Room\" along with the total cost. The inclusion of an administrative fee and tax amount indicates that this is a standard practice for businesses to calculate and display their final charges to customers. The note at the bottom promotes social media engagement, which is a common marketing strategy in today's digital age.",
  "conversation": [
    {
      "user": "remote",
      "prompt": "We need to perform the following task.\n\n### Task\nsummarize all items purchased\n\n### Instructions\nYou will not have direct access to the context, but can chat with a small language model which has read the entire thing.\n\nFeel free to think step-by-step, but eventually you must provide an output in the format below:\n\n```json\n{\n    \"message\": \"<your message to the small language model. If you are asking model to do a task, make sure it is a single task!>\"\n}\n```\n",
      "output": "\n\n\n{\n    \"message\": \"Please list all the items that were purchased.\"\n}"
    },
    {
      "user": "local",
      "prompt": "Please list all the items that were purchased.",
      "output": "Here are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00"
    },
    {
      "user": "remote",
      "prompt": "\nHere is the response from the small language model:\n\n### Response\nHere are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00\n\n\n### Instructions\nAnalyze the response and think-step-by-step to determine if you have enough information to answer the question.\n\nThink about:\n1. What information we have gathered\n2. Whether it is sufficient to answer the question\n3. If not sufficient, what specific information is missing\n4. If sufficient, how we would calculate or derive the answer\n\n",
      "output": "To determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```"
    },
    {
      "user": "remote",
      "prompt": "Here is the response after step-by-step thinking.\n\n### Response\nTo determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```\n\n### Instructions\nIf you have enough information or if the task is complete, write a final answer to fullfills the task. \n\n```json\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"<your answer>\"\n}\n```\n\nOtherwise, if the task is not complete, request the small language model to do additional work, by outputting the following:\n\n```json\n{\n    \"decision\": \"request_additional_info\",\n    \"message\": \"<your message to the small language model>\"\n}\n```\n\n",
      "output": "\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}"
    }
  ],
  "generated_final_answer": "The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00)."
}

For reference, here's the current behavior with Granite, which also correctly completed the task:

{
  "task": "summarize all items purchased",
  "context": "\n--------------------\n### Content from receipt.jpg:\n\nThe image depicts a receipt from \"The Tack Room,\" located at 145 Lincoln Road, Lincoln, MA 01773. The receipt is dated April 8, 2024, and the time of the transaction was 7:13 PM.\n\nHere's a detailed breakdown of the items ordered along with their respective prices:\n\n1. **BBQ Potato Chips** - $7.00\n2. **Diet Coke** - $3.00\n3. **Trillium Fort Point** - $10.00\n4. **Fried Chicken Sandwich** - $34.00\n5. **Famous Duck Grilled Cheese** - $25.00\n6. **Mac & Cheese** - $17.00\n7. **Burger of the moment** - $18.00\n\nThe subtotal for these items is calculated as follows:\n- BBQ Potato Chips: $7.00\n- Diet Coke: $3.00\n- Trillium Fort Point: $10.00\n- Fried Chicken Sandwich: $34.00\n- Famous Duck Grilled Cheese: $25.00\n- Mac & Cheese: $17.00\n- Burger of the moment: $18.00\n\nAdding these amounts together gives a subtotal of $114.00.\n\nNext, there is an administrative fee of $3.42 and a tax amount of $7.11, which brings the total to $124.53.\n\nThe receipt also includes a note at the bottom that reads: \"Add us on Instagram: Tackroomlincoln.\" This suggests that The Tack Room encourages customers to follow them on Instagram for updates or promotions.\n\n### Analysis and Description:\n\nThis receipt is from a dining establishment named \"The Tack Room,\" located in Lincoln, MA. The customer ordered several items including various types of sandwiches, chips, and cheese dishes. The total cost for these items amounts to $124.53, which includes the subtotal before taxes and fees.\n\nThe receipt also indicates that there is an administrative fee of $3.42 and a tax amount of $7.11, resulting in a final total of $124.53. This suggests that the customer paid for their meal including both the food items and additional charges such as fees and taxes.\n\nThe note at the bottom encourages customers to follow The Tack Room on Instagram, which is a common practice among businesses to engage with customers digitally and promote their brand through social media platforms.\n\n### Conclusion:\n\nThis receipt provides a clear record of the items ordered by a customer from \"The Tack Room\" along with the total cost. The inclusion of an administrative fee and tax amount indicates that this is a standard practice for businesses to calculate and display their final charges to customers. The note at the bottom promotes social media engagement, which is a common marketing strategy in today's digital age.",
  "conversation": [
    {
      "user": "remote",
      "prompt": "We need to perform the following task.\n\n### Task\nsummarize all items purchased\n\n### Instructions\nYou will not have direct access to the context, but can chat with a small language model which has read the entire thing.\n\nFeel free to think step-by-step, but eventually you must provide an output in the format below:\n\n```json\n{\n    \"message\": \"<your message to the small language model. If you are asking model to do a task, make sure it is a single task!>\"\n}\n```\n",
      "output": "\n\n\n{\n    \"message\": \"Please list all the items that were purchased.\"\n}"
    },
    {
      "user": "local",
      "prompt": "Please list all the items that were purchased.",
      "output": "Here are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00"
    },
    {
      "user": "remote",
      "prompt": "\nHere is the response from the small language model:\n\n### Response\nHere are the items listed from the receipt:\n\n1. BBQ Potato Chips - $7.00\n2. Diet Coke - $3.00\n3. Trillium Fort Point - $10.00\n4. Fried Chicken Sandwich - $34.00\n5. Famous Duck Grilled Cheese - $25.00\n6. Mac & Cheese - $17.00\n7. Burger of the moment - $18.00\n\n\n### Instructions\nAnalyze the response and think-step-by-step to determine if you have enough information to answer the question.\n\nThink about:\n1. What information we have gathered\n2. Whether it is sufficient to answer the question\n3. If not sufficient, what specific information is missing\n4. If sufficient, how we would calculate or derive the answer\n\n",
      "output": "To determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```"
    },
    {
      "user": "remote",
      "prompt": "Here is the response after step-by-step thinking.\n\n### Response\nTo determine if we have enough information to summarize all items purchased, let's analyze the response step-by-step:\n\n1. **What information we have gathered:**\n   - We have a list of items purchased along with their prices:\n     1. BBQ Potato Chips - $7.00\n     2. Diet Coke - $3.00\n     3. Trillium Fort Point - $10.00\n     4. Fried Chicken Sandwich - $34.00\n     5. Famous Duck Grilled Cheese - $25.00\n     6. Mac & Cheese - $17.00\n     7. Burger of the moment - $18.00\n\n2. **Whether it is sufficient to answer the question:**\n   - The task is to summarize all items purchased. The list provided includes all the items and their respective prices, which is sufficient to create a summary.\n\n3. **If not sufficient, what specific information is missing:**\n   - In this case, the information provided is sufficient. We have all the items and their prices, which allows us to summarize the purchases.\n\n4. **If sufficient, how we would calculate or derive the answer:**\n   - To summarize the items purchased, we can simply list them along with their prices. Additionally, we could calculate the total cost of all items if needed.\n\nSince we have all the necessary information, we can proceed to summarize the items purchased.\n\n```json\n{\n    \"message\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}\n```\n\n### Instructions\nIf you have enough information or if the task is complete, write a final answer to fullfills the task. \n\n```json\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"<your answer>\"\n}\n```\n\nOtherwise, if the task is not complete, request the small language model to do additional work, by outputting the following:\n\n```json\n{\n    \"decision\": \"request_additional_info\",\n    \"message\": \"<your message to the small language model>\"\n}\n```\n\n",
      "output": "\n{\n    \"decision\": \"provide_final_answer\", \n    \"answer\": \"The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00).\"\n}"
    }
  ],
  "generated_final_answer": "The items purchased are: BBQ Potato Chips ($7.00), Diet Coke ($3.00), Trillium Fort Point ($10.00), Fried Chicken Sandwich ($34.00), Famous Duck Grilled Cheese ($25.00), Mac & Cheese ($17.00), and Burger of the moment ($18.00)."
}

…ge; Refined prompt to ask for output in structured markdown

danbider

can you incorporate a more minimal change in the existing app.py instead of having a new app file?

thanks for the work here.

AlienKevin added 2 commits March 18, 2025 23:02

Copied app.py to minions_ocr.py

62deb2c

Switched from granite3.2-vision to gemma3:4b in extract_text_from_ima…

85d2195

…ge; Refined prompt to ask for output in structured markdown

danbider reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minions document OCR Parser with Gemma3 #26

Minions document OCR Parser with Gemma3 #26

AlienKevin commented Mar 19, 2025 •

edited

Loading

danbider left a comment •

edited

Loading

Minions document OCR Parser with Gemma3 #26

Are you sure you want to change the base?

Minions document OCR Parser with Gemma3 #26

Conversation

AlienKevin commented Mar 19, 2025 • edited Loading

danbider left a comment • edited Loading

Choose a reason for hiding this comment

AlienKevin commented Mar 19, 2025 •

edited

Loading

danbider left a comment •

edited

Loading