Technology

February 3, 2026

12 min read

Computer Vision in Production: What We Learned Building for Careit

Putting computer vision into a real product taught us things the tutorials don't cover. Here's what we learned about food recognition, weight estimation, and shipping AI that real people rely on.

Careit is a food donation and rescue platform. Restaurants, grocery stores, and caterers use it to connect surplus food with local nonprofits instead of throwing it away. The core workflow is simple: a donor has food to give away, they post it, a nonprofit claims it, someone picks it up. Speed matters because food has a shelf life.

The bottleneck was data entry. A restaurant manager finishing a dinner rush doesn't want to type out "14 bread rolls, approximately 2kg, baked today, best before tomorrow." They want to take a photo and move on. So we built computer vision features that do three things: identify the food items, generate a text description, and estimate the weight. Here's what we learned.

Choosing the right approach

We evaluated three options: training a custom model from scratch, fine-tuning an existing image classification model, or using a multimodal LLM (like Claude or Gemini) with vision capabilities. For a project this size, training from scratch was out - you need tens of thousands of labeled images to get reasonable accuracy on food categories, and we didn't have that dataset.

We went with multimodal LLMs. The reasoning: food item identification is a task where general knowledge matters more than narrow specialization. A model that understands "these are dinner rolls in a hotel pan" because it's seen millions of food images during pretraining is going to outperform a small custom model trained on a few thousand photos from three restaurants. The tradeoff is cost per inference and latency, but for a single photo per donation, both are acceptable.

The three features

Item recognition

The donor takes a photo. We send it to the vision API with a structured prompt that asks for: item name, category (bakery, produce, prepared meals, dairy, etc.), estimated quantity, and condition. The prompt is specific about output format so we can parse the response reliably. We ask for JSON output and validate the schema before showing anything to the user.

Accuracy on clean, well-lit photos of single item types is very high - above 95% on the categories we tested. Mixed items (a box with rolls, fruit, and wrapped sandwiches) are harder, maybe 80-85%. We handle that by letting the model list multiple items and letting the donor confirm or edit.

Description generation

Once we know what the items are, we generate a natural-language description that the nonprofit will see. "14 dinner rolls, freshly baked, in a full-size hotel pan. Good condition." This replaces what the donor would have typed manually. We keep it factual and short - nonprofits need to know what they're picking up and how much space it takes, not a food review.

Weight estimation

This was the hardest feature. Photos don't have depth information, so estimating weight from a 2D image requires indirect signals: the type of item (bread rolls weigh roughly X each), the visible quantity, the container size if recognizable (hotel pans, catering trays, and produce boxes have standard dimensions), and visual density.

We prompt the model with reference weights for common food items and ask it to estimate based on what it sees. The result is a rough estimate - not precise enough for a scale, but accurate enough for a nonprofit to know whether to send a car or a van. We show it as "approximately 2-3 kg" rather than a false-precision "2.47 kg."

The edge cases that took the most time

Low lighting in commercial kitchen walk-in coolers and storage areas
Photos taken through plastic wrap or in sealed containers where the food is partially obscured
Mixed donations in a single photo (half a shelf of different items)
Prepared meals where the contents aren't obvious from the outside
Blurry photos from phones with dirty lenses

We handle most of these by asking the model to flag low confidence and fall back to asking the donor to confirm or re-take the photo. The worst outcome isn't a wrong guess - it's a confidently wrong guess that the donor doesn't notice.

Cost and latency

Each image analysis costs a few cents in API calls. For Careit's volume, that's not a meaningful expense. Latency is 2-4 seconds per photo, which is fine because the donor can review other fields while the AI processes. We show a loading indicator with "Analyzing photo..." and fill in the fields when the response arrives.

What we'd do differently next time

Build a feedback mechanism from day one. Let users correct the AI, and log those corrections. After a few hundred corrections, you have a dataset for evaluation and potential fine-tuning
Test with real photos from real kitchens early, not stock photos. The difference in quality and lighting is significant
Start with fewer categories and expand. Getting 10 food types right is better than getting 50 half-right
Set expectations with the UI. "AI estimate - tap to edit" signals that the values are suggestions, not facts

The result

The feature turned a 2-minute data entry task into a 5-second photo. Donation volume went up because the friction of posting went down. Nonprofits get better item descriptions and more reliable weight estimates for pickup planning. The AI isn't perfect - but it's good enough that donors prefer it over typing, and that's the bar that matters for adoption.

Ben Arledge

CEO & CTO, CloudOwl

Need help building this?

No sales pitch, just an honest conversation about what you're building.

See our AI capabilities, React/Next.js work, or full service list.

Start a conversation See how we work

Computer Vision in Production: What We Learned Building for Careit

Putting computer vision into a real product taught us things the tutorials don't cover. Here's what we learned about food recognition, weight estimation, and shipping AI that real people rely on.

Choosing the right approach

The three features

Item recognition

Description generation

Weight estimation

The edge cases that took the most time

Low lighting in commercial kitchen walk-in coolers and storage areas
Photos taken through plastic wrap or in sealed containers where the food is partially obscured
Mixed donations in a single photo (half a shelf of different items)
Prepared meals where the contents aren't obvious from the outside
Blurry photos from phones with dirty lenses

Cost and latency

What we'd do differently next time

Build a feedback mechanism from day one. Let users correct the AI, and log those corrections. After a few hundred corrections, you have a dataset for evaluation and potential fine-tuning
Test with real photos from real kitchens early, not stock photos. The difference in quality and lighting is significant
Start with fewer categories and expand. Getting 10 food types right is better than getting 50 half-right
Set expectations with the UI. "AI estimate - tap to edit" signals that the values are suggestions, not facts

The result

Ben Arledge

CEO & CTO, CloudOwl

Need help building this?

No sales pitch, just an honest conversation about what you're building.

See our AI capabilities, React/Next.js work, or full service list.

Start a conversation See how we work

Computer Vision in Production: What We Learned Building for Careit

Choosing the right approach

The three features

Item recognition

Description generation

Weight estimation

The edge cases that took the most time

Cost and latency

What we'd do differently next time

The result

Ben Arledge

Need help building this?

More from the blog

Why Your Business Needs a Fractional CTO in 2026

Next.js vs. Traditional CMS: Which Is Right for Your Business Website?

How to Scope a Software Project Without Getting Burned

Computer Vision in Production: What We Learned Building for Careit

Choosing the right approach

The three features

Item recognition

Description generation

Weight estimation

The edge cases that took the most time

Cost and latency

What we'd do differently next time

The result

Ben Arledge

Need help building this?

More from the blog

Why Your Business Needs a Fractional CTO in 2026

Next.js vs. Traditional CMS: Which Is Right for Your Business Website?

How to Scope a Software Project Without Getting Burned