February 3, 2026
12 min read
Computer Vision in Production: What We Learned Building for Careit
Putting computer vision into a real product taught us things the tutorials don't cover. Here's what we learned about food recognition, weight estimation, and shipping AI that real people rely on.
Careit is a food donation and rescue platform. Restaurants, grocery stores, and caterers use it to connect surplus food with local nonprofits instead of throwing it away. The core workflow is simple: a donor has food to give away, they post it, a nonprofit claims it, someone picks it up. Speed matters because food has a shelf life.
The bottleneck was data entry. A restaurant manager finishing a dinner rush doesn't want to type out "14 bread rolls, approximately 2kg, baked today, best before tomorrow." They want to take a photo and move on. So we built computer vision features that do three things: identify the food items, generate a text description, and estimate the weight. Here's what we learned.
Choosing the right approach
We evaluated three options: training a custom model from scratch, fine-tuning an existing image classification model, or using a multimodal LLM (like Claude or Gemini) with vision capabilities. For a project this size, training from scratch was out - you need tens of thousands of labeled images to get reasonable accuracy on food categories, and we didn't have that dataset.
We went with multimodal LLMs. The reasoning: food item identification is a task where general knowledge matters more than narrow specialization. A model that understands "these are dinner rolls in a hotel pan" because it's seen millions of food images during pretraining is going to outperform a small custom model trained on a few thousand photos from three restaurants. The tradeoff is cost per inference and latency, but for a single photo per donation, both are acceptable.
The three features
Item recognition
The donor takes a photo. We send it to the vision API with a structured prompt that asks for: item name, category (bakery, produce, prepared meals, dairy, etc.), estimated quantity, and condition. The prompt is specific about output format so we can parse the response reliably. We ask for JSON output and validate the schema before showing anything to the user.
Accuracy on clean, well-lit photos of single item types is very high - above 95% on the categories we tested. Mixed items (a box with rolls, fruit, and wrapped sandwiches) are harder, maybe 80-85%. We handle that by letting the model list multiple items and letting the donor confirm or edit.
Description generation
Once we know what the items are, we generate a natural-language description that the nonprofit will see. "14 dinner rolls, freshly baked, in a full-size hotel pan. Good condition." This replaces what the donor would have typed manually. We keep it factual and short - nonprofits need to know what they're picking up and how much space it takes, not a food review.
Weight estimation
This was the hardest feature. Photos don't have depth information, so estimating weight from a 2D image requires indirect signals: the type of item (bread rolls weigh roughly X each), the visible quantity, the container size if recognizable (hotel pans, catering trays, and produce boxes have standard dimensions), and visual density.
We prompt the model with reference weights for common food items and ask it to estimate based on what it sees. The result is a rough estimate - not precise enough for a scale, but accurate enough for a nonprofit to know whether to send a car or a van. We show it as "approximately 2-3 kg" rather than a false-precision "2.47 kg."
The edge cases that took the most time
- Low lighting in commercial kitchen walk-in coolers and storage areas
- Photos taken through plastic wrap or in sealed containers where the food is partially obscured
- Mixed donations in a single photo (half a shelf of different items)
- Prepared meals where the contents aren't obvious from the outside
- Blurry photos from phones with dirty lenses
We handle most of these by asking the model to flag low confidence and fall back to asking the donor to confirm or re-take the photo. The worst outcome isn't a wrong guess - it's a confidently wrong guess that the donor doesn't notice.
Cost and latency
Each image analysis costs a few cents in API calls. For Careit's volume, that's not a meaningful expense. Latency is 2-4 seconds per photo, which is fine because the donor can review other fields while the AI processes. We show a loading indicator with "Analyzing photo..." and fill in the fields when the response arrives.
What we'd do differently next time
- Build a feedback mechanism from day one. Let users correct the AI, and log those corrections. After a few hundred corrections, you have a dataset for evaluation and potential fine-tuning
- Test with real photos from real kitchens early, not stock photos. The difference in quality and lighting is significant
- Start with fewer categories and expand. Getting 10 food types right is better than getting 50 half-right
- Set expectations with the UI. "AI estimate - tap to edit" signals that the values are suggestions, not facts
The result
The feature turned a 2-minute data entry task into a 5-second photo. Donation volume went up because the friction of posting went down. Nonprofits get better item descriptions and more reliable weight estimates for pickup planning. The AI isn't perfect - but it's good enough that donors prefer it over typing, and that's the bar that matters for adoption.

Ben Arledge
CEO & CTO, CloudOwlNeed help building this?
No sales pitch, just an honest conversation about what you're building.
See our AI capabilities, React/Next.js work, or full service list.
