The digital wave has transformed how we shop, bringing convenience to our fingertips. Yet, for something as essential and frequent as grocery shopping, the experience on most apps often still feels… cumbersome. Scrolling through endless lists, typing out items, filtering options – it’s functional, but is it truly intuitive?

At a startup in quick commerce space empowering kirana stores, we looked at the heart of how people shop for groceries in India – often through conversations with their trusted local Kirana store owner, or simply listing out needs from memory. This natural, conversational approach is efficient and deeply ingrained. We asked: what if we could bring that same level of effortless interaction to the digital world?

This is why Voice-to-Cart isn’t just a feature at our startup; it’s the very core of our user experience and our unique selling proposition (USP). It’s not an alternative way to search; it’s the primary way you interact with the app to build your basket.

Why Voice is More Than Just a Feature for Our Startup

Think about your typical grocery run (digital or physical). You often have a list in your head or scribbled down. You know you need milk, eggs, bread, maybe some onions and tomatoes. On a traditional app, you open it, find the search bar, type “milk,” add it, go back, type “eggs,” add them, and so on. It’s a series of discrete, deliberate actions.

Now, imagine saying: “Hey AI, add 2 liters milk, one dozen eggs, a loaf of bread, half a kilo onions, and 1kg tomatoes.

In one go, you’ve added five different items with specific quantities. This isn’t just faster; it mirrors how you’d naturally communicate your needs. For our startup, which connects users to their local Kirana stores, this conversational interface feels like talking to your familiar shopkeeper, who understands your needs even with simple, natural language.

This shift from a visual, tap-based interface to a voice-first interaction for cart building is fundamental because:

  • It’s Natural: Voice is the most intuitive form of human communication. We think and speak our needs more readily than we type them.
  • It’s Faster for Lists: Adding multiple items in one go is significantly quicker than searching and adding each item individually via text input.
  • It’s Hands-Free: Useful when you’re cooking, have your hands full, or have accessibility needs.
  • It Handles Complexity Organically: Describing ingredients for a meal (“ingredients for chicken biryani for 4”) or a general need (“regular veggies for a week”) is easier and more natural than breaking it down into individual search queries.
  • It Bridges the Digital Divide: For users less comfortable with typing or navigating complex menus, voice offers a more accessible entry point into online grocery shopping.

By making voice the primary mode of interaction for cart building, we fundamentally change the user experience from a “browse and tap” model to a “speak and add” model. This is not merely a convenience; it’s the core engine driving efficiency and ease of use on our platform.

The AI Engine: Powering Understanding with Open Source

Building a voice interface that truly understands the nuances of Indian users requires sophisticated technology. It’s not enough to have a basic Speech-to-Text; the system needs to grasp the meaning and intent behind the words, especially in a country with diverse languages and unique linguistic habits.

This is where our AI engine, powered by open-source Large Language Models (LLMs) and Automatic Speech Recognition (ASR) models, comes into play.

The process begins when a user speaks:

  1. Automatic Speech Recognition (ASR): The user’s spoken words are converted into text. This is the first critical step. Generic ASR models trained on standard datasets often struggle with Indian accents, local colloquialisms, and code-switching (mixing English words with a local language).
  2. Natural Language Processing (NLP) / Natural Language Understanding (NLU): The text output from the ASR is then processed to understand the user’s intent (e.g., “add item,” “ask for recommendations,” “track order”), identify key entities (e.g., product names like “sugar,” “milk,” quantities like “1 kg,” “2 liters,” actions like “add,” “remove”), and understand the relationships between these entities.

Why Open Source?

We deliberately chose to build upon open-source LLMs and ASRs for several key reasons:

  • Flexibility and Customization: Open-source models provide the underlying architecture that we can fine-tune and adapt precisely to the Indian context. This is crucial for handling the linguistic diversity and specific grocery-related terminology.
  • Handling Local Nuances: This is paramount. Standard models often don’t understand terms like:
    • Quantities: “adha” (आधा - half), “pavu” (पाव / पाऊण - quarter / three-quarters), “muthi” (मुठी - handful), “dabba” (डब्बा - container, implying a standard pack size), “sher” (शेर - a historical unit, roughly 1kg, still used conversationally).
    • Product Variations: “regular tea powder,” “local atta,” “unpackaged sugar,” “nattu kozhi muttai” (நாட்டு கோழி முட்டை - country chicken egg in Tamil), “jada avalakki” (ಜಾಡ ಅವಲಕ್ಕಿ - thick poha in Kannada).
    • Mixed Languages: “Add some atta and 2 packets of biscuits please,” “Butter venum oru packet” (Butter is needed one packet - Tamil/English mix). By training open-source models on large, curated datasets containing these specific terms and language patterns from different regions of India, we can achieve a level of understanding that off-the-shelf solutions cannot match.
  • Cost-Effectiveness: Leveraging open-source technology allows us to build a powerful, customized solution without the prohibitive costs associated with proprietary, general-purpose AI systems.
  • Innovation Control: We have direct control over the AI’s development roadmap, allowing us to continuously improve its understanding based on user interactions and evolving language trends.

Our multi-faceted approach goes beyond just training models. It involves:

  • Data Collection & Annotation: Gathering vast amounts of voice data from diverse Indian speakers and carefully annotating it with product names, quantities, intents, and local terms.
  • Grocery-Specific Ontologies & Knowledge Graphs: Building a deep understanding of grocery items, their categories, common variations, relationships (e.g., ‘biryani masala’ is related to ‘chicken’ and ‘rice’), and common groupings for meals or purposes (e.g., “cleaning supplies,” “basic breakfast items”).
  • Contextual Understanding: Enabling the AI to remember previous turns in the conversation and use that context to interpret subsequent requests accurately.
  • Feedback Loops: Continuously using user interactions (successful orders, clarifications needed, errors) to retrain and improve the models.

Examples of Voice-to-Cart in Action: Understanding Every Indian Voice

Let’s see how our startup’s AI handles different voice commands, showcasing its understanding across languages and its ability to interpret both specific lists and more open-ended needs.

Example 1: Simple Addition (Hindi)

  • User: “हे एआई, एक किलो चावल डाल दो” (Hey AI, ek kilo chawal daal do - Hey AI, add one kilo rice.)
  • System Process: ASR -> “ek kilo chawal daal do”. NLP identifies intent (add), item (chawal/rice), quantity (ek kilo/one kilo).
  • AI Response (Voice): “ठीक है, एक किलो चावल आपके कार्ट में डाल दिया है।” (Theek hai, ek kilo chawal aapke cart mein daal diya hai - Okay, added one kilo rice to your cart.)

Example 2: Local Quantity (Marathi)

  • User: “आधा किलो साखर आणि पावशेर डाळ घाला।” (Adha kilo sakhar ani pavsher daal ghala - Add half a kilo sugar and a quarter kilo dal.)
  • System Process: ASR -> “adha kilo sakhar ani pavsher daal ghala”. NLP identifies intent (add), items (sakhar/sugar, daal/dal), quantities (adha kilo/half kilo, pavsher/quarter kilo - understanding ‘pawsher’ as a common local unit).
  • AI Response (Voice): “जी, आधा किलो साखर आणि पावशेर डाळ कार्टमध्ये ॲड केली आहे।” (Ji, adha kilo sakhar ani pavsher daal cart madhye add keli aahe - Yes, half a kilo sugar and a quarter kilo dal have been added to the cart.)

Example 3: Recipe/Meal Request (English + Item Listing)

  • User: “I want to make sambar today. Add the main ingredients for sambar.”
  • System Process: ASR -> “I want to make sambar today. Add the main ingredients for sambar.” NLP identifies intent (add ingredients for recipe), recipe (sambar), scope (main ingredients). System accesses knowledge base for sambar ingredients.
  • AI Response (Voice): “Okay, adding toor dal, tamarind paste, onions, tomatoes, and sambar powder to your cart. Do you need any vegetables like drumsticks or okra?”

Example 4: Clarification Needed (Tamil)

  • User: “கொஞ்சம் எண்ணெய் சேருங்கள்।” (Konjam ennai serungal - Add some oil.)
  • System Process: ASR -> “konjam ennai serungal”. NLP identifies intent (add), item (ennai/oil), quantity (konjam/some - vague). Recognizes ambiguity (cooking oil, gingelly oil, etc.).
  • AI Response (Voice): “எந்த வகையான எண்ணெய்? சமையல் எண்ணெயா, அல்லது நல்லெண்ணையா?” (Endha vagaiyaana ennai? Samayal ennai-ah, alladhu nallenna-ah? - Which type of oil? Cooking oil, or gingelly oil?)

Example 5: Specific Item with Local Term (Telugu)

  • User: “ఒక పావు బెల్లం కావాలి।” (oka paavu bellam kaavali - I need one paavu jaggery.)
  • System Process: ASR -> “oka paavu bellam kaavali”. NLP identifies intent (need/add), item (bellam/jaggery), quantity (oka paavu/one paavu - understanding ‘paavu’ as 250g in this context).
  • AI Response (Voice): “సరే, పావు కిలో బెల్లం మీ కార్ట్‌లో చేర్చబడింది।” (Sarē, pāvu kilō bellam mī kārṭlō cērcabaḍindi - Okay, quarter kilo jaggery has been added to your cart.)

Example 6: Open-Ended Scenario - Meal (Kannada)

  • User: “ನಾನು ನಾಷ್ಟಾಕ್ಕೆ ಏನಾದ್ರೂ ಮಾಡಬೇಕು. ಎರಡರಿಂದ ಮೂರು ದಿನಕ್ಕೆ ಆಗುವಷ್ಟು ಸಾಮಾನು ಸೇರಿಸಿ।” (Naanu nashtaakke enaadru maadabeku. Eradinda mooru dinakke aaguvashtu saamaanu serisi - I need to make something for breakfast. Add items sufficient for two to three days.)
  • System Process: ASR -> “Naanu nashtaakke enaadru maadabeku. Eradinda mooru dinakke aaguvashtu saamaanu serisi.” NLP identifies intent (add items for meal), meal (nashta/breakfast), duration (eradu mooru dinakke/two three days), quantity qualifier (aaguvashtu/sufficient). System accesses knowledge base for common breakfast items and estimates quantities for 2-3 days for a standard household.
  • AI Response (Voice): “ಖಂಡಿತ, ಎರಡರಿಂದ ಮೂರು ದಿನಗಳಿಗೆ ಬೇಕಾಗುವ ಬ್ರೆಡ್, ಮೊಟ್ಟೆ, ಹಾಲು, ಮತ್ತು ಇಡ್ಲಿ ರವೆ ಸೇರಿಸುತ್ತಿದ್ದೇನೆ. ನಿಮಗೆ ಇನ್ನೇನಾದರೂ ಬೇಕಾ?” (Khanḍita, eradinda mooru dinagaḷige bēkāguva breḍ, moṭṭe, hālu, mattu iḍli rave sērisuttiddēne. Nimagge innēnādarū bēkā? - Certainly, adding bread, eggs, milk, and idli rava needed for two to three days. Do you need anything else?)

Example 7: Open-Ended Scenario - Purpose (Malayalam)

  • User: “എനിക്ക് അടുക്കള വൃത്തിയാക്കണം. അതിനു വേണ്ട സാധനങ്ങൾ കാർട്ടിൽ ചേർക്കൂ।” (Enikku adukkala vruthiyakkanaṁ. Atinu vēṇṭa sādhanannaḷ kārṭil cērkkū - I need to clean the kitchen. Add the things needed for that to the cart.)
  • System Process: ASR -> “Enikku adukkala vruthiyakkanaṁ. Atinu vēṇṭa sādhanannaḷ kārṭil cērkkū.” NLP identifies intent (add items for purpose), purpose (adukkala vruthiyakkanaṁ/cleaning the kitchen), category (sādhanannaḷ/things needed). System accesses knowledge base for common kitchen cleaning supplies.
  • AI Response (Voice): “ശരി, അടുക്കള വൃത്തിയാക്കാൻ ആവശ്യമായ ഡിറ്റർജന്റ്, സ്ക്രബ് പാഡ്, ക്ലീനിംഗ് സ്പ്രേ എന്നിവ കാർട്ടിൽ ചേർക്കുന്നു. മറ്റെന്തെങ്കിലും വേണോ?” (Śari, aṭukkaḷa vr̥tiyākkān āvaśyamāya ḍiṟṟarjenṟ, skrabb pāḍ, klīninng spre enniva kārṭil cērkkunnu. Maṟṟentenkiluṁ vēṇēā? - Okay, adding detergent, scrub pad, and cleaning spray needed for kitchen cleaning to the cart. Do you need anything else?)

These diverse examples highlight not just the multilingual capabilities but also the AI’s capacity to understand intent beyond simple item listing. It can infer needs based on context like recipes, meal plans, or even activities like cleaning. This is possible because our open-source based NLP is trained on vast datasets of grocery items, categories, and common associations, allowing it to make intelligent suggestions and build carts based on more abstract requests.

Why Quick-Commerce Startup should Use Voice, Not Just Tap Navigation

The grocery app market is crowded with platforms offering similar tap-based navigation and search. While effective for Browse or discovering new products, this model can be slow and tedious when you know exactly what you need or have a general idea of items required for a task.

Our startup’s voice-to-cart approach differentiates us by prioritizing speed, convenience, and a natural user experience for the most common grocery shopping task: building your list of known items or gathering supplies for a purpose. It’s like walking into your local Kirana and just telling the owner what you need – they understand, they gather the items, and you’re done. We bring that same efficiency to your phone.

While we still provide visual interfaces for Browse, discovery, and cart review (multimodal is key!), the primary input method for building your cart quickly and naturally is voice. This focus makes us uniquely positioned to serve users who value speed and simplicity, and empowers local Kiranas by making it easier for customers to order from them.

The Future is Conversational

In the age of AI, the most powerful interfaces feel less like using a computer and more like interacting with another person. By making voice-to-cart our core functionality and investing in training open-source AI models on the rich linguistic diversity and unique shopping habits of India, Our startup is building a self-sustainable solution that is not just technologically advanced but also deeply aligned with user behavior and local contexts.

We believe the future of grocery shopping is conversational, and with our voice-first approach, we’re leading the way, one spoken grocery list at a time.