By Himanshi Lohchab
As AI model intelligence peaks, its reliance on complex, human-curated data is only deepening.They started with microtasks such as transcribing audio files, marking tick boxes, translating language and labelling objects in images. Now, data annotators are correcting software code, checking financial statements and analysing diagnostic reports, as the training needs of artificial intelligence models become more complex.Data annotation, or simply data labelling, is the most crucial and foundational step for building high-quality datasets to train AI models, enhance accuracy, curtail hallucinations and build safety guardrails against inappropriate or harmful content. And India is fast emerging as a hub for data annotation services with flexible workers, mid-tier business analysts and even skilled data engineers, auditors, radiologists, lawyers, etc., contributing to building high-quality datasets. 鈥淗onestly, I think we need to retire the term 鈥榙ata labelling鈥,鈥 says Jonathan Siddharth, founder of Palo Alto-based talent and AI tools company Turing. 鈥淚t鈥檚 like calling a smartphone a 鈥榩ortable telephone鈥.鈥濃淲hat we鈥檙e doing now is fundamentally different. We鈥檙e not tagging cats and dogs; we鈥檙e orchestrating teams of Olympiad-level talent to solve highly complex problems across industries,鈥 he said. AI models have got so smart that sometimes you need a physicist, a software engineer, and a data scientist working together just to generate data that challenges them, he explained.Harshul Arora, founder and CEO of early-stage startup Macgence, said his company is focussing on curating custom datasets for AI/ML models and agents. 鈥淏usinesses now have custom data sourcing needs which capture linguistic and cultural nuances. These datasets are not available on open libraries like Hugging Face,鈥 he said.Riding the growth waveThe global market for data annotation is likely to expand from about $6.5 billion in 2025 to nearly $20 billion by 2030, growing at about 25鈥30% each year, according to staffing firm TeamLease Digital. In India, the market was worth $80 million in 2023 and is expected to reach nearly $500 million by 2030, growing at almost 30% each year, it said. And this has reflected in the growth of the workforce in this segment from 20,000 in 2022 to 70,000 currently. These include annotators, quality controllers and project managers, who work in startups, IT services and crowdsourcing platforms.ETtech
鈥淒ata annotation has grown more complex with the rise of LLMs, leading to the emergence of specialised, higher-paying roles for domain-specific tasks,鈥 said Kapil Joshi, CEO 鈥 Quess IT Staffing, adding that some of its clients have grown 50% year-on-year. With this growth, the sector will soon witness a talent scarcity, said TeamLease Digital CEO Neeti Sharma. 鈥淏y 2026, the industry could face a shortage of 40鈥50% in skilled professionals.鈥濃淎s models evolve, data demands will shift 鈥 certain types of data may require lower volumes but others will rapidly expand,鈥 said Ryan Kolln, CEO of Appen, a Washington-based company which has delivered over 15,000 AI data projects, including LLM fine-tuning, evaluation, red teaming, and multimodal annotation. 鈥淎 good example of this is LLM work, where elementary math question data is reducing, but data is still growing in demand for more complex STEM (science, technology, engineering and mathematics) problems,鈥 he said. The sector鈥檚 importance is underscored by Meta鈥檚 recent $14.3 billion deal to acquire a 49% stake in Scale AI, valuing the data company at $29 billion. This has opened a multi-million opportunity for global companies like Turing and Appen as tech giants OpenAI, Google, Microsoft have reportedly terminated their contracts with Scale. Turing鈥檚 Siddharth said the deal validates that 鈥渄ata is as strategic as compute in the race to AGI (artificial general intelligence), and signals that the scale of investment here will rival or even exceed billions annually across frontier labs鈥. In the past weeks, Turing has added potential contracts worth $50 million, the Time reported.The India advantageData companies have long depended on India鈥檚 talent and scale for servicing global projects. 鈥淭he depth of technical expertise 鈥 from IIT grads to domain-specific PhDs in math, physics and engineering 鈥 is extraordinary. And it鈥檚 evolving in sync with what AI needs: not just coding talent, but frontier minds who can help push the limits of reasoning, multimodality and agentic workflows,鈥 said Siddharth of Turing, whose 40% workforce is based in India. He added that data labs need the best minds to compete, 鈥渘ot just recycle the same talent pool in Silicon Valley. When a physicist in Bengaluru helps train a model that might cure diseases, or an engineer in Pune improves an AI that could revolutionise education, that鈥檚 the democratisation of both intelligence and opportunity鈥.Appen鈥檚 Kolln pointed out that logical thinking and problem-solving skills are strong in the Indian education system given the strong emphasis on mathematics and science. The company has a pool of 50,000 contributors from India.Hardik, founder and CEO of Indika AI, said: 鈥淥ver the past three years, we鈥檝e seen strong global demand for multilingual, domain-specific data infrastructure which translated into 5X top line growth for us.鈥 The company鈥檚 freelance platform, Flexibench, has 70,000 registered contributors, 5%-10% of whom are working actively at any given time, he added.