[Podcast] How AI models learn with data labeling

November 24, 2024 | 1,204 views

How does AI data labeling work, and why do we need human experts to teach AI?

What does it take to create a well-operating AI model? Data is the key, but there is something more that lies behind high-quality data. According to Manu Sharma, human experts are the answer. While we can create purely synthetic data in some cases, teaching a model entirely new concepts requires humans. What’s more, ensuring that AI is aligned with human intent is crucial, and data labeling is at the heart of that alignment.

In this episode of the AI at Scale podcast, Gosia Gorska welcomes Manu Sharma, the Founder and CEO of Labelbox, the leading data factory for building next-generation AI. Manu starts with explaining what data labeling is. He highlights the critical role of data labeling in AI development, the evolution of data annotation techniques, and the future of AI technologies. He shares his extensive experience in aerospace engineering, AI research, and entrepreneurship, highlighting how his company supports businesses in aligning and training their AI models through high-quality data labeling.

Producing high-quality data at scale involves multiple steps and meticulous processes. To address this challenge, Labelbox manages one of the largest networks of AI tutors globally – alignerr.com, which includes university professors, domain experts, language experts from various regions, and more. They label data through part-time jobs and help producing high-quality data at scale.

Manu Sharma, CEO of Labelbox, explains data labeling in AI at Scale podcast by Schneider Electric

Listen to the AI at Scale podcast

Listen to Manu Sharma: How AI models learn with data labeling episode.
Subscribe to the show on the preferred streaming platform (Apple Podcasts, Spotify).

Transcript

Gosia Gorska:
Today I have a big pleasure to welcome Manu Sharma. Manu is the Founder and CEO of Labelbox, the leading data factory for building next-gen AI. Welcome, Manu.

Manu Sharma:
Thank you for having me.

AI and Aviation Background

Gosia:
Yeah, it’s a pleasure, really. Manu, I’ve been reading about you and I was so excited to discover that you developed products at Planet Labs and drone deployment and you co-founded renewable energy and space exploration companies. It started with you studying aerospace engineering and a wide range of robotics, electrical engineering, and industrial design subjects. You hold master’s degrees from Stanford University and Embry-Riddle Aeronautical University. Tell me, how are you still doing something in this field?

Manu:
First of all, thank you again for having me. And yeah, you did quite some research. I am a pilot. I fly planes and I think that’s my closest to being in the aviation aeronautics world. However, I am very deeply optimistic and passionate about space exploration, and I think, in my lifetime, we’re going to go to Mars and explore a lot of parts of our solar system. That part of me is always curious and excited about developments in this sector. But my day-to-day work is in artificial intelligence, and AI is a very exciting industry at the moment. When I was growing up and in university, I was very fascinated by AI. In fact, while pursuing my aeronautical engineering, I spent most of my time doing research in AI. I developed algorithms with neural networks to control flight characteristics. Essentially, flight controls are typically built with classical systems, and I was very curious about whether we could train neural networks to control flight. That was about 15 years ago. My interest in AI has always been there. Fast forward to 2018, when everything was coming together—the hardware technologies were getting really good, and deep learning algorithms were becoming mainstream. That’s when I thought it was the right time to build a company like Labelbox.

AI’s Role in Aviation

Gosia:
Okay, I see. But I’m curious. You need to give me the answer. So can large language models actually control flights or not?

Manu:
They are going to be able to do so, and the applications of large language models are particularly well-suited in scenarios where you don’t know what to do. What would be an example of that? For instance, humanoid robots or the robotics industry are growing rapidly because many people are realizing they have the missing piece: an intelligent system that can perceive the surroundings and navigate to accomplish tasks. This integration of robotics, large language models, and vision models is a winning recipe for making these systems work in the physical world. This is also a great architecture for space exploration. If we want to send hundreds of robots to Mars to explore the surroundings, it will be very effective to have large language models operating at the edge to make decisions. Currently, decisions are made every eight minutes or so because of the lag in signal travel between the two places. With intelligent systems, decisions can be made on-site, prioritizing important data to send back. Companies like Planet Labs started developing data prioritization techniques about seven or eight years ago. We were capturing the entire Earth with 300 or so satellites every day and needed the fleet to prioritize which images to focus on. I’m sure many companies now operate with that kind of architecture.

The Importance of Data Labeling in AI

Gosia:
Yeah, that’s really fascinating. You mentioned image, and I think this is the area where we will direct our conversation right now because I guess this is related to data labeling. One of the kinds of data labeling is exactly labeling the images needed for AI systems to cooperate with humans and bring some intelligence to us as well. But the first question before we go into the details of this: how did you decide you wanted to launch a startup? Was Labelbox the first one? Was there any particular reason why you chose exactly the data labeling field?

Manu:
Yeah, so this is perhaps my third company. I had two small businesses that I operated during my university time. While pursuing my bachelor’s, I was really interested in residential wind turbines. We attempted to transition a research project into a small business. Later, I developed a small business around building small cubesats or small satellites—experiment modules that would go to the International Space Station, allowing researchers to conduct experiments remotely by logging onto a website to control them. These were small businesses in very niche industries. The most important elements of those experiences were the fun of solving problems and interacting with the market to make things happen. In university, while taking classes, I was building research labs for space station experiments on SpaceX and Antares rockets. It was a fun adventure. These experiences taught me professional software development. Joining Drone Deploy or Planet Labs taught me how to produce software at a large scale, something I didn’t know since I was mostly in non-software industries. This led me to understand how to build technology and software companies, decomposing ideas into smaller tasks, working with engineering to develop them iteratively and quickly. In 2017-2018, deep learning was a hot topic in Silicon Valley with self-driving cars and startups emerging. At Planet, we started building computer vision technologies. Our business was selling images, but customers wanted insights in their areas of interest. We developed vision models and realized that data was a critical component. Along with talent and compute, data was essential for developing AI systems. Humans will always want AI systems to be aligned with their needs, using data labeling as the primary means of alignment. We built Labelbox around the idea that the world will have many AI models, each needing unique alignment through data labeling. Today, vision models are aligned with data labeling techniques that are now heavily AI-accelerated, reducing the need for extensive human supervision. Large language models follow a similar path, with multimodal models integrating vision and language capabilities. This commonality underpins the importance of data labeling across AI applications.

Path to Founding Labelbox

Gosia:
Yeah, that’s all data labeling, exactly. For people who are already reading and interested, maybe studying AI, this whole concept is quite clear. But if you were just about to give a simple definition of the role of data labeling, how it plays in AI, and you’ve already described a bit how it changed over the years, but coming to today, how are businesses using data labeling and why it’s needed? Because I think the general awareness is more focused on how large language models or vision recognition models are working, but we still forget about the input side, as you said, and the fact that data has to be of high quality and labeled with high accuracy. How do you ensure this at Labelbox? Could you start with maybe a simple definition?

Manu:
Yeah. Any business developing a unique, differentiated AI model—only a smaller fraction of companies will build AI systems from the ground up because they have unique data produced by their existence. Companies like Google or Meta have social networks, search engines, and other software applications generating vast amounts of unique data. Industrial companies like John Deere have tractors with cameras in agricultural contexts, producing and capturing data at scale. These companies have unique advantages. By using their data to build new AI-driven product capabilities, they can automate or enhance valuable decisions in their industries. For John Deere, identifying specific weeds in farms to optimize herbicide use can reduce application by 90%, benefiting the world. Similarly, industrial companies like Schneider Electric have use cases in manufacturing and defect analysis unique to their operations. To build AI models that streamline, automate, or enhance their operations, businesses must produce high-quality labeled data. That’s the essence of data labeling—to train customized, unique AI models. Producing high-quality data at scale is complex and challenging, not achievable by every company. At Labelbox, we’ve developed strategies and techniques to produce data at scale. Depending on the task, expert humans are required. For large language models and generative AI applications, data labeling must be done by experts with backgrounds in PhDs, linguistics, advanced language skills, and software engineering. Labelbox has one of the largest networks of these expert AI tutors globally. Our network, aligner.com, includes university professors, domain experts from industries like Google and Amazon, language experts from various regions, and more. They label data through part-time jobs and get paid. Our platform manages all tasks, orchestrating data production like an assembly line—breaking down tasks, ensuring quality through QA and QC processes, automating where possible, and measuring data quality at every step to enable continuous improvement. Producing high-quality data at scale involves multiple steps and meticulous processes, all managed through our software and AI capabilities.

Earning with Flexible Data Labeling Jobs

Gosia:
Yeah. It’s really great, Manu, that you mentioned the opportunity to participate as a part-time job in these projects. I think it’s a beautiful alternative to what some of us find irritating on web pages where we’re asked to select pictures—like a car or a tree or a bus—knowing we’re doing data labeling for free as an exercise. You’re offering an alternative where people can actually be part-time contributors by labeling data according to their expertise. It’s a really great idea.

Manu:
Yeah, absolutely. Tens of thousands of people are actively making money doing these tasks on our network. People are earning on the side, giving them more options and flexibility in their lives. Their wages are very compelling—some of the best aligners are making over $100 to $150 an hour doing these tasks. It’s a compelling part-time job that people do on their own time.

Labelbox’s Approach to Quality Data Labeling

Gosia:
Yeah, that’s a great idea. Talking about specific examples, you already started to share some, like in agriculture or industry. Could you give us some examples where the data that was labeled has improved machine learning model performance in concrete applications?

Manu:
Yeah, there are many examples. Let’s start with computer vision applications, where the primary task is to identify objects with high precision and accuracy. Improving these models often involves building a data engine, similar to how Tesla Autopilot was designed. The term “data engine” was coined by Tesla’s team, popularizing advanced vision models at scale. For example, in John Deere’s case, they have cameras on tractors identifying weeds under varying conditions—morning, evening, rain. The sensor data varies significantly, challenging vision models to perform consistently across all scenarios. To improve model performance, the machine learning team prioritizes identifying where the model struggles, such as poor lighting conditions, and labels that data with humans. This enables the AI model to learn from these edge cases and improve weed identification in those scenarios. Over the last few years, our customers have evolved from R&D projects to production-deployed products generating large-scale revenue through constant iterations—rolling out products, capturing more data, prioritizing data labeling, retraining models, and so forth.

In large language model applications, similar scenarios exist. Our company supports many large language models and AI assistants used globally. The data for these models is very interesting, focusing on reasoning and advanced problem-solving. This data is produced by experts with PhDs, linguists, and software engineers, solving complex problems with step-by-step instructions. This high-quality reasoning data isn’t available on the Internet, which is why training models solely on Internet data limits their reasoning capabilities. Labelbox provides this specialized data to frontier labs, enabling advanced reasoning skills in models like ChatGPT, Gemini, and Llama. In essence, teaching a model new capabilities requires new data, produced in various ways depending on the problem. Sometimes, synthetic data suffices when existing models can solve a problem, allowing training of smaller, specialized models through techniques like model distillation. However, teaching a model entirely new capabilities without existing knowledge requires human-produced data, often with AI assistance. This is a common thread across all AI model training.

Future of AI: Enhancing Multilingual Models

Gosia:
I see. I’m glad that we are moving away from training models only on Internet data because, as you mentioned, we need high-quality expertise from humans. If we relied solely on Internet data, we’d be feeding the models “garbage in, garbage out,” limiting their reasoning capabilities. You’ve described the current status of technology; what’s your take on the future? You say that Labelbox is the data factory for the next-gen AI. So what’s next in next-gen AI, and how does the future look?

Manu:
I think there are three dimensions. First, frontier models will continue to improve. Currently, these models are very good in English but less so in other languages like Hindi or Polish, often making more errors. We can expect these models to quickly become much better in other languages. Second, these models will advance in reasoning capabilities, solving more complex problems in math, science, and beyond. This advancement hints at the potential for models to perform agentic tasks in enterprise contexts—solving complex problems that involve interacting with multiple software systems and manipulating data reliably. Advanced reasoning will enable models to handle such tasks effectively.

Third, multimodality will become prominent. Architectures are converging so that solving vision or language problems won’t require different model architectures. Instead, a single model will handle multiple tasks simultaneously, allowing interactions through images, videos, text, and documents within the same model. All these capabilities will be integrated into a single model, enhancing versatility and functionality.

Gosia:
I see. That’s an interesting vision of the future. We’re looking forward to it. It was a pleasure to talk with you, Manu. Thank you so much for your time with us.

Like what you hear?

Visit our AI at Scale website to discover how do we transform energy management and industrial automation with artificial intelligence technologies to create more sustainable solutions.

Listen to the other episodes of the AI at Scale podcast.

AI at Scale Schneider Electric podcast series continues!

The first Schneider Electric podcast dedicated only to artificial intelligence is available on all streaming platforms. The AI at Scale podcast invites AI practitioners and AI experts to share their experiences, insights, and AI success stories. Through casual conversations, the show provides answers to questions such as: How do I implement AI successfully and sustainably? How do I make a real impact with AI? The AI at Scale podcast features real AI solutions and innovations and offers a sneak peek into the future.

Tags: aerospace, AI, AI at scale, AI at scale podcast, AI expert, AI models, AI research, annotation, Artificial Intelligence (AI), CEO, DATA, data at scale, data factory, data labeling, Digital transformation, entrepreneurship, Founder, genAI, high-quality data, high-quality data labeling, innovation, Labelbox, Malgorzata Gorska, Manu Sharma, next-gen AI, Podcast, synthetic data, training AI models