I’ve been involved in many machine learning (ML) projects over the past ten years and without question the number one thing that delays success is a lack of adequate training data. Delays are caused by the time it takes to get the right data, or the project is blocked completely as the collection process is prohibitively time-consuming and expensive.
If there’s no data available for your specific use case, many people are now turning to general-purpose LLMs to help with data for text classification. Surely the models are so large and general means that there’s no need for fine-tuning as you can just use an LLM API to get ‘good enough’ results. So it’s no surprise that we’re now starting to get questions from clients about whether they should train their own ML models or use large language model (LLM) APIs.
However, this decision has real-world implications on project timelines, costs, and the precision of the results. On one hand, creating custom ML models offers unparalleled control and precision but involves an arduous and costly data collection process. On the other hand, using LLM APIs simplifies development but introduces dependencies on LLM providers that can be both costly and restrictive.
In this blog, I will explore how to harness the best of both worlds by generating synthetic data with LLMs for training compact, efficient ML models. To illustrate this, I ran an experiment using GPT-4o to generate synthetic data for three text classification tasks: spam detection, product classification, and sentiment analysis. I’ve also set out some recommendations for anyone looking to leverage synthetic data generated by LLMs for training efficient ML models.
ML Workflows
Before diving into the experiment, let’s look at ML and LLM workflows to illustrate the pros and cons of each approach.
The classic ML workflow involves three primary steps:
- Data collection: This step is labour-intensive and requires domain expertise to label the data accurately. The time and cost associated with this process can be significant, especially for large datasets.
- Model training: Training the ML model and validating its performance necessitates ML expertise. This step involves choosing the right algorithms, tuning hyperparameters, and ensuring the model generalises well to new data.
- Deployment: Deploying the trained model demands MLOps proficiency. This includes setting up infrastructure, monitoring model performance in production, and handling updates and maintenance.
Each of these steps is resource-intensive, contributing to delays and increased costs. Additionally, the need for specialised expertise at each stage can be a barrier for many organisations.
With the advent of LLM APIs, a new approach has emerged:
- Problem definition: Define the problem clearly and prepare relevant prompts.
- API calls: Use LLM APIs to generate input-output pairs for validation.
While this method is faster and simpler, it incurs high costs and latency issues. Moreover, it introduces dependency on LLM providers, which limits control over the model and its environment. The convenience comes at the expense of increased operational costs and potential limitations in customisation.
The case for synthetic data
As mentioned above, we’re going to use synthetic data in our experiment. Synthetic data provides numerous benefits in machine learning, especially when real data is scarce, sensitive, or costly.
It enhances datasets through augmentation, increasing their size and diversity, which improves model robustness and generalisation. In fields like medical imaging and autonomous driving, synthetic data addresses data scarcity, aiding model training. It preserves privacy by mimicking real datasets without exposing sensitive information.
Synthetic data helps address biases, leading to fairer models, and provides controlled environments for testing and validation. It facilitates quick prototyping and experimentation, allows simulation of rare events, supports continuous learning systems, and improves model accuracy by providing diverse training data.
With the recent appearance of LLMs, they are becoming a popular choice for generating synthetic data. The models have the capability of producing high-quality, diverse data that mimics real-world scenarios which can be used to train ML models. Meaning it’s now possible to combine the convenience of LLMs with the control and efficiency of customised ML models.
A hybrid approach: What’s the latest research?
To illustrate the efficacy of combining both approaches, consider a case study conducted by the Hugging Face team. They aimed to classify investor sentiment in a large corpus of news articles.
Case study: Hugging Face sentiment analysis
- Task: Classify investor sentiment in news articles.
- Traditional ML approach: Cost $2.70 for sentiment analysis on a large news corpus. Inference took 0.13 seconds per request, with CO2 emissions around 0.12 kg.
- LLM API approach: Cost $3,061 due to pay-per-use pricing. Inference took several seconds, with CO2 emissions ranging from 300 to 1,000 kg.
- Findings: Traditional ML methods offer good results at lower costs and with a significantly smaller environmental footprint.
Approach | Classic ML approach | LLM API |
Model | RoBERTa | GPT4 |
Cost | $2.7 | $3061 |
Latency | 0.13 seconds | Multiple seconds |
CO2 emission | 0.12 kg CO2 | 735-1100 kg CO2 |
Source: https://huggingface.co/blog/synthetic-data-save-costs#34-pros-and-cons-of-different-approaches
The following table summarises the results – refer to the original reference for full details.

Source: https://huggingface.co/blog/synthetic-data-save-costs#34-pros-and-cons-of-different-approaches
These findings illustrate the substantial benefits of combining synthetic data generation with traditional ML training, emphasising cost efficiency, speed, and environmental sustainability. Traditional ML methods not only offer good results at a lower cost but also have a significantly smaller environmental footprint. This perspective is important for organisations conscious of sustainability.
Combining the best of both worlds
The optimal solution lies in using LLMs to generate synthetic training data and then employing this data to train traditional ML models. This hybrid approach leverages the strengths of both methods, reducing data collection and labelling time while maintaining high model performance and low latency. Let’s see what my experiment reveals.
Experimental Results
I conducted experiments using GPT-4o and data from Kaggle to generate synthetic data for three text classification tasks: spam detection, product classification, and sentiment analysis. The accuracy was measured using the output from the GPT-4o model. The performance metrics below show “zero-shot”, “one-shot”, “two-shot” prompting performance.
Datasets used:
- Spam detection: Classifies messages as spam or not spam. Example:
- Not spam: “Nah I don’t think he goes to usf, he lives around here though.“
- Spam: “Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 …“
- Product classification: Categorizes products into household, books, electronics, or clothing and accessories. Example:
- Household: “Look In Furniture Clara Sofa Set, 3+1+1(Grey) Look In Furniture brings to you unique modern sofa set for your living room…..“
- Books: “Postgraduate Urology Entrance Review (PGIMER): Fully Solved Question Papers 2015-2001 …“
- Electronics: “Aquapac Neoprene Armband Aquapac Armband Sea Kayaking,Cycle Commuting Or A Day At The Beach,This Armband Leaves You Worry-Free …”
- Clothing and Accessories: “Amour Butterfly Design Sunglasses For Girls 6+ Years { SKU16 } Amour Butterfly Design Sunglasses to give full protection …”
- Sentiment analysis: Analyses sentiments into five categories: extremely negative, negative, neutral, positive, and extremely positive. This dataset is collected from Twitter during the COVID-19 period, making it challenging due to informal language and special symbols. I wanted to include this to see how the experiment would work with hard-to-classify data.
Performance metrics:
Accuracy Dataset | zero-shot | one–shot | two-shot | Number of validation samples | Number of classes |
Spam detection | 95% | NA | NA | 100 | 2 |
E-commerce product classification | 85% | 85% | 85% | 40 | 4 |
Sentiment analysis | 27% | 38% | 41% | 100 | 5 |
- Spam detection: The model achieved an impressive 95% accuracy with zero-shot learning. This high performance can be attributed to the simplicity of the binary classification task, which the model handled with notable robustness.
- Product classification: The model maintained around 85% accuracy with both zero-shot and few-shot prompting. However, it struggled with overlapping categories, such as electronics and clothing. Items like electronic accessories often fell into a grey area between categories, revealing the complexities involved in product classification tasks.
- Sentiment analysis: The model’s performance improved from 27% accuracy with zero-shot prompting to 41% with two-shot prompting. While adding more examples generally enhances model performance, the complexity of sentiment analysis tasks requires a careful evaluation of the costs associated with generating additional synthetic data.
These results illustrate that while few-shot prompting can improve model performance, it also increases costs. The performance gains must be weighed against the additional expenses incurred by generating more synthetic data. This balance is needed to optimise the effectiveness and efficiency of ML models in text classification tasks.
Practical recommendations
When considering how to leverage synthetic data generated by LLMs for training efficient ML models, it’s important to balance the costs and benefits. Here are some recommendations based on our experience and the insights gathered from recent research and experiments.
- Balancing costs and benefits: For straightforward tasks like spam detection and product classification, LLM-generated data can quickly yield high-performing models with minimal costs. These tasks benefit from the simplicity of binary or well-defined categorical classifications.
- Handling complex tasks: For more complex tasks such as sentiment analysis, traditional methods might still be necessary to achieve the desired accuracy. The complexity of sentiment analysis requires extensive and well-balanced training data.
- Addressing imbalanced datasets: Synthetic data can balance classes, leading to better model performance, especially for datasets with underrepresented classes.
- Utilising active learning approaches: Active learning approaches can improve model robustness by generating labels for cases where the model is uncertain, refining accuracy without significantly increasing costs.
- Synthetic data augmentation: Enriching existing datasets with synthetic data can be beneficial, particularly for imbalanced datasets, enhancing model performance and ensuring more accurate predictions.
- Environmental and cost considerations: Consider the environmental impact and cost-efficiency of using synthetic data. Combining LLM-generated synthetic data with traditional ML training can offer a balanced solution that leverages the strengths of both methods.
Final thoughts on integrating LLMs and traditional ML
Combining LLMs with traditional ML approaches offers a pragmatic solution to the challenges of text classification. By generating synthetic data with LLMs, we can expedite the training process and reduce costs while maintaining high performance and control. This hybrid strategy allows us to enjoy the best of both worlds, making it a compelling choice for modern ML projects.