Driving Innovation in Agriculture with Synthetic Data

By Ron Baruchi, CEO of Agmatix

Data is the cornerstone of agricultural product innovation. It’s necessary to develop products, meet regulatory requirements, discover new mechanisms, and educate the market on how to optimize the use of ag inputs and practices.

But generating quality, real-world data is an expensive, time-consuming process. According to a 2018 Phillips McDougall crop protection industry report, it typically takes more than 150 studies to register a new active ingredient. From 2010-14, it cost around $286 million to discover and develop a new crop protection product, with $47 million (approximately 16%) going toward field trials. It also takes just over 11 years from the first synthesis to the first sale of a crop protection product.

The advent of synthetic data may soon be changing that. 

Synthetic data

Synthetic data is based on real-world data that has been generated by a model that keeps the same statistical properties and connections between the different parameters of real-world data sets. Datasets can be fully synthetic, or partially synthetic, where synthetic data helps fill in any gaps in real-world data. 

Synthetic data is not a replacement for original data, but a secondary source—one that can significantly reduce the time, cost, and effort in obtaining original data. Which offers great potential in reducing the time and investment of bringing new agricultural products to market.

Training AI Models

Fully synthetic data is often used to validate AI models. Instead of conducting real-world experiments to train AI, we can use synthetic data to look for early signs of correlations and model validity prior to investing in large, real-world data collection. Once the AI is performing as expected, then you move forward with validating it in real-world trials.

As Fortune reported last year, John Deere was training its AI on synthetic images of weed species under different conditions so its tractors spray the right plants and not a farmer’s crops. The farm equipment manufacturer said it would eventually test how well the synthetic data training compared to AI that was trained on real data.

Using Digital Twins for Virtual Trials

Synthetic data can also be used for R&D purposes. Scientists can create a “digital twin,” in which a computer takes real-world data to maintain its statistical correlations, and generates synthetic data to create a system that emulates real life. According to Forbes, digital twins have already been created for transportation infrastructure and sports stadiums. 

In agriculture, you could create a digital twin of a field trial to test which variables, such as soil types and weather conditions, are necessary for a successful real-world field trial. This can have huge implications for agricultural input suppliers like crop protection companies, who are required to manage large field trials to receive regulatory approval, or seed companies that rely heavily on experimentation to improve their seed genetics. 

Digital twins can also be used to fill in data gaps from real-world sets. If an equipment error or a remote sensor fails and data is missing, you can generate synthetic data based on statistical models to fill in those holes and provide a complete picture of the study. Or if data is missing in certain geographic locations due to a lack of research facilities, synthetic data can help fill in those absent areas.

Preserving Privacy When Collaborating

Privacy and security have always been major barriers to obtaining real-world data, especially on-farm data. And with more agribusinesses collaborating with one another, that concern has only grown. 

But synthetic data allows companies to strip the personal and confidential information from a dataset, while keeping the data correlations and relations of the original real-world data. This opens the door for greater collaboration and confidence in data sharing.

How Can We Trust Synthetic Data?

While the potential for synthetic data is great, naturally there’s concern around its efficacy. How can we trust synthetic data and be sure it’s an accurate representation of real-world data?

Perhaps the strongest evidence is in the pharmaceutical industry, where we’re seeing increasing regulatory trust in synthetic data.

In 2020, the FDA approved a synthetic control arm for use in a Medidata cancer trial. As Jennifer Goldsack explains in an article for STAT, synthetic control arms use what’s called real-world data—data collected from external sources such as electronic health records, historical clinical trial data, and even consumer fitness trackers—instead of gathering data from patients recruited for a trial. 

The Medidata synthetic control arm was built from an archive of more than 22,000 clinical studies, and Goldsack points out there have been other examples of their use in receiving FDA and EU approval, setting regulatory precedence.

Now is the Time to Start 

It’s not inconceivable that someday regulatory bodies like the EPA could accept synthetic data in their approval process. In fact, Gartner predicts that by 2030 synthetic data will completely overshadow real data in AI models. Which means now is the time for agribusinesses to learn how to work with synthetic data.

While ag companies can create synthetic data themselves, it’s a significant investment, given the time and resources it takes to develop a scientifically sound model, and gather original data to create the synthetic data. It’s much more efficient for an R&D department to externally source synthetic data.

As a data company, Agmatix has more than 670 million data points and 53 million values of professional observation that we can use to generate synthetic data. But we can also help companies pull the data they already have and standardize it in one central repository, which can be used as both real-world data or generated into synthetic data. Once that data is generated, we develop models to convert the data into actionable insights.

The possibilities with data are unlimited. But to tap into that potential, agribusinesses need to build their data, connect it, standardize it, create and validate a model for interpreting it, and then use synthetic data to supplement it.