Recently, Apple postponed the introduction of its enhanced and more personalized Siri functionalities. In an effort to improve future updates of Apple Intelligence, Bloomberg notes a significant change in Apple’s approach to training its AI models.
The report mentions a blog entry from Apple’s Machine Learning Research site, outlining how the company typically utilizes synthetic data to train its AI systems. However, this approach has its drawbacks, particularly as synthetic data often struggles to grasp trends in complex tasks such as summarization or for tools that work with longer sentences or entire email threads.
To overcome these challenges, Apple has introduced a new technology that compares synthetic data with a limited selection of recent user emails while ensuring user privacy is safeguarded:
To enhance our models, we must generate numerous emails that encompass the most common topics found in messages. We begin by crafting a diverse array of synthetic messages on various subjects. For instance, we could generate a synthetic message that states, “Would you like to play tennis tomorrow at 11:30AM?”
This process is conducted without any insights into individual user emails. We then derive a representation called an embedding for each synthetic message that encapsulates key aspects such as language, topic, and length. These embeddings are sent to a select group of user devices that have opted into Device Analytics.
These devices then choose a small sample of recent user emails to compute their embeddings. Each device then identifies which of the synthetic embeddings is most similar to these samples. By applying differential privacy, Apple can ascertain the most commonly selected synthetic embeddings across all devices, without accessing the specific synthetic embedding chosen on any particular device.
The most frequently chosen synthetic embeddings can subsequently be utilized to create either training or testing data, or additional curation steps can be performed to refine the dataset further. For example, if the message concerning tennis ranks among the top embeddings, a similar message could be generated by replacing “tennis” with “soccer” or another sport to be included in the next curation step (see Figure 1). This iterative process enables us to enhance the topics and language of our synthetic emails, aiding in the training of our models for better text outputs in features like email summaries, all while maintaining user privacy.
Apple states that these methodologies empower it to “grasp overall trends without acquiring information about any individual.” Bloomberg reports that this new system is expected to be launched in a forthcoming beta version of iOS 18.5 and macOS 15.5.
For further information, you can read Apple’s complete blog post.
Follow Chance: Threads, Bluesky, Instagram, and Mastodon.