In WWDC25 Metal 4 released quite excited new features for machine learning optimization, but as we all know the pytorch based on metal shader performance (mps) is the one of most important tools for Mac machine learning area.but on mps introduced website we cannot see any support information for metal4.
General
RSS for tagExplore the power of machine learning within apps. Discuss integrating machine learning features, share best practices, and explore the possibilities for your app.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Hey Devs,
I'm trying to create my own Real Time Text detection like this Apple project. https://developer.apple.com/documentation/vision/extracting-phone-numbers-from-text-in-images
I want to use the new iOS18 RecognizeTextRequest instead of the old VNRecognizeTextRequest in my SwiftUI project.
This is my delegate code with the camera setup. I removed region of interest for debugging but I'm trying to scan English words in books. The idea is to get one word in the ROI in the future. But I can't even get proper words so testing without ROI incase my math is wrong.
@Observable
class CameraManager: NSObject, AVCapturePhotoCaptureDelegate
...
override init() {
super.init()
setUpVisionRequest()
}
private func setUpVisionRequest() {
textRequest = RecognizeTextRequest(.revision3)
}
...
func setup() -> Bool {
captureSession.beginConfiguration()
guard
let captureDevice = AVCaptureDevice.default(
.builtInWideAngleCamera, for: .video, position: .back)
else {
return false
}
self.captureDevice = captureDevice
guard let deviceInput = try? AVCaptureDeviceInput(device: captureDevice)
else {
return false
}
/// Check whether the session can add input.
guard captureSession.canAddInput(deviceInput) else {
print("Unable to add device input to the capture session.")
return false
}
/// Add the input and output to session
captureSession.addInput(deviceInput)
/// Configure the video data output
videoDataOutput.setSampleBufferDelegate(
self, queue: videoDataOutputQueue)
if captureSession.canAddOutput(videoDataOutput) {
captureSession.addOutput(videoDataOutput)
videoDataOutput.connection(with: .video)?
.preferredVideoStabilizationMode = .off
} else {
return false
}
// Set zoom and autofocus to help focus on very small text
do {
try captureDevice.lockForConfiguration()
captureDevice.videoZoomFactor = 2
captureDevice.autoFocusRangeRestriction = .near
captureDevice.unlockForConfiguration()
} catch {
print("Could not set zoom level due to error: \(error)")
return false
}
captureSession.commitConfiguration()
// potential issue with background vs dispatchqueue ??
Task(priority: .background) {
captureSession.startRunning()
}
return true
}
}
// Issue here ???
extension CameraManager: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(
_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection
) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
Task {
textRequest.recognitionLevel = .fast
textRequest.recognitionLanguages = [Locale.Language(identifier: "en-US")]
do {
let observations = try await textRequest.perform(on: pixelBuffer)
for observation in observations {
let recognizedText = observation.topCandidates(1).first
print("recognized text \(recognizedText)")
}
} catch {
print("Recognition error: \(error.localizedDescription)")
}
}
}
}
The results I get look like this ( full page of English from a any book)
recognized text Optional(RecognizedText(string: e bnUI W4, confidence: 0.5))
recognized text Optional(RecognizedText(string: ?'U, confidence: 0.3))
recognized text Optional(RecognizedText(string: traQt4, confidence: 0.3))
recognized text Optional(RecognizedText(string: li, confidence: 0.3))
recognized text Optional(RecognizedText(string: 15,1,#, confidence: 0.3))
recognized text Optional(RecognizedText(string: jllÈ, confidence: 0.3))
recognized text Optional(RecognizedText(string: vtrll, confidence: 0.3))
recognized text Optional(RecognizedText(string: 5,1,: 11, confidence: 0.5))
recognized text Optional(RecognizedText(string: 1141, confidence: 0.3))
recognized text Optional(RecognizedText(string: jllll ljiiilij41, confidence: 0.3))
recognized text Optional(RecognizedText(string: 2f4, confidence: 0.3))
recognized text Optional(RecognizedText(string: ktril, confidence: 0.3))
recognized text Optional(RecognizedText(string: ¥LLI, confidence: 0.3))
recognized text Optional(RecognizedText(string: 11[Itl,, confidence: 0.3))
recognized text Optional(RecognizedText(string: 'rtlÈ131, confidence: 0.3))
Even with ROI set to a specific rectangle Normalized to Vision, I get the same results with single characters returning gibberish.
Any help would be amazing thank you.
Am I using the buffer right ?
Am I using the new perform(on: CVPixelBuffer) right ?
Maybe I didn't set up my camera properly? I can provide code
Hi, recently i tried to fine-tune Gemma-2-2b mlx model on my macbook (24 GB UMA). The code started running, after few seconds i saw swap size reaching 50GB and ram around 23 GB and then it stopped. I ran the Gemma-2-2b (cuda) on colab, it ran and occupied 27 GB on A100 gpu and worked fine. Here i didn't experienced swap issue.
Now my question is if my UMA was more than 27 GB, i also would not have experienced swap disk issue.
Thanks.
Topic:
Machine Learning & AI
SubTopic:
General
I have used mlx_lm.lora to fine tune a mistral-7b-v0.3-4bit model with my data. I fused the mistral model with my adapters and upload the fused model to my directory on huggingface. I was able to use mlx_lm.generate to use the fused model in Terminal. However, I don't know how to load the model in Swift. I've used
Imports
import SwiftUI
import MLX
import MLXLMCommon
import MLXLLM
let modelFactory = LLMModelFactory.shared
let configuration = ModelConfiguration(
id: "pharmpk/pk-mistral-7b-v0.3-4bit"
)
// Load the model off the main actor, then assign on the main actor
let loaded = try await modelFactory.loadContainer(configuration: configuration)
{ progress in
print("Downloading progress: \(progress.fractionCompleted * 100)%")
}
await MainActor.run {
self.model = loaded
}
I'm getting an error
runModel error: downloadError("A server with the specified hostname could not be found.")
Any suggestions?
Thanks, David
PS, I can load the model from the app bundle
// directory: Bundle.main.resourceURL!
but it's too big to upload for Testflight
Topic:
Machine Learning & AI
SubTopic:
General
Hi
We're on tensorflow 2.20 that has support now for python 3.13 (finally!). tensorflow-metal is still only supporting 2.18 which is over a year old.
When can we expect to see support in tensorflow-metal for tf 2.20 (or later!) ?
I bought a mac thinking I would be able to get great performance from the M processors but here I am using my CPU for my ML projects.
If it's taking so long to release it, why not open source it so the community can keep it more up to date?
cheers
Matt
Hello,
I am interested in using jax-metal to train ML models using Apple Silicon. I understand this is experimental.
After installing jax-metal according to https://developer.apple.com/metal/jax/, my python code fails with the following error
JaxRuntimeError: UNKNOWN: -:0:0: error: unknown attribute code: 22
-:0:0: note: in bytecode version 6 produced by: StableHLO_v1.12.1
My issue is identical to the one reported here https://github.com/jax-ml/jax/issues/26968#issuecomment-2733120325, and is fixed by pinning to jax-metal 0.1.1., jax 0.5.0 and jaxlib 0.5.0.
Thank you!
Hi team,
I’m exploring the Model Context Protocol (MCP), which is used to connect LLMs/AI agents to external tools in a structured way. It's becoming a common standard for automation and agent workflows.
Before I go deeper, I want to confirm:
Does Apple currently provide any official MCP server, API surface, or SDK on iOS/macOS?
From what I see, only third-party MCP servers exist for iOS simulators/devices, and Apple’s own frameworks (Foundation Models, Apple Intelligence) don’t expose MCP endpoints.
Is there any chance Apple might introduce MCP support—or publish recommended patterns for safely integrating MCP inside apps or developer tools?
I would like to see if I can share my app's data to the MCP server to enable other third-party apps/services to integrate easily
Topic:
Machine Learning & AI
SubTopic:
General
Hi everyone,
I'm experiencing an inconsistent behavior with the Translation framework on iOS 18. The LanguageAvailability.status() API reports language models as .installed, but translation fails with Code 16.
Setup:
Using translationTask modifier with TranslationSession
Batch translation with explicit source/target languages
Languages: Portuguese→English, German→English
Issue:
let status = await LanguageAvailability().status(from: sourceLang, to: targetLang) // Returns: .installed
// But translation fails:
let responses = try await session.translations(from: requests)
// Error: TranslationErrorDomain Code=16 "Offline models not available"
Logs:
Language model installed: pt -> en
Language model installed: de -> en
Starting translation: de -> en
Error Domain=TranslationErrorDomain Code=16 "Translation failed"NSLocalizedFailureReason=Offline models not available for language pair
What I've tried:
Re-downloading languages in Settings
Using source: nil for auto-detection
Fresh TranslationSession.Configuration each time
Questions:
Is there a way to force model re-validation/re-download programmatically?
Should translationTask show download popup when Code 16 occurs?
Has anyone found a reliable workaround?
I've seen similar reports in threads 791357 and 777113. Any guidance appreciated!
Thanks!
Topic:
Machine Learning & AI
SubTopic:
General
Hi everyone,
I’m exploring ideas around on-device analysis of user typing behavior on iPhone, and I’d love input from others who’ve worked in this area or thought about similar problems.
Conceptually, I’m interested in things like:
High-level sentiment or tone inferred from what a user types over time using ML-models
Identifying a user’s most important or frequent topics over a recent window (e.g., “last week”)
Aggregated insights rather than raw text (privacy-preserving summaries: e.g., your typo-rate by hour to infer highly efficient time slots or "take-a-break" warning typing errors increase)
I understand the significant privacy restrictions around keyboard input on iOS, especially for third-party keyboards and system text fields. I’m not trying to bypass those constraints—rather, I’m curious about what’s realistically possible within Apple’s frameworks and policies. (For instance, Grammarly as a correction tool includes some information about tone)
Questions I’m thinking through:
Are there any recommended approaches for on-device text analysis that don’t rely on capturing raw keystrokes?
Has anyone used NLP / Core ML / Natural Language successfully for similar summarization or sentiment tasks, scoped only to user-explicit input?
For custom keyboards, what kinds of derived or transient signals (if any) are acceptable to process and summarize locally?
Any design patterns that balance usefulness with Apple’s privacy expectations?
If you’ve built something adjacent—journaling, writing analytics, well-being apps, etc.—I’d appreciate hearing what worked, what didn’t, and what Apple reviewers were comfortable with.
Thanks in advance for any ideas or references 🙏
Topic:
Machine Learning & AI
SubTopic:
General
After watching the What's new in App Intents session I'm attempting to create an intent conforming to URLRepresentableIntent. The video states that so long as my AppEntity conforms to URLRepresentableEntity I should not have to provide a perform method . My application will be launched automatically and passed the appropriate URL.
This seems to work in that my application is launched and is passed a URL, but the URL is in the form: FeatureEntity/{id}.
Am I missing something, or is there a trick that enables it to pass along the URL specified in the AppEntity itself?
struct MyExampleIntent: OpenIntent, URLRepresentableIntent {
static let title: LocalizedStringResource = "Open Feature"
static var parameterSummary: some ParameterSummary {
Summary("Open \(\.$target)")
}
@Parameter(title: "My feature", description: "The feature to open.")
var target: FeatureEntity
}
struct FeatureEntity: AppEntity {
// ...
}
extension FeatureEntity: URLRepresentableEntity {
static var urlRepresentation: URLRepresentation {
"https://myurl.com/\(.id)"
}
}
Hi,
One can configure the languages of a (VN)RecognizeTextRequest with either:
.automatic: language to be detected
a specific language, say Spanish
If the request is configured with .automatic and successfully detects Spanish, will the results be exactly equivalent compared to a request made with Spanish set as language?
I could not find any information about this, and this is very important for the core architecture of my app.
Thanks!
Hi everyone! 👋
I'm working on a C++ project using TensorFlow Lite and was wondering if anyone has a prebuilt TensorFlow Lite C++ library (libtensorflowlite) for macOS (Apple Silicon M1/M2) that they’d be willing to share.
I’m looking specifically for the TensorFlow Lite C++ API — something that lets me use tflite::Interpreter, tflite::FlatBufferModel, etc. Building it from source using Bazel on macOS has been quite challenging and time-consuming, so a ready-to-use .dylib or .a build along with the required headers would be incredibly helpful.
TensorFlow Lite version: v2.18.0 preferred
Target: macOS arm64 (Apple Silicon)
What I need:
libtensorflowlite.dylib or .a
Corresponding headers (ideally organized in a clean include/ folder)
If you have one available or know where I can find a reliable prebuilt version, I’d be super grateful. Thanks in advance! 🙏
The WWDC25: Explore large language models on Apple silicon with MLX video talks about using your own data to fine-tune a large language model. But the video doesn't explain what kind of data can be used. The video just shows the command to use and how to point to the data folder. Can I use PDFs, Word documents, Markdown files to train the model? Are there any code examples on GitHub that demonstrate how to do this?
WWDC25: Combine Metal 4 machine learning and graphics
Demonstrated a way to combine neural network in the graphics pipeline directly through the shaders, using an example of Texture Compression. However there is no mention of using which ML technique texture is compressed.
Can anyone point me to some well known model/s for this particular use case shown in WWDC25.
I have a question. In China, long pressing a picture in the album can segment the target. Is this model a local model? Is there any information? Can developers use it?
Environment
MacOC 26
Xcode Version 26.0 beta 7 (17A5305k)
simulator: iPhone 16 pro
iOS: iOS 26
Problem
NLContextualEmbedding.load() fails with the following error
In simulator
Failed to load embedding from MIL representation: filesystem error: in create_directories: Permission denied ["/var/db/com.apple.naturallanguaged/com.apple.e5rt.e5bundlecache"]
filesystem error: in create_directories: Permission denied ["/var/db/com.apple.naturallanguaged/com.apple.e5rt.e5bundlecache"]
Failed to load embedding model 'mul_Latn' - '5C45D94E-BAB4-4927-94B6-8B5745C46289'
assetRequestFailed(Optional(Error Domain=NLNaturalLanguageErrorDomain Code=7 "Embedding model requires compilation" UserInfo={NSLocalizedDescription=Embedding model requires compilation}))
in #Playground
I'm new to this embedding model. Not sure if it's caused by my code or environment.
Code snippet
import Foundation
import NaturalLanguage
import Playgrounds
#Playground {
// Prefer initializing by script for broader coverage; returns NLContextualEmbedding?
guard let embeddingModel = NLContextualEmbedding(script: .latin) else {
print("Failed to create NLContextualEmbedding")
return
}
print(embeddingModel.hasAvailableAssets)
do {
try embeddingModel.load()
print("Model loaded")
} catch {
print("Failed to load model: \(error)")
}
}
We are developing Apple AI for overseas markets and adapting it for iPhone 17 and later models. When the system language and Siri language do not match—such as the system being in English while Siri is in Chinese—it may result in Apple AI being unusable. So, I would like to ask, how can this issue be resolved, and are there other reasons that might cause it to be unusable within the app?
It is vital for Apple to refine its OCR models to correctly distinguish between Khmer and Thai scripts. Incorrectly labeling Khmer text as Thai is more than a technical bug; it is a culturally insensitive error that impacts national identity, especially given the current geopolitical climate between Cambodia and Thailand. Implementing a more robust language-detection threshold would prevent these harmful misidentifications.
There is a significant logic flaw in the VNRecognizeTextRequest language detection when processing Khmer script. When the property automaticallyDetectsLanguage is set to true, the Vision framework frequently misidentifies Khmer characters as Thai.
While both scripts share historical roots, they are distinct languages with different alphabets. Currently, the model’s confidence threshold for distinguishing between these two scripts is too low, leading to incorrect OCR output in both developer-facing APIs and Apple’s native ecosystem (Preview, Live Text, and Photos).
import SwiftUI
import Vision
class TextExtractor {
func extractText(from data: Data, completion: @escaping (String) -> Void) {
let request = VNRecognizeTextRequest { (request, error) in
guard let observations = request.results as? [VNRecognizedTextObservation] else {
completion("No text found.")
return
}
let recognizedStrings = observations.compactMap { observation in
let str = observation.topCandidates(1).first?.string
return "{text: \(str!), confidence: \(observation.confidence)}"
}
completion(recognizedStrings.joined(separator: "\n"))
}
request.automaticallyDetectsLanguage = true // <-- This is the issue.
request.recognitionLevel = .accurate
let handler = VNImageRequestHandler(data: data, options: [:])
DispatchQueue.global(qos: .background).async {
do {
try handler.perform([request])
} catch {
completion("Failed to perform OCR: \(error.localizedDescription)")
}
}
}
}
Recognizing Khmer
Confidence Score is low for Khmer text. (The output is in Thai language with low confidence score)
Recognizing English
Confidence Score is high expected.
Recognizing Thai
Confidence Score is high as expected
Issues on Preview, Photos
Khmer text
Copied text
Kouk Pring Chroum Temple [19121 รอาสายสุกตีนานยารรีสใหิสรราภูชิตีนนสุฐตีย์ [รุก
เผือชิษาธอยกัตธ์ตายตราพาษชาณา ถวเชยาใบสราเบรถทีมูสินตราพาษชาณา ทีมูโษา เช็ก
อาษเชิษฐอารายสุกบดตพรธุรฯ ตากร"สุก"ผาตากรธกรธุกเยากสเผาพศฐตาสาย รัอรณาษ"ตีพย"
สเผาพกรกฐาภูชิสาเครๆผู:สุกรตีพาสเผาพสรอสายใผิตรรารตีพสๆ เดียอลายสุกตีน
ธาราชรติ ธิพรหณาะพูชุบละเาหLunet De Lajonquiere ผารูกรสาราพารผรผาสิตภพ ตารสิทูก ธิพิ
คุณที่นสายเระพบพเคเผาหนารเกะทรนภาษเราภุพเสารเราษทีเลิกสญาเราหรุฬารชสเกาก เรากุม
สงสอบานตรเราะากกต่ายภากายระตารุกเตียน
Recommended Solutions
1. Set a Threshold
Filter out the detected result where the threshold is less than or equal to 0.5, so that it would not output low quality text which can lead to the issue.
For example,
let recognizedStrings = observations.compactMap { observation in
if observation.confidence <= 0.5 {
return nil
}
let str = observation.topCandidates(1).first?.string
return "{text: \(str!), confidence: \(observation.confidence)}"
}
2. Add Khmer Language Support
This issue would never happen if the model has the capability to detect and recognize image with Khmer language.
Doc2Text GitHub: https://github.com/seanghay/Doc2Text-Swift
Hello All,
I’m working on a computer-vision–heavy iOS application that uses the camera, LiDAR depth maps, and semantic segmentation to reason about the environment (object identification, localization and measurement - not just visualization).
Current architecture
I initially built the image pipeline around CIImage as a unifying abstraction. It seemed like a good idea because:
CIImage integrates cleanly with Vision, ARKit, AVFoundation, Metal, Core Graphics, etc.
It provides a rich set of out-of-the-box transforms and filters.
It is immutable and thread-safe, which significantly simplified concurrency in a multi-queue pipeline.
The LiDAR depth maps, semantic segmentation masks, etc. were treated as CIImages, with conversion to CVPixelBuffer or MTLTexture only at the edges when required.
Problem
I’ve run into cases where Core Image transformations do not preserve numeric fidelity for non-visual data.
Example:
Rendering a CIImage-backed segmentation mask into a larger CVPixelBuffer can cause label values to change in predictable but incorrect ways.
This occurs even when:
using nearest-neighbor sampling
disabling color management (workingColorSpace / outputColorSpace = NSNull)
applying identity or simple affine transforms
I’ve confirmed via controlled tests that:
Metal → CVPixelBuffer paths preserve values correctly
CIImage → CVPixelBuffer paths can introduce value changes when resampling or expanding the render target
This makes CIImage unsafe as a source of numeric truth for segmentation masks and depth-based logic, even though it works well for visualization, and I should have realized this much sooner.
Direction I’m considering
I’m now considering refactoring toward more intent-based abstractions instead of a single image type, for example:
Visual images: CIImage (camera frames, overlays, debugging, UI)
Scalar fields: depth / confidence maps backed by CVPixelBuffer + Metal
Label maps: segmentation masks backed by integer-preserving buffers (no interpolation, no transforms)
In this model, CIImage would still be used extensively — but primarily for visualization and perceptual processing, not as the container for numerically sensitive data.
Thread safety concern
One of the original advantages of CIImage was that it is thread-safe by design, and that was my biggest incentive.
For CVPixelBuffer / MTLTexture–backed data, I’m considering enforcing thread safety explicitly via:
Swift Concurrency (actor-owned data, explicit ownership)
Questions
For those may have experience with CV / AR / imaging-heavy iOS apps, I was hoping to know the following:
Is this separation of image intent (visual vs numeric vs categorical) a reasonable architectural direction?
Do you generally keep CIImage at the heart of your pipeline, or push it to the edges (visualization only)?
How do you manage thread safety and ownership when working heavily with CVPixelBuffer and Metal? Using actor-based abstractions, GCD, or adhoc?
Are there any best practices or gotchas around using Core Image with depth maps or segmentation masks that I should be aware of?
I’d really appreciate any guidance or experience-based advice. I suspect I’ve hit a boundary of Core Image’s design, and I’m trying to refactor in a way that doesn't involve too much immediate tech debt, remains robust and maintainable long-term.
Thank you in advance!
I'm playing with the new Vision API for iOS18, specifically with the new CalculateImageAestheticsScoresRequest API.
When I try to perform the image observation request I get this error:
internalError("Error Domain=NSOSStatusErrorDomain Code=-1 \"Failed to create espresso context.\" UserInfo={NSLocalizedDescription=Failed to create espresso context.}")
The code is pretty straightforward:
if let image = image {
let request = CalculateImageAestheticsScoresRequest()
Task {
do {
let cgImg = image.cgImage!
let observations = try await request.perform(on: cgImg)
let description = observations.description
let score = observations.overallScore
print(description)
print(score)
} catch {
print(error)
}
}
}
I'm running it on a M2 using the simulator.
Is it a bug? What's wrong?