Unable to load a quantized Qwen 1.… | Apple Developer Forums

Unable to load a quantized Qwen 1.7B model on an iPhone SE 3

I am trying to benchmark and see if the Qwen3 1.7B model can run in an iPhone SE 3 [4 GB RAM].

My core problem is - Even with weight quantization the SE 3 is not able to load into memory.

What I've tried:

I am converting a Torch model to the Core ML format using coremltools. I have tried the following combinations of quantization and context length

8 bit + 1024
8 bit + 2048
4 bit + 1024
4 bit + 2048

All the above quantizations are done with dynamic shape with the default being [1,1] in the hope that the whole context length does not get allocated in memory

The 4-bit model is approximately 865MB on disk
The 8-bit model is approximately 1.7 GB on disk

During load:

With the int4 quantization the memory spikes during intitial load a lot. Could this be because many operations are converted to int8 or fp16 as core ML does not perform operations natively on int4?
With int8 on the profiler the memory does not go above 2 GB (only 900 MB) but it is still not able to load as it shows the following error. 2GB is the limit where jetsam kills the app for the iPhone SE 3

E5RT: Error(s) occurred compiling MIL to BNNS graph:
[CreateBnnsGraphProgramFromMIL]: BNNS Graph Compile: 
failed to preallocate file with error: No space left on device 
for path: /var/mobile/Containers/Data/Application/
5B8BB7D2-06A6-4BAE-A042-407B6D805E7C/Library/Caches
/com.tss.qwen3-coreml/
com.apple.e5rt.e5bundlecache/
23A341/<long key>.tmp.12586_4362093968.bundle/
H14.bundle/main/main_bnns/bnns_program.bnnsir

Some online sources have suggested activation quantization but I am unsure if that will have any impact on loading [as the spike is during load and not inference]

The model spec also suggests that there is no dequantization happening (for e.g from 4 bit -> fp16)

So I had couple of queries:

Has anyone faced similar issues?
What could be the reasons for the temporary memory spike during LOAD
What are approaches that can be adopted to deal with this issue?

Any help would be greatly appreciated. Thank you.