Unable to load a quantized Qwen 1.7B model on an iPhone SE 3

I am trying to benchmark and see if the Qwen3 1.7B model can run in an iPhone SE 3 [4 GB RAM].

My core problem is - Even with weight quantization the SE 3 is not able to load into memory.

What I've tried:

I am converting a Torch model to the Core ML format using coremltools. I have tried the following combinations of quantization and context length

  • 8 bit + 1024
  • 8 bit + 2048
  • 4 bit + 1024
  • 4 bit + 2048

All the above quantizations are done with dynamic shape with the default being [1,1] in the hope that the whole context length does not get allocated in memory

  • The 4-bit model is approximately 865MB on disk
  • The 8-bit model is approximately 1.7 GB on disk

During load:

  • With the int4 quantization the memory spikes during intitial load a lot. Could this be because many operations are converted to int8 or fp16 as core ML does not perform operations natively on int4?
  • With int8 on the profiler the memory does not go above 2 GB (only 900 MB) but it is still not able to load as it shows the following error. 2GB is the limit where jetsam kills the app for the iPhone SE 3
E5RT: Error(s) occurred compiling MIL to BNNS graph:
[CreateBnnsGraphProgramFromMIL]: BNNS Graph Compile: 
failed to preallocate file with error: No space left on device 
for path: /var/mobile/Containers/Data/Application/
5B8BB7D2-06A6-4BAE-A042-407B6D805E7C/Library/Caches
/com.tss.qwen3-coreml/
com.apple.e5rt.e5bundlecache/
23A341/<long key>.tmp.12586_4362093968.bundle/
H14.bundle/main/main_bnns/bnns_program.bnnsir

Some online sources have suggested activation quantization but I am unsure if that will have any impact on loading [as the spike is during load and not inference]

The model spec also suggests that there is no dequantization happening (for e.g from 4 bit -> fp16)

So I had couple of queries:

  • Has anyone faced similar issues?
  • What could be the reasons for the temporary memory spike during LOAD
  • What are approaches that can be adopted to deal with this issue?

Any help would be greatly appreciated. Thank you.

Unable to load a quantized Qwen 1.7B model on an iPhone SE 3
 
 
Q