Ram Disk to speed up inference

This revision is from 2024/10/29 05:39. You can Restore it.

Using a RAM disk to speed up inference.

Does not work.

LLM's require CPU/GPU operations.

1 token per second, and longer inferences are slower. 1 inferences comes back in about 1 or 2 minutes, for instance I say: Hello, LLM returns a minute later with the finished response. Too slow.

Tested with 14B quant 8

  

📝 📜 ⏱️ ⬆️