Switching back and forth between caches? #1102

pikaro · 2024-01-18T17:41:11Z

pikaro
Jan 18, 2024

I'd like to inference on the same conversation twice - using the same model, but using different histories, parameters and system prompts. Each of these would build in order, so the model has two "personalities" that evaluate the prompt.

I already got the basic concept working in a variety of ways, but not by re-using the same model twice. I tried it like this:

class Model:
  def __init__(self):
    self.caches = {'A': LlamaRAMCache(), 'B': LlamaRAMCache()}
    self.model = Llama(...)
  def infer(self, person, message):
    self.model.set_cache(self.caches[person])
    self.model.create_chat_completion(...)

This works for a little bit, but it soon crashes as the context builds:

GGML_ASSERT: /private/var/folders/0n/sytk40_13jv299c8l6dtwm_40000gn/T/pip-install-2i8k1t79/llama-cpp-python_35ccd77964ed4a8d8e896e9d86d4a6e4/vendor/llama.cpp/llama.cpp:10430:
  ctx->logits.capacity() == logits_cap

That sounds like a cache size issue, but at that point the cache is only 100 MB and the default is 1024 MB.

Lastly, if I only use one of the personalities, it is much slower than leaving the cache unspecified - around half as fast. So I think this is using RAM, and with offload_kqv, it's normally using VRAM? There appears to be no way to specifically create a VRAM cache, though.

Unfortunately, I have very little idea what I'm doing here, so I'm probably missing something substantial - but it's not exactly easy to get a grip on all this... can someone maybe shed a little light on what's happening, and if my approach is fundamentally bad or impossible? Or maybe there are some obvious issues with the model config visible in the dump below? Thanks!

Platform: MacOS Sonoma, but Linux compatibility would be very important as well.

Log output (including from my application for a little context):

2024-01-14 19:34:17 - log_setup        - INFO     - Setting up logging at level info
2024-01-14 19:34:17 - Processor        - INFO     - [init] Initializing with language german in cache/nltk
[nltk_data] Downloading package punkt to cache/nltk...
[nltk_data]   Package punkt is already up-to-date!
2024-01-14 19:34:18 - Master           - INFO     - [init] Initialized master
2024-01-14 19:34:18 - Master           - INFO     - Starting LLM
2024-01-14 19:34:18 - Master           - INFO     - [set_llm] Adding model
2024-01-14 19:34:18 - Llm              - INFO     - [init] Loading model: nous-hermes-2-solar-10.7b.Q5_K_M.gguf
llama_model_loader: loaded meta data with 22 key-value pairs and 435 tensors from nous-hermes-2-solar-10.7b.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 10.73 B
llm_load_print_meta: model size       = 7.08 GiB (5.66 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.17 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  7245.97 MiB, ( 7246.03 / 21845.34)
llm_load_tensors: system memory used  = 7245.42 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/user/src/misc/diallm/venv/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   768.00 MiB, ( 8015.59 / 21845.34)
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 8015.61 / 21845.34)
llama_build_graph: non-view tensors processed: 1012/1012
llama_new_context_with_model: compute buffer total size = 291.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   288.02 MiB, ( 8303.61 / 21845.34)
AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
2024-01-14 19:34:18 - Master           - INFO     - [set_llm] Added model
2024-01-14 19:34:18 - Llm              - INFO     - [init] IPC to text processor established
2024-01-14 19:34:18 - Llm              - INFO     - [init] Using seed: 2664199386
2024-01-14 19:34:18 - main             - INFO     - Starting iteration
2024-01-14 19:34:18 - Processor        - INFO     - [iterate] Iterating message of length 4976
2024-01-14 19:34:18 - Processor        - INFO     - [tokenize] Tokenizing message of length 4976
2024-01-14 19:34:18 - Llm              - INFO     - [init] Model loaded
2024-01-14 19:34:18 - Llm              - INFO     - [listen] Listening for messages
2024-01-14 19:34:18 - Llm              - INFO     - [listen] Waiting for text
2024-01-14 19:34:18 - Processor        - INFO     - [tokenize] Tokenized message into 49 parts
2024-01-14 19:34:18 - Processor        - INFO     - [iterate] Iterating part 1 of 49
Llama._create_completion: cache miss
2024-01-14 19:34:19 - Llm              - NOTICE   - [infer] [B] [iter0] Time for first token: 1.3395109176635742

llama_print_timings:        load time =    1337.28 ms
llama_print_timings:      sample time =       0.25 ms /     3 runs   (    0.08 ms per token, 12048.19 tokens per second)
llama_print_timings: prompt eval time =    1336.66 ms /   373 tokens (    3.58 ms per token,   279.05 tokens per second)
llama_print_timings:        eval time =      66.95 ms /     2 runs   (   33.47 ms per token,    29.87 tokens per second)
llama_print_timings:       total time =    1408.26 ms
Llama._create_completion: cache save
Llama.save_state: saving llama state
Llama.save_state: got state size: 853118932
Llama.save_state: allocated state
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    70.31 MiB, ( 8373.92 / 21845.34)
Llama.save_state: copied llama state: 121591224
Llama.save_state: saving 121591224 bytes of llama state
Llama._create_completion: cache saved
2024-01-14 19:34:20 - Llm              - INFO     - [infer] [B] [iter0] Finished inference loop
2024-01-14 19:34:20 - Llm              - INFO     - [listen] Waiting for text
2024-01-14 19:34:20 - Processor        - INFO     - [receive] [B] [iter0] Received message in 1.628 seconds
Llama._create_completion: cache miss
Llama.generate: prefix-match hit
2024-01-14 19:34:21 - Llm              - NOTICE   - [infer] [A] [iter0] Time for first token: 0.9518940448760986
2024-01-14 19:34:21 - Llm              - INFO     - [infer] [A] [iter0] Finished inference loop
2024-01-14 19:34:21 - Llm              - INFO     - [listen] Waiting for text
2024-01-14 19:34:21 - Processor        - INFO     - [receive] [A] [iter0] Received message in 1.088 seconds
2024-01-14 19:34:21 - Processor        - INFO     - [iterate] Iterating part 2 of 49
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    70.31 MiB, ( 8373.92 / 21845.34)
Llama._create_completion: cache hit
Llama.generate: prefix-match hit
2024-01-14 19:34:21 - Llm              - NOTICE   - [infer] [B] [iter1] Time for first token: 0.44945502281188965

llama_print_timings:        load time =    1337.28 ms
llama_print_timings:      sample time =       0.25 ms /     3 runs   (    0.08 ms per token, 12096.77 tokens per second)
llama_print_timings: prompt eval time =     340.92 ms /    94 tokens (    3.63 ms per token,   275.73 tokens per second)
llama_print_timings:        eval time =      67.06 ms /     2 runs   (   33.53 ms per token,    29.82 tokens per second)
llama_print_timings:       total time =     517.67 ms
Llama._create_completion: cache save
Llama.save_state: saving llama state
Llama.save_state: got state size: 853118932
Llama.save_state: allocated state
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    88.31 MiB, ( 8391.92 / 21845.34)
Llama.save_state: copied llama state: 140465976
Llama.save_state: saving 140465976 bytes of llama state
Llama._create_completion: cache saved
2024-01-14 19:34:21 - Llm              - INFO     - [infer] [B] [iter1] Finished inference loop
2024-01-14 19:34:21 - Llm              - INFO     - [listen] Waiting for text
2024-01-14 19:34:21 - Processor        - INFO     - [receive] [B] [iter1] Received message in 0.698 seconds
Llama._create_completion: cache miss
Llama.generate: prefix-match hit
2024-01-14 19:34:23 - Llm              - NOTICE   - [infer] [A] [iter1] Time for first token: 1.1416089534759521
2024-01-14 19:34:23 - Llm              - INFO     - [infer] [A] [iter1] Finished inference loop
2024-01-14 19:34:23 - Llm              - INFO     - [listen] Waiting for text
2024-01-14 19:34:23 - Processor        - INFO     - [receive] [A] [iter1] Received message in 1.279 seconds
2024-01-14 19:34:23 - Processor        - INFO     - [iterate] Iterating part 3 of 49
GGML_ASSERT: /private/var/folders/0n/sytk40_13jv299c8l6dtwm_40000gn/T/pip-install-2i8k1t79/llama-cpp-python_35ccd77964ed4a8d8e896e9d86d4a6e4/vendor/llama.cpp/llama.cpp:10430: ctx->logits.capacity() == logits_cap
Traceback (most recent call last):
  File "/Users/user/src/misc/diallm/./main.py", line 40, in <module>
    master.processor.iterate(message)
  File "/Users/user/src/misc/diallm/processor.py", line 82, in iterate
    resp = list(self.receive(nonce))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/src/misc/diallm/processor.py", line 61, in receive
    while (response := self.master.llm_pipe.recv())[1] is not None:
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
    raise EOFError
EOFError
2024-01-14 19:34:23 - Master           - INFO     - [destroy] Destroying master
2024-01-14 19:34:23 - Master           - INFO     - [destroy] Closing pipes
Exception ignored in atexit callback: <function master_cleanup at 0x123c98900>
Traceback (most recent call last):
  File "/Users/user/src/misc/diallm/./main.py", line 35, in master_cleanup
    master.destroy()
  File "/Users/user/src/misc/diallm/master.py", line 36, in destroy
    self.llm_pipe.send('exit')
  File "/opt/homebrew/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/opt/homebrew/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 427, in _send_bytes
    self._send(header + buf)
  File "/opt/homebrew/Cellar/[email protected]/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 384, in _send
    n = write(self._handle, buf)
        ^^^^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe

stduhpf · 2024-01-18T18:30:35Z

stduhpf
Jan 18, 2024

Looks like you are encontering something related to this issue: #997, which was marked as closed today.
Maybe you should try upgrading to the lastest release (pip install llama-cpp-python --upgrade) and see if it resolves your problem? (it woked for me)

1 reply

pikaro Jan 18, 2024
Author

Amazing, no idea why that didn't show up in my searches - it even has the same error message! Thanks a lot for pointing it out, that issue is solved. Unfortunately, it doesn't work as expected, doesn't seem to be caching at all once I have two responders (edit: with different system prompts) active... but now I can at least start trying to figure out if I have a bug somewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching back and forth between caches? #1102

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Switching back and forth between caches? #1102

pikaro Jan 18, 2024

Replies: 1 comment · 1 reply

stduhpf Jan 18, 2024

pikaro Jan 18, 2024 Author

pikaro
Jan 18, 2024

Replies: 1 comment 1 reply

stduhpf
Jan 18, 2024

pikaro Jan 18, 2024
Author