The problem is you want to achieve a high level answer from a low level model, it doesn't matter how much you change models if you keep to low parameter ones, you need to use big ones like the ones used in their data centers.
I've used 13B models with somewhat good results, I only tried once the mistral 8x7B and it was amazing the responses it gave.
But this was using llamacpp offloading some layers to the GPU and just the base model, no training.
Also, how did you connected the llm to your notes? Did you trained a lora? Used embeddings? Or were your notes just fed via the context?
IIRC the last two are basically the same and are limited to what your model accepts, usually 2048 tokens, which might be enough for a one chat with a not, but not enough for large amounts of notes.