Run LLM from Linux Command Line- Immortality Knowledge Base

Run LLM from Linux Command Line

New name B I U S link image code HTML list	Show page Syntax
The .gguf format is typically used with the llama.cpp project and its Python bindings, not with the Hugging Face Transformers library. To use a .gguf file, you need to use a different library, such as llama-cpp-python. {pre} python3 -m pip install --upgrade pip pip install transformers pip install torch pip install llama-cpp-python {/pre} Python script: {pre} import os from llama_cpp import Llama model_path = "/home/x/Downloads/Lexi-Llama-3-8B-Uncensored_Q8_0.gguf" # Load the model llm = Llama(model_path=model_path) # Initialize conversation history conversation = [] print("Welcome! Type 'exit' to end the conversation.") while True: # Get user input user_input = input("You: ").strip() # Check if user wants to exit if user_input.lower() == 'exit': print("Goodbye!") break # Add user input to conversation history conversation.append(f"Human: {user_input}") # Construct the prompt with conversation history prompt = "\n".join(conversation) + "\nAI:" # Generate a response response = llm(prompt, max_tokens=200, stop=["Human:", "\n"]) # Extract and print the response ai_response = response['choices'][0]['text'].strip() print("AI:", ai_response) # Add AI response to conversation history conversation.append(f"AI: {ai_response}"){/pre} Execute: {pre} python llm_script.py {/pre} To load a hugging face transformer model directly... {pre}import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "internlm/internlm2-chat-7b" # This is the Hugging Face model ID tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Move model to GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) # Initialize conversation history conversation = [] print("Welcome! Type 'exit' to end the conversation.") while True: # Get user input user_input = input("You: ").strip() # Check if user wants to exit if user_input.lower() == 'exit': print("Goodbye!") break # Add user input to conversation history conversation.append(f"Human: {user_input}") # Construct the prompt with conversation history prompt = "\n".join(conversation) + "\nAI:" # Tokenize input inputs = tokenizer(prompt, return_tensors="pt").to(device) # Generate response with torch.no_grad(): outputs = model.generate( *inputs, max_new_tokens=200, num_return_sequences=1, do_sample=True, temperature=0.7, top_p=0.95, no_repeat_ngram_size=3, pad_token_id=tokenizer.eos_token_id ) # Decode and print the response response = tokenizer.decode(outputs[0], skip_special_tokens=True) ai_response = response[len(prompt):].strip() print("AI:", ai_response) # Add AI response to conversation history conversation.append(f"AI: {ai_response}") {/pre} Improvements... Memory management: As the conversation grows longer, implement a sliding window or summarization technique to keep the context within the model's token limit. * Error handling: Try-except blocks to handle potential errors, especially for long-running sessions. * Saving conversations: Save the conversation to a file for later review. * Model parameters: Experiment with different values for temperature, top_p, and max_new_tokens to find the best balance of coherence and creativity. * Prompt engineering: Refine the prompt structure to potentially improve the model's responses. For example, you might include a system message at the beginning of each prompt to set the AI's behavior.
Password Summary of changes

📜 ⏱️ ⬆️