💡

Key Points

Key Takeaways

  • 1

    Privacy First

  • 2

    Mac Studio (Unified Memory): Only solution to break through VRAM wall. With 192GB memory, Mixtral 8x22B runs easily as well as Llama 3 70B quantized model.

  • 3

    Jetson Orin AGX: Highest peak of embedded AI. With 60W power consumption, this is the one if making agent (JARVIS) running permanently.

  • 4

    Ollama: LLM starts up with one command. API compatible, existing LangChain apps run as is.

  • 5

    RAG (Retrieval Augmented Generation): Search your Notion or Obsidian entirely and generate answers. Data does not go outside even one step.

Introduction: Limits of Cloud AI

ChatGPT is convenient, but you cannot feed it confidential company code or private diaries. There is also censorship. Not only “How to make a bomb”, but even “Extreme jokes” are rejected.

To obtain true freedom (Uncensored Model), you have no choice but to buy hardware.

1. The VRAM King: Mac Studio (M3 Ultra)

Nvidia’s GPU (RTX 4090) is powerful, but has only 24GB of VRAM. This is not enough to run 70B class models. Apple Silicon’s Unified Memory architecture destroyed this bottleneck.

Apple Mac Studio (M2 Ultra)

Maximum 192GB unified memory. Possesses LLM inference capability equivalent to two A100 80GB (millions of yen). Era where you can put this monster on desk with fan noise at inaudible level.

Apple MLX Framework

If you use MLX framework optimized for Apple Silicon without going through PyTorch, inference speed accelerates further. Since it hits Metal (GPU) directly from Python, there is no overhead.

2. The Edge AI King: Nvidia Jetson Orin AGX

If you want to “Keep running 24/7”, power consumption of Mac Studio is concerning. Jetson Orin boasts overwhelming Watt Performance because it was developed as robot brain.

NVIDIA Jetson AGX Orin Developer Kit

AI processing capacity of 275 TOPS in palm size. Ubuntu runs and CUDA can be used natively. Power consumption maximum 60W. Fits in rack as AI personnel of home server (Homelab).

3. Operations: Ollama & Open WebUI

No need to infer on black screen. If you run Ollama in backend and put Open WebUI (formerly Ollama WebUI) in front, appearance is completely ChatGPT.

# Start Llama 3
ollama run llama3:70b

Just this starts local API server (localhost:11434). If you rewrite endpoint of Windsurf or Cursor to here, coding is also possible offline.

Deep Dive: Model Size and VRAM Calculation Formula

The VRAM (Unified Memory) required to run a 70B class model can be calculated with the following simplified formula:

Memory Consumption (GB) ≈ (Number of Parameters * Quantization Bits / 8) * 1.2 (Overhead)

Example: Running Llama 3 70B at 4-bit (Q4_K_M)
(70 * 4 / 8) * 1.2 = 42GB

This is the reason why you should have 128GB or more memory in a Mac Studio (M2/M3 Ultra). Expanding context length (to 128k tokens, etc.) requires an additional 10GB to 20GB for KV cache, so design with a margin is important.

Conclusion: Don’t Rent Brains

Cloud AI is “Rental”. You can use it as long as you pay rent (subscription), but you cannot remodel it. Local LLM is “Owned House”. Repaint walls, extend, and grow your favorite strongest assistant.

Initial investment is high, but it is cheap if you think of it as ticket to freedom.