Accelerate Hugging Face CPU Inference with Optimum and ONNX
Hugging Face's Optimum library can accelerate CPU inference for transformers by converting models to ONNX format, enabling them to run on the ONNX Runti.
50 articles
Hugging Face's Optimum library can accelerate CPU inference for transformers by converting models to ONNX format, enabling them to run on the ONNX Runti.
Hugging Face Cross-Encoders can rerank your RAG results by treating the query and each retrieved document as a single input pair, allowing for a much mo.
Hugging Face's Trainer is surprisingly flexible when it comes to how it batches data, often making you think it's magic.
Hugging Face datasets library can stream data larger than your RAM, but it doesn't actually stream data in the way you might expect; it streams pointers.
The Hugging Face datasets library is often thought of as just a way to download and use pre-made datasets, but its real power lies in its ability to eff.
Direct Preference Optimization DPO and Reinforcement Learning from Human Feedback RLHF are not just different flavors of fine-tuning; they represent a f.
The most surprising thing about semantic search is that it doesn't actually "understand" your query; it just finds text that is statistically similar in.
The evaluate library’s primary superpower is its ability to standardize and simplify model evaluation, letting you swap out metrics as easily as changin.
FlashAttention 2 doesn't just make attention faster; it fundamentally changes how attention is computed by fusing operations and optimizing memory acces.
GenerationConfig and sampling parameters are how you tell a Hugging Face transformers model how to generate text, not what text to generate.
The most surprising thing about loading GGUF quantized models with Hugging Face Transformers is that you're not actually using Hugging Face Transformers.
Gradient checkpointing lets you trade compute for memory by recomputing activations during the backward pass instead of storing them all during the forw.
You can access private and gated models on the Hugging Face Hub by generating an access token and using it to authenticate your requests.
Vision Transformers ViTs can learn to classify images with surprising effectiveness, even when trained on datasets much smaller than those typically use.
Hugging Face Inference Endpoints actually makes deploying models to production easier than running them locally in many cases.
The most surprising truth about fine-tuning LLMs for instruction following is that the model often doesn't "understand" instructions in the way humans d.
Fine-tuning Llama and Mistral models with Hugging Face TRL is surprisingly less about "teaching" the model and more about "guiding" its existing knowled.
Merge Hugging Face Models with MergeKit for Combined Capabilities — practical guide covering huggingface setup, configuration, and troubleshooting with ...
Model cards are your chance to make your Hugging Face Model Hub submission shine, but getting them to pass review can feel like a black box.
Hugging Face Hub models aren't just static files; they're dynamic entities you can push to and pull from, effectively acting as a versioned, collaborati.
Hugging Face's accelerate library is your best friend here, and it's not just for distributed training; it's for inference too, and it does the heavy li.
LLaVA doesn't just understand images; it can actually reason about them in natural language. Let's see LLaVA in action, pulling it all together with Hug.
PEFT and LoRA allow you to fine-tune massive language models on consumer-grade hardware by only training a tiny fraction of the model's parameters.
The Hugging Face pipeline API is a black box that actually lets you run models locally without needing to understand PyTorch or TensorFlow.
Deploying Hugging Face models in an air-gapped environment is surprisingly straightforward once you understand the core constraint: no internet access.
QLoRA lets you fine-tune massive language models on consumer-grade GPUs by cleverly packing model weights into 4-bit integers.
The surprising truth about fine-tuning large language models is that you're not teaching it a new language, but rather how to perform a specific task wi.
Training a reward model from human preferences is a surprisingly effective way to align large language models with desired behaviors, even when those be.
The most surprising thing about safetensors is that it's not just about security; it's fundamentally a faster, more efficient way to serialize and deser.
Seq2Seq models don't actually "understand" language; they're just incredibly sophisticated pattern matchers that learn to map input sequences to output .
Gradio Spaces can host your Hugging Face models, turning them into interactive web demos with zero infrastructure management.
Speculative decoding in Hugging Face isn't just a performance trick; it's a fundamental shift in how we generate text, allowing models to "guess" ahead .
Stable Diffusion can run inference on a single consumer-grade GPU for under $500, making high-quality image generation accessible to anyone.
The most surprising thing about generating synthetic data with Hugging Face models is that you're not just creating more data, you're actively shaping t.
The TEI server can embed more text per second than you'd think, but not if you're doing it wrong. Let's see TEI in action
Deploying large language models LLMs at scale often involves a surprisingly simple underlying principle: treat the model itself as a stateful service th.
Fine-tuning a Named Entity Recognition NER model isn't about teaching it new words; it's about teaching it a new context for recognizing existing entiti.
The Hugging Face tokenizers library is a high-performance Rust-based tokenizer written in Python, designed to be fast and flexible for modern NLP tasks.
The Hugging Face Trainer API is a surprisingly opinionated, yet incredibly flexible, tool for training PyTorch and TensorFlow models, abstracting away v.
Fine-tuning a transformer on your own data is less about teaching it a new language and more about teaching it a new accent.
The most surprising thing about fine-tuning Whisper is that you don't actually need to fine-tune it at all to get it to understand your language better.
Zero-shot classification, in its current popular form, doesn't actually classify text; it finds the most relevant description of text from a predefined .
Training large models across multiple GPUs is a common bottleneck, and Hugging Face Accelerate is the go-to library for making this seamless.
The Hugging Face Hub isn't just a model repository; it's a dynamic registry where model architectures and their weights are versioned and linked, allowi.
Batch inference on Hugging Face models can be surprisingly tricky to optimize, and the most impactful gains often come from understanding how the model'.
BERT and Sentence Transformers can generate text embeddings, but the most surprising thing is that they don't actually "understand" text in the way huma.
Hugging Face models often boast impressive performance, but their sheer size can be a major hurdle for deployment, especially on resource-constrained ha.
Hugging Face's Trainer class can log directly to Weights & Biases, but it's not just a simple wandb. init call; the Trainer needs to be explicitly told .
Fine-tuning a Hugging Face model for production isn't about fitting more data into a pre-trained network; it's about strategically teaching a model to s.
Self-hosting Hugging Face models can dramatically slash LLM inference costs, but the real magic isn't just saving money; it's gaining control over your .