AI Inference Pricing

Basic Tier

  • Usage: Up to 60 requests per minute for free users, 600 requests per minute for users who deposit a minimum of $10 into their accounts. If users require higher rate limit, they can contact [email protected].
    • Important disclaimer: Each source IP address will be limited to 100/min to prevent DDoS attacks.
  • Features: Access to text-to-text, text-to-speech, and text-to-image models, text-to-video models and fine-tuning services.
  • Cost:
    • $10 credit for free trial
    • Text-to-text:
      • Instruct models:
        • Llama 3.1 8B (BF16): $0.1 per 1M tokens
        • Llama 3.1 70B (BF16): $0.4 per 1M tokens
        • Llama 3 70B (BF16): $0.4 per 1M tokens
        • Hermes 3 70B (BF16): $0.4 per 1M tokens
        • Llama 3.1 405B (BF16): $4 per 1M tokens
        • DeepSeek V2.5(BF16): $2 per 1M tokens
        • Qwen 2.5 72B (BF16): $0.4 per 1M tokens
        • Llama 3.2 3B (BF16): $0.1 per 1M tokens
      • Base models:
        • Llama 3.1 405B parameters BASE (FP8): $2 per 1M tokens
        • Llama 3.1 405B parameters BASE (BF16): $4 per 1M tokens
        • Llama 3.2 90B Vision BASE (BF16): $2 per 1M tokens
    • Text-to-image:
      • Pricing Formula: $Base Rate * (width/1024) * (height/1024) * (steps/25) per image
      • How it works?
        • Base Rate for SD family and Flux Dev: $0.01 is the cost for generating a standard image at 1024x1024 pixels with 25 steps
        • Base Rate for Flux Pro: $0.05 is the cost for generating a standard image at 1024x1024 pixels with 50 steps
        • Image Size Adjustment: The price scales with the width and height of the image. For example:
          • 512x512 pixels: The price would be $Base Rate * (512/1024) * (512/1024) * (steps/25)
          • 2048x2048 pixels: The price would be $Base Rate * (2048/1024) * (2048/1024) * (steps/25)
        • Step Adjustment: The price also scales based on the number of steps used in generating the image:
          • 25 steps: The multiplier is 1 (i.e., steps/25 = 25/25 = 1)
          • 50 steps: The multiplier is 2 (i.e., steps/25 = 50/25 = 2)
    • VLMs:
      • Pixtral 12B (BF16): $0.1 per 1M tokens
      • Qwen2-VL-7B-Instruct (BF16): $0.1 per 1M tokens
      • Qwen2-VL-72B-Instruct (BF16): $0.4 per 1M tokens
      • Llama-3.2-90B-Vision-Instruct (BF16): $0.4 per 1M tokens
    • Text-to-speech:
      • $5.00 per 1M characters
    • Text-to-video:
      • Price TBD
  • Purpose: Cater to startups and small to medium-sized enterprises that need higher throughput and advanced features.

Enterprise Tier

  • Usage: Unlimited Requests
  • Features: Full suite of AI models, dedicated support, custom SLAs, or dedicated instances.
  • Dedicated Instances:
    • Hourly Hosting Fee:
      • H100 SXM: $3.20
      • H100 PCIe: $3.00
      • A100 SXM: $1.80
      • A100 PCIe: $1.60
      • 3090: $0.30
      • 4090: $0.50
  • Custom Model Hosting: Host and optimize your custom AI models, resulting in high throughput and lower latency.
  • Purpose: Serve large enterprises with substantial and specific requirements, offering them scalability and dedicated resources.

Notes

Understanding FP8 vs BF16

When choosing between FP8 and BF16 for model inference, it’s about balancing speed, precision, and cost.

BF16 (16-bit Brain Floating Point):

BF16 is the precision level at which the model was originally trained. BF16 offers the best in precision and performance. It retains more accuracy, making it suitable for tasks where precision is critical—like medical diagnostics or scientific research. With BF16, you get reliable results without compromising speed, though it comes at a slightly higher cost.

FP8 (8-bit Floating Point):

FP8 is all about efficiency. It’s fast, lean, and perfect for applications where speed matters more than precision. FP8 helps you scale at a lower cost, making it ideal for high-throughput needs.

In Summary:

FP8 is your go-to for cost-effective scaling with speed. BF16 is for when precision can’t be compromised, even if it means a bit more on the price tag. It’s all about what your application demands.

Understanding Base vs. Instruct Models

When deciding between base models and instruct models, it’s about understanding the level of guidance and the complexity of the tasks you need to solve.

Base Models:

Base models are the foundation—they’re versatile and trained on a broad range of data. They can handle a wide variety of tasks but don’t have specific instructions baked in. Think of them as powerful tools that can be shaped and directed however you need.

Instruct Models:

Instruct models take things a step further. They’re base models that have been specifically trained on pairs of instructions and responses. This means when you give an instruct model a command, it knows to follow that command precisely. It’s like having a model that already understands how to respond to specific tasks right out of the box.

In Summary:

Base models offer flexibility and can be adapted to many uses. Instruct models are pre-trained to follow instructions directly, making them ideal for tasks where you want immediate, accurate responses.

Note:

If you require an invoice for the compute credits you've purchased, please email [email protected] with the necessary invoicing details.