Developers using the Perplexity API often receive unexpected charges because they do not understand how tokens are counted. The Perplexity API charges based on the total number of tokens sent in a request and received in a response. Token counting includes the prompt, the system instructions, any context, and the generated answer. This article explains exactly which elements count toward your token usage, how to estimate costs before making a request, and how to reduce token consumption to control your bill.
Key Takeaways: Perplexity API Token Counting Rules
- Total tokens = prompt tokens + completion tokens: Every character in the prompt and the response is counted as tokens.
- System instructions and context are included: The system message, user message history, and any provided context all consume tokens.
- Model selection changes the token price: Different Perplexity models have different per-token rates, and some models use more tokens for the same text.
How Perplexity Counts Tokens in API Requests and Responses
The Perplexity API uses a tokenization system similar to other large language model APIs. A token is a unit of text that the model processes. One token is roughly four characters of English text. The API counts tokens in two parts: prompt tokens and completion tokens.
Prompt tokens include every character in the messages array you send. This includes the system role, user role, any assistant role messages you include for context, and the current user query. The API also counts any metadata or special tokens the model adds internally, but these are transparent to you.
Completion tokens are the tokens the model generates in its response. The entire text of the answer, including any citations or formatting, is counted as completion tokens. The API returns the token counts in the response object under usage.prompt_tokens and usage.completion_tokens.
What Exactly Counts as a Token
Every character in the prompt is tokenized. Spaces, punctuation, and line breaks all count. A single word like “Perplexity” is typically one token. A longer word like “tokenization” may be two tokens. The exact token count depends on the model’s tokenizer, but all Perplexity models use a tokenizer that gives similar counts for English text.
What Does Not Count as a Token
The API does not count HTTP request headers, authentication tokens, or network overhead. Only the actual text content of the request and response is billed. Also, if you set a max_tokens parameter, the API will not generate beyond that limit, which caps the completion token count.
Steps to Calculate Token Usage Before Sending a Request
- Count the characters in your prompt
Use a character counter tool or your code editor. Divide the total character count by 4 to get an approximate token count. For example, 1000 characters equals roughly 250 tokens. - Add the system message and context
If you include a system message or previous conversation turns, count those characters as well. Sum all character counts from the entire messages array. - Estimate the response length
Set amax_tokensvalue to cap the response. If you expect a short answer, set a low value like 200 tokens. For long answers, set 1000 or more. - Multiply by the model’s per-token price
Check the Perplexity pricing page for the specific model you use. Multiply the total estimated tokens by the price per token. Add prompt and completion token counts separately because they may have different rates. - Use the API’s usage response to verify
After making a test request, read theusagefield from the API response. Compare the actual token counts to your estimates to improve future cost predictions.
Common Token Counting Mistakes and How to Avoid Them
“My bill is higher than expected even though I send short queries”
The most common cause is including large context in the system message or conversation history. If you send the entire conversation history with every request, the prompt token count grows rapidly. Use a sliding window of recent messages instead of the full history. Also, check if you are sending the same system message repeatedly without caching it. The API counts the system message every time.
“The API returns more completion tokens than my max_tokens setting”
The max_tokens parameter sets the maximum number of tokens the model can generate. However, the model may stop earlier due to a stop sequence or end-of-text token. If you see completion tokens exceeding max_tokens, you may have set the parameter incorrectly. Verify that you are passing max_tokens in the request body, not the headers.
“My token counts vary between requests with the same prompt”
The token count for the same prompt should be identical. If it varies, check whether you are including a timestamp, random user ID, or other dynamic content in the request. Dynamic content changes the token count. Also, ensure you are using the same model version across requests, as different model versions may tokenize text slightly differently.
| Item | Perplexity API | OpenAI API |
|---|---|---|
| Token counting method | Prompt tokens + completion tokens | Prompt tokens + completion tokens |
| System message included | Yes | Yes |
| Context history included | Yes | Yes |
| Per-token pricing | Varies by model; typically lower than OpenAI | Varies by model; typically higher than Perplexity |
| Token usage in response | usage.prompt_tokens and usage.completion_tokens |
usage.prompt_tokens and usage.completion_tokens |
Understanding how Perplexity counts tokens helps you control API costs. Always check the usage field in the API response after each request. Reduce prompt size by trimming conversation history and removing unnecessary system instructions. Use the max_tokens parameter to limit response length. For applications with many users, implement a token budget per request to prevent runaway costs. Test with a small sample before scaling to production to confirm your cost estimates match actual usage.