How to Use strtok and strtok_r in C

char *strtok(char *restrict str, const char *restrict delim);
char *strtok_r(char *restrict str, const char *restrict delim, char **restrict saveptr);

strtok and strtok_r are string tokenization functions in C's <string.h> library. Given a pointer to some string str and some delimiter delim, strtok will attempt to divide the string that str points to into a sequence of tokens delimited by delim. On each call, strtok will find the next delimiter in str, and return a null-terminated token preceding that delimiter. In the event that the string cannot be further tokenized, strtok will return NULL.

As a demonstrative example, let's consider a simple program that uses strtok to tokenize a comma-separated string:

#include <string.h>
#include <stdio.h>

int main() {
char s[16] = "A,B,C,D";
char* tok = strtok(s, ",");
while (tok != NULL) {
printf("%s\n", tok);
tok = strtok(NULL, ",");
}
return 0;
}
A
B
C
D

For those who immediately want to get started using strtok and strtok_r, you should keep these things in mind:

So how does strtok actually work?

strtok holds a static pointer in between calls that points to the first character of the un-tokenized portion of its original input. The actual behavior of the function depends on whether or not its first parameter is a pointer to an actual string or a NULL pointer:

As an example, the behavior of strtok on our previous example would be as follows:

im

The maintainence of this internal pointer has several important implications for the actual behavior of the function:

  1. strtok is not thread-safe. The pointer that is used is static, which creates a section of code that can be modified at the same time by multiple threads, causing undefined behavior in multithreaded code.
  2. Tokens extracted by strtok are adjacent in memory.
  3. Tokens extracted by strtok cannot be tokenized themselves until whole original string has been tokenized. strtok_r avoids this, and we will see why later.

What's the difference between strtok and strtok_r?

strtok_r is defined as the reentrant version of strtok, meaning that the function can be safely interrupted and then called again before the interrupted process can finish (for more details on this see here). The fact that strtok_r is explicitly reentrant is not of particular importance to us here, so we won't dive any further.

The main difference between strtok and strtok_r is that strtok_r contains an additional parameter called saveptr that behaves exactly the same as the internal static pointer of strtok. Like the static pointer in strtok, saveptr allows the function to preserve the context of what has and hasn't been tokenize, but parameterizing this part of the function allows for multiple contexts at the same time. Practically speaking, this is useful when we have data where we want to break tokens into their own sub-tokens.

Example: Tokenizing IPV4 addresses

Let's say we want to build a simple (and somewhat silly) program in C where divide a list of IPV4 addresses into a list of numbers that make up those addresses. We use strtok_r to make the following program:

#include <string.h>
#include <stdio.h>

int main() {
char ips[] = "127.0.0.1, 192.0.2.1, 172.16.254.1";
char* saveptr1; // Initialize our save pointers to save the context of tokens
char* saveptr2;

char* tok = strtok_r(ips, ", ", &saveptr1);
while (tok != NULL) {
printf("Token: %s\n", tok);
char* subtok = strtok_r(tok, ".", &saveptr2);
while (subtok != NULL) { // Once we have an IP address, break it down into each dotted number
printf("Subtoken: %s\n", subtok);
subtok = strtok_r(NULL, ".", &saveptr2);
}
tok = strtok_r(NULL, ", ", &saveptr1);
}
return 0;
}
Token: 127.0.0.1
Subtoken: 127
Subtoken: 0
Subtoken: 0
Subtoken: 1
Token: 192.0.2.1
Subtoken: 192
Subtoken: 0
Subtoken: 2
Subtoken: 1
Token: 172.16.254.1
Subtoken: 172
Subtoken: 16
Subtoken: 254
Subtoken: 1

Here, we're leveraging the fact that strtok_r allows multiple context by creating a pointer that handles each IPV4 address in the list and one that handles each number per IPV4 address. If we used just strtok, our pointer to our original list would reset after every call to strtok on ips!

Common pitfalls

Using strtok vs. using strtok_r

Because strtok can only hold one internal pointer at a time, it is impossible to use strtok to tokenize two strings in parallel. To prove this, let's try to replicate the example that we used earlier, only using strtok instead of strtok_r:


#include <string.h>
#include <stdio.h>

int main() {
char ips[] = "127.0.0.1, 192.0.2.1, 172.16.254.1";

char* tok = strtok_r(ips, ", ");
while (tok != NULL) {
printf("Token: %s\n", tok);
char* subtok = strtok(tok, ".");
while (subtok != NULL) { // Once we have an IP address, break it down into each dotted number
printf("Subtoken: %s\n", subtok);
subtok = strtok(NULL, ".");
}
tok = strtok(NULL, ", ");
}
return 0;
Token: 127.0.0.1
Subtoken: 127
Subtoken: 0
Subtoken: 0
Subtoken: 1

Our implementation using strtok halts early! The first call to strtok on ips produces the token 127.0.0.1 and places a null byte at the comma delimiter. We then break that token into subtokens as normal. However, when we call strtok after processing our subtokens, it will return NULL because its internal reference is to the first token we extracted, which has already been processed, causing our program to finish earlier than it should. So, for situations where a string is composed of a hierarchy of tokens, we should use strtok_r instead of strtok.

Using strtok on constant or read-only strings

Recall that strtok will change anything matching delim to a null byte. So, if we try to use strtok on a string that exists in read-only memory, our program will likely segfault! Here's a brief example:

char* s = "This is in read-only memory!";
char* tok = strtok(s, " ");
Segmentation fault

The moment that strtok reaches the first space in s, it will immediately cause a segfault because it will attempt to write a null byte in the fourth (zero-indexed) position instead.

Further Reading

From the Systems Encyclopedia:

Outside readings: