Skip to content

C Tip: Safe String Manipulation

C-style strings are considered evil by hardcore C++ programmers. They love to talk down on those of us who still use such unsafe and archaic techniques. I can’t count the number of times I’ve seen someone ask for help with a C problem at a game development forum only to be flamed and lectured for not using C++. And woe unto thee who do use C++ but haven’t seen fit to use std::string.

Sure, C-style strings are the source of many security holes in countless libraries and applications in the wild, but a cautious programmer can usually avoid such costly mistakes as buffer overruns. Everyone makes mistakes now and again, otherwise these bugs would never find their way into production code. But one way to alleviate such bugs with C strings is to wrap string operations in functions which use safe techniques. Perhaps the most common source of string security breaches are the string copying functions. This tip doesn’t look specifically at the string copying functions declared in string.h (like strcpy and it’s kin), but instead looks at some general issues with copying data from one string to another manually during string manipulation. Custom ToLower and ToUpper functions are implemented to demonstrate the problems.

The C standard library defines two functions in ctype.h to convert the case of a character: tolower() and toupper(). However, there are no functions to convert entire strings. To do that, you have to roll your own. A generic implementation is easy when converting strings in place:

Mini-Tip: When manipulating string function parameters via pointer dereference syntax (rather than array syntax), it’s a good habit to declare a new local variable to point to the parameter rather than manipulating the parameter itself. This allows the original parameter to point to the start of the string so that, later, if you decide to use the original parameter after manipulating it, you can. I never manipulate string params directly, even with small functions like this. Destination strings in a copy-on-write function are an exception to this rule, since they should be considered “empty” strings anyway.

Note: All example loops in this tip use C99 syntax.


#include &<ctype.h&>

void String_ToLower(char *str)
{
	for(char *p = str; *p; ++p)
		*p = tolower(*p);
}

void String_ToUpper(char *str)
{
	for(char *p = str; *p; ++p)
		*p = toupper(*p);
}

The above two functions are perfectly harmless (assuming, of course, that the given string is null-terminated, which we will). Trouble arises, though, when you start using copy-on-write semantics, i.e. rather than converting the characters in place, leave the original string untouched and place the converted characters in a new string.

To accomplish copy-on-write (COW), it’s tempting to write a function like this:



char *String_ToLowerCOW(char *str)
{
	char *dst = (char*)malloc(strlen(str) + 1);
	char *p = str;
	while(*p)
	{
		*dst = toupper(*p);
		*p++;
		*dst++;
	}

	// always null terminate the new string
	*dst = 0;
}

This is a safe implementation because the function is in control of the destination memory. The destination memory block will always be large enough to hold the required number of characters. The function also always null terminates the new string. You can’t get much safer than that. Unfortunately, this is not a very efficient implementation. One problem is that you are putting the onus on the caller to release the memory. With string manipulation functions, this is usually a bad idea. Most people would never expect that and, even if you explicitly documented it, would be apt to forget. Hello, memory leaks. Another problem is that if you call this function frequently, you are going to fragment your memory in no time.

Memory fragmentation occurs from frequently allocating and deallocating memory and, ultimately, affects the performance of an application. This is exasperated when the memory blocks are of varying sizes. Managing memory is relatively easy with structures of fixed sizes. You can create allocation pools, make sure your structs are aligned on certain boundaries, and more. This is because all structs of the same type are going to be the same size. You might have many different types of structs in one application, but by monitoring your usage you can develop a strategy that will reduce memory fragmentation. Strings are much more difficult to manage because they are notoriously unpredictable. String data is often pulled from external resources or user input, so you never know how large or small a string will be. It’s possible to write an allocator to specifically, and efficiently, allocate strings, but it’s extra work and not necessary for the general case. So while the function above is quite safe, it is only suitable for special case usage.

In order to write an efficient COW version of any string manipulation function, you need to let the caller provide the destination memory block. If you look at function declarations in string.h, you can find examples of this technique. The problem is, once you put the responsibility into the caller’s hands, you open the door to buffer overruns and other nasty security bugs.

Let’s look at a fist pass of a new version of String_ToLowerCOW:



void String_ToLowerCOW(char *dst, char *src)
{
	char *p = src;
	while(*p)
	{
		*dst = tolower(*p);
		++p;
		++dst;
	}
}

This implementation does the trick, but it is horribly unsafe. If the size of the destination buffer is smaller than the source, you will have yourself a perfect buffer overrun. Just imagine if you read some text from a file and pass it to this function with a destination buffer that is too small. That’s a script-kiddie’s wet dream come true. To correct this gaping black hole of an error, you might reimplement the function like this:



void String_ToLowerCOW(char *dst, size_t maxlen, char *src)
{
	// using array syntax since a counter variable is used anyway
	for(size_t i = 0; i &< maxlen; ++i)
		dst[i] = tolower(src[i]);
}

Now this looks good, right? Not really. There are three potential problems. Two are pretty severe, but can be corrected. The other is something that can only be overcome with discipline. First, the severe problems.

The variable ‘maxlen’ indicates the maximum number of characters that can be copied from the source to the destination. You may want to copy the complete string or a portion of the string. Regardless, maxlen should always indicate the size of the destination buffer. But what happens if the source string is shorter than the destination buffer? Look out, you’re going to be reading beyond the source string into who knows what memory. Because you can’t always control where the source string comes from, this is a potential security risk.

The other severe problem is that ‘maxlen’ might be shorter than the source string. That may seem harmless on the surface, but what it really means is that when the function returns, the destination string will not be null-terminated. You could rely on the caller to handle that, but it’s best to put it in the function itself. The following function corrects both problems:



void String_ToLowerCOW(char *dst, size_t maxlen, char *src)
{
	// determine a safe maximum
	size_t len = strlen(src);
	size_t max = (maxlen &> len) ? len : maxlen;

	// convert and copy
	for(len = 0; len &< max; ++len)
		dst[len] = tolower(src[len]);

	// null terminate the destination
	dst[max] = 0;
}

This version does the best it can do to prevent buffer overruns and always null terminates the destination string. It’s still not 100% safe, but it never can be. Function implementation is only half of string safety. The other half is function usage. You have to rely on the caller to provide a proper value for ‘maxlen’ and there’s no way around it. If you want to expose this function outside of your application via a plugin or scripting interface, you should wrap it with something that takes additional precautions. But for internal use, ‘maxlen’ should always be the length of the destination buffer, in bytes. If you are allocating the destination dynamically, it should be the length of the source string plus one:



char *src = "foobar";
char dst1[32];
String_ToLower(dst1, sizeof(dst), src);

size_t dst2Size = strlen(src) + 1;
char *dst2 = (char*)malloc(dst2Size);
String_ToLower(dst2, dst2Size, src);

String safety in C is definitely something to be concerned about. This function still contains the potential for bad errors, but it puts the responsibility on the caller. Java and C# hide such problems from you, and rightly so, but it’s no reason to be afraid of C strings. Sometimes, you may have no choice but to use them on some platforms. When you do have a choice, it’s certainly better to use something that eases the burden, such as std::string in C++. But when you do need to use C strings, a little discipline and vigilance go a long way.

Complete String_ToLower and String_ToUpper functions:

// cow.c -- to compile with gcc (including MingW or Cygwin on Windows):
//      gcc cow.c -std=c99 -o{outputfile name}
#include &<ctype.h&>
#include &<string.h&>
#include &<stdio.h&>
#include &<stdlib.h&>

void String_ToLowerCOW(char *dst, size_t maxlen, char *src)
{
	// some people would use assert here to validate dst and src, but since the
	// parameter values can potentially come from external sources at runtime,
	// I prefer to use an if block
	if(!dst || !maxlen || !src)
	{
		// Log an error here. Maybe set the function up to return
		// a boolean and return false here.
		return;
	}

	// determine a safe maximum
	size_t len = strlen(src);
	size_t max = (maxlen &> len) ? len : maxlen;

	// convert and copy
	for(len = 0; len &< max; ++len)
		dst[len] = tolower(src[len]);

	// null terminate the destination
	dst[max] = 0;
}

void String_ToUpperCOW(char *dst, size_t maxlen, char *src)
{
	// some people would use assert here to validate dst and src, but since the
	// parameter values can potentially come from external sources at runtime,
	// I prefer to use an if block
	if(!dst || !maxlen || !src)
	{
		// Log an error here. Maybe set the function up to return
		// a boolean and return false here.
		return;
	}

	// determine a safe maximum
	size_t len = strlen(src);
	size_t max = (maxlen &> len) ? len : maxlen;

	// convert and copy
	for(len = 0; len &< max; ++len)
		dst[len] = toupper(src[len]);

	// null terminate the destination
	dst[max] = 0;
}

int main(int argc, char **argv)
{
	char *str = "My StRiNg";
	printf("%s %d\n", str, strlen(str));

	char lower[32];
	String_ToLowerCOW(lower, sizeof(lower), str);
	printf("%s %d\n", lower, strlen(lower));

	size_t len = strlen(str) + 1;
	char *upper = (char*)malloc(len);
	String_ToUpperCOW(upper, len, str);
	printf("%s %d\n", upper, strlen(upper));
	free(upper);

	return 0;
}

Recap: The above functions are meant to be examples of safe string handling in C and not to be the most efficient implementations of ToLower and ToUpper. When implementing string manipulation functions, always keep in mind buffer overruns and null-terminations. When using string manipulation functions, always be sure to pass the proper parameters.

Technorati Tags: , , ,

{ 3 } Comments

  1. Programmer16 | September 17, 2006 at 7:09 am | Permalink

    I agree that char-pointer strings are not evil, but in my opinion there is no need to use them unless they’re absolutely required. Part of the reason that I like C++ strings better than C-strings is that C++ strings are OO while C strings require functions for any of the nice features (appending and such.)

    Anyway, I usually don’t take anybody that says “X is evil” or “you shouldn’t do X” seriously, since those terms are used by too many people who have no idea what they’re talking about.

  2. gdmike | September 17, 2006 at 3:57 pm | Permalink

    That’s why this is labeled as a C Tip and not a C++ Tip :) I’ve been a C user for some years now, so I’m just used to C strings. I think your reasoning about using std::string, because it’s “OO”, is flawed.

    The fact that std::string has an object oriented design does not by itself make it inherently better than C’s string API. In fact, there are different pitfalls that you don’t need to worry about with C strings (a side effect of the C++ language, not object orientation in general). The benefits come from the implementation details, not the fact that you can call methods on it with str.method syntax.

    Second, there’s nothing that prevents you from using free functions in an OO design. Object orientation is not about classes - that’s just an implementation detail. Object orientation is about objects, and such a design can be implemented in C easily enough. Languages like C++ just give you the tools to make the implementation easier.

    I suggest you take a look at the D programming language. D’s strings are similar to those in C — arrays of characters manipulated with free functions. The biggest difference is that D strings don’t need to be null terminated, since arrays are actually structs under the hood that contain a data pointer and a length value. Another feature of D is that you can use the first argument to a function with . syntax (so that doSomething(myStr) is the same as myStr.doSomething()). How does that fit into your definition of OO?

    I do agree with you on the general principle that std::string is a better option when using C++, but my reasoning is that it’s safer than using raw C-strings.

  3. Programmer16 | September 18, 2006 at 6:44 am | Permalink

    The OO (sorry, I should actually say class-based design or something) wasn’t the only reason, as I said it was part of the reason. Safety and cleaner code would be a couple other reasons (safety would have been a lot better choice than what I said.)

    I misused the term OO because I was quite tired (as you can see I posted at 7am, I think that puts me at being up for about 36 hours since I didn’t go to bed on time the day before either lol.) I do understand that object orientation isn’t just about classes.

    I actually just switched over to using C++ strings most of the time (unless C-strings are needed) a couple months ago.

Post a Comment

Your email is never published nor shared. Required fields are marked *