Skip to content
Advertisement

Preserve quotes when using wordexp

I’m trying to use the wordexp function to shell-like expansion on some strings. wordexp removes single and double quotes, I would like to preserve those however. My initial though was to just surround all quotation mark pairs in the input string with another pair of, this time escaped, quotation marks which wordexp should leave untouched (or the other way around). Unfortunately this fails for more complex inputs.

For example, for '""TEST""' I would like to end up with '""TEST""', I’ve written this snippet to demonstrate what actually happens when I use my approach:

#include <stdio.h>
#include <wordexp.h>

static void expansion_demo(char const *str)
{
    printf("Before expansion: %sn", str);

    wordexp_t exp;
    wordexp(str, &exp, 0);
    printf("After expansion: %sn", exp.we_wordv[0]);
    wordfree(&exp);
}

int main(void)
{
    char const *str1 = "\''\"""\"TEST1\"""\"'\'";
    expansion_demo(str1);

    char const *str2 = "'\'"\"\""TEST2"\"\""\''";
    expansion_demo(str2);

    return 0;
}

This results in:

Before expansion: ''""""TEST1""""''
After expansion: '""""TEST1""""'
Before expansion: ''""""TEST2""""''
Segmentation fault (core dumped)

This fails because of the double quotes are nested inside single quotes and naively surrounding every pair of quotes with escaped quotes can’t work in that case (though I’m not sure why the segfault happens).

I also thought about temporarily swapping the quotes with other ascii characters but there aren’t any that could not be part of some valid shell command.

Is there a way to adapt this to do what I want? Or maybe some much simpler way?

Advertisement

Answer

Segmentation fault

In your code, the second test string:

char const *str2 = "'\'"\"\""TEST2"\"\""\''";

yields a syntax error. Coping with C or shell escaping rules is moderately hideous on a string like that, but you can analyze that you have an unmatched single quote at the end of the string. Converting the C string literal into the string yields:

''""""TEST2""""''

When analyzed, the key characters are marked by the carets:

''""""TEST2""""''
^^^^^ ^ ^^    ^^ ^ ^^ ^
12345 6 78    91 1 11 1
               0 1 23 4
  1. Start single-quoted string
  2. Backslash (no special meaning inside a single-quoted string)
  3. End single-quoted string
  4. Start double-quoted string
  5. First escaped double quote (part of the string)
  6. Second escaped double quote (part of the string)
  7. End double-quoted string
  8. Word TEST2 is plain text outside quotes (part of the string)
  9. Start double-quoted string
  10. First escaped double quote (part of the string)
  11. Second escaped double quote (part of the string)
  12. End double-quoted string
  13. Escaped single quote (part of the string)
  14. Start of single-quoted string

Because there is no end to the final single-quoted string, there is a syntax error, and the return value from wordexp() is WRDE_SYNTAX which says that. And you get the segmentation fault because the exp structure has been set with a null pointer in the exp.we_wordv member.

This safer version of your code demonstrates this:

/* SO 5246-1162 */
#include <stdio.h>
#include <wordexp.h>

static const char *worderror(int errnum)
{
    switch (errnum)
    {
    case WRDE_BADCHAR:
        return "One of the unquoted characters - <newline>, '|', '&', ';', '<', '>', '(', ')', '{', '}' - appears in an inappropriate context";
    case WRDE_BADVAL:
        return "Reference to undefined shell variable when WRDE_UNDEF was set in flags to wordexp()";
    case WRDE_CMDSUB:
        return "Command substitution requested when WRDE_NOCMD was set in flags to wordexp()";
    case WRDE_NOSPACE:
        return "Attempt to allocate memory in wordexp() failed";
    case WRDE_SYNTAX:
        return "Shell syntax error, such as unbalanced parentheses or unterminated string";
    default:
        return "Unknown error from wordexp() function";
    }
}

static void expansion_demo(char const *str)
{
    printf("Before expansion: [%s]n", str);
    wordexp_t exp;
    int rc;
    if ((rc = wordexp(str, &exp, 0)) == 0)
    {
        for (size_t i = 0; i < exp.we_wordc; i++)
            printf("After expansion %zu: [%s]n", i, exp.we_wordv[i]);
        wordfree(&exp);
    }
    else
        printf("Expansion failed (%d: %s)n", rc, worderror(rc));
}

int main(void)
{
    char const *str1 = "\''\"""\"TEST1\"""\"'\'";
    expansion_demo(str1);

    char const *str2 = "'\'"\"\""TEST2"\"\""\''";
    expansion_demo(str2);

    return 0;
}

Output is:

Before expansion: [''""""TEST1""""'']
After expansion 0: ['""""TEST1""""']
Before expansion: [''""""TEST2""""'']
Expansion failed (6: Shell syntax error, such as unbalanced parentheses or unterminated string)

What wordexp() does

The wordexp() function is designed to do (more or less) the same expansions that a shell would do if given the string as part of a command line. Here’s a simple program that can illustrate this. It’s an adaptation of an answer to Running ‘wc’ using execvp() recognizes /home/usr/foo.txt but not ~/foo.txt — source file wexp79.c.

#include "stderr.h"
#include <stdio.h>
#include <stdlib.h>
#include <wordexp.h>

static const char *worderror(int errnum)
{
    switch (errnum)
    {
    case WRDE_BADCHAR:
        return "One of the unquoted characters - <newline>, '|', '&', ';', '<', '>', '(', ')', '{', '}' - appears in an inappropriate context";
    case WRDE_BADVAL:
        return "Reference to undefined shell variable when WRDE_UNDEF was set in flags to wordexp()";
    case WRDE_CMDSUB:
        return "Command substitution requested when WRDE_NOCMD was set in flags to wordexp()";
    case WRDE_NOSPACE:
        return "Attempt to allocate memory in wordexp() failed";
    case WRDE_SYNTAX:
        return "Shell syntax error, such as unbalanced parentheses or unterminated string";
    default:
        return "Unknown error from wordexp() function";
    }
}

static void do_wordexp(const char *name)
{
    wordexp_t wx = { 0 };
    int rc;
    if ((rc = wordexp(name, &wx, WRDE_NOCMD | WRDE_SHOWERR | WRDE_UNDEF)) != 0)
        err_remark("Failed to expand word [%s]n%d: %sn", name, rc, worderror(rc));
    else
    {
        printf("Expansion of [%s]:n", name);
        for (size_t i = 0; i < wx.we_wordc; i++)
            printf("%zu: [%s]n", i+1, wx.we_wordv[i]);
        wordfree(&wx);
    }
}

int main(int argc, char **argv)
{
    err_setarg0(argv[0]);

    if (argc <= 1)
    {
        char *buffer = 0;
        size_t buflen = 0;
        int length;
        while ((length = getline(&buffer, &buflen, stdin)) != -1)
        {
            buffer[length-1] = '';
            do_wordexp(buffer);
        }
        free(buffer);
    }
    else
    {
        for (int i = 1; i < argc; i++)
            do_wordexp(argv[i]);
    }
    return 0;
}

(Yes: code duplication — not good.)

This can be run with command line arguments (which means you have to fight the shell — or at least ensure that the shell doesn’t interfere with what you specify), or it will read lines from standard input. Either way, it runs wordexp() on a string and prints the results. Given an input file:

*.c
*[mM]*
*.[ch] *[mM]* ~/.profile $HOME/.profile

it will produce:

Expansion of [*.c]:
1: [esc11.c]
2: [so-5246-1162-a.c]
3: [so-5246-1162-b.c]
4: [wexp19.c]
5: [wexp79.c]
Expansion of [*[mM]*]:
1: [README.md]
2: [esc11.dSYM]
3: [makefile]
4: [so-5246-1162-b.dSYM]
5: [wexp19.dSYM]
6: [wexp79.dSYM]
Expansion of [*.[ch] *[mM]* ~/.profile $HOME/.profile]:
1: [esc11.c]
2: [so-5246-1162-a.c]
3: [so-5246-1162-b.c]
4: [wexp19.c]
5: [wexp79.c]
6: [README.md]
7: [esc11.dSYM]
8: [makefile]
9: [so-5246-1162-b.dSYM]
10: [wexp19.dSYM]
11: [wexp79.dSYM]
12: [/Users/jleffler/.profile]
13: [/Users/jleffler/.profile]

Note how it expanded both tilde-notation and $HOME.

Escaping a string

It appears that what you’re after is code that will preserve a string such as

'""TEST""'

across the expansion by a shell, yielding an output such as:

''""TEST""''

I have a series of functions that can produce a string equivalent to that (though the actual output differs from what I showed; the functions use brute force where the example output above generates a slightly simpler string). This code is available in my SOQ (Stack Overflow Questions) repository on GitHub as files escape.c and escape.h in the src/libsoq sub-directory. Here’s a program using escape_simple(), which escapes any string containing characters outside the portable file name character set ([-A-Za-z0-9_.,/]).

/* SO 5246-1162 */
#include <stdio.h>
#include "escape.h"

int main(void)
{
    static const char *words[] =
    {
        "'""TEST""'",
        "\''\"""\"TEST1\"""\"'\'",
        "'\'"\"\""TEST2"\"\""\''",
    };
    enum { NUM_WORDS = sizeof(words) / sizeof(words[0]) };

    for (int i = 0; i < NUM_WORDS; i++)
    {
        printf("Word %d:  [[%s]]n", i, words[i]);
        char buffer[256];
        if (escape_simple(words[i], buffer, sizeof(buffer)) >= sizeof(buffer))
            fprintf(stderr, "Escape failed - not enough space!n");
        else
            printf("Escaped: [[%s]]n", buffer);
    }

    return 0;
}

Note that interpreting the C string is fairly messy. Here’s the output from the program:

Word 0:  [['""TEST""']]
Escaped: [[''''""TEST""'''']]
Word 1:  [[''""""TEST1""""'']]
Escaped: [['''''''""""TEST1""""''''''']]
Word 2:  [[''""""TEST2""""'']]
Escaped: [['''''''""""TEST2""""''''''']]

As I noted, the escape code uses brute force. It outputs a single quote, then processes the string, replacing each single quote it encounters with '''. This sequence:

  • Ends the current single-quoted string
  • Adds an escaped single quote (')
  • Starts (continues) a single-quoted string

Inside single quotes, only single quotes need special treatment. Clearly, a more sophisticated parser would handle (repeated) single quotes at the start or end of the string more cleverly, and would recognize repeated single quotes and encode them more succinctly too.

You can use the escaped output in a printf command (as opposed to function) like this:

$ printf "%sn" ''''""TEST""'''' '''''''""""TEST1""""''''''' '''''''""""TEST2""""'''''''
'""TEST""'
''""""TEST1""""''
''""""TEST2""""''
$

There’s no way to claim that any of the shell code there is easy to read; it is abominably difficult to read. But copy’n’paste makes life easier.

Advertisement