teaching machines

Segfault on std::vector

June 3, 2016 by . Filed under code, public.

I was writing a syntax highlighter for the Madeup desktop application the other day. I already had a lexer written months ago that generated a token stream. Each token in the stream knows its indices in the source code, so I sat down to write a pretty innocent algorithm:

tokens = lex src
for each token in tokens
  case token.type
    when COMMENT
      color src[token.start..token.end] in mute gray
    when NUMBER
      color src[token.start..token.end] token orange
    when IDENTIFIER
      color src[token.start..token.end] token dark blue
    etc
  end
end
It worked pretty well, except the GUI framework complained about the indices of the last couple of tokens. Oh, yeah. The last couple of tokens are artificial ones that aren’t really in the original source. The last one is END_OF_FILE, which I emit so the parser knows when to stop processing. The second to last one is NEWLINE, which I insert just in the case the programmer didn’t include one. If the parser can assume that all statements end with a NEWLINE, the rules are simpler. (This practice is common. See how POSIX defines line.) Trying to colorize these artificial tokens triggered the warning. So, I changed my loop a bit:
for i = 0; i < tokens.size - 2; ++i
  ...
end
This generated a segfault—sometimes. My routine worked fine on blank source. It worked on longer programs too. But sometimes it crashed. That’s weird, I thought. I’m doing strictly less work than before. How can I be getting a segfault? I wish I could say I found the answer right away. But I didn’t. Here’s a minimum working example that demonstrates the same problem:
#include <iostream>
#include <vector>

int main(int argc, char **argv) {
  std::vector<int> numbers;
  for (int i = 0; i < numbers.size() - 2; ++i) {
    std::cout << "numbers[" << i << "]: " << numbers[i] << std::endl;
  }
  return 0;
}
On my machine, this generates a segfault pretty reliably. It might not on your machine, but it definitely shouldn’t print anything, since the vector is empty. The issue is numbers.size() - 2. The size method probably returns a size_t value, an unsigned type. If the size is ever 0, the subtraction will wrap around and the upper bound will be really big:
#include <iostream>

int main(int argc, char **argv) {
  size_t size = 0;

  // prints 18446744073709551614
  std::cout << "size - 2: " << size - 2 << std::endl;

  return 0;
}
So, how was the number of tokens ever 0, given my two artificial tokens? When the lexer encounters a unknown token early in the processing, I raise an exception and no tokens are emitted. I still want the highlighter to colorize as much as it can on such an exception, so I must still process whatever tokens I do have. The fix was simple: just favor addition over subtraction:
#include <iostream>
#include <vector>

int main(int argc, char **argv) {
  std::vector<int> numbers;
  for (int i = 0; i + 2 < numbers.size(); ++i) {
    std::cout << "numbers[" << i << "]: " << numbers[i] << std::endl;
  }
  return 0;
}