Listing all square-free words.

An idea that I find to be fun is square-free words.  These are not words like words in spoken languages. You start with what you call an alphabet, which is a set of symbols. We will use the letters a, b, and c for our alphabet.  Then a word is any sequence of these symbols.  So for our alphabet, a, bb, and aba are all words.  You can also have an empty word, which is represented by the greek letter epsilon,  \epsilon.

Words can have factors. This is basically a word inside of a word. So if the word is abcabc, then one factor of this is bcab. We can factor a word by breaking it into factors. So if we say to let w=xy, then x and y are two factors of w.

A word is square-free if it has no back-to-back repeating patterns. In other words, its not possible to write a word w as xyyz. Examples of square-free words are \epsilon, a, and abacb. Two words which are not square-free is abcabc and aabb.

How might we list all square-free words?

We could list every single word, and then check them to see if they are square-free or not. However, this would be slow. First, the number of factors of a word does not scale nicely as the length of the words grow. And second, its more likely for a word to not be square-free. It would be ideal if there was a way to not have to check all the factors of each word, and not have to check every word.

The thing to consider that you can list all words in a tree like so.

alphatree
Given any word, its children are all the words made by appending each of the letters of the alphabet to itself.  From here we can see a way to not test every word.  Since any word contains all of its parents, if a word is not square-free, none of its children are either. We can do a breadth-first transversal of this tree, but not continue down any branch that we hit a word which isn’t square-free on.

To shorten the way that we test words there is an alternate definition of a square-free word which we need to prove is equivalent to our original one.

Let w = x\alpha be a word. Then w is square-free if and only if

  • x is square-free.
  • every factor which ends at \alpha is not preceded by itself.

Something to keep in mind is that “a factor ending at \alpha” means a factor all the way at the end of the word. It refers to position, not the symbol \alpha itself.

\Longrightarrow Suppose that w=x\alpha is square-free. Every factor of a square-free word is square-free, which means that x is. Also, no factor of a square-free word can be preceded by itself, so every factor which ends at \alpha is not preceded by itself.

\Longleftarrow Let w=x\alpha be a word. Suppose that x is square-free, and that every factor which ends at \alpha is not preceded by itself. There are no repeating factors in x, so if there is a repeating factor it’s going to need to end at \alpha. But no factor which ends at \alpha which has a repeating factor. Then w has no repeating factors and is square-free. \blacksquare

This goes hand in hand with the tree we’ve created. At any word w=x\alpha in the tree, its parent is x, and we know its parent is square-free. So we can use the above theorem to only test factors ending at \alpha, which cuts down on what we have to test significantly.

The following code written in Python implements this by printing out all square-free words up to a certain length.

import sys

'''
    Start with the rightmost letter in a word and go 
    back one position at a time. At each position 
    look at the word in front of it (including it), 
    and the word behind it of equal length. 

    This function tests if any of those pairs of 
    words are the same. For example, if the word 
    is abcde we test the following pairs: 
        d e  (pivot on e)
        bc de (pivot on d)
    If the word is abcdef, 
        e f  (pivot on f)
        cd ef (pivot on e)
        abc def (pivot on d)
'''
def testword(word):
    wordl = len(word)
    pivot = wordl - 1
    # We cut off one earlier if the word is of odd length.
    cutoff = (wordl % 2) + wordl / 2
    while pivot >= cutoff:
        right = word[pivot:]
        left = word[len(right) * (-2):pivot]
        if left == right:
            return False
        pivot -= 1
    return True

'''
    Enumerates all words by doing a breadth first search of the 
    tree where, given a word with length l, its children are
    all words of length l+1 which have it as a suffix.
'''
def enumerate_words(alphabet, stoplen):
    breadth_queue = ['']
    words = [] # Words that are square-free.
    while breadth_queue:
        word = breadth_queue[0]
        breadth_queue = breadth_queue[1:]
        children = []
        for w in alphabet:
            test = word + w
            if testword(test):
                children.append(test)
            if len(word) < stoplen:
                breadth_queue = breadth_queue + children
        words.append(word)
    return words

if __name__ == '__main__':
    if len(sys.argv) < 2:
        size = 5
    else:
        size = int(sys.argv[1])
    for w in enumerate_words( ['a','b','c'],size):
        print w