Second Pass

Main Loop

Recall that our second pass takes each file found during the first pass and relocates it, updating any internal references to file paths.

The main loop for the second pass is:


# Create the new root directory for relocated files
new_root = '../relocate'
os.makedirs(new_root)
   
# Now actually perform the relocation
for srcdst in files_map.items():
    relocate(file, new_root, files_map)

The function os.makedirs recursively creates a directory path. It throws an exception if the directory path already exists. We will not catch this exception since we want to ensure the files are being relocated to a clean directory.

Relocating by Copying


import shutil

def relocate(srcdst, dst_root, files_map):
    """Relocate a file, correcting internal file references."""
   dst_file = os.path.join(dst_root, srcdst[1])
   dst_dir = os.path.dirname(dst_file)
   if not os.path.isdir(dst_dir):
       os.makedirs(dst_dir)
   if isSourceFile(dst_file):
       relocateSource(srcdst, dst_file, files_map)
   else:
       shutil.copyfile(srcdst[0], dst_file)

There aren't many new features to comment on here. The first parameter to the function, srcdst, is a (key, value) item from the files_map dictionary, so srcdst[0] is the path to the original file, and srcdst[1] is the path to the relocated file, relative to the new root directory.

We create the destination directory unless it already exists. Then, if our file is a source file, we call relocateSource(); otherwise, we simply copy the file across.

I admit this isn't the most object-oriented of functions. The meaning of the literal 0 and 1 isn't transparent, and switching on type often indicates unfamiliarity with polymorphism. It isn't Python that's to blame here, nor a lack of familiarity with polymorphism: rather a reluctance on my part to add such sophistication to a simple script.

Identifying Source Files


import re

def isSourceFile(file):
    """Return True if the input file is a C/C++ source file.""" 
    src_re = re.compile(r'\.(c|h)(pp)?$',
                        re.IGNORECASE)
    return src_re.search(file) is not None

We identify source files using a regular expression pattern: r'\.(c|h)(pp)?$'. Here, the r stands for raw string, which means that backslashes are not handled in any special way by Python – the string literal is passed directly on to the regular expression module.

Regular expression patterns in Python are as powerful, concise and downright confusing to the uninitiated as they are elswhere. I would say that subsequent use of regular expression matches is a little more friendly.

In this case, the regex reads:

match a '.' followed by either a 'c' or an 'h' followed by one or no 'pp's, followed by the end of the string

The re.IGNORECASE flag tells the regex compiler to ignore case.

So, we expect:


assert(isSourceFile('a.cpp'))
assert(isSourceFile('a.C'))
assert(not isSourceFile('a.cc'))
assert(not isSourceFile('a.cppp'))
assert(isSourceFile('a.cc.h'))

This time, the assertions hold.

Relocating Source Files

relocateSource


def relocateSource(srcdst, dstfile, files_map):
    """Relocate a source file, correcting included file paths to included."""
    fin  = file(srcdst[0], 'r')
    fout = file(dstfile, 'w')
    for line in fin
        fout.write(
            processSourceLine(line,
                              srcdst,
                              files_map)
            )
   fin.close()
   fout.close()

The function relocateSource() simply reads the input file line by line. Each line is converted and written to the output file.

Processing a Line of a Source File


def processSourceLine(line, srcdst, files_map):
    """Process a line from a source file, correcting included file paths."""
    
    include_re = re.compile(
        r'^\s*#\s*include\s*"'
        r'(?P<inc_file>\S+)'
        '"')
        
    match = include_re.match(line)
    
    if match:
        mapped_inc_file = mapIncludeFile(
            match.group('inc_file'),
            srcdst,
            files_map
            )
        line = line[:match.start('inc_file')] + \
               mapped_inc_file + \
               line[match.end('inc_file'):]
    
    return line

The function processSourceLine has a rather more complicated regex at its core. Essentially we want to spot lines similar to: #include "UserIF/Wgts/Menu.hpp" and extract the double-quoted file path. The complication is that there may be any amount of whitespace at several points on the line – hence the appearances of \s*, which reads zero or more whitespace characters.

The three raw strings which comprise the regex will be concatenated before the regex is compiled in the same way that adjacent string literals in C/C++ get joined together in an early phase of compilation. I have split the string in this way to emphasise its meaning.

The bizarre (#P<inc_file>\S+) syntax creates a named group: essentially, it allows us to identify the sub-group of a match object using the group name inc_file.

So, the function looks for lines of the form: #include "inc_file" then calls mapIncludeFile(inc_file...) to find what should now be included, and returns: #include "mapped_inc_file".

Incidentally, I am assuming here that the angle brackets are reserved for inclusion of standard library files – or at least not the files we are moving. That is, we don't try and alter lines such as: #include <vector>.

Mapping Include Files


import sys

def mapIncludeFile(inc, srcdst, files_map):
    """Determine the remapped include file path."""
    
    # First, obtain a path to the include file relative to
    # the original source root
    if os.path.dirname(inc):
        pass # Assumption 1) - "inc" is our relative path
    else:
        # Assumption 2) The file must be located in the same
        # directory as the source file which includes it.
        inc = os.path.join(
            os.path.dirname(srcdst[0]), inc)
            
    # Look up the new home for the file
    try:
        mapped_inc = files_map[inc]
        if (os.path.dirname(mapped_inc) ==
            os.path.dirname(srcdst[1])):
            mapped_inc = os.path.basename(mapped_inc)
    except KeyError:
        print 'Failed to locate [%s] (included by [%s]) ' \
              'relative to source root.' % (
              include, srcdst[0])
        sys.exit(1)
        
    return mapped_inc

The function mapIncludeFile() is actually quite simple, though only because of an assumption I have made about the way include paths are used in this source tree. The assumption is:

All #include directives give a path name relative to the root of the source tree, except when the included file is present in the same directory as the source file – in which case the file can be included directly by its basename. Furthermore, there are no source files at the top-level of the source tree (there are only directories at this level).

For source trees with more complex include paths, and correspondingly more subtle #include directives, this function will need fairly heavy-weight adaptation. (Alternatively, run another script to simplify your include paths first.)

If this assumption holds, we can easily determine the original path to the included files, then use our files_map dictionary to look up the new path. If the assumption doesn't hold, then the dictionary look up will fail, raising a KeyError exception. The exception is caught, a diagnostic printed, then the script exits with status 1.

We could test whether mapped_inc is a key in our dictionary before attempting to use it; and if it were absent, we could simply print an error and continue. However, we choose to view such an absence as exceptional since it undermines the assumptions made by the script. We do not wish to risk moving thousands of files without being sure of what we're doing.