How It Works – DLSuperC Unleashed!

DLSuperC is a compare program that differs from most normal compare programs and yet it has many of the same external displaying features. When you run DLSuperC , you expect it to detect differences and produce an output that shows where the differences have been found. A deficiency of some of the other programs are that they sometimes tend to get lost when producing results even showing differences in areas that haven’t been changed. Many programs also are limited in their capacity to handle large files whereas DLSuperC can handle huge files utilizing a unique iterative process that dynamically partitioning the file into sections and then combines the individual partial results into one overall set of statistics. Not many compare programs feature filters whereby the input files can be subjected to a set of text definitions to suppress data that might not be of interest in detecting changes at any particular time. Comment lines, predefined lines of unimportant text, sequence number columns from the input data stream are examples of data that can be avoided in preparing the compare set.

Some older compare programs used sequence numbers to detect where changes had been made. Of course, most newer PC program files don’t normally have sequence numbers so this is an outmoded consideration. Some other newer programs scan both files to be compared and use different criteria for finding change intervals. These programs mostly use the consecutive text sequences from the original files and then try to find where they differ by using an assortment of iterative algorithms digesting these text strings. This could consume a lot of I/O accesses (due to not being able to accommodate the complete input files within real storage) especially if the compared input had long lines and were large files. Other programs even try to sort lines and then utilize some tagging notation in determining matches, inserts and deletions.

Key distinguishing capabilities featured in DLSuperC are accuracy, speed, robustness (i.e., it doesn’t get lost), ease of use, input filterable, configurable with many run time options, and the capability of producing a variety of short or long output reports with color text highlighting data changes. Coloring can ease the task for the user when trying to digest the important difference. However, the internal technical techniques on how DLSuperC detects differences between two files is the main topic of this panel.

The following discussion will assume that the two input files are text files which are composed of a number of lines. One file is termed the old file while the other is the new file. The example might be also be termed the original old file and the other the modified or updated new file Such simplifications are only used in the explanation as the files could be reversed while the only change would be that the insert and delete references would also be reversed. In fact, neither file needs to be a text file. DLSuperC will work just as efficiently with binary files but the displayed results would look quite different. There is no restriction that text file delimiters have to be present. The program looks for CR/LF line delimiters (also LF/CR, CR, or LF) but will operate in their absence processing each file as long but bounded lines to limits set by the programs 2048 character input buffer. An assumption is that the user of DLSuperC wants to find sequence order line matches where the order of the lines containing source statements (implying sequential execution flow) is important. That is, if two lines appear in reverse order between the two files, one file would have a line flagged as an insert while the other would have it flagged as a delete line. This is contrasted with another form of matching that can be termed content matching whereby order is unimportant and the existence of an exact line appearing in both files is required. Content matching is discussed later in this discussion.

Processing Begins

An overall simplified description of the matching process is that DLSuperC uses an iterative process to find all the valid largest match compare set (LMCS) between the two files, marking them as matches and labels the left over unmatched lines as inserts and deletes. Each file inspection starts as one separate contiguous sets of individual lines. The initial task is to find the LMCS among the numerous set of matched consecutive sequences between the files. The key is that the LMCS (modified and augmented as explained later) divides each file into a series of all LMCS areas that connect between each file but does not cross each other. One such small fragment example is shown below. It was extracted from my original disclosure as published in the “IBM Technical Disclosure Bulletin – Vol. 22 No. 8A January 1980. The complete disclosure is available via any technical library.

LMCS Fragment as Example

The LMCS sets, using lines from the files in the explanation, could be as simple as consecutive matching lines from each file which is usually the standard definition of a LMCS set. DLSuperC , further, expands the LMCS set where lines are initially preprocessed to remove all included blank characters. No changes are made to the user input files. Only the internal buffers are modified. Blank filtering allows the matching set to include reformatted lines to increase the size of an existing LMCS set. This maximizes the probability of correctly locating user expected match sets that might be overlooked due to the lines being actually changed and shortening the length of the, now, expanded LMCS. This also creates longer match sets making them more selectable than other matching sets that could distort final boundary setting.

The LMCS set can be further affected by the processing due to the optional exclusion capability in DLSuperC of whole comment lines, part comment lines, normalization of upper and lower case text character, and columns ranges that can exclude and include parts of an input line. There is, also, a processing technique that allows small discontinuities of up to 1 to 3 lines to be considered as part of a continuous compare set. The continuation of a large compare set can be enlarged to include additional lines if the discontinuity between the sets is within this range and the extension addition increases the compare set length by a minimum amount. The 1 to 3 line discontinuity does not have to be identical for each files but can differ in either. As with the case of reformatted lines, this absorbing of discontinuous lines into a larger compare set, strengthens the identification of valid sets that should be selected when setting the LMCS boundaries.

Using 32 Bit HashSum Values

As each line is processed, it is compressed into a 32 bit hash fixed value that temporarily represents the original input line data content. Most of the comparison process uses this hash values as a substitute for the input line. This hash value does not insure that each line has a unique character content value. It is a simplification that speeds up the comparison process. Comparing a fixed value is faster than using a variable length text string to find equal lines. Hash values can never be singularly used to detect equal lines since their data content might be different. However, hash values that are unequal guarantees that the lines are not equal. Processing must insure that equal hash values that have the same data content can be easily selected. This is implemented by using a chaining mechanism whereby the lines with the same data content are chained together. Of course, this chain will have hash values that are also equal.

Blank suppression, line exclusion, column selection, and case normalization is employed prior to the hashsum processing. Their effect must be reflected in the overall results and can be done after the comparison process is completed but before the results are displayed and statistics are gathered. However it should be pointed out that the computed hashsums are the exclusive basis for the match determination as no dynamic substitution or content processing exits are made while the matching process is being done.

Preprocessing of Equal HashSums Reduces Overhead

Preprocessing starts by using the hash value representing a line to index into a reasonably sized array. The size of the array is arbitrary but DLSuperC employs an array that accommodates only 13 bits of the 32 bit hash value. This serves as a first step to narrow in on the set of lines with equal hashes. Further processing is needed to reduce the selected set of hash values and eliminate selections due to the 13 bit value and text line content that compacts to the same 32 bits hash value – a false match. The initial indexing mostly serves to identify lines that have no matches and reduces the processing needed to ferret out false matches and correctly place true matches onto there own unique chains.

The old file of equal hash values are chained within an old file data structure using the hash array table value reference. Array entries represent the last entry for the line with the hash value as the lines processed in reverse input order. As each new entry is entered, a backward chain develops pointing to any previous equal hash entry. The new file line structure uses the final state of the array so that its entries points to the first occurring old file line with the same hash value. The new file topmost reference will be dynamically updated during the matching process as match spans eliminate structure elements that no longer need to be referenced. Nevertheless, the inspection process can use the initial chain structure to determine how to progress through the span matching process as the links have been set up eliminating time consuming searches to determine the next eligible match.

Since equal hash values may represent text lines that are not data content equal, these false matches (as promised) must be unchained and rechained into separate chains that have both the same hash and data content. Now the LMCS process can be started as all matching lines have been identified and appear on correct chains. The task that is left is to determine where the best matches are located and where the mismatches in the lines left over exists since these are the inserted and deleted lines.

Hash Value Matching Begins

The LMCS process proceeds to establish dividing areas between each file. The best LMCS area fragments the files so that succeeding scans for newer LMCSes can not cross any established match set. Each iteration either starts at the top of each file looking for new matches or continues at the end of the last valid LMCS area validated. Iterations may find that no valid matches exist between the boundaries scanned. Those lines from the old file are marked as deletes and those from the new file are marked as inserts. The process ends when the whole file has been digested or the 16 K maximum number of lines has been reached.

There are many duplicate matching lines and small matching sequences within modern day source file. For example, there are a large number of identical “begin;” and “end;” full statement lines (followed by or proceeded by one or two generic lines) appearing throughout a typical Pascal program. These repetitive statement sequences can be termed as “onezs”, “twozs” and “threezs” matching candidates. Top, middle, and end matches exist independent to their fit in the overall contextual matching of the source input. The best procedure is to try to, initially, eliminate as many of them from the candidate matching sets since many will be eliminated in the final results. The DLSuperC approach is to try to initially bypass these bothersome small compare sets and establish large LMCS dividing boundaries. Many of these superfluous candidates will, thereafter, be eliminated as they conflict and are not contained within the newly established LMCS boundaries. This is only a temporary measure as all remaining matching sets must eventually be processed with many small sets still being potentially eligible. Nevertheless, small compare sets consume significant processing overhead while only a small number actually survive.

Other Optimizations Used

There is some optimizing code that advances each new match span inspection to the next start point. It uses the last matched span length to eliminate redundant examination for all the inclusive sets within the last determined match span. Using the length as an increment to start the next set of candidates, the process double checks to see if the last advancement increment might not have missed some eligible back-match candidates. It is very tricky but seems to work – saving a lot of non-productive processing. Another selection optimization used checks for equal length LMCSes to determine the best LMCS to be selected among a number of equal length candidate LMCS sets. Some processing analysis ensues to, conditionally, determine which candidate is the least destructive in setting its new match boundaries.

It is, also, possible that some compare sets can be extended due to discontinuities of 1 to 3 unmatched lines. This could make this set a better candidate than others that can not absorb any discontinuity. The discontinuity need not be the same within each file. A simple example might be a pair of files that has one insert in the new file and two deletes as changes in the old file. This might occur when two lines from an updated file is replaced by a single line. Instead of two match sets being found, the initial processing scan would find a single eligible set whereby each set absorbs the change discontinuity. The discontinuity condition could exist in any real case and not only reduce the number of eligible compare sets to be processed but enhance the proper selection of the real world matches within the changed source.

Since blanks are always suppressed from lines during initial preprocessing, all match sets must be revalidated for matches that could be actual reformats. This is done in a final validation pass. An exception for validating reformatted lines is where the number of blanks differ at the end of a line as some editors truncate line ending blanks whereas others do not. It is unreasonable for any user to discover by looking at a report listing to recognize whether blank differences exists at the end of a line or the line is last character is the end of a line since CR/LF characters are never indicated in a displayed line.

Large Files and Multiple Passes

DLSuperC has the capability to process large files. It does this by initially digesting up to 16 K lines. Matches, inserts, and deletes are determined for each pass. Additional passes are, thereafter, restarted at a file position following the best LMCS determined in the previous pass. This amounts to heuristically backing up in the file and rescanning lines that may have already been processed in the previous pass. These rescanned lines could become part of an enlarged compare set in the next pass. Up to 16 K new lines (rescanned + new lines = 16 K) are processed in this next pass and the process continues until both files have been entirely traversed. The ending displayed report and final statistics are a combination of the summed end results. User experience seems to confirm that the iterative results generally agree with those expected whereby the entire process would have been done in one pass.

Other DLSuperC Variations

The basic LMCS algorithm in DLSuperC is also used in DLSuperCX , DLSuperCRV , DLSuperCTW , and DLSuperCBT.

In DLSuperCRV no text filters are allowed. It is the only variation of DLSuperC that allows a line definition to be, optionally, defined with a fixed line size instead of a variable CR/LF line ending tag.

DLSuperCTW has the input line parsed into word/token values that are processed as 32 bit hash values. This allows matches within changed lines and lines that have been reformatted and appear across separate lines.

DLSuperCBT uses 8 bit hash values in the normal LMCS parsing algorithm. The 8 bit fixed hash comparison algorithm appears to work even though there appears to be a bias to a set of values that occurs in the alphanumeric range and a byte value that is limited between 0 and 256.

The LMCS parsing algorithm is common with all the versions of the DLSuperC family. It works well with the token elements line, word, and byte since the major processing is done with a fixed hash value instead of the token unit. This implies that any definable unit could be used with the DLSuperC algorithm to do the comparison phase. DLSuperCBT was an extreme case where there is a lot of source duplication in the 8 bit hash value. This contrasts with the line compare hash values that has a lot more variability where it represents up to 2048 characters compacted to a hash of 32 bits. This did not seem to present an overriding concern but special insight was required when presenting the results in DLSuperCTW , and DLSuperCBT . Extra lines were required for displaying inserted and deleted words and composed lines in DLSuperCTW , and a memory formatted dump was needed to show the results for DLSuperCBT .

The iterative processing technology was first utilized in DLSuperC to process large files. A limit could have be set as to the size of the files DLSuperC could process. Instead, no upper limit was set and large number of lines were segmented using multiple passes. A backup process and final summing was used for reporting the final results. This allowed the same methodology to pursue the more common and larger line tokenizing requirements of DLSuperCTW , and DLSuperCBT . Multiple pass requirements for these environments are very common. Each word or each byte token must be processed much like a line in the normal compare. After so many words or bytes, the process could be subjected to an additional pass. Therefore, a reasonable limit in the line compare to process 16 K lines created a processing architecture suited for large line files and allowed a technique for processing word and byte files using the same iterative matching process.

Content Matching Instead of Sequence Order Matching

Content matching processing can be done with DLSuperC using the Clnc (Compare unordered line-list for matches, exclusions, and dupes) option. The process could have also been done with Fmvl (Flag moved lines) but a more specialized and direct method was thought to better. Lines from one file are located in the other file. Lines not found in either file are termed new file inserts and old file deletes. Duplicate lines within each file are termed dupes. This processing is done by iteratively using the calculated hash values similar to the order dependent match process except limiting crossing boundaries is not used. Three major difference exists in that a dupe chain is maintained for each file, LMCS determination is not used, and the entire input file set must be processed in a single pass rather than iteratively.

A user having a sequential data base file, might have a requirement for the unordered compare option since additions and deletions to the file are mostly random operations. The file may never be sorted.