DLSuperCTW

DLsuperCTW Trial VersionDLSuperCTW is a novel word compare extension to the suite of DLSuperC line comparison programs. The output report from DLSuperCTW makes it possible to observe word changes in text and script files (e.g., *.txt, *. html, *.ini, *.bat, and *.pas) that were sometimes difficult to detect in changed lines of text even when the changes differed in only a few places.

DLSuperCTW even allows matching words to be shifted to adjacent lines in the default-processing mode. This could be a case of a descriptive text file having lines changed due to the flow of inserting and deleting words against a fixed right column margin. The compare program displays, in the default mode, completely matched lines (containing no changed words), whole inserted and whole deleted lines as lines subjected to a maximum 128-character line limit. Lines containing both matching and embedded changed words are displayed as separately composed changed text lines color-coded lines vertically below the corresponding matching word line segment. Space gaps appear at the inserted or deleted position in the matching portion so the user can determine how the changes would appear in the original line.

The DLSuperCTW program, also, features an alternate display and processing mode invoked by specifying the Wcln option. This mode also compares full lines of text as words and text tokens but requires matches to respect both line boundaries. This amounts to comparing for text differences between lines much like DLSuperC but at the word instead of the line level. Additionally, the output from the Wcln variation, displays all long matching, unchanged and changed compared text lines as whole text lines, up a 2,000+ limit. This contrasts with both DLSuperCTW’s default mode and the interactive operation of DLSuperC, which limits output-displayed lines to a maximum of 128 characters. The Wcln report does not generate composed changed lines. Lines are always full source copies of the New and Old input file lines.

Furthermore,Wcln uses different colors for insert and delete text integrated within each changed line. This is a highly effective manner that focuses on words that have changed. This visual representation, also, reduces the number of report lines used in the default report from a maximum of three to two lines since there is no longer a need to generate a separate line for the reference matching line portion. Reports containing changed long lines appear natural as lines are laterally generated as one complete line instead of being generated as separate lines until the composed line has been fully displayed.

The Wcln program mode generates output quite similar to the standard DLSuperC report. There are additions due to the differences in the color appearances of certain changed lines, some new identification tags, some information relocation, new summary statistic entries, some changes to the part comment displayed line, and the increased displayed length for long lines. Nevertheless, the display reporting of whole line matches, whole line deletes, and whole line inserts should appear similar. In essence, the Wcln display is the same as DLSuperC’s line compare display with changed words identified within a line. Since the comparison algorithm is based on word compare instead of line compare, some minor differences may be noted due to differences in the matching algorithm selection of the best word versus line compare sets. More discussion will appear about this topic in a later section.

As a complementary addition to the DLSuperC suite of products, the DLSuperCTW program retains much of the feel, output format, accuracy, and filtering capability of the basic DLSuperC line compare program with many of DLSuperC’s standard features. This includes excluding comment lines, comparing between columns, comparing only part of two files between start and stop limits, case changes of text, and a rich set of report types featuring a summary of file processing statistics for line and word change results.

DLSuperCTW Processing

The DLSuperCTW program parses input text lines into individual word tokens using white space characters as the default word separator. Nine additional user specified separators (e.g., “<", ">“, “=”, etc) can be added. The parsing operation uses a maximum default 24-character word length limit, less if desired, for breaking long contiguous text strings into smaller separate pseudo-word tokens. Differences are detected by processing the word changes between the parsed token words from both files. Processing for the best compare set of words is patterned after DLSuperC’s line compare algorithm with words, instead of complete lines, being compacted to hash sum values.

A typical use might be comparing file consisting of source code, control script, or Ascii text. HTML script lines, typically, extend past the normal 80-character viewing line margin. Some other script lines may extend up to 256 characters. Data base text lines may even have 2000+ character lines. In addition, many segments of continuous script text appear without intervening blank space separators making it difficult to identify which portion of a changed line segment is responsible for the line different. The use of the optional user specified separators makes long non-blank script strings easier to decompose into individual comparable units. Users, also, manually break up long lines as a convenience to aid in their visual line appearance and mainly to break long lines into shorter visible line segments that fit a fixed window viewing area. This break up does not, normally, change the textual meaning of scripts but it does influence the normal detection of differences. Line reformatting, text shifting, and line breakup affects most compare results.

The DLSuperC line compare program would see this shifting of text as a line change. Conversely, DLSuperCTW’s default mode ignores the line shifting of words and accepts this shifting by discarding the old file CR-LF separators in determining word changes. DLSuperCTW respects the New File line ending CR/LF control sequence in composing displayed lines. The Old File CR/LF sequence is used only as a reference in synchronizing the start of new lines to be compared.

The Wcln option overrides the default action of DLSuperCTW. It is similar to DLSuperC since it compares only those words that are contained completely within a single line boundary. Word shifting to an adjacent lines is considered a to be part of the next line. Wcln would view the word shifting as a change because it respects all CR/LF control sequences from both files. Nevertheless, DLSuperCTW using the Wcln option adds more granularity into a line compare of text that makes it easier to determine changes at the word token level. Hence, this should add to the productivity of most users change reviews when using this option.

DLSuperC, DLSuperCTW, and the Wcln Preference Option

Although word compare uses a slightly different process than the line compare used by the DLSuperC program, the overall results may appear to look quite similar. Both use a presentation that features matches, inserts and deletes. The unit of reference in DLSuperC is a complete line. Blank and reformatted lines, meaning two equal character content lines with a different number of blanks and/or spacing differences, are important to DLSuperC but have little meaning in DLSuperCTW’s word compare since spaces and interspersed blank lines contain no textual content or interruption.

DLSuperCTW composes lines from the words and tokens of the input file lines. For the default option, changed composed lines are, normally, presented on up to three vertically displayed consecutive lines. The inserted and deleted words are displayed under the referenced matching line at a position where those changes have occurred. The reflected composed line may be similar to the original input line but different line spacing is common due to the combined size necessary to accommodate both the vertical deleted and inserted word spacing. Different color highlighting and positioning of the changed line words beneath the matching reference help to highlight where the changes within the line occurred. The entire matched composed line is black whereas the insert line is blue and the delete line is red.

Reformatted lines are not tagged in word compare, as character blanks are unimportant within a compared line. The reference line and its spacing are always the new file line. Likewise, blank lines are not considered as inserted or deleted lines but would appear only as new file not processed lines. In addition, there is no Don’t Process option for filtering out blank lines. Should they exist in the input file, they can be excluded from the output report to decrease their visual cosmetic distraction. Comparing a line between column limits is likewise different, as the non-compared front prefix columns or line ending columns of the processed line are not displayed. This default mode of DLSuperCTW displays only the text between the compared limits in recreating the composed changed lines.

DLSuperCTW, with the Wcln option, restricts matching words to be constrained to the same lines up to each CR/LF ending. Words from a single line can only match words on another single line. This is similar to the DLSuperC line matching requirements. Wcln’s output lines reflect the exact input word positioning from both files (i.e. the lines are not composed).

Hidden from the visual aspects of manually inspecting for word matches, a word match may, occasionally, be observed as being missed by the matching algorithm within Wcln’s program constraints. Word matching is, initially, made without considering line boundaries. This allows for better match selection when considering the entire file as a series of consecutive words and it is the basis for the default mode of word matching detection. Later reclassification must undo some matches that are no longer valid. As a result, other less optimal competing matches might have ended up marked as unmatched words.

Reclassifying match words that cross line boundaries to a mismatched status is not difficult. But the recovery and searching for other obvious overlooked matches is more challenging. Several procedures are followed to handle many common cases. Other rare cases with marginally expected success results are not pursued. Some users may keenly recognize an occasional isolated missed match but the overall reevaluation of the required reclassification is quite involved and time consuming. Rarely, does the reclassification result in confusing the user or affect the overall value of the final results. There is a summary statistic for the override and recovery in the compare process.

DLSuperCTW expands the length and capability of the Report listbox by generating multi-colored text lines – specifically, black mixed with red words or black mixed with blue words composed as a single line versus a listbox which only allows a single color for a displayed line.

Tagging Codes for Report Lines

The DLSuperCTW difference determination uses the same change methodology as used for DLSuperC, DLSuperCX (line compare), and DLSuperCBT (byte compare) programs. However, the comparison is oriented to word tokens not lines or bytes. The displayed result lines have common tags for flagging report lines. DLSuperCTW lines are flagged in its report as:

The default compare option tags are:

  • Blank – Matched with both new and old file lines compared as equal.
  • MC- – Matched composed new file line with matched words or pseudo-word segments displayed.
  • MCC – Matched composed continuation matching words or pseudo-word segments displayed from the new file.
  • I – – Inserted line from new file. Absent (sequentially) from the old file.
  • IC- – Inserted composed new file line with inserted words or pseudo-word segments displayed.
  • ICC – Inserted composed continuation” of words or pseudo-word segments displayed from the new file.
  • D – – Deleted line from old file. Absent (sequentially) from the new file.
  • DC- – Deleted composed old file line with deleted words or pseudo-word segments displayed.
  • DCC – Deleted composed continuation” of words or pseudo-word segments displayed from the old file.
  • The Wcln preference compare option tags are:
  • Blank – Matched with both new file and old file lines word content compared as equal.
  • M – – Matched new file line with matched words and pseudo-word segments displayed.
  • I – – Inserted whole line from new file. Absent (sequentially) from the old file.
  • MI- – Matched new file line with blue colored inserted words and pseudo-word segments displayed.
  • D – – Deleted line from old file. Absent (sequentially) from the new file.
  • MD- -Matched new file line with red colored deleted words and pseudo- word segments displayed.

Note 1: A DP column heading is generated when a Dp options is selected. DP lines are tagged DP for a comment line from the new file has been detected and is listed without being processed or compared to the old file. PDP lines are Part-Dp lines that are filtered from the new input file. Embedded Part-Dp lines will not appear in the report.

Note 2: All lines for the Wcln preference are listed up to the character limits of the browser used. Lines are not composed nor truncated.

Interpreting Results from the default mode for Composed Line Conventions

The Legend button on the Report Display Panel displays the following convention associated with the report lines:

Matched, inserted, deleted, and comment lines are written as whole lines up to the 80 or 128 character truncated limits of the report output line.

Composed lines (matched, inserted and deleted words) are listed with single space separators between each white-space delimited word. User defined word separators, or word-token strings up to 30 characters, are not separated by a space but appear as contiguous strings even though they are compared separately.

The composed text lines (MC, IC, and DC) contain the full input line of text and word wraps to additional continuation lines for MCC, ICC, and DCC tagged lines.

A left column assigned reference number appears for each input text line, which incorporates the continuation block of composed lines.

The text output for composed lines attempts to mirror the new file line CR/LF appearance. The old file lines are only mirrored for the ” I – ” classified lines.

Input line numbers for both files are listed to the right for each non-composed line, or above the block change in the rightmost columns for composed lines.

Comments and null lines never interrupt the word comparison allowing continued comparison across line boundaries. Old File comments and blank lines never appear in the output display report. Only New File comments and blank lines are displayed.

Report Customization and Appearance

The user can select the standard DLSuperCTW type of report whereby the output may consists of a minimum (Ovsum), medium (Delta or Change), or maximum (Long) listing type. This affects the size of the report and limits the volume so the user can efficiently inspect the report to determine the importance of the text changes. The overall compare statistics always follows the compare data report. This summary information concerns the volume of lines, words, and changes also contains a recitation of options specified, a summary of their affects on the compare operation, information on certain option selection, and a legend for the change flagging used in the output report.

It should be stressed that word change reports could contain more report volume when using the default compare mode then a similar but comparable line compare reports due to the added line for the separate base matching data and the composed continuation lines for long lines. Many word compare reports may, in fact, contain too much change tracking information for rapid comprehension. This is sometimes called – looking at a lot of noisy data. The changes may be an accurate portrayal of the data changes but can be at many times, very overwhelming. More data is not better as many changes between the two input data files can yield results that looks confusing and sometimes appear worthless. Yet, more meaningful DLSuperCTW comparison reports can have an overall value characterizing the extent of the change activity between two related upgrades of some source code. The report appears quite similar to the corresponding compare output from DLSuperC’s line compare program although the format may appear somewhat different. Many times, DLSuperCTW’s results can be superior to the older line compare standard report as the newer Wcln and default word compare report can identify changes to the exact changed area for any simple word change.

The output of the Wcln report can be saved as an HTML, RTF or a black-only text file. The HTML or RTF saved results can, later, be viewed by invoking any of the user’s own external HTML or RTF viewers. The MS NotePad program is an excellent choice for viewing this output or even printing them as a backgroud job. These colored reports can also be directly routed to either a color or non-color printer. Saving an color capable file is much faster than in previous versions as a simpler internal HTML and RTF output generator is currently being used. It reduces the complexity of the previously more general components used for this operation.

A dramatic output appearance change is displayed for the Wcln option. File line numbers are displayed to the left of the compared text line. This is in contrast to their appearing on the extreme right on the output data line, which required truncation of some long text lines. Most users may have been unaware of those numbering columns since their appearance usually occurred in the hidden portion of most displayed windowed area. The Wcln user can suppress the left appearing number columns via a preference option. The most import reason for repositioning of the line numbering information to the left side of the report was due to the open ended long line lengths supported by the HTML viewer.

How many ways can text differences be portrayed? The granularity of word compare versus line compare results may make a major difference in easily detecting important changes that may be overlooked at the line level. Both reports should be equally accurate within the constraints of the algorithm used (i.e., line versus word compare) while neither more accurate than the other. Looking at a colored word indicating a change should be more revealing than trying to decide which item or items in a changed line is different.

Some line and word changes are so subtle that most visual observers might have a hard time recognizing the actual difference. This difficulty prompted color coding word changes within changed lines as a aid rather than classifying lines as changed and leaving it up to the user to scan the changed lines to detect the exact words that appeared to change.

Also, be forewarned that the creation of additional word tokens increases the program’s storage requirements for processing word changes compared to the normal line compare process which only requires less resources for line changes. There is no specific program limitation within DLSuperCTW on the maximum sizes of files that can be processed. However, I’m sure these exist within the capabilities of some maximum limit on certain system resource. Huge file sizes may be costly to process for many marginal systems that do not have inexhaustible amount of space and storage.

It should, also, not be a surprise that the creation of an HTML report with its fairly simple document markup requirements might take longer to display then a straight single colored text line. A competing RTF format for word changes would have even been required more processing than HTML.

Finally, most users have never heard of the maximum file sizes that DOS and Windows can handle. It appears that the Windows 98 FAT32 is limited to a 2 GB maximum size file. Windows NT and 2000 with NTFS files is said to have a 4 GB file size limit.

DLSuperCTW Recursion Operations

The operations of DLSuperCTW incorporate a recursion process of partitioning input files into 8,000 lines of text. Lines are then parsed into word tokens, which are the actual units that are compared. The process is heuristic involving a backup to the last previous best compare set, extending the next set by 8,000 previous or newer lines until the complete file has been processed.

The program’s output report shows the overall process of the composite results. The user may never be aware when additional passes are performed for big files. However, normal compares rarely require more than a single pass due to the initial processing of 8,000 lines in the first pass.

This process followed by DLSuperCTW is not that different than the line compare algorithm followed by DLSuperC. However, the extra line decomposition and the more taxing output display formatting considerably different.