KDiff3 / Discussion / Open Discussion: Better Text Alignment using Indentation Hierarchy, Unique Line flags, Line-to-Line mapping

Hi,

I've also emailed Joachim directly, but I thought it would be good to open this up for input from all interested users.

Alignment between compared files is still a problem, and none of the diff tools I have looked at (a few, free and paid) still don't do a great job at aligning text properly. This is because the 'shortest difference' algorithm is actually not always the best solution, Why? Because it has no concept of word boundaries and correct syntax.

Two very common features found in most text files which are subject to difference checking are:

1 or more lines which are unique within the file - i.e. line number 'x' is not duplicated anywhere else in the file;
Indentation, or bracketing (in the case of most programming languages).

If KDiff were to parse each file before comparing them and identify every unique line with some flag (in a temporary array variable or similar), and also identify the branch hierarchy using either indentation (spaces/tabs) and/or brackets (or both!), it could then give precedence to the branch level before returning a character match. It could also parse each input file and build a temp array of all 'words' found (printable strings separated by whitespace or any other separator the user wants to use), and then give precedence to matching whole words where possible over just matching characters. It should also have an option to either ignore or respect end of line characters/sequences (LF, CR, CRLF, etc.). It would be great if the user could choose what character/hex strings define an 'EndOfLine' for a given file or compare operation, but by default autodetect Unix or DOS EOL sequences.

I confess I am not a programmer, but I am not after kudos, I am after time saving code and results that will reduce how much eyeballing I have to do to compare 2 or more files. So here is some initial Excel VBA code I have been playing with to start the ball rolling (please forgive me for not speaking C[++|#] (a real language ;)) and being able to directly contribute to the KDiff3 source...one day maybe):
(sorry about the formatting, I couldn't get correct syntax highlighting to work with this)

Option Base 1
Option Explicit

Sub fxUniqueLine()
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ' This procedure iterates through every line of text checking  '
    ' for uniqueness by comparing it to every other line. As each  '
    ' line is confirmed unique or duplicate, it is marked as such. '
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

    Dim intLineCount As Integer
    Dim intIndentLevel As Integer
    Dim intIndex As Integer
    Dim strOriginal As String
    Dim strNew As String
    Dim strIgnore As String
    Dim intIgnore As Integer
    Dim c As Integer
    Dim i As Integer
    Dim j As Integer
    Dim k As Integer
    Dim r As Integer
    Dim bolIgnore As Boolean
    Dim arrText() As Variant
    Dim arrText2() As Variant
    Dim arrText2() As Variant
    Dim intIgnoreStrings As Integer
    Dim arrIgnoreStrings() As Variant

    ' create array of strings to ignore (i.e. ignore comments)
    Worksheets("IgnoreStrings").Select
    Cells(Worksheets("IgnoreStrings").Rows.Count, 1).End(xlUp).Select
    intIgnoreStrings = Selection.Row
    arrIgnoreStrings() = Range(Cells(1, 1), Cells(intIgnoreStrings, 2)).Value2
    ' calculate and store the length of each IgnoreString to help the loop
    For i = LBound(arrIgnoreStrings) To UBound(arrIgnoreStrings)
        arrIgnoreStrings(i, 2) = Len(arrIgnoreStrings(i, 1))
    Next

    ' get the range containing the text we want to check
    Worksheets("Text1").Select ' change "Text1" to the name of the tab in which you paste your text to check
    Cells(Worksheets("Text1").Rows.Count, 1).End(xlUp).Select
    intLineCount = Selection.Row
'    create a loop with these code blocks (above/below) to iterate through x files (ranges) to compare
'    Worksheets("Text2").Select
'    Cells(Worksheets("Text2").Rows.Count, 1).End(xlUp).Select
'    intLineCount = Selection.Row

    intIndex = 1 ' our array indexer

    For r = 1 To intLineCount ' loop through cells in rngText; ignore blank lines and IgnoreStrings
        strOriginal = Cells(r, 1).Value2 ' save the original string just in case we need it
        If strOriginal <> "" Then ' if cell is blank ignore (keep non-blank lines)
            ReDim Preserve arrText(8, intIndex) ' ReDim can only expand last dimension of array...
            arrText(1, intIndex) = intIndex ' add a line number to each kept string
            arrText(2, intIndex) = strOriginal ' store the original string
            intIgnore = 0 ' initialise character position to ignore from
            bolIgnore = False ' initialise ignore flag to false (i.e. by default do not ignore any characters from string)
            For c = 1 To (Len(strOriginal)) ' loop through characters in current line (string)
                For k = LBound(arrIgnoreStrings) To UBound(arrIgnoreStrings) ' loop through list of ignore strings for each string position
                    If Mid(strOriginal, c, arrIgnoreStrings(k, 2)) = arrIgnoreStrings(k, 1) Then ' if current IgnoreString is found
                        bolIgnore = True ' then we have to ignore all characters in the string starting from current position
                        Exit For ' exit IgnoreString check loop because we found a matching IgnoreString (only need 1 match)
                    End If
                Next ' else check next IgnoreString
                If bolIgnore Then intIgnore = c ' if line contains IgnoreString then
                If intIgnore > 0 Then ' if the IgnoreString starts after the 1st character then
                    strNew = Left(strOriginal, intIgnore - 1) ' store the part not to be ignored
                    strIgnore = Right(strOriginal, Len(strOriginal) - Len(strNew)) ' and the part to be ignored
                    Exit For ' then exit string check loop because we found an IgnoreString
                Else
                    strNew = strOriginal ' elseif no IgnoreString found then new string = original string (we will check whole line for uniqueness)
                    strIgnore = "" ' and set the ignore part to empty - there is nothing to be ignored in the line
                End If
            Next
            arrText(3, intIndex) = strNew ' store the new match string to be checked for uniqueness
            arrText(4, intIndex) = strIgnore ' store the string to be ignored
            intIndentLevel = 0
            For j = 1 To Len(strOriginal) ' loop through original string again
                If Not Mid(strOriginal, j, 1) Like " " Then ' count leading spaces to use as IndentLevel
                    Exit For ' when no more leading spaces are found exit loop - we know the IndentLevel
                Else
                    intIndentLevel = intIndentLevel + 1 ' otherwise add 1 to the IndentLevel
                End If
            Next ' check next character in string
            arrText(5, intIndex) = intIndentLevel ' store the indentation level
            arrText(6, intIndex) = intIndentLevel & "." & intIndex ' use the line number as the indentation level index
            arrText(7, intIndex) = Len(strNew) ' store the new match string length
            intIndex = intIndex + 1
        End If
    Next

    bolUnique = False
'   Loop through the source array (arrText) and mark unique lines
    For k = LBound(arrText, 2) To UBound(arrText, 2) - 1
        If Not arrText(8, k) = "Duplicate" Then
            bolUnique = True
            For j = k + 1 To UBound(arrText, 2)
                If Not arrText(8, j) = "Duplicate" Then
                    If arrText(3, k) = arrText(3, j) Then
                        arrText(8, j) = "Duplicate"
                        bolUnique = False
                    End If
                End If
            Next
            If bolUnique Then arrText(8, k) = "Unique" Else arrText(8, k) = "Duplicated"
            Debug.Print "arrText(3, " & k & ")[" & arrText(3, k) & "] is " & arrText(8, k)
        End If
    Next
    If IsEmpty(arrText(8, UBound(arrText, 2))) Then arrText(8, UBound(arrText, 2)) = "Unique"

End Sub

Joachim Eibl - 2013-06-02

Hi Stephen,

Thank you for your suggestion and for your effort to give an example implementation.

I also had similar ideas previously, but for a number of reasons didn't follow in the path you suggest. A good diff algorithm is hard to get right. Matching unique lines is also a part of the algorithm in use (as a starting point). But the algorithm must also work well if no lines are unique. KDiff3 also explicitely ignores indentation because especially for code this changes very often, but the lines should still be aligned.

The path I'd like to choose (if I had the time) would be to improve the diff results by also taking the character differences between nearby lines into account.
(But please understand that because it took many versions to get the current implementation working reliable, I am very reluctant/careful to change anything in that part of KDiff3.)

I don't quite understand what would be gained by specifying your own EOL-character. Could you please give an example?

Best regards,
Joachim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nuno Furbiz - 2013-06-04
  
  Hi Joachim,
  
  Firstly, thanks so much for taking the time to reply. I really appreciate your response.
  
  Regarding the ability to 'choose an EOL character/character sequence', it's a low priority, but to try and illustrate why I thought this would be a useful function, I think it could be useful when comparing binary files, and the purpose of being able to do this is to tell KDiff where/when to wrap lines. I totally understand your point about the difficulty in coming up with a good diff algo. I guess it would help to point out that I don't really even use KDiff3 for comparing code, but rather for comparing configuration files from network systems (such as router/switch/firewall configs) - in terms of indentation, I can say with a high degree of confidence that the indentation in an automatically generated configuration file from one of these types of system is very regular/reliable/dependable, because the text has not been manually typed. I understand too that it is possible to have a text file where there are no unique 'lines'. But what I was hoping to achieve was the ability as the KDiff user to dictate up front whether or not I want the algorithm to give precedence to or take account of indentation/hierarchy, line uniqueness, EOL characters, whole word or line matching over byte/character matching (i.e. if I can match a whole line or word, then prefer that match over finding just the longest character level match).
  
  When you generate a config file from a network type system, there are syntactic rules which (unless there is a bug) are always followed. So I always know the order of the lines and there is a fixed vocabulary, with only parameter values and variable object names changing (such as MAC/IP addresses, hostnames, FQDNs, policy names, ACL entry comments, etc.). The syntax of the lines in which these variable names are found, however, is always the same (follows a rule).
  
  Also, many network operating systems (switches, routers, firewalls etc.) use indentation in the config file output to indicate the parent-child relationship of each configuration object/parameter.
  
  E.g. 1
  
  filter log 101 create description "Known Routing Protocols" destination memory 100 exit log 102 create description "Known OAM" destination memory 100 exit log 103 create description "NTP and Syslog" destination memory 100 exit log 104 create description "SSH, Radius and SNMP" destination syslog 3 exit log 105 create description "Denied Traffic" destination syslog 3 exit log 106 create description "cpm-filter test - allow all" destination memory 1024 exit exit
  
  E.g. 2
  
  port 1/1/1 description "To-<neighbour>-1/1/1" ethernet mtu 9190 egress-scheduler-policy "Qos-Policy-1" network queue-policy "Queue-Policy-1" accounting-policy 1 collect-stats exit exit no shutdown exit port 1/1/2 description "To-<neighbour>-1/1/2" ethernet mtu 9190 egress-scheduler-policy "Qos-Policy-1" network queue-policy "Queue-Policy-1" accounting-policy 1 collect-stats exit exit no shutdown exit
  
  E.g. 3
  
  enter org UCS set descr "UCS Domain 1" up enter org ORGx enter boot-policy sanboot01 purpose operational enter storage enter san-image primary set vhba vHBA-A enter path primary set lun 0 set wwn 50:0A:09:83:88:C2:6C:23 exit enter path secondary set lun 0 set wwn 50:0A:09:84:88:C2:6C:23 exit exit enter san-image secondary set vhba vHBA-B enter path primary set lun 0 set wwn 50:0A:09:83:98:C2:6C:23 exit enter path secondary set lun 0 set wwn 50:0A:09:84:98:C2:6C:23 exit exit set order 2 exit enter virtual-media read-only set order 1 exit set descr sanboot01 set enforce-vnic-name yes set reboot-on-update no exit
  
  These are just some sanitised snippets of configuration output from a couple of different systems to show you what I mean.
  
  For many of these types of configuration files, there is a common pattern of repetition for one or more of the 'branches' of the tree. I.e. on a router, for example, there will be n-times an IP interface, with all the interface specific config lines/settings, which are shown by way of indentation to be child objects of the interface object in the hierarchy. On a firewall, for example, there are n-times policies/rules, and for each rule there may be objects such as source, destination, service, QoS settings, log settings, etc., all shown 'by-way-of-indentation' as children of the rule object.
  
  This would enable the user to 'give' KDiff some awareness of the hierarchy if and when it exists in a file so that it can better match whole 'branches' or blocks of lines when that is a more useful result. In short, I guess I'm saying that a 'longest match' algorithm is not necessarily a best match algorithm, and this would usually be true for auto-generated text files where there are syntax rules and hierarchy/indentation.
  
  Does that help to explain better what I am getting at?
  
  Best Regards,
  Steve.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nuno Furbiz - 2013-06-04
    
    BTW, Joachim,
    
    My implementation is far from complete...as I am not a programmer. But I hope it serves well to show how I determine the uniqueness of each line, how I ignore 'comment' style trailing strings, and how I determine indentation level for each line to try and find where in the tree each line sits.
    
    Cheers,
    Steve.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nuno Furbiz - 2013-06-04

"A good diff algorithm is hard to get right" - I definitely agree...but I would change this to read: "A good diff algorithm is one that is built to suit the files being compared".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nuno Furbiz - 2013-06-04

One more thing I forgot to mention before was to have the ability to tell KDiff which line number to start the comparison from in each source file. Sometimes there are preliminary lines in a config output file that I want to ignore.

Also, what do you think of the idea of having a file which contains meta information for a given comparison operation? It would contain the file names, the 'start-line' for each file, flags telling KDiff whether or not to regard indentation, line uniqueness, match precedence (i.e. match lines, then words, then characters - as mentioned in my previous post) etc. It could also have a comma separated list of matching lines between the files as a guide.

What do you think?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Better Text Alignment using Indentation Hierarchy, Unique Line flags,...

A graphical text difference analyzer

Forums

Help

Better Text Alignment using Indentation Hierarchy, Unique Line flags, Line-to-Line mapping