Menu

Better Text Alignment using Indentation Hierarchy, Unique Line flags, Line-to-Line mapping

2013-05-31
2013-06-04
  • Nuno Furbiz

    Nuno Furbiz - 2013-05-31

    Hi,

    I've also emailed Joachim directly, but I thought it would be good to open this up for input from all interested users.

    Alignment between compared files is still a problem, and none of the diff tools I have looked at (a few, free and paid) still don't do a great job at aligning text properly. This is because the 'shortest difference' algorithm is actually not always the best solution, Why? Because it has no concept of word boundaries and correct syntax.

    Two very common features found in most text files which are subject to difference checking are:

    1. 1 or more lines which are unique within the file - i.e. line number 'x' is not duplicated anywhere else in the file;
    2. Indentation, or bracketing (in the case of most programming languages).

    If KDiff were to parse each file before comparing them and identify every unique line with some flag (in a temporary array variable or similar), and also identify the branch hierarchy using either indentation (spaces/tabs) and/or brackets (or both!), it could then give precedence to the branch level before returning a character match. It could also parse each input file and build a temp array of all 'words' found (printable strings separated by whitespace or any other separator the user wants to use), and then give precedence to matching whole words where possible over just matching characters. It should also have an option to either ignore or respect end of line characters/sequences (LF, CR, CRLF, etc.). It would be great if the user could choose what character/hex strings define an 'EndOfLine' for a given file or compare operation, but by default autodetect Unix or DOS EOL sequences.

    I confess I am not a programmer, but I am not after kudos, I am after time saving code and results that will reduce how much eyeballing I have to do to compare 2 or more files. So here is some initial Excel VBA code I have been playing with to start the ball rolling (please forgive me for not speaking C[++|#] (a real language ;)) and being able to directly contribute to the KDiff3 source...one day maybe):
    (sorry about the formatting, I couldn't get correct syntax highlighting to work with this)

    Option Base 1
    Option Explicit
    
    Sub fxUniqueLine()
        ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
        ' This procedure iterates through every line of text checking  '
        ' for uniqueness by comparing it to every other line. As each  '
        ' line is confirmed unique or duplicate, it is marked as such. '
        ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    
        Dim intLineCount As Integer
        Dim intIndentLevel As Integer
        Dim intIndex As Integer
        Dim strOriginal As String
        Dim strNew As String
        Dim strIgnore As String
        Dim intIgnore As Integer
        Dim c As Integer
        Dim i As Integer
        Dim j As Integer
        Dim k As Integer
        Dim r As Integer
        Dim bolIgnore As Boolean
        Dim arrText() As Variant
        Dim arrText2() As Variant
        Dim arrText2() As Variant
        Dim intIgnoreStrings As Integer
        Dim arrIgnoreStrings() As Variant
    
        ' create array of strings to ignore (i.e. ignore comments)
        Worksheets("IgnoreStrings").Select
        Cells(Worksheets("IgnoreStrings").Rows.Count, 1).End(xlUp).Select
        intIgnoreStrings = Selection.Row
        arrIgnoreStrings() = Range(Cells(1, 1), Cells(intIgnoreStrings, 2)).Value2
        ' calculate and store the length of each IgnoreString to help the loop
        For i = LBound(arrIgnoreStrings) To UBound(arrIgnoreStrings)
            arrIgnoreStrings(i, 2) = Len(arrIgnoreStrings(i, 1))
        Next
    
        ' get the range containing the text we want to check
        Worksheets("Text1").Select ' change "Text1" to the name of the tab in which you paste your text to check
        Cells(Worksheets("Text1").Rows.Count, 1).End(xlUp).Select
        intLineCount = Selection.Row
    '    create a loop with these code blocks (above/below) to iterate through x files (ranges) to compare
    '    Worksheets("Text2").Select
    '    Cells(Worksheets("Text2").Rows.Count, 1).End(xlUp).Select
    '    intLineCount = Selection.Row
    
        intIndex = 1 ' our array indexer
    
        For r = 1 To intLineCount ' loop through cells in rngText; ignore blank lines and IgnoreStrings
            strOriginal = Cells(r, 1).Value2 ' save the original string just in case we need it
            If strOriginal <> "" Then ' if cell is blank ignore (keep non-blank lines)
                ReDim Preserve arrText(8, intIndex) ' ReDim can only expand last dimension of array...
                arrText(1, intIndex) = intIndex ' add a line number to each kept string
                arrText(2, intIndex) = strOriginal ' store the original string
                intIgnore = 0 ' initialise character position to ignore from
                bolIgnore = False ' initialise ignore flag to false (i.e. by default do not ignore any characters from string)
                For c = 1 To (Len(strOriginal)) ' loop through characters in current line (string)
                    For k = LBound(arrIgnoreStrings) To UBound(arrIgnoreStrings) ' loop through list of ignore strings for each string position
                        If Mid(strOriginal, c, arrIgnoreStrings(k, 2)) = arrIgnoreStrings(k, 1) Then ' if current IgnoreString is found
                            bolIgnore = True ' then we have to ignore all characters in the string starting from current position
                            Exit For ' exit IgnoreString check loop because we found a matching IgnoreString (only need 1 match)
                        End If
                    Next ' else check next IgnoreString
                    If bolIgnore Then intIgnore = c ' if line contains IgnoreString then
                    If intIgnore > 0 Then ' if the IgnoreString starts after the 1st character then
                        strNew = Left(strOriginal, intIgnore - 1) ' store the part not to be ignored
                        strIgnore = Right(strOriginal, Len(strOriginal) - Len(strNew)) ' and the part to be ignored
                        Exit For ' then exit string check loop because we found an IgnoreString
                    Else
                        strNew = strOriginal ' elseif no IgnoreString found then new string = original string (we will check whole line for uniqueness)
                        strIgnore = "" ' and set the ignore part to empty - there is nothing to be ignored in the line
                    End If
                Next
                arrText(3, intIndex) = strNew ' store the new match string to be checked for uniqueness
                arrText(4, intIndex) = strIgnore ' store the string to be ignored
                intIndentLevel = 0
                For j = 1 To Len(strOriginal) ' loop through original string again
                    If Not Mid(strOriginal, j, 1) Like " " Then ' count leading spaces to use as IndentLevel
                        Exit For ' when no more leading spaces are found exit loop - we know the IndentLevel
                    Else
                        intIndentLevel = intIndentLevel + 1 ' otherwise add 1 to the IndentLevel
                    End If
                Next ' check next character in string
                arrText(5, intIndex) = intIndentLevel ' store the indentation level
                arrText(6, intIndex) = intIndentLevel & "." & intIndex ' use the line number as the indentation level index
                arrText(7, intIndex) = Len(strNew) ' store the new match string length
                intIndex = intIndex + 1
            End If
        Next
    
        bolUnique = False
    '   Loop through the source array (arrText) and mark unique lines
        For k = LBound(arrText, 2) To UBound(arrText, 2) - 1
            If Not arrText(8, k) = "Duplicate" Then
                bolUnique = True
                For j = k + 1 To UBound(arrText, 2)
                    If Not arrText(8, j) = "Duplicate" Then
                        If arrText(3, k) = arrText(3, j) Then
                            arrText(8, j) = "Duplicate"
                            bolUnique = False
                        End If
                    End If
                Next
                If bolUnique Then arrText(8, k) = "Unique" Else arrText(8, k) = "Duplicated"
                Debug.Print "arrText(3, " & k & ")[" & arrText(3, k) & "] is " & arrText(8, k)
            End If
        Next
        If IsEmpty(arrText(8, UBound(arrText, 2))) Then arrText(8, UBound(arrText, 2)) = "Unique"
    
    End Sub
    
     
  • Joachim Eibl

    Joachim Eibl - 2013-06-02

    Hi Stephen,

    Thank you for your suggestion and for your effort to give an example implementation.

    I also had similar ideas previously, but for a number of reasons didn't follow in the path you suggest. A good diff algorithm is hard to get right. Matching unique lines is also a part of the algorithm in use (as a starting point). But the algorithm must also work well if no lines are unique. KDiff3 also explicitely ignores indentation because especially for code this changes very often, but the lines should still be aligned.

    The path I'd like to choose (if I had the time) would be to improve the diff results by also taking the character differences between nearby lines into account.
    (But please understand that because it took many versions to get the current implementation working reliable, I am very reluctant/careful to change anything in that part of KDiff3.)

    I don't quite understand what would be gained by specifying your own EOL-character. Could you please give an example?

    Best regards,
    Joachim

     
    • Nuno Furbiz

      Nuno Furbiz - 2013-06-04

      Hi Joachim,

      Firstly, thanks so much for taking the time to reply. I really appreciate your response.

      Regarding the ability to 'choose an EOL character/character sequence', it's a low priority, but to try and illustrate why I thought this would be a useful function, I think it could be useful when comparing binary files, and the purpose of being able to do this is to tell KDiff where/when to wrap lines. I totally understand your point about the difficulty in coming up with a good diff algo. I guess it would help to point out that I don't really even use KDiff3 for comparing code, but rather for comparing configuration files from network systems (such as router/switch/firewall configs) - in terms of indentation, I can say with a high degree of confidence that the indentation in an automatically generated configuration file from one of these types of system is very regular/reliable/dependable, because the text has not been manually typed. I understand too that it is possible to have a text file where there are no unique 'lines'. But what I was hoping to achieve was the ability as the KDiff user to dictate up front whether or not I want the algorithm to give precedence to or take account of indentation/hierarchy, line uniqueness, EOL characters, whole word or line matching over byte/character matching (i.e. if I can match a whole line or word, then prefer that match over finding just the longest character level match).

      When you generate a config file from a network type system, there are syntactic rules which (unless there is a bug) are always followed. So I always know the order of the lines and there is a fixed vocabulary, with only parameter values and variable object names changing (such as MAC/IP addresses, hostnames, FQDNs, policy names, ACL entry comments, etc.). The syntax of the lines in which these variable names are found, however, is always the same (follows a rule).

      Also, many network operating systems (switches, routers, firewalls etc.) use indentation in the config file output to indicate the parent-child relationship of each configuration object/parameter.

      E.g. 1

      filter
          log 101 create
              description "Known Routing Protocols"
              destination memory 100
          exit
          log 102 create
              description "Known OAM"
              destination memory 100
          exit
          log 103 create
              description "NTP and Syslog"
              destination memory 100
          exit
          log 104 create
              description "SSH, Radius and SNMP"
              destination syslog 3
          exit
          log 105 create
              description "Denied Traffic"
              destination syslog 3
          exit
          log 106 create
              description "cpm-filter test - allow all"
              destination memory 1024
          exit
      exit
      

      E.g. 2

      port 1/1/1
          description "To-<neighbour>-1/1/1"
          ethernet
              mtu 9190
              egress-scheduler-policy "Qos-Policy-1"
              network
                  queue-policy "Queue-Policy-1"
                  accounting-policy 1
                  collect-stats
              exit
          exit
          no shutdown
      exit
      port 1/1/2
          description "To-<neighbour>-1/1/2"
          ethernet
              mtu 9190
              egress-scheduler-policy "Qos-Policy-1"
              network
                  queue-policy "Queue-Policy-1"
                  accounting-policy 1
                  collect-stats
              exit
          exit
          no shutdown
      exit
      

      E.g. 3

       enter org UCS
           set descr "UCS Domain 1"
       up
       enter org ORGx
           enter boot-policy sanboot01 purpose operational
               enter storage
                   enter san-image primary
                       set vhba vHBA-A
                       enter path primary
                           set lun 0
                           set wwn 50:0A:09:83:88:C2:6C:23
                       exit
                       enter path secondary
                           set lun 0
                           set wwn 50:0A:09:84:88:C2:6C:23
                       exit
                   exit
                   enter san-image secondary
                       set vhba vHBA-B
                       enter path primary
                           set lun 0
                           set wwn 50:0A:09:83:98:C2:6C:23
                       exit
                       enter path secondary
                           set lun 0
                           set wwn 50:0A:09:84:98:C2:6C:23
                       exit
                   exit
      
                   set order 2
               exit
               enter virtual-media read-only
                   set order 1
               exit
               set descr sanboot01
               set enforce-vnic-name yes
               set reboot-on-update no
           exit
      

      These are just some sanitised snippets of configuration output from a couple of different systems to show you what I mean.

      For many of these types of configuration files, there is a common pattern of repetition for one or more of the 'branches' of the tree. I.e. on a router, for example, there will be n-times an IP interface, with all the interface specific config lines/settings, which are shown by way of indentation to be child objects of the interface object in the hierarchy. On a firewall, for example, there are n-times policies/rules, and for each rule there may be objects such as source, destination, service, QoS settings, log settings, etc., all shown 'by-way-of-indentation' as children of the rule object.

      This would enable the user to 'give' KDiff some awareness of the hierarchy if and when it exists in a file so that it can better match whole 'branches' or blocks of lines when that is a more useful result. In short, I guess I'm saying that a 'longest match' algorithm is not necessarily a best match algorithm, and this would usually be true for auto-generated text files where there are syntax rules and hierarchy/indentation.

      Does that help to explain better what I am getting at?

      Best Regards,
      Steve.

       
      • Nuno Furbiz

        Nuno Furbiz - 2013-06-04

        BTW, Joachim,

        My implementation is far from complete...as I am not a programmer. But I hope it serves well to show how I determine the uniqueness of each line, how I ignore 'comment' style trailing strings, and how I determine indentation level for each line to try and find where in the tree each line sits.

        Cheers,
        Steve.

         
  • Nuno Furbiz

    Nuno Furbiz - 2013-06-04

    "A good diff algorithm is hard to get right" - I definitely agree...but I would change this to read: "A good diff algorithm is one that is built to suit the files being compared".

     
  • Nuno Furbiz

    Nuno Furbiz - 2013-06-04

    One more thing I forgot to mention before was to have the ability to tell KDiff which line number to start the comparison from in each source file. Sometimes there are preliminary lines in a config output file that I want to ignore.

    Also, what do you think of the idea of having a file which contains meta information for a given comparison operation? It would contain the file names, the 'start-line' for each file, flags telling KDiff whether or not to regard indentation, line uniqueness, match precedence (i.e. match lines, then words, then characters - as mentioned in my previous post) etc. It could also have a comma separated list of matching lines between the files as a guide.

    What do you think?

     
MongoDB Logo MongoDB