I've also emailed Joachim directly, but I thought it would be good to open this up for input from all interested users.
Alignment between compared files is still a problem, and none of the diff tools I have looked at (a few, free and paid) still don't do a great job at aligning text properly. This is because the 'shortest difference' algorithm is actually not always the best solution, Why? Because it has no concept of word boundaries and correct syntax.
Two very common features found in most text files which are subject to difference checking are:
1 or more lines which are unique within the file - i.e. line number 'x' is not duplicated anywhere else in the file;
Indentation, or bracketing (in the case of most programming languages).
If KDiff were to parse each file before comparing them and identify every unique line with some flag (in a temporary array variable or similar), and also identify the branch hierarchy using either indentation (spaces/tabs) and/or brackets (or both!), it could then give precedence to the branch level before returning a character match. It could also parse each input file and build a temp array of all 'words' found (printable strings separated by whitespace or any other separator the user wants to use), and then give precedence to matching whole words where possible over just matching characters. It should also have an option to either ignore or respect end of line characters/sequences (LF, CR, CRLF, etc.). It would be great if the user could choose what character/hex strings define an 'EndOfLine' for a given file or compare operation, but by default autodetect Unix or DOS EOL sequences.
I confess I am not a programmer, but I am not after kudos, I am after time saving code and results that will reduce how much eyeballing I have to do to compare 2 or more files. So here is some initial Excel VBA code I have been playing with to start the ball rolling (please forgive me for not speaking C[++|#] (a real language ;)) and being able to directly contribute to the KDiff3 source...one day maybe):
(sorry about the formatting, I couldn't get correct syntax highlighting to work with this)
OptionBase1OptionExplicitSubfxUniqueLine()''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' This procedure iterates through every line of text checking '' for uniqueness by comparing it to every other line. As each '' line is confirmed unique or duplicate, it is marked as such. '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''DimintLineCountAsIntegerDimintIndentLevelAsIntegerDimintIndexAsIntegerDimstrOriginalAsStringDimstrNewAsStringDimstrIgnoreAsStringDimintIgnoreAsIntegerDimcAsIntegerDimiAsIntegerDimjAsIntegerDimkAsIntegerDimrAsIntegerDimbolIgnoreAsBooleanDimarrText()AsVariantDimarrText2()AsVariantDimarrText2()AsVariantDimintIgnoreStringsAsIntegerDimarrIgnoreStrings()AsVariant' create array of strings to ignore (i.e. ignore comments) Worksheets("IgnoreStrings").Select Cells(Worksheets("IgnoreStrings").Rows.Count, 1).End(xlUp).Select intIgnoreStrings = Selection.Row arrIgnoreStrings() = Range(Cells(1, 1), Cells(intIgnoreStrings, 2)).Value2' calculate and store the length of each IgnoreString to help the loop For i = LBound(arrIgnoreStrings) To UBound(arrIgnoreStrings) arrIgnoreStrings(i, 2) = Len(arrIgnoreStrings(i, 1)) Next' get the range containing the text we want to checkWorksheets("Text1").Select' change "Text1" to the name of the tab in which you paste your text to check Cells(Worksheets("Text1").Rows.Count, 1).End(xlUp).Select intLineCount = Selection.Row' create a loop with these code blocks (above/below) to iterate through x files (ranges) to compare' Worksheets("Text2").Select' Cells(Worksheets("Text2").Rows.Count, 1).End(xlUp).Select' intLineCount = Selection.RowintIndex=1' our array indexerForr=1TointLineCount' loop through cells in rngText; ignore blank lines and IgnoreStringsstrOriginal=Cells(r,1).Value2' save the original string just in case we need itIfstrOriginal<>""Then' if cell is blank ignore (keep non-blank lines)ReDimPreservearrText(8,intIndex)' ReDim can only expand last dimension of array...arrText(1,intIndex)=intIndex' add a line number to each kept stringarrText(2,intIndex)=strOriginal' store the original stringintIgnore=0' initialise character position to ignore frombolIgnore=False' initialise ignore flag to false (i.e. by default do not ignore any characters from string)Forc=1To(Len(strOriginal))' loop through characters in current line (string)Fork=LBound(arrIgnoreStrings)ToUBound(arrIgnoreStrings)' loop through list of ignore strings for each string positionIfMid(strOriginal,c,arrIgnoreStrings(k,2))=arrIgnoreStrings(k,1)Then' if current IgnoreString is foundbolIgnore=True' then we have to ignore all characters in the string starting from current positionExitFor' exit IgnoreString check loop because we found a matching IgnoreString (only need 1 match) End IfNext' else check next IgnoreStringIfbolIgnoreThenintIgnore=c' if line contains IgnoreString thenIfintIgnore>0Then' if the IgnoreString starts after the 1st character thenstrNew=Left(strOriginal,intIgnore-1)' store the part not to be ignoredstrIgnore=Right(strOriginal,Len(strOriginal)-Len(strNew))' and the part to be ignoredExitFor' then exit string check loop because we found an IgnoreString ElsestrNew=strOriginal' elseif no IgnoreString found then new string = original string (we will check whole line for uniqueness)strIgnore=""' and set the ignore part to empty - there is nothing to be ignored in the line End If NextarrText(3,intIndex)=strNew' store the new match string to be checked for uniquenessarrText(4,intIndex)=strIgnore' store the string to be ignored intIndentLevel = 0Forj=1ToLen(strOriginal)' loop through original string againIfNotMid(strOriginal,j,1)Like" "Then' count leading spaces to use as IndentLevelExitFor' when no more leading spaces are found exit loop - we know the IndentLevel ElseintIndentLevel=intIndentLevel+1' otherwise add 1 to the IndentLevel End IfNext' check next character in stringarrText(5,intIndex)=intIndentLevel' store the indentation levelarrText(6,intIndex)=intIndentLevel&"."&intIndex' use the line number as the indentation level indexarrText(7,intIndex)=Len(strNew)' store the new match string length intIndex = intIndex + 1 End If Next bolUnique = False' Loop through the source array (arrText) and mark unique lines For k = LBound(arrText, 2) To UBound(arrText, 2) - 1 If Not arrText(8, k) = "Duplicate" Then bolUnique = True For j = k + 1 To UBound(arrText, 2) If Not arrText(8, j) = "Duplicate" Then If arrText(3, k) = arrText(3, j) Then arrText(8, j) = "Duplicate" bolUnique = False End If End If Next If bolUnique Then arrText(8, k) = "Unique" Else arrText(8, k) = "Duplicated" Debug.Print "arrText(3, " & k & ")[" & arrText(3, k) & "] is " & arrText(8, k) End If Next If IsEmpty(arrText(8, UBound(arrText, 2))) Then arrText(8, UBound(arrText, 2)) = "Unique"End Sub
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your suggestion and for your effort to give an example implementation.
I also had similar ideas previously, but for a number of reasons didn't follow in the path you suggest. A good diff algorithm is hard to get right. Matching unique lines is also a part of the algorithm in use (as a starting point). But the algorithm must also work well if no lines are unique. KDiff3 also explicitely ignores indentation because especially for code this changes very often, but the lines should still be aligned.
The path I'd like to choose (if I had the time) would be to improve the diff results by also taking the character differences between nearby lines into account.
(But please understand that because it took many versions to get the current implementation working reliable, I am very reluctant/careful to change anything in that part of KDiff3.)
I don't quite understand what would be gained by specifying your own EOL-character. Could you please give an example?
Best regards,
Joachim
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Firstly, thanks so much for taking the time to reply. I really appreciate your response.
Regarding the ability to 'choose an EOL character/character sequence', it's a low priority, but to try and illustrate why I thought this would be a useful function, I think it could be useful when comparing binary files, and the purpose of being able to do this is to tell KDiff where/when to wrap lines. I totally understand your point about the difficulty in coming up with a good diff algo. I guess it would help to point out that I don't really even use KDiff3 for comparing code, but rather for comparing configuration files from network systems (such as router/switch/firewall configs) - in terms of indentation, I can say with a high degree of confidence that the indentation in an automatically generated configuration file from one of these types of system is very regular/reliable/dependable, because the text has not been manually typed. I understand too that it is possible to have a text file where there are no unique 'lines'. But what I was hoping to achieve was the ability as the KDiff user to dictate up front whether or not I want the algorithm to give precedence to or take account of indentation/hierarchy, line uniqueness, EOL characters, whole word or line matching over byte/character matching (i.e. if I can match a whole line or word, then prefer that match over finding just the longest character level match).
When you generate a config file from a network type system, there are syntactic rules which (unless there is a bug) are always followed. So I always know the order of the lines and there is a fixed vocabulary, with only parameter values and variable object names changing (such as MAC/IP addresses, hostnames, FQDNs, policy names, ACL entry comments, etc.). The syntax of the lines in which these variable names are found, however, is always the same (follows a rule).
Also, many network operating systems (switches, routers, firewalls etc.) use indentation in the config file output to indicate the parent-child relationship of each configuration object/parameter.
E.g. 1
filterlog101createdescription"Known Routing Protocols"destinationmemory100exitlog102createdescription"Known OAM"destinationmemory100exitlog103createdescription"NTP and Syslog"destinationmemory100exitlog104createdescription"SSH, Radius and SNMP"destinationsyslog3exitlog105createdescription"Denied Traffic"destinationsyslog3exitlog106createdescription"cpm-filter test - allow all"destinationmemory1024exitexit
These are just some sanitised snippets of configuration output from a couple of different systems to show you what I mean.
For many of these types of configuration files, there is a common pattern of repetition for one or more of the 'branches' of the tree. I.e. on a router, for example, there will be n-times an IP interface, with all the interface specific config lines/settings, which are shown by way of indentation to be child objects of the interface object in the hierarchy. On a firewall, for example, there are n-times policies/rules, and for each rule there may be objects such as source, destination, service, QoS settings, log settings, etc., all shown 'by-way-of-indentation' as children of the rule object.
This would enable the user to 'give' KDiff some awareness of the hierarchy if and when it exists in a file so that it can better match whole 'branches' or blocks of lines when that is a more useful result. In short, I guess I'm saying that a 'longest match' algorithm is not necessarily a best match algorithm, and this would usually be true for auto-generated text files where there are syntax rules and hierarchy/indentation.
Does that help to explain better what I am getting at?
Best Regards,
Steve.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My implementation is far from complete...as I am not a programmer. But I hope it serves well to show how I determine the uniqueness of each line, how I ignore 'comment' style trailing strings, and how I determine indentation level for each line to try and find where in the tree each line sits.
Cheers,
Steve.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"A good diff algorithm is hard to get right" - I definitely agree...but I would change this to read: "A good diff algorithm is one that is built to suit the files being compared".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
One more thing I forgot to mention before was to have the ability to tell KDiff which line number to start the comparison from in each source file. Sometimes there are preliminary lines in a config output file that I want to ignore.
Also, what do you think of the idea of having a file which contains meta information for a given comparison operation? It would contain the file names, the 'start-line' for each file, flags telling KDiff whether or not to regard indentation, line uniqueness, match precedence (i.e. match lines, then words, then characters - as mentioned in my previous post) etc. It could also have a comma separated list of matching lines between the files as a guide.
What do you think?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I've also emailed Joachim directly, but I thought it would be good to open this up for input from all interested users.
Alignment between compared files is still a problem, and none of the diff tools I have looked at (a few, free and paid) still don't do a great job at aligning text properly. This is because the 'shortest difference' algorithm is actually not always the best solution, Why? Because it has no concept of word boundaries and correct syntax.
Two very common features found in most text files which are subject to difference checking are:
If KDiff were to parse each file before comparing them and identify every unique line with some flag (in a temporary array variable or similar), and also identify the branch hierarchy using either indentation (spaces/tabs) and/or brackets (or both!), it could then give precedence to the branch level before returning a character match. It could also parse each input file and build a temp array of all 'words' found (printable strings separated by whitespace or any other separator the user wants to use), and then give precedence to matching whole words where possible over just matching characters. It should also have an option to either ignore or respect end of line characters/sequences (LF, CR, CRLF, etc.). It would be great if the user could choose what character/hex strings define an 'EndOfLine' for a given file or compare operation, but by default autodetect Unix or DOS EOL sequences.
I confess I am not a programmer, but I am not after kudos, I am after time saving code and results that will reduce how much eyeballing I have to do to compare 2 or more files. So here is some initial Excel VBA code I have been playing with to start the ball rolling (please forgive me for not speaking C[++|#] (a real language ;)) and being able to directly contribute to the KDiff3 source...one day maybe):
(sorry about the formatting, I couldn't get correct syntax highlighting to work with this)
Hi Stephen,
Thank you for your suggestion and for your effort to give an example implementation.
I also had similar ideas previously, but for a number of reasons didn't follow in the path you suggest. A good diff algorithm is hard to get right. Matching unique lines is also a part of the algorithm in use (as a starting point). But the algorithm must also work well if no lines are unique. KDiff3 also explicitely ignores indentation because especially for code this changes very often, but the lines should still be aligned.
The path I'd like to choose (if I had the time) would be to improve the diff results by also taking the character differences between nearby lines into account.
(But please understand that because it took many versions to get the current implementation working reliable, I am very reluctant/careful to change anything in that part of KDiff3.)
I don't quite understand what would be gained by specifying your own EOL-character. Could you please give an example?
Best regards,
Joachim
Hi Joachim,
Firstly, thanks so much for taking the time to reply. I really appreciate your response.
Regarding the ability to 'choose an EOL character/character sequence', it's a low priority, but to try and illustrate why I thought this would be a useful function, I think it could be useful when comparing binary files, and the purpose of being able to do this is to tell KDiff where/when to wrap lines. I totally understand your point about the difficulty in coming up with a good diff algo. I guess it would help to point out that I don't really even use KDiff3 for comparing code, but rather for comparing configuration files from network systems (such as router/switch/firewall configs) - in terms of indentation, I can say with a high degree of confidence that the indentation in an automatically generated configuration file from one of these types of system is very regular/reliable/dependable, because the text has not been manually typed. I understand too that it is possible to have a text file where there are no unique 'lines'. But what I was hoping to achieve was the ability as the KDiff user to dictate up front whether or not I want the algorithm to give precedence to or take account of indentation/hierarchy, line uniqueness, EOL characters, whole word or line matching over byte/character matching (i.e. if I can match a whole line or word, then prefer that match over finding just the longest character level match).
When you generate a config file from a network type system, there are syntactic rules which (unless there is a bug) are always followed. So I always know the order of the lines and there is a fixed vocabulary, with only parameter values and variable object names changing (such as MAC/IP addresses, hostnames, FQDNs, policy names, ACL entry comments, etc.). The syntax of the lines in which these variable names are found, however, is always the same (follows a rule).
Also, many network operating systems (switches, routers, firewalls etc.) use indentation in the config file output to indicate the parent-child relationship of each configuration object/parameter.
E.g. 1
E.g. 2
E.g. 3
These are just some sanitised snippets of configuration output from a couple of different systems to show you what I mean.
For many of these types of configuration files, there is a common pattern of repetition for one or more of the 'branches' of the tree. I.e. on a router, for example, there will be n-times an IP interface, with all the interface specific config lines/settings, which are shown by way of indentation to be child objects of the interface object in the hierarchy. On a firewall, for example, there are n-times policies/rules, and for each rule there may be objects such as source, destination, service, QoS settings, log settings, etc., all shown 'by-way-of-indentation' as children of the rule object.
This would enable the user to 'give' KDiff some awareness of the hierarchy if and when it exists in a file so that it can better match whole 'branches' or blocks of lines when that is a more useful result. In short, I guess I'm saying that a 'longest match' algorithm is not necessarily a best match algorithm, and this would usually be true for auto-generated text files where there are syntax rules and hierarchy/indentation.
Does that help to explain better what I am getting at?
Best Regards,
Steve.
BTW, Joachim,
My implementation is far from complete...as I am not a programmer. But I hope it serves well to show how I determine the uniqueness of each line, how I ignore 'comment' style trailing strings, and how I determine indentation level for each line to try and find where in the tree each line sits.
Cheers,
Steve.
"A good diff algorithm is hard to get right" - I definitely agree...but I would change this to read: "A good diff algorithm is one that is built to suit the files being compared".
One more thing I forgot to mention before was to have the ability to tell KDiff which line number to start the comparison from in each source file. Sometimes there are preliminary lines in a config output file that I want to ignore.
Also, what do you think of the idea of having a file which contains meta information for a given comparison operation? It would contain the file names, the 'start-line' for each file, flags telling KDiff whether or not to regard indentation, line uniqueness, match precedence (i.e. match lines, then words, then characters - as mentioned in my previous post) etc. It could also have a comma separated list of matching lines between the files as a guide.
What do you think?