<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en" xmlns="http://www.w3.org/2005/Atom"><title>Recent changes to bugs</title><link href="https://sourceforge.net/p/python-ngram/bugs/" rel="alternate"/><link href="https://sourceforge.net/p/python-ngram/bugs/feed.atom" rel="self"/><id>https://sourceforge.net/p/python-ngram/bugs/</id><updated>2009-06-11T14:46:11Z</updated><subtitle>Recent changes to bugs</subtitle><entry><title>Implement Python 3.0 compliance</title><link href="https://sourceforge.net/p/python-ngram/bugs/4/" rel="alternate"/><published>2009-06-11T14:46:11Z</published><updated>2009-06-11T14:46:11Z</updated><author><name>Anonymous</name><uri>https://sourceforge.net/u/userid-None/</uri></author><id>https://sourceforge.netd20dd275e8c3817e8d56c632caea888d61444f5c</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;This module should become Python 3.0 compliant.  I patch was submitted to the patches category.&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Character encoding trouble</title><link href="https://sourceforge.net/p/python-ngram/bugs/3/" rel="alternate"/><published>2007-11-21T14:43:05Z</published><updated>2007-11-21T14:43:05Z</updated><author><name>Michel Albert</name><uri>https://sourceforge.net/u/exhuma/</uri></author><id>https://sourceforge.netaf9c4d514cd4ccfeb7dd0f9dc06ce7b0ed6d8972</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;If supplying a multibyte string as either "haystack" or "needle", results are unpredictable:&lt;/p&gt;
&lt;p&gt;&amp;gt;&amp;gt;&amp;gt; ngram.compare('dfsédfsdf', 'dfsédfsdf')&lt;br /&gt;
XXd: 'XXd'&lt;br /&gt;
Xdf: 'Xdf'&lt;br /&gt;
dfs: 'dfs'&lt;br /&gt;
fs�: 'fs\xc3'&lt;br /&gt;
sé: 's\xc3\xa9'&lt;br /&gt;
éd: '\xc3\xa9d'&lt;br /&gt;
�df: '\xa9df'&lt;br /&gt;
dfs: 'dfs'&lt;br /&gt;
fsd: 'fsd'&lt;br /&gt;
sdf: 'sdf'&lt;br /&gt;
dfX: 'dfX'&lt;br /&gt;
fXX: 'fXX'&lt;br /&gt;
1.0&lt;/p&gt;
&lt;p&gt;Note that the trigrams in the middle are not trigrams at all, but di-grams because the multibyte character is recognized as two characters. Essentially, in this case, as everything get's treated the same, the end-result is correct. However, when supplying the ctrings as unicode objects, it all works as expected:&lt;/p&gt;
&lt;p&gt;&amp;gt;&amp;gt;&amp;gt; ngram.compare(u'dfsédfsdf', u'dfsédfsdf')&lt;br /&gt;
XXd: u'XXd'&lt;br /&gt;
Xdf: u'Xdf'&lt;br /&gt;
dfs: u'dfs'&lt;br /&gt;
fsé: u'fs\xe9'&lt;br /&gt;
séd: u's\xe9d'&lt;br /&gt;
édf: u'\xe9df'&lt;br /&gt;
dfs: u'dfs'&lt;br /&gt;
fsd: u'fsd'&lt;br /&gt;
sdf: u'sdf'&lt;br /&gt;
dfX: u'dfX'&lt;br /&gt;
fXX: u'fXX'&lt;br /&gt;
1.0&lt;/p&gt;
&lt;p&gt;For this reason, I will add a type-check that will only allow unicode objects to be passed down into the module. This may also reveal possible encoding trouble, beacuse the unicode conversion will most likely fail in that case. So the module will not accept data that is obviously wrong.&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Wrong results due to padding with "X"</title><link href="https://sourceforge.net/p/python-ngram/bugs/2/" rel="alternate"/><published>2007-11-21T14:11:16Z</published><updated>2007-11-21T14:11:16Z</updated><author><name>Michel Albert</name><uri>https://sourceforge.net/u/exhuma/</uri></author><id>https://sourceforge.net41447c70cef250ea5c9ce528cf91ff66a4123e9b</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;Internally the algorithm pads the supplied strings with "X" characters. This causes problems when the string itself begins or ends with an X. The similarity score will be smaller than expected as one trigram "disappears".&lt;/p&gt;
&lt;p&gt;A solution is in the works ;)&lt;/p&gt;&lt;/div&gt;</summary></entry><entry><title>Sources not available</title><link href="https://sourceforge.net/p/python-ngram/bugs/1/" rel="alternate"/><published>2006-07-04T20:53:33Z</published><updated>2006-07-04T20:53:33Z</updated><author><name>Anonymous</name><uri>https://sourceforge.net/u/userid-None/</uri></author><id>https://sourceforge.net76d309fe8caedec3ba558132937aa357064fb7e1</id><summary type="html">&lt;div class="markdown_content"&gt;&lt;p&gt;The source package is empty!&lt;/p&gt;&lt;/div&gt;</summary></entry></feed>