Menu

#169 Weigh score by the length of the filename

open
nobody
5
2008-05-12
2008-05-12
No

Hi,

I think the score would be far more meaningful for searches, if it was weighted by with the length of the filename.

Proposed formula:

QUERY_LENGTH: total length of the query (spaces counted).

FILENAME_LENGTH: total length of the filename (spaces counted).

KEYWORDS_PERCENTAGE: the percentage of the number of keywords found in the filename (+tags) divided by the number of keywords

sqrt(): square root, taken because when we add tags, we might not want the length of the filename become too important - we want to avoid losing well described files, while still stopping keyword hogging.

Score = KEYWORDS_PERCENTAGE * sqrt(QUERY_LENGTH / FILENAME_LENGTH)

Only if the query was the exact filename, the score will be 100 (%).

With this, an overly long filename containing all keywords can rank lower than a short one, which only contains one keyword.

Corner-cases:

many short-keywords (i.e. 3 letter):
* 3 keywords (=11 letters), one 3 letter file ( + 4 letters suffix) against a 20 letter file with all 3 keywords (a well described one) against a 50 letter file with all 3 keywords (that one goes a bit over the top but is still OK) against a >100 letter file (keyword hogging):
* 3+4(suffix): score: 33% (1 of 3 kwds) * sqrt(11/7) (length) = 41.8% .

* 20+4: score: 100% * sqrt(11/24) = 67.7%
* 50+4: score: 100% * sqrt(11/54) = 45.17%
* 100+4: score: 100% * sqrt(11/104) = 32.5%

Notes:
1) Spaces are counted, because some spammers use keywords way back in the filename (seperated by spaces) to make hits on files.

2) Metadata has to be handled differently. For example, only fields of metadata which contain a query keyword could be counted into the length of the filename (if there are only a limited number of fields avaible). We should take care that providing solid metadata gives a higher score than providing only the filename.

3) This favorizes files with shorter but precise filenames and with fewer but precise tags (as soon as we have metadata).

4) If sqrt is too expensive, we could reach the same effect with simply squaring the keyword percentage (in range [0,1]) which should be a bit cheaper.

5) This is a ranking function for filenames. There are bound to be academic works on ranking filenames for relevance (or at google...), so gathering these might provide interesting insights.

Discussion

  • Arne Babenhauserheide

    • summary: Weigh score by the length of the filename --> Weight score by the length of the filename
     
  • Arne Babenhauserheide

    • summary: Weight score by the length of the filename --> Weigh score by the length of the filename
     
MongoDB Logo MongoDB