Menu

#310 Identify problem rows on import

Backlog
open
nobody
CSV import (4)
6normal
2025-06-25
2023-01-30
Steve Keen
No

This data file is useful since it contains data entry errors that stop Ravel from importing (when the "throw exception" choice is made). The reason is, for example, that the firm Raytheon has two entries for 2020--one must obviously be a mistake.

It would be helpful to be able to identify such rows and let the user decide what to do--delete them, choose one over the other, etc.

2 Attachments

Discussion

  • High Performance Coder

    That was the idea of the "error report" functionality.

    If I look at the "error report", I don't thjnk it is as simple as saying choose one entry over the other. How would you choose, anyway?

    HP is showing two entries for 2017-2020. I happen to know that HP split into two companies in late 2015 - HP Enterprise and HP Inc. I suspect these two different companies received the same name in this database. But that doesn't explain the other entries. Raytheon is still just the one company, as is ITT in the 50s and 60s, as far as I know (it was split in 1995, and then again in 2011). And what happened to Fairchild in 1989? That last one might be a mistake.

     
  • High Performance Coder

    I loaded the files by taking the average of duplicate values (maybe sum is better, if these truly correspond to corporate breakups), and plotted revenue vs rank.

     
  • High Performance Coder

    But only one year works :(

     
  • High Performance Coder

    Need to find out why.

     
  • Steve Keen

    Steve Keen - 2023-02-02

    Yeah, a lot of the time it is just going to be sloppy data entry that causes hassles like these. Since so many such files are created in spreadsheets rather than databases, errors like this aren't picked up on entry--and it's obvious that no-one has attempted to clean this data for decades!

    So an automated process might work (such as taking the Max here, which I would do in this instance on the assumption that the biggest market size is the correct one), but it would be better still to show the offending lines to the user in an edit window, where changes could be made (delete rows, edit names, etc.) and then written back to the source file (or a choice could be made like my assumption, to just load the Max value in those cases, or the average, etc.).

    This is more elegant than giving the user another CSV file to manually scan in Excel, especially in cases where Excel can't load all the file anyway.

     
  • High Performance Coder

    • labels: --> CSV import
     
  • High Performance Coder

    • Milestone: Pascal --> Backlog
     
  • High Performance Coder

    Ticket moved from /p/minsky/ravel/315/

    Can't be converted:

    • _priority: 6normal
     
  • High Performance Coder

    Ticket moved from /p/minsky/tickets/1826/

    Can't be converted:

    • _priority: 6normal
     

Log in to post a comment.

MongoDB Logo MongoDB