Menu

#286 Support detecting change of post-processed data

Invalid
pending
None
Feature Request
7
2025-02-01
2025-01-19
No

Sometimes WebChangeMonitor's (WCM's) integrated tools aren't sufficient to narrow down a URI's response to the actual desired data. Fortunately, WCM supports post-processing, during which the user can apply every available tool (or write their own) to reduce the response data to contain only the actual desired data.

At this point, the desired behaviour is for WCM to compare the previously obtained post-processing output with the newly obtained post-processing output. Unfortunately, unless I'm not seeing something, WCM always compares the PRE-processed data. In many cases, this results in WCM indicating that the desired data has changed, when in fact, it has not. (WCM is technically correct, as the pre-processed data has, in fact, changed, but this is not the data that matters to the user; what matters to the user is the post-processed data.)

Would it be possible to add a per-item option to allow WCM to compare the post-processed data, instead of the pre-processed data?

Discussion

  • Morten MacFly

    Morten MacFly - 2025-01-26
    • status: open --> pending
    • Group: Future_Release --> Invalid
     
  • Morten MacFly

    Morten MacFly - 2025-01-26

    For clarification: A manual post-processor would operate on the file created by WCM and modify this file. So you are asking to read the file after the post-processor is run and re-evaluate the diff, right?

    So what yo envision is the following work-flow:

    WCM -> writes temp. content file
    WCM -> starts post-processor "PP"
    PP -> does "magic" and modifies the temp. content file
    WCM -> gets notified (how?)
    WCM -> reads temp. content file again
    WCM -> calculates changes and sets state of item accordingly

    The problem here is with the notification part: Currently, any post-processing tools are run asynchronously to avoid that WCM hangs (probably forever) waiting for the end of a post-processing tool. (Note that a thread is no solution here.) To me, part of the post processing should also be to inform the user (i.e. via email) that a change took place.

    Before thinking about the direction above I would rather think about removing the limitation you face why you actually need to call a post-processing tool at all. To my experience this is only the case if you want to interpret the content. WCM has quite some powerful tools to do that meanwhile. The items state if then calculated on the interpreted content. Which would do what you want.

    Do you have an example of what you cannot do and why you need a post-processor tool (which one) for that?

    Ps.: I would have another idea btw how to solve this that includes IPC... but lets start easy.

     
  • Gitoffthelawn

    Gitoffthelawn - 2025-01-27

    Yes, everything you wrote is accurate and correct. I really like how you want to think of a more elegant solution, as that matches my desire as well. So let's start there!

    Approximately 90% of my post-processing routines are to call jq. If you're not using it, jq is like sed, but for JSON. If you're not familiar with sed, we can't be friends any longer. ;) I use jq to collect the needed data (and only the needed data) from JSON files. If WCM was to somehow include jq functionality, those post-processing routines would no longer be needed.

    Most of my other post-processing routines are to handle line-breaks. I still haven't been able to get WCM to add carriage returns (ASCII 0x0d) or linefeeds (ASCII 0x0a) to its output. After much time invested in trying to get it to work, I wrote post-processing scripts to perform this task. I think this should be easy to implement in WCM, and IIRC I created an issue report for it a while back, but a quick search of issue reports isn't popping it up.

     
  • Morten MacFly

    Morten MacFly - 2025-01-29

    OK, I can guess what you have in mind ( and know sed btw. ;-) ) but I' afraid I need more info to be sure what to do.

    I guess it would be the best if you provide me with:
    1.) A source JSON source file you want to process (which can be anonymized)
    2.) The JQ command(s) you are operating on that file
    3.) The expected output file

    Especially also flag why and where you want to add (?) line-feeds or similar. Wouldn't you usually want to remove these?

    I am using jsoncons ( https://github.com/danielaparker/jsoncons ) for the JSON stuff an probably using just 5% of its functionality so chances are high, that it would cover the required routines, too.

     

    Last edit: Morten MacFly 2025-10-26
    • Gitoffthelawn

      Gitoffthelawn - 2025-02-01

      Here's a good example. It uses Mozilla's public API to get the first 50 recommended Firefox extensions. The API returns a massive amount of data, but all I want is the total number of recommended extensions, the number of pages, the number of results on the current page, and the name of each extension returned along with its i18n code.

      The URI:
      https://addons.mozilla.org/api/v5/addons/search/?app=firefox&promoted=recommended&sort=created&type=extension&page=1&page_size=50&lang=en

      The WCM item relies on 2 regex replace rules that process the JSON within WCM before passing the data to jq:

      Regex replace rule #1
      find: "authors":.?"last_updated":"[^"]",
      replace: [null] (blank)

      Regex replace rule #2
      find: ,"previews":.?"_score":.?}
      replace: }

      The WCM item then calls this post-processing script (this is a Windows batch file; I wrote a similar one for Linux):

      jq.exe -cr "\"\(.count) total extensions on \(.page_count) pages, \(.results ^| length) on this page\",.results[].name" "C:\example\%1" > "C:\example\temp.txt"
      move /y "C:\example\temp.txt" "C:\example\%1"
      

      To keep things simple, I'm just using C:\example for the WCM content folder in this code. You can replace it with whatever you like.


      I'm hesitant to mention the following, because I think focusing on the above is the most important. At the same time, providing a bigger picture may help with ideas/implementation:

      Because there are multiple pages for the above results (and similar situations with other APIs), ideally WCM will request each page and concatenate the pages. Because WCM cannot current do this, I create separate WCM items for each API call. Thus, if results span 10 pages, which requires 10 separate API calls, the WCM job currently requires 10 separate WCM items. It would be great to have a single WCM item to handle all those repetitive API calls, simply incrementing a variable within the URI for each call.

      I wrote a complex script that, within WCM, will concatenate a bunch of discrete WCM data files. It works by taking advantage of WCM's post-processing feature and concatenating all data files in a group when any one is changed. Because the script will have no knowledge of what is going on outside of itself, I implemented all sorts of hackery to make it work even when multiple data files are updated by WCM in succession.

       

Log in to post a comment.