[GH-ISSUE #633] Remove tracking params from URL before hoarding it #404

Open
opened 2026-03-02 11:49:34 +03:00 by kerem · 5 comments
Owner

Originally created by @raviwarrier on GitHub (Nov 10, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/633

Describe the feature you'd like

Most times you are on a website or an app and hoard stuff and usually they come with tracking params - utm, fb, ms, etc. and it would be nice to be able to save a clean URL.

I had created a url cleaning workflow on n8n for myself that cleans a copied URL, re-copies the clean URL and then sends it to my other devices (laptop to phone, phone to laptop).

here's the code that I wrote in JS for n8n workflow if it helps:

// Define the tracking parameters as a Set for efficient lookup
const trackingParams = new Set([
    'utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content',
    'utm_id', 'utm_cid', 'utm_reader', 'utm_name', 'utm_social', 'utm_placement',
    'gclid', 'dclid', 'gclsrc', 'wbraid', 'gbraid', '_gl', 'fbclid', 
    'fb_action_ids', 'fb_action_types', 'fb_source', 'fb_ref', 'fb_comment_id',
    'msclkid', 'ocid', 'wt.mc_id', 'cvid', 'WT.mc_id', 'wt.mc_ev', '_hsenc',
    '_hsmi', 'mc_cid', 'mc_eid', '_ga', '_ke', 'ir_campaign_id',
    'ir_ad_id', 'cid', 'eid', '_bta_tid', '_bta_c', 'trk_contact', 'trk_msg',
    'trk_module', 'trk', 'mkt_tok', 'wickedid', 'wickedsource', 'share',
    'share_id', 'share_source', 'cmpid', 'social', 'socid', 'twclid', 'pinid',
    'igshid', 'igref', 'lipi', 'vero_id', 'mc_', 'ref_', 'dm_i', 'epik', 
    'mailid', 'mid', 'spMailingID', 'spReportId', 'spUserID', 'ss_campaign_id',
    'ss_email_id', 'ss_source', 'subscriber', 'tag', 'psc', 'pd_rd_r',
    'pd_rd_w', 'pd_rd_wg', '_encoding', 'linkCode', 'linkId', 'affiliate',
    'affiliate_id', 'affid', 'sourceid', 'source_id', 'source', 'ref', 'referral',
    'referer', 'referrer', 'refid', 'ref_id', 'mpid', 'clickid', 'click_id',
    'adjust_tracker', 'adjust_campaign', 'adjust_adgroup', 'app_id', 'app_name',
    'device_id', 'platform', 'admitad_uid', 'adset', 'adset_name', 'adgroup',
    'ad_id', 'ad_name', 'adposition', 'campaignid', 'placement', 'target',
    'loc_physical_ms', 'loc_interest_ms', 'action', 'campaign_date', 'campaign_id',
    'force_sid', 'geo', 'id', 'key', 'medium', 'notification', 'offer', 'sid',
    'site', 'timestamp', 'tracking', 'user', 'visitor', 'webid', 'wmid', 
    'browser', 'browser_version', 'os', 'os_version', 'viewport', 'resolution',
    'language', 'country', 'region', 'city', 'session_id', 'session', 'user_id', 'h_ad_id',
    'userid', 'visitor_id', 'client_id', 'clientid', 'cust_id', 'custid', 'fbc_id', 'igsh'
]);

/**
 * Function to clean a URL by removing tracking parameters.
 * @param {string} inputUrl - The original URL to be cleaned.
 * @returns {string|null} - The cleaned URL or null if invalid.
 */
function cleanUrl(inputUrl) {
    try {
        // Remove all null characters and control characters from the URL
        const sanitizedUrl = inputUrl.replace(/[\u0000-\u001F\u007F]/g, '').trim();

        // Parse the sanitized URL
        const url = new URL(sanitizedUrl);
        const params = url.searchParams;

        // Collect keys to delete
        const keysToDelete = [];

        for (const key of params.keys()) {
            // Check for exact match
            if (trackingParams.has(key)) {
                keysToDelete.push(key);
            } else {
                // Check for prefix matches (e.g., 'mc_', 'ref_')
                for (const param of trackingParams) {
                    if (param.endsWith('_') && key.startsWith(param)) {
                        keysToDelete.push(key);
                        break; // No need to check other prefixes
                    }
                }
            }
        }

        // Remove the identified tracking parameters
        keysToDelete.forEach(key => params.delete(key));

        // Reconstruct the cleaned URL
        // Handle cases where there are no remaining query parameters
        const cleanedUrl = params.toString() 
            ? `${url.origin}${url.pathname}?${params.toString()}${url.hash}` 
            : `${url.origin}${url.pathname}${url.hash}`;

        return cleanedUrl;
    } catch (error) {
        // Log the error for debugging purposes
        console.error('Error cleaning URL:', error);
        return null;
    }
}

// Access the incoming data from the webhook
const items = $input.all(); // Get all incoming items

// Process each incoming item
return items.map(item => {
    // Extract the URL from the payload
    // Refer to it as 'the input URL'
    let inputUrl = item.json.body.body || '';

    if (!inputUrl) {
        // Handle cases where the URL is not found
        return {
            json: {
                error: 'URL not found in the webhook payload.'
            }
        };
    }

    // Clean the sanitized URL
    const cleanedUrl = cleanUrl(inputUrl);

    if (cleanedUrl) {
        // Return the cleaned URL
        return {
            json: {
                cleanedUrl
            }
        };
    } else {
        // Handle invalid URL scenarios
        return {
            json: {
                error: 'Invalid URL provided.'
            }
        };
    }
});

these were all the tracking params I could find, but I am sure there are more. maybe, if and when you incorporate this functionality, you could also include using AI to check if the cleaned up url has any remaining,

Describe the benefits this would bring to existing Hoarder users

Would help with having clean URLs that do not automatically start tracking them when the open the links.

Can the goal of this request already be achieved via other means?

Not really. people could manually clean up links or use automation workflows like I do, but that's not possible when you are directly sharing to Hoarder (either via extension or mobile app).

Have you searched for an existing open/closed issue?

  • I have searched for existing issues and none cover my fundamental request

Additional context

No response

Originally created by @raviwarrier on GitHub (Nov 10, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/633 ### Describe the feature you'd like Most times you are on a website or an app and hoard stuff and usually they come with tracking params - utm, fb, ms, etc. and it would be nice to be able to save a clean URL. I had created a url cleaning workflow on n8n for myself that cleans a copied URL, re-copies the clean URL and then sends it to my other devices (laptop to phone, phone to laptop). here's the code that I wrote in JS for n8n workflow if it helps: ``` // Define the tracking parameters as a Set for efficient lookup const trackingParams = new Set([ 'utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content', 'utm_id', 'utm_cid', 'utm_reader', 'utm_name', 'utm_social', 'utm_placement', 'gclid', 'dclid', 'gclsrc', 'wbraid', 'gbraid', '_gl', 'fbclid', 'fb_action_ids', 'fb_action_types', 'fb_source', 'fb_ref', 'fb_comment_id', 'msclkid', 'ocid', 'wt.mc_id', 'cvid', 'WT.mc_id', 'wt.mc_ev', '_hsenc', '_hsmi', 'mc_cid', 'mc_eid', '_ga', '_ke', 'ir_campaign_id', 'ir_ad_id', 'cid', 'eid', '_bta_tid', '_bta_c', 'trk_contact', 'trk_msg', 'trk_module', 'trk', 'mkt_tok', 'wickedid', 'wickedsource', 'share', 'share_id', 'share_source', 'cmpid', 'social', 'socid', 'twclid', 'pinid', 'igshid', 'igref', 'lipi', 'vero_id', 'mc_', 'ref_', 'dm_i', 'epik', 'mailid', 'mid', 'spMailingID', 'spReportId', 'spUserID', 'ss_campaign_id', 'ss_email_id', 'ss_source', 'subscriber', 'tag', 'psc', 'pd_rd_r', 'pd_rd_w', 'pd_rd_wg', '_encoding', 'linkCode', 'linkId', 'affiliate', 'affiliate_id', 'affid', 'sourceid', 'source_id', 'source', 'ref', 'referral', 'referer', 'referrer', 'refid', 'ref_id', 'mpid', 'clickid', 'click_id', 'adjust_tracker', 'adjust_campaign', 'adjust_adgroup', 'app_id', 'app_name', 'device_id', 'platform', 'admitad_uid', 'adset', 'adset_name', 'adgroup', 'ad_id', 'ad_name', 'adposition', 'campaignid', 'placement', 'target', 'loc_physical_ms', 'loc_interest_ms', 'action', 'campaign_date', 'campaign_id', 'force_sid', 'geo', 'id', 'key', 'medium', 'notification', 'offer', 'sid', 'site', 'timestamp', 'tracking', 'user', 'visitor', 'webid', 'wmid', 'browser', 'browser_version', 'os', 'os_version', 'viewport', 'resolution', 'language', 'country', 'region', 'city', 'session_id', 'session', 'user_id', 'h_ad_id', 'userid', 'visitor_id', 'client_id', 'clientid', 'cust_id', 'custid', 'fbc_id', 'igsh' ]); /** * Function to clean a URL by removing tracking parameters. * @param {string} inputUrl - The original URL to be cleaned. * @returns {string|null} - The cleaned URL or null if invalid. */ function cleanUrl(inputUrl) { try { // Remove all null characters and control characters from the URL const sanitizedUrl = inputUrl.replace(/[\u0000-\u001F\u007F]/g, '').trim(); // Parse the sanitized URL const url = new URL(sanitizedUrl); const params = url.searchParams; // Collect keys to delete const keysToDelete = []; for (const key of params.keys()) { // Check for exact match if (trackingParams.has(key)) { keysToDelete.push(key); } else { // Check for prefix matches (e.g., 'mc_', 'ref_') for (const param of trackingParams) { if (param.endsWith('_') && key.startsWith(param)) { keysToDelete.push(key); break; // No need to check other prefixes } } } } // Remove the identified tracking parameters keysToDelete.forEach(key => params.delete(key)); // Reconstruct the cleaned URL // Handle cases where there are no remaining query parameters const cleanedUrl = params.toString() ? `${url.origin}${url.pathname}?${params.toString()}${url.hash}` : `${url.origin}${url.pathname}${url.hash}`; return cleanedUrl; } catch (error) { // Log the error for debugging purposes console.error('Error cleaning URL:', error); return null; } } // Access the incoming data from the webhook const items = $input.all(); // Get all incoming items // Process each incoming item return items.map(item => { // Extract the URL from the payload // Refer to it as 'the input URL' let inputUrl = item.json.body.body || ''; if (!inputUrl) { // Handle cases where the URL is not found return { json: { error: 'URL not found in the webhook payload.' } }; } // Clean the sanitized URL const cleanedUrl = cleanUrl(inputUrl); if (cleanedUrl) { // Return the cleaned URL return { json: { cleanedUrl } }; } else { // Handle invalid URL scenarios return { json: { error: 'Invalid URL provided.' } }; } }); ``` these were all the tracking params I could find, but I am sure there are more. maybe, if and when you incorporate this functionality, you could also include using AI to check if the cleaned up url has any remaining, ### Describe the benefits this would bring to existing Hoarder users Would help with having clean URLs that do not automatically start tracking them when the open the links. ### Can the goal of this request already be achieved via other means? Not really. people could manually clean up links or use automation workflows like I do, but that's not possible when you are directly sharing to Hoarder (either via extension or mobile app). ### Have you searched for an existing open/closed issue? - [X] I have searched for existing issues and none cover my fundamental request ### Additional context _No response_
Author
Owner

@raviwarrier commented on GitHub (Nov 10, 2024):

Also, this could be a togglable setting. "Clean all URLs before saving?" yes/no. and "Ask to clean be saving?" yes/no.

The reason for "ask to clean" is because some links (like of webapps) may break if it doesn't have certain tracking params, and so a person can choose not to clean a specific URL if the "clean all urls..." is marked as 'yes'.

<!-- gh-comment-id:2466571885 --> @raviwarrier commented on GitHub (Nov 10, 2024): Also, this could be a togglable setting. "Clean all URLs before saving?" yes/no. and "Ask to clean be saving?" yes/no. The reason for "ask to clean" is because some links (like of webapps) may break if it doesn't have certain tracking params, and so a person can choose not to clean a specific URL if the "clean all urls..." is marked as 'yes'.
Author
Owner

@ALILEX-1 commented on GitHub (Nov 24, 2024):

Remove unnecessary strings from links and keep the URL, like "abcd https://XXXXXXXX.com abcd",The expected content is"https://XXXXXXXX.com"

<!-- gh-comment-id:2495980260 --> @ALILEX-1 commented on GitHub (Nov 24, 2024): Remove unnecessary strings from links and keep the URL, like "abcd https://XXXXXXXX.com abcd",The expected content is"https://XXXXXXXX.com"
Author
Owner

@raviwarrier commented on GitHub (Nov 24, 2024):

more like https://example.com?utm_source=newsletter&param=value turning into https://example.com

<!-- gh-comment-id:2496006245 --> @raviwarrier commented on GitHub (Nov 24, 2024): more like https://example.com?utm_source=newsletter&param=value turning into https://example.com
Author
Owner

@ALILEX-1 commented on GitHub (Nov 30, 2024):

more like 更像 https://example.com?utm_source=newsletter&param=valuehttps://example.com?utm_source=newsletter&param=value turning into 变成 https://example.comhttps://example.com

更像 https://example.com?utm_source=newsletter&param=value 变成 https://example.com

in mobile,you can try this app https://play.google.com/store/apps/details?id=com.catchingnow.share

<!-- gh-comment-id:2508979303 --> @ALILEX-1 commented on GitHub (Nov 30, 2024): > more like 更像 [https://example.com?utm_source=newsletter&param=valuehttps://example.com?utm_source=newsletter&param=value](https://example.com?utm_source=newsletter&param=value) turning into 变成 [https://example.comhttps://example.com](https://example.com) > > 更像 https://example.com?utm_source=newsletter&param=value 变成 https://example.com in mobile,you can try this app [https://play.google.com/store/apps/details?id=com.catchingnow.share](url)
Author
Owner

@raviwarrier commented on GitHub (Nov 30, 2024):

more like 更像 https://example.com?utm_source=newsletter&param=valuehttps://example.com?utm_source=newsletter&param=value turning into 变成 https://example.comhttps://example.com
更像 https://example.com?utm_source=newsletter&param=value 变成 https://example.com

in mobile,you can try this app https://play.google.com/store/apps/details?id=com.catchingnow.share

thanks. I have a tasker profile on my phone and an eventghost trigger on my laptop to do this. it automatically cleans the urls in clipboard, but it doesn't help when I am sharing directly to hoarder, either on the phone or in my browser. that's why a native url cleaner within hoarder would be better. it could clean before it saves.

<!-- gh-comment-id:2509049163 --> @raviwarrier commented on GitHub (Nov 30, 2024): > > more like 更像 [https://example.com?utm_source=newsletter&param=valuehttps://example.com?utm_source=newsletter&param=value](https://example.com?utm_source=newsletter&param=value) turning into 变成 [https://example.comhttps://example.com](https://example.com) > > 更像 https://example.com?utm_source=newsletter&param=value 变成 https://example.com > > in mobile,you can try this app [https://play.google.com/store/apps/details?id=com.catchingnow.share](url) thanks. I have a tasker profile on my phone and an eventghost trigger on my laptop to do this. it automatically cleans the urls in clipboard, but it doesn't help when I am sharing directly to hoarder, either on the phone or in my browser. that's why a native url cleaner within hoarder would be better. it could clean before it saves.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#404
No description provided.