[PR #275] [CLOSED] Draft: Hypothetical modularizing refactor spec #1105

New issue

Closed

opened 2026-03-01 14:48:27 +03:00 by kerem · 0 comments

kerem commented

2026-03-01 14:48:27 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/275
Author: @pirate
Created: 9/25/2019
Status: ❌ Closed

Base: master ← Head: v0.5.0

📝 Commits (4)

d88db7b wip new version
8fbb55a fix init error
6e86389 Merge pull request #246 from gisforgirard/patch-1
21c2098 Merge branch 'master' into v0.5.0

📊 Changes

2 files changed (+28 additions, -1 deletions)

View changed files

📝 archivebox/core/models.py (+20 -0)
📝 archivebox/core/settings.py (+8 -1)

📄 Description

WIP modular design:


[core]
TIMEOUT=30
URL_BLACKLIST=...

[cli]
IS_TTY
USE_COLOR
SHOW_PROGRESS

[server]
SECRET_KEY=...
HTTP_PORT=8000
HTTP_HOST=...

[server.theme_dark]
ENABLED=True

[dependencies.wget]
BINARY=/root/bin/wget
USER_AGENT=...
TIMEOUT=...

[extractors.wget]
USER_AGENT=
TIMEOUT=...

[crawlers.pocket]
DISABLED=True

...

---

setup.py
VERSION

archivebox/
    __main__.py
    __init__.py

    core/
        util.py
        system.py

        config.py
            for module in config.DEFAULT_MODULES:
                INSTALLED_PACKAGES.append(module)

            OUTPUT_DIR: str
            CONFIG_FILE: str

            TIMEOUT: int
            URL_BLACKLIST: List[URL]
            USER_AGENT: str
            SCREEN_RESOLUTION: str
            COOKIES_FILE: str
            CHECK_SSL_VALIDITY: bool

        models.py
            Link
                url: URL
                title: str
                bookmarked: datetime
                indexed: datetime
                updated: datetime
                tags: ManyToMany(Tag)
                inlinks: ManyToMany(Link)
                outlinks: ManyToMany(Link)

            Task
                link: ForeignKey(Link)
                output: str
                module: str
                cmd: str
                cmd_version: str
                pwd: str
                status: str
                start_ts: datetime
                end_ts: datetime

            Tag
                type: url/label/mimetype/date
                value: str
                created: datetime
                updated: datetime

        tasks.py
            from huey.contrib.djhuey import db_task, enqueue

            @db_task(retries=3, retry_delay=1, priority=100)
            def add(url: str, depth=0, source: Link=None):
                # Fsync the link to index with forward and backreferences
                # *first*
                # before doing anything related to archiving!
                # always always always
                # commit the index to disk first or you'll be in a world of pain.
                link = Link.objects.get_or_create(url=input_url)
                # --mm--a wild WAL appears--mm--
                if source:
                    source.outlinks.add(link)
                    link.inlinks.add(source)

                # Run the extractors and archive the link's content
                #    warc, media, git, screenshot, pdf, etc.
                extractors = get_extractors(link, outputs)
                outputs = {}
                while extractors:
                    for extractor in extractors:
                        outputs[extractor] = run_extractor(extractor, link, outputs)
                    # the list shrinks as extractors only run if there's
                    # no previous archive of that link 
                    extractors = get_extractors(link, outputs)

                # Crawl the content for outlinks and recurse if needed
                if depth <= 0:
                    # you're out of depth coins :'(
                    # you do not go to space today 🚀
                    return link

                # Crawlers will attempt to accept any string input
                # but they bail out fast if parsing is incompatible,
                # so running through all the crawlers each time
                # is still reasonably quick
                for crawler in get_crawlers(link):
                    task = crawler.run(link)
                    for outlink_url in task.output:
                        add(outlink_url, depth=depth-1, source=link)

                return link


            # I'm trying to avoid async, in favor of straightforward stuff like this:
            # def start(url, depth=0):
            #     from multiprocessing import Pool

            #     WORKERS = [
            #         {'queues': ['adder'], 'processes': 1, 'threads': ADDER_WORKERS},
            #         {'queues': ['fetcher'], 'processes': 1, 'threads': FETCHER_WORKERS},
            #         {'queues': ['extractor'], 'processes': 1, 'threads': EXTRACTOR_WORKERS},
            #         {'queues': ['crawler'], 'processes': 1, 'threads': CRAWLER_WORKERS},
            #     }

            #     pool = Pool(processes=len(WORKERS))
            #     pool.map(lambda **kwargs: run_command('run_huey', **kwargs), WORKERS)


    server/
        config.py
            SECRET_KEY: str
            USER_PERMISSIONS: rwx
            GROUP_PERMISSIONS: rwx
            PUBLIC_PERMISSIONS: rwx

            ENABLE_SITE_ISOLATION_BETA: bool

        urls.py
            /             -> server.views.index   -> /core.server/list.html
            /search       -> server.views.search  -> /core.server/list.html
            /add          -> server.views.add     -> /core.server/add.html
            /link/<slug>  -> server.views.link    -> /core.server/link.html

        views.py
            index
            search
            add
            link

        templates/
            index.html
            add.html
            link.html
            snapshot.html
            500.html
            404.html

        static/
            base.css

    cli/
        config.py
            IS_TTY
            USE_COLOR
            SHOW_PROGRESS
        cli/
            help.py
            init.py
            version.py
            info.py
            shell.py
            manage.py
            list.py
            update.py
            add.py
            remove.py
            config.py

    server.theme_light/
        config.py
        static/
            dark.css

    server.theme_dark/
        config.py
        static/
            light.css

    server.theme_oldschool/
        config.py
        static/
            base.css

    server.theme_modern/
        config.py
        static/
            base.css

    dependencies.python/
        config.py
            OPTIONAL = False
            VERSION = get_python_version()
        dependencies/
            python.py

    dependencies.django/
        config.py
            OPTIONAL = False
            VERSION = get_django_version()

    dependencies.sqlite3/
        config.py
            BINARY = get_sqlite_binary()
            VERSION = get_sqlite_version()
        settings.py
            DATABASES = {
                'default': {
                    'ENGINE': 'django.db.backends.sqlite3',
                    'NAME': DATA_DIR / 'database.sqlite3',
                },
                'tasks': {
                    'ENGINE': 'django.db.backends.sqlite3',
                    'NAME': DATA_DIR / 'tasks.sqlite3',
                },
            }

    dependencies.huey/
        config.py
            BINARY = get_dramatiq_binary()
            VERSION = get_dramatiq_version()
            CACHE_MB = 20
            FSYNC = True
            IMMEDIATE = False

        settings.py
            INSTALLED_APPS += ['huey.contrib.djhuey']
            HUEY = {
                'huey_class': 'huey.SqliteHuey',
                'filename': settings.DATABASES['tasks']['NAME'],
                'cache_mb': .config.CACHE_MB,
                'fsync': .config.FSYNC,
                'immediate': .config.IMMEDIATE,
                'results': True,
                'store_none': False,
                'utc': True,
                'consumer': {
                    'workers': 4,
                    'worker_type': 'thread',
                    'initial_delay': 1,
                    'backoff': 1.15,
                    'max_delay': 120,
                    'scheduler_interval': 1,
                    'periodic': True,
                    'check_worker_health': True,
                    'health_check_interval': 2,
                },
            }


    dependencies.wget/
        config.py
            BINARY
            VERSION
            USER_AGENT
            CHECK_SSL_VALIDITY

    dependencies.youtubedl/
        config.py
            BINARY
            VERSION
            USER_AGENT
            CHECK_SSL_VALIDITY
        api.py

    dependencies.chrome/
        config.py
            BINARY
            VERSION
            USER_AGENT
            CHECK_SSL_VALIDITY
            HEADLESS
            SANDBOX
            USER_DATA_DIR
        api.py

    dependencies.curl/
        config.py
            BINARY
            VERSION
            USER_AGENT
            CHECK_SSL_VALIDITY
            COOKIES_FILE
        api.py

    dependencies.git/
        config.py
        api.py

    dependencies.pywb/
        config.py
        api.py

    dependencies.crontab/
        config.py
        api.py


    crawlers.json_links/
        crawlers/
            json_links.py

    crawlers.rss_links/
        crawlers/
            rss_links.py

    crawlers.txt_links/
        crawlers/
            txt_links.py

    crawlers.html_links/
        crawlers/
            html_links.py

    crawlers.pocket_links/
        crawlers/
            pocket_links.py

    crawlers.medium_links/
        crawlers/
            medium_links.py

    crawlers.pinboard_links/
        crawlers/
            pinboard_links.py

    crawlers.shaarli_links/
        crawlers/
            shaarli_links.py


    extractors.metadata/
        extractors/
            metadata.py
                asset.content = download_url(asset.uri)
                asset.title = calculate_title(asset.uri, asset.content)
                asset.size = calculate_size(asset.uri, asset.content)
                asset.hash = calculate_hash(asset.uri, asset.content)
                asset.filetype = calculate_filetype(asset.uri, asset.content)
                asset.mimetype = calculate_mimetype(asset.uri, asset.content)
                asset.save()
                return str(asset.id)

    extractors.wget_clone/
        config.py
        extractors/
            warc.py

    extractors.wget_warc/
        config.py
        extractors/
            warc.py

    extractors.youtubedl_media/
        config.py
            TIMEOUT
            SAVE_PLAYLISTS
        extractors/
            media.py
                should_run()
                run()

    extractors.archivedotorg/
        config.py
        extractors/
            archivedotorg.py

    extractors.chrome_dom/
        config.py
        extrators/
            dom.py

    extractors.chrome_screenshot/
        config.py
        extractors/
            screenshot.py

    extractors.chrome_pdf/
        config.py
        extractors/
            pdf.py

    extras.oneshot/
        config.py
        cli/
            oneshot.py

    extras.proxy/
        dependencies/
           dependencies.pywb
        config.py
        cli/
            proxy.py

    extras.webrecorder/
        dependencies/
            dependencies.pywb
        config.py
        cli/
            webrecorder.py

    extras.schedule/
        config.py
            REQUIRES = dependencies.crontab
        cli/
            schedule.py

    extras.federation/
        config.py
        models.py
        views.py
        urls.py
        templates/
            network.html
            node.html

    extras.neo4j/
        config.py
        models.py
        views.py
        urls.py
        templates/
            graph.html
        static/
            node_icon.png
            edge_icon.png

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/275 **Author:** [@pirate](https://github.com/pirate) **Created:** 9/25/2019 **Status:** ❌ Closed **Base:** `master` ← **Head:** `v0.5.0` --- ### 📝 Commits (4) - [`d88db7b`](https://github.com/ArchiveBox/ArchiveBox/commit/d88db7b074b258723e739299d6729ca3b784d553) wip new version - [`8fbb55a`](https://github.com/ArchiveBox/ArchiveBox/commit/8fbb55a87b1f6e9c0a8c440c7c4f816a130722fc) fix init error - [`6e86389`](https://github.com/ArchiveBox/ArchiveBox/commit/6e863891624a848c98a59da50085e6e04ff4bbfb) Merge pull request #246 from gisforgirard/patch-1 - [`21c2098`](https://github.com/ArchiveBox/ArchiveBox/commit/21c20985c4cbced4c1228cf9e662fbddc4cc6305) Merge branch 'master' into v0.5.0 ### 📊 Changes **2 files changed** (+28 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/core/models.py` (+20 -0) 📝 `archivebox/core/settings.py` (+8 -1) </details> ### 📄 Description WIP modular design: ```python3 [core] TIMEOUT=30 URL_BLACKLIST=... [cli] IS_TTY USE_COLOR SHOW_PROGRESS [server] SECRET_KEY=... HTTP_PORT=8000 HTTP_HOST=... [server.theme_dark] ENABLED=True [dependencies.wget] BINARY=/root/bin/wget USER_AGENT=... TIMEOUT=... [extractors.wget] USER_AGENT= TIMEOUT=... [crawlers.pocket] DISABLED=True ... --- setup.py VERSION archivebox/ __main__.py __init__.py core/ util.py system.py config.py for module in config.DEFAULT_MODULES: INSTALLED_PACKAGES.append(module) OUTPUT_DIR: str CONFIG_FILE: str TIMEOUT: int URL_BLACKLIST: List[URL] USER_AGENT: str SCREEN_RESOLUTION: str COOKIES_FILE: str CHECK_SSL_VALIDITY: bool models.py Link url: URL title: str bookmarked: datetime indexed: datetime updated: datetime tags: ManyToMany(Tag) inlinks: ManyToMany(Link) outlinks: ManyToMany(Link) Task link: ForeignKey(Link) output: str module: str cmd: str cmd_version: str pwd: str status: str start_ts: datetime end_ts: datetime Tag type: url/label/mimetype/date value: str created: datetime updated: datetime tasks.py from huey.contrib.djhuey import db_task, enqueue @db_task(retries=3, retry_delay=1, priority=100) def add(url: str, depth=0, source: Link=None): # Fsync the link to index with forward and backreferences # *first* # before doing anything related to archiving! # always always always # commit the index to disk first or you'll be in a world of pain. link = Link.objects.get_or_create(url=input_url) # --mm--a wild WAL appears--mm-- if source: source.outlinks.add(link) link.inlinks.add(source) # Run the extractors and archive the link's content # warc, media, git, screenshot, pdf, etc. extractors = get_extractors(link, outputs) outputs = {} while extractors: for extractor in extractors: outputs[extractor] = run_extractor(extractor, link, outputs) # the list shrinks as extractors only run if there's # no previous archive of that link extractors = get_extractors(link, outputs) # Crawl the content for outlinks and recurse if needed if depth <= 0: # you're out of depth coins :'( # you do not go to space today 🚀 return link # Crawlers will attempt to accept any string input # but they bail out fast if parsing is incompatible, # so running through all the crawlers each time # is still reasonably quick for crawler in get_crawlers(link): task = crawler.run(link) for outlink_url in task.output: add(outlink_url, depth=depth-1, source=link) return link # I'm trying to avoid async, in favor of straightforward stuff like this: # def start(url, depth=0): # from multiprocessing import Pool # WORKERS = [ # {'queues': ['adder'], 'processes': 1, 'threads': ADDER_WORKERS}, # {'queues': ['fetcher'], 'processes': 1, 'threads': FETCHER_WORKERS}, # {'queues': ['extractor'], 'processes': 1, 'threads': EXTRACTOR_WORKERS}, # {'queues': ['crawler'], 'processes': 1, 'threads': CRAWLER_WORKERS}, # } # pool = Pool(processes=len(WORKERS)) # pool.map(lambda **kwargs: run_command('run_huey', **kwargs), WORKERS) server/ config.py SECRET_KEY: str USER_PERMISSIONS: rwx GROUP_PERMISSIONS: rwx PUBLIC_PERMISSIONS: rwx ENABLE_SITE_ISOLATION_BETA: bool urls.py / -> server.views.index -> /core.server/list.html /search -> server.views.search -> /core.server/list.html /add -> server.views.add -> /core.server/add.html /link/<slug> -> server.views.link -> /core.server/link.html views.py index search add link templates/ index.html add.html link.html snapshot.html 500.html 404.html static/ base.css cli/ config.py IS_TTY USE_COLOR SHOW_PROGRESS cli/ help.py init.py version.py info.py shell.py manage.py list.py update.py add.py remove.py config.py server.theme_light/ config.py static/ dark.css server.theme_dark/ config.py static/ light.css server.theme_oldschool/ config.py static/ base.css server.theme_modern/ config.py static/ base.css dependencies.python/ config.py OPTIONAL = False VERSION = get_python_version() dependencies/ python.py dependencies.django/ config.py OPTIONAL = False VERSION = get_django_version() dependencies.sqlite3/ config.py BINARY = get_sqlite_binary() VERSION = get_sqlite_version() settings.py DATABASES = { 'default': { 'ENGINE': 'django.db.backends.sqlite3', 'NAME': DATA_DIR / 'database.sqlite3', }, 'tasks': { 'ENGINE': 'django.db.backends.sqlite3', 'NAME': DATA_DIR / 'tasks.sqlite3', }, } dependencies.huey/ config.py BINARY = get_dramatiq_binary() VERSION = get_dramatiq_version() CACHE_MB = 20 FSYNC = True IMMEDIATE = False settings.py INSTALLED_APPS += ['huey.contrib.djhuey'] HUEY = { 'huey_class': 'huey.SqliteHuey', 'filename': settings.DATABASES['tasks']['NAME'], 'cache_mb': .config.CACHE_MB, 'fsync': .config.FSYNC, 'immediate': .config.IMMEDIATE, 'results': True, 'store_none': False, 'utc': True, 'consumer': { 'workers': 4, 'worker_type': 'thread', 'initial_delay': 1, 'backoff': 1.15, 'max_delay': 120, 'scheduler_interval': 1, 'periodic': True, 'check_worker_health': True, 'health_check_interval': 2, }, } dependencies.wget/ config.py BINARY VERSION USER_AGENT CHECK_SSL_VALIDITY dependencies.youtubedl/ config.py BINARY VERSION USER_AGENT CHECK_SSL_VALIDITY api.py dependencies.chrome/ config.py BINARY VERSION USER_AGENT CHECK_SSL_VALIDITY HEADLESS SANDBOX USER_DATA_DIR api.py dependencies.curl/ config.py BINARY VERSION USER_AGENT CHECK_SSL_VALIDITY COOKIES_FILE api.py dependencies.git/ config.py api.py dependencies.pywb/ config.py api.py dependencies.crontab/ config.py api.py crawlers.json_links/ crawlers/ json_links.py crawlers.rss_links/ crawlers/ rss_links.py crawlers.txt_links/ crawlers/ txt_links.py crawlers.html_links/ crawlers/ html_links.py crawlers.pocket_links/ crawlers/ pocket_links.py crawlers.medium_links/ crawlers/ medium_links.py crawlers.pinboard_links/ crawlers/ pinboard_links.py crawlers.shaarli_links/ crawlers/ shaarli_links.py extractors.metadata/ extractors/ metadata.py asset.content = download_url(asset.uri) asset.title = calculate_title(asset.uri, asset.content) asset.size = calculate_size(asset.uri, asset.content) asset.hash = calculate_hash(asset.uri, asset.content) asset.filetype = calculate_filetype(asset.uri, asset.content) asset.mimetype = calculate_mimetype(asset.uri, asset.content) asset.save() return str(asset.id) extractors.wget_clone/ config.py extractors/ warc.py extractors.wget_warc/ config.py extractors/ warc.py extractors.youtubedl_media/ config.py TIMEOUT SAVE_PLAYLISTS extractors/ media.py should_run() run() extractors.archivedotorg/ config.py extractors/ archivedotorg.py extractors.chrome_dom/ config.py extrators/ dom.py extractors.chrome_screenshot/ config.py extractors/ screenshot.py extractors.chrome_pdf/ config.py extractors/ pdf.py extras.oneshot/ config.py cli/ oneshot.py extras.proxy/ dependencies/ dependencies.pywb config.py cli/ proxy.py extras.webrecorder/ dependencies/ dependencies.pywb config.py cli/ webrecorder.py extras.schedule/ config.py REQUIRES = dependencies.crontab cli/ schedule.py extras.federation/ config.py models.py views.py urls.py templates/ network.html node.html extras.neo4j/ config.py models.py views.py urls.py templates/ graph.html static/ node_icon.png edge_icon.png ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>