update lists

2025-07-18 22:22:32 +02:00
parent 55bed3b4a1
commit 5916c5c074
3078 changed files with 331679 additions and 357255 deletions
--- a/html/webarchiving.html
+++ b/html/webarchiving.html
@@ -74,9 +74,11 @@ Murray</a></li>
 href="https://netpreserve.org/web-archiving/training-materials/">IIPC
 and DPC Training materials: module for beginners (8 sessions)</a></li>
 <li><a href="https://github.com/vphill/web-archiving-course">UNT Web
-Archiving Course 2022</a></li>
+Archiving Course</a></li>
 <li><a href="https://cedwarc.github.io/">Continuing Education to Advance
 Web Archiving (CEDWARC)</a></li>
+<li><a href="https://github.com/commoncrawl/whirlwind-python/">A
+Whirlwind Tour of Common Crawl’s Datasets using Python</a></li>
 </ul></li>
 <li>The WARC Standard:
 <ul>
@@ -106,8 +108,11 @@ organisations who publish on the web, and who want to make sure their
 site can be archived.</p>
 <ul>
 <li><a
-href="https://library.stanford.edu/projects/web-archiving/archivability">Stanford
-Libraries’ Archivability pages</a></li>
+href="https://nullhandle.org/web-archivability/index.html">Definition of
+Web Archivability</a> - This describes the ease with which web content
+can be preserved. (<a
+href="https://web.archive.org/web/20230728211501/https://library.stanford.edu/projects/web-archiving/archivability">Archived
+version from the Stanford Libraries</a>)</li>
 <li>The <a href="http://archiveready.com/">Archive Ready</a> tool, for
 estimating how likely a web page will be archived successfully.</li>
 </ul>
@@ -115,11 +120,15 @@ estimating how likely a web page will be archived successfully.</li>
 <p>This list of tools and software is intended to briefly describe some
 of the most important and widely-used tools related to web archiving.
 For more details, we recommend you refer to (and contribute to!) these
-excellent resources from other groups: * <a
+excellent resources from other groups:</p>
+<ul>
+<li><a
 href="https://github.com/archivers-space/research/tree/master/web_archiving">Comparison
-of web archiving software</a> * <a
+of web archiving software</a></li>
+<li><a
 href="https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring">Awesome
-Website Change Monitoring</a></p>
+Website Change Monitoring</a></li>
+</ul>
 <h3 id="acquisition">Acquisition</h3>
 <ul>
 <li><a href="https://github.com/pirate/ArchiveBox">ArchiveBox</a> - A
@@ -131,10 +140,12 @@ links using wget, Chrome headless, and other methods (formerly
 href="http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html">Python
 library</a> to push web resources into on-demand web archives.
 <em>(Stable)</em></li>
-<li><a href="https://archiveweb.page">ArchiveWeb.Page</a> - A plugin for
-Chrome and other Chromium based browsers that lets you interactively
-archive web pages, replay them, and export them as WARC data. Also
-available as an Electron based desktop application.</li>
+<li><a
+href="https://webrecorder.net/archivewebpage/">ArchiveWeb.Page</a> - A
+plugin for Chrome and other Chromium based browsers that lets you
+interactively archive web pages, replay them, and export them as WARC
+&amp; WACZ files. Also available as an Electron based desktop
+application.</li>
 <li><a href="https://github.com/bellingcat/auto-archiver">Auto
 Archiver</a> - Python script to automatically archive social media
 posts, videos, and images from a Google Sheets document. Read the <a
@@ -142,9 +153,9 @@ href="https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-cont
 about Auto Archiver on bellingcat.com</a>.</li>
 <li><a
 href="https://github.com/webrecorder/browsertrix-crawler">Browsertrix
-Crawler</a> - A Chrome based high-fidelity crawling system, designed to
-run a complex, customizable browser-based crawl in a single Docker
-container.</li>
+Crawler</a> - A Chromium based high-fidelity crawling system, designed
+to run a complex, customizable browser-based crawl in a single Docker
+container. <em>(Stable)</em></li>
 <li><a href="https://github.com/internetarchive/brozzler">Brozzler</a> -
 A distributed web crawler (爬虫) that uses a real browser (Chrome or
 Chromium) to fetch pages and embedded urls and to extract links.
@@ -154,6 +165,9 @@ and CLI tool for saving webpages. <em>(Stable)</em></li>
 <li><a href="https://github.com/CGamesPlay/chronicler">Chronicler</a> -
 Web browser with record and replay functionality. <em>(In
 Development)</em></li>
+<li><a href="https://www.community-archive.org/">Community Archive</a> -
+Open Twitter Database and API with tools and resources for building on
+archived Twitter data.</li>
 <li><a href="https://github.com/turicas/crau">crau</a> - crau is the way
 (most) Brazilians pronounce crawl, it’s the easiest command-line tool
 for archiving the Web and playing archives: you just need a list of
@@ -209,7 +223,7 @@ for witnessing the web. <em>(Stable)</em></li>
 href="https://github.com/gildas-lormeau/SingleFile">SingleFile</a> -
 Browser extension for Firefox/Chrome and CLI tool to save a faithful
 copy of a complete page as a single HTML file. <em>(Stable)</em></li>
-<li><a href="http://mementoweb.github.com/SiteStory/">SiteStory</a> - A
+<li><a href="http://mementoweb.github.io/SiteStory/">SiteStory</a> - A
 transactional archive that selectively captures and stores transactions
 that take place between a web client (browser) and a web server.
 <em>(Stable)</em></li>
@@ -258,9 +272,6 @@ Open-source workflow management for selective web archiving.
 <li><a href="https://github.com/WebMemex">WebMemex</a> - Browser
 extension for Firefox and Chrome which lets you archive web pages you
 visit. <em>(In Development)</em></li>
-<li><a href="https://webrecorder.io/">Webrecorder</a> - Create
-high-fidelity, interactive recordings of any web site you browse.
-<em>(Stable)</em></li>
 <li><a href="http://www.gnu.org/software/wget/">Wget</a> - An open
 source file retrieval utility that of <a
 href="http://www.archiveteam.org/index.php?title=Wget_with_WARC_output">version
@@ -280,32 +291,40 @@ href="https://ipfs.io/">IPFS</a>.</li>
 open source project aimed to develop Wayback Machine, the key software
 used by web archives worldwide to play back archived websites in the
 user’s browser. <em>(Stable)</em></li>
-<li><a href="https://github.com/ikreymer/pywb">PyWb</a> - A Python (2
-and 3) implementation of web archival replay tools, sometimes also known
-as ‘Wayback Machine’. <em>(Stable)</em></li>
+<li><a href="https://github.com/webrecorder/pywb">PYWB</a> - A Python 3
+implementation of web archival replay tools, sometimes also known as
+‘Wayback Machine’. <em>(Stable)</em></li>
 <li><a
 href="https://oduwsdl.github.io/Reconstructive/">Reconstructive</a> -
 Reconstructive is a ServiceWorker module for client-side reconstruction
 of composite mementos by rerouting resource requests to corresponding
 archived copies (JavaScript).</li>
-<li><a href="https://replayweb.page/">ReplayWeb.Page</a> - A
-browser-based, fully client-side replay engine for both local and remote
-WARC files.</li>
+<li><a href="https://webrecorder.net/replaywebpage/">ReplayWeb.page</a>
+- A browser-based, fully client-side replay engine for both local and
+remote WARC &amp; WACZ files. Also available as an Electron based
+desktop application. <em>(Stable)</em></li>
 <li><a href="https://github.com/iipc/warc2html">warc2html</a> - Converts
 WARC files to static HTML suitable for browsing offline or
 rehosting.</li>
 </ul>
 <h3 id="search-discovery">Search &amp; Discovery</h3>
 <ul>
+<li><a href="https://github.com/medialab/hyphe">hyphe</a> - A webcrawler
+built for research uses with a graphical user interface in order to
+build web corpuses made of lists of web actors and maps of links between
+them. <em>(Stable)</em></li>
 <li><a href="https://github.com/machawk1/mink">Mink</a> - A <a
 href="https://www.google.com/intl/en/chrome/">Google Chrome</a>
 extension for querying Memento aggregators while browsing and
-integrating live-archived web navigation. <em>(Stable)</em>
-<!--lint ignore double-link--></li>
+integrating live-archived web navigation. <em>(Stable)</em></li>
+<li><a href="https://github.com/Guillaume-Levrier/PANDORAE">PANDORÆ</a>
+- A desktop research software to be plugged on a Solr endpoint to query,
+retrieve, normalize and visually explore web archives.
+<em>(Stable)</em></li>
 <li><a href="https://github.com/wabarc/playback">playback</a> - A
-toolkit for searching archived webpages from <a
-href="https://web.archive.org">Internet Archive</a>, <a
-href="https://archive.today">archive.today</a>, <a
+toolkit for searching archived webpages from
+<!--lint ignore double-link--><a href="https://web.archive.org">Internet
+Archive</a>, <a href="https://archive.today">archive.today</a>, <a
 href="http://timetravel.mementoweb.org">Memento</a> and beyond. <em>(In
 Development)</em></li>
 <li><a href="https://securitytrails.com/">SecurityTrails</a> - Web based
@@ -324,8 +343,7 @@ in Tempas</a>). <em>(Stable)</em></li>
 href="https://github.com/ukwa/webarchive-discovery">webarchive-discovery</a>
 - WARC and ARC full-text indexing and discovery tools, with a number of
 associated tools capable of using the index shown below.
-<em>(Stable)</em>
-<ul>
+<em>(Stable)</em></li>
 <li><a href="https://github.com/ukwa/shine">Shine</a> - A prototype web
 archives exploration UI, developed with researchers as part of the <a
 href="https://buddah.projects.history.ac.uk/">Big UK Domain Data for the
@@ -352,19 +370,18 @@ system</a>. <em>(In Development)</em></li>
 <li>Other possible options for builting a front-end are listed on in the
 <code>webarchive-discovery</code> wiki, <a
 href="https://github.com/ukwa/webarchive-discovery/wiki/Front-ends">here</a>.</li>
-</ul></li>
 </ul>
 <h3 id="utilities">Utilities</h3>
 <ul>
 <li><a href="https://github.com/recrm/ArchiveTools">ArchiveTools</a> -
-Collection of tools to extract and interact with WARC files (Python).
-<!--lint ignore double-link--></li>
+Collection of tools to extract and interact with WARC files
+(Python).</li>
 <li><a href="https://pypi.org/project/cdx-toolkit/">cdx-toolkit</a> -
 Library and CLI to consult cdx indexes and create WARC extractions of
 subsets. Abstracts away Common Crawl’s unusual crawl structure.
 <em>(Stable)</em></li>
 <li><a href="https://github.com/karust/gogetcrawl">Go Get Crawl</a> -
-Extract web archive data using <a
+Extract web archive data using <!--lint ignore double-link--><a
 href="https://web.archive.org/">Wayback Machine</a> and <a
 href="https://commoncrawl.org/">Common Crawl</a>. <em>(Stable)</em></li>
 <li><a href="https://github.com/nlnwa/gowarcserver">gowarcserver</a> -
@@ -372,26 +389,29 @@ href="https://commoncrawl.org/">Common Crawl</a>. <em>(Stable)</em></li>
 index (CDX) and WARC record server, used to index and serve WARC files
 (Go).</li>
 <li><a href="https://github.com/webrecorder/har2warc">har2warc</a> -
-Convert HTTP Archive (HAR) -&gt; Web Archive (WARC) format
-(Python).</li>
-<li><a href="https://httpreserve.info/">httpreserve.info</a> - Service
-to return the status of a web page or save it to the Internet Archive.
+Convert HTTP Archive (HAR) -&gt; Web Archive (WARC) format (Python).
+<!--lint ignore double-link--></li>
+<li><a href="https://httpreserve.info">httpreserve.info</a> - Service to
+return the status of a web page or save it to the Internet Archive.
 HTTPreserve includes disambiguation of well-known short link services.
 It returns JSON via the browser or command line via CURL using GET.
 Describes web sites using earliest and latest dates in the Internet
 Archive and demonstrates the construction of Robust Links in its output
 using that range. (Golang). <em>(Stable)</em></li>
 <li><a href="https://github.com/httpreserve/linkstat">HTTPreserve
-linkstat</a> - Command line implementation of <a
+linkstat</a> - Command line implementation of
+<!--lint ignore double-link--><a
 href="https://httpreserve.info">httpreserve.info</a> to describe the
 status of a web page. Can be easily scripted and provides JSON output to
 enable querying through tools like JQ. HTTPreserve Linkstat describes
-current status, and earliest and latest links on <a
+current status, and earliest and latest links on
+<!--lint ignore double-link--><a
 href="https://archive.org/">archive.org</a>. (Golang).
 <em>(Stable)</em></li>
 <li><a href="https://github.com/jjjake/internetarchive">Internet Archive
 Library</a> - A command line tool and Python library for interacting
-directly with <a href="https://archive.org">archive.org</a>. (Python).
+directly with <!--lint ignore double-link--><a
+href="https://archive.org">archive.org</a>. (Python).
 <em>(Stable)</em></li>
 <li><a href="https://github.com/nla/httrack2warc">httrack2warc</a> -
 Convert HTTrack archives to WARC format (Java).</li>
@@ -412,14 +432,9 @@ href="https://github.com/ukwa/ukwa-heritrix/blob/master/src/main/java/uk/bl/wap/
 href="https://github.com/unt-libraries/py-wasapi-client">py-wasapi-client</a>
 - Command line application to download crawls from WASAPI (Python).
 <em>(Stable)</em></li>
-<li><a href="https://archivebrowser.c3.cx/">The Archive Browser</a> -
-The Archive Browser is a program that lets you browse the contents of
-archives, as well as extract them. It will let you open files from
-inside archives, and lets you preview them using Quick Look. WARC is
-supported (macOS only, Proprietary app).</li>
-<li><a href="http://unarchiver.c3.cx/unarchiver">The Unarchiver</a> -
-Program to extract the contents of many archive formats, inclusive of
-WARC, to a file system. Free variant of The Archive Browser (macOS only,
+<li><a href="https://theunarchiver.com/">The Unarchiver</a> - Program to
+extract the contents of many archive formats, inclusive of WARC, to a
+file system. Free variant of The Archive Browser (macOS only,
 Proprietary app).</li>
 <li><a
 href="https://github.com/httpreserve/tikalinkextract">tikalinkextract</a>
@@ -439,6 +454,8 @@ database. <em>(Stable)</em></li>
 <li><a href="https://gitlab.com/taricorp/warcdedupe">warcdedupe</a> -
 WARC deduplication tool (and WARC library) written in Rust. (In
 Development)</li>
+<li><a href="https://github.com/natliblux/warc-safe">warc-safe</a> -
+Automatic detection of viruses and NSFW content in WARC files.</li>
 <li><a
 href="https://github.com/helgeho/WarcPartitioner">WarcPartitioner</a> -
 Partition (W)ARC Files by MIME Type and Year. <em>(Stable)</em></li>
@@ -462,8 +479,12 @@ Splitable Hadoop InputFormat for Concatenated GZIP Files (and
 <code>*.warc.gz</code>). <em>(Stable)</em></li>
 <li><a href="https://github.com/iipc/jwarc">jwarc</a> - Read and write
 WARC files with a type safe API (Java).</li>
-<li><a href="https://sbforge.org/display/JWAT/JWAT">Jwat</a> - Libraries
-and tools for reading/writing/validating WARC/ARC/GZIP files (Java).
+<li><a href="https://github.com/netarchivesuite/jwat">Jwat</a> -
+Libraries for reading/writing/validating WARC/ARC/GZIP files (Java).
+<em>(Stable)</em></li>
+<li><a
+href="https://github.com/netarchivesuite/jwat-tools">Jwat-Tools</a> -
+Tools for reading/writing/validating WARC/ARC/GZIP files (Java).
 <em>(Stable)</em></li>
 <li><a href="https://github.com/N0taN3rd/node-warc">node-warc</a> -
 Parse WARC files or create WARC files using either <a
@@ -478,6 +499,9 @@ Command line interface to unzip WARC and WACZ files (Python).</li>
 <li><a href="https://github.com/chfoo/warcat">Warcat</a> - Tool and
 library for handling Web ARChive (WARC) files (Python).
 <em>(Stable)</em></li>
+<li><a href="https://github.com/chfoo/warcat-rs">Warcat-rs</a> -
+Command-line tool and Rust library for handling Web ARChive (WARC)
+files. <em>(In Development)</em></li>
 <li><a href="https://github.com/webrecorder/warcio">warcio</a> -
 Streaming WARC/ARC library for fast web archive IO (Python).
 <em>(Stable)</em></li>
@@ -598,6 +622,8 @@ regularly reviews and summarizes work done in the Digital Preservation
 field.</li>
 <li><a href="https://blogs.bl.uk/webarchive/">UK Web Archive
 Blog</a></li>
+<li><a href="https://commoncrawl.org/blog">Common Crawl Foundation
+Blog</a> - <a href="http://commoncrawl.org/blog/rss.xml">rss</a></li>
 </ul>
 <h3 id="mailing-lists">Mailing Lists</h3>
 <ul>
@@ -624,6 +650,8 @@ href="https://archivers-slack.herokuapp.com/">Invite yourself</a> to a
 multi-disciplinary effort for archiving projects run in affiliation with
 <a href="https://envirodatagov.org/archiving/">EDGI</a> and <a
 href="http://datatogether.org/">Data Together</a>.</li>
+<li><a href="https://ccfpartners.slack.com/">Common Crawl Foundation
+Partners</a> (ask greg zat commoncrawl zot org for an invite)</li>
 </ul>
 <h3 id="twitter">Twitter</h3>
 <ul>
@@ -646,10 +674,10 @@ endorsement of these services, and readers should check and evaluate
 these options based on their needs.</p>
 <h3 id="self-hostable-open-source">Self-hostable, Open Source</h3>
 <ul>
-<li><a href="https://browsertrix.cloud/">Browsertrix Cloud</a> - From <a
-href="https://webrecorder.net/">Webrecorder</a>, source available at <a
-href="https://github.com/webrecorder/browsertrix-cloud"
-class="uri">https://github.com/webrecorder/browsertrix-cloud</a>.</li>
+<li><a href="https://webrecorder.net/browsertrix/">Browsertrix</a> -
+From <a href="https://webrecorder.net/">Webrecorder</a>, source
+available at <a href="https://github.com/webrecorder/browsertrix"
+class="uri">https://github.com/webrecorder/browsertrix</a>.</li>
 <li><a href="https://conifer.rhizome.org/">Conifer</a> - From <a
 href="https://rhizome.org/">Rhizome</a>, source available at <a
 href="https://github.com/Rhizome-Conifer"
@@ -667,3 +695,6 @@ href="https://www.mirrorweb.com/solutions/capabilities/website-archiving">Mirror
 <li><a
 href="https://www.smarsh.com/platform/compliance-management/web-archive">Smarsh</a></li>
 </ul>
+<p><a
+href="https://github.com/iipc/awesome-web-archiving">webarchiving.md
+Github</a></p>