Files
awesome-awesomeness/html/webarchiving.md2.html
2025-07-18 23:13:11 +02:00

701 lines
35 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!--lint ignore awesome-github-->
<h1 id="awesome-web-archiving-awesome">Awesome Web Archiving <a
href="https://awesome.re"><img src="https://awesome.re/badge.svg"
alt="Awesome" /></a></h1>
<p>Web archiving is the process of collecting portions of the World Wide
Web to ensure the information is preserved in an archive for future
researchers, historians, and the public. Web archivists typically employ
Web crawlers for automated capture due to the massive scale of the Web.
Ever-evolving Web standards require continuous evolution of archiving
tools to keep up with the changes in Web technologies to ensure reliable
and meaningful capture and replay of archived web pages.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#trainingdocumentation">Training/Documentation</a></li>
<li><a href="#resources-for-web-publishers">Resources for Web
Publishers</a></li>
<li><a href="#tools--software">Tools &amp; Software</a>
<ul>
<li><a href="#acquisition">Acquisition</a></li>
<li><a href="#replay">Replay</a></li>
<li><a href="#search--discovery">Search &amp; Discovery</a></li>
<li><a href="#utilities">Utilities</a></li>
<li><a href="#warc-io-libraries">WARC I/O Libraries</a></li>
<li><a href="#analysis">Analysis</a></li>
<li><a href="#quality-assurance">Quality Assurance</a></li>
<li><a href="#curation">Curation</a></li>
</ul></li>
<li><a href="#community-resources">Community Resources</a>
<ul>
<li><a href="#other-awesome-lists">Other Awesome Lists</a></li>
<li><a href="#blogs-and-scholarship">Blogs and Scholarship</a></li>
<li><a href="#mailing-lists">Mailing Lists</a></li>
<li><a href="#slack">Slack</a></li>
<li><a href="#twitter">Twitter</a></li>
</ul></li>
<li><a href="#web-archiving-service-providers">Web Archiving Service
Providers</a>
<ul>
<li><a href="#self-hostable-open-source">Self-hostable, Open
Source</a></li>
<li><a href="#hosted-closed-source">Hosted, Closed Source</a></li>
</ul></li>
</ul>
<h2 id="trainingdocumentation">Training/Documentation</h2>
<ul>
<li>Introductions to web archiving concepts:
<ul>
<li><a href="https://youtu.be/ubDHY-ynWi0">What is a web archive?</a> -
A video from <a
href="https://www.youtube.com/channel/UCJukhTSw8VRj-VNTpBcqWkw">the UK
Web Archive YouTube Channel</a></li>
<li><a
href="https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives">Wikipedias
List of Web Archiving Initiatives</a></li>
<li><a
href="https://support.archive-it.org/hc/en-us/articles/208111686-Glossary-of-Archive-It-and-Web-Archiving-Terms">Glossary
of Archive-It and Web Archiving Terms</a></li>
<li><a
href="https://archive-it.org/blog/post/announcing-the-web-archiving-life-cycle-model/">The
Web Archiving Lifecycle Model</a> - The Web Archiving Lifecycle Model is
an attempt to incorporate the technological and programmatic arms of the
web archiving into a framework that will be relevant to any organization
seeking to archive content from the web. Archive-It, the web archiving
service from the Internet Archive, developed the model based on its work
with memory institutions around the world.</li>
<li><a
href="https://kit.exposingtheinvisible.org/en/web-archive.html/">Retrieving
and Archiving Information from Websites by Wael Eskandar and Brad
Murray</a></li>
</ul></li>
<li>Training materials:
<ul>
<li><a
href="https://netpreserve.org/web-archiving/training-materials/">IIPC
and DPC Training materials: module for beginners (8 sessions)</a></li>
<li><a href="https://github.com/vphill/web-archiving-course">UNT Web
Archiving Course</a></li>
<li><a href="https://cedwarc.github.io/">Continuing Education to Advance
Web Archiving (CEDWARC)</a></li>
<li><a href="https://github.com/commoncrawl/whirlwind-python/">A
Whirlwind Tour of Common Crawls Datasets using Python</a></li>
</ul></li>
<li>The WARC Standard:
<ul>
<li>The <a
href="https://iipc.github.io/warc-specifications/">warc-specifications</a>
community HTML version of the official specification and hub for new
proposals.</li>
<li>The <a href="http://bibnum.bnf.fr/WARC/">offical ISO 28500 WARC
specification homepage</a>.</li>
</ul></li>
<li>For researchers using web archives:
<ul>
<li><a href="https://glam-workbench.github.io/web-archives/">GLAM
Workbench: Web Archives</a> - See also <a
href="https://netpreserveblog.wordpress.com/2020/05/28/asking-questions-with-web-archives/">this
related blog post on Asking questions with web archives</a>.</li>
<li><a href="https://aut.docs.archivesunleashed.org/">Archives Unleashed
Toolkit documentation</a></li>
<li><a
href="https://sobre.arquivo.pt/en/tutorial-for-humanities-researchers-about-how-to-use-arquivo-pt/">Tutorial
for Humanities researchers about how to explore Arquivo.pt</a></li>
</ul></li>
</ul>
<h2 id="resources-for-web-publishers">Resources for Web Publishers</h2>
<p>These resources can help when working with individuals or
organisations who publish on the web, and who want to make sure their
site can be archived.</p>
<ul>
<li><a
href="https://nullhandle.org/web-archivability/index.html">Definition of
Web Archivability</a> - This describes the ease with which web content
can be preserved. (<a
href="https://web.archive.org/web/20230728211501/https://library.stanford.edu/projects/web-archiving/archivability">Archived
version from the Stanford Libraries</a>)</li>
<li>The <a href="http://archiveready.com/">Archive Ready</a> tool, for
estimating how likely a web page will be archived successfully.</li>
</ul>
<h2 id="tools-software">Tools &amp; Software</h2>
<p>This list of tools and software is intended to briefly describe some
of the most important and widely-used tools related to web archiving.
For more details, we recommend you refer to (and contribute to!) these
excellent resources from other groups:</p>
<ul>
<li><a
href="https://github.com/archivers-space/research/tree/master/web_archiving">Comparison
of web archiving software</a></li>
<li><a
href="https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring">Awesome
Website Change Monitoring</a></li>
</ul>
<h3 id="acquisition">Acquisition</h3>
<ul>
<li><a href="https://github.com/pirate/ArchiveBox">ArchiveBox</a> - A
tool which maintains an additive archive from RSS feeds, bookmarks, and
links using wget, Chrome headless, and other methods (formerly
<code>Bookmark Archiver</code>). <em>(In Development)</em></li>
<li><a href="https://github.com/oduwsdl/archivenow">archivenow</a> - A
<a
href="http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html">Python
library</a> to push web resources into on-demand web archives.
<em>(Stable)</em></li>
<li><a
href="https://webrecorder.net/archivewebpage/">ArchiveWeb.Page</a> - A
plugin for Chrome and other Chromium based browsers that lets you
interactively archive web pages, replay them, and export them as WARC
&amp; WACZ files. Also available as an Electron based desktop
application.</li>
<li><a href="https://github.com/bellingcat/auto-archiver">Auto
Archiver</a> - Python script to automatically archive social media
posts, videos, and images from a Google Sheets document. Read the <a
href="https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/">article
about Auto Archiver on bellingcat.com</a>.</li>
<li><a
href="https://github.com/webrecorder/browsertrix-crawler">Browsertrix
Crawler</a> - A Chromium based high-fidelity crawling system, designed
to run a complex, customizable browser-based crawl in a single Docker
container. <em>(Stable)</em></li>
<li><a href="https://github.com/internetarchive/brozzler">Brozzler</a> -
A distributed web crawler (爬虫) that uses a real browser (Chrome or
Chromium) to fetch pages and embedded urls and to extract links.
<em>(Stable)</em></li>
<li><a href="https://github.com/wabarc/cairn">Cairn</a> - A npm package
and CLI tool for saving webpages. <em>(Stable)</em></li>
<li><a href="https://github.com/CGamesPlay/chronicler">Chronicler</a> -
Web browser with record and replay functionality. <em>(In
Development)</em></li>
<li><a href="https://www.community-archive.org/">Community Archive</a> -
Open Twitter Database and API with tools and resources for building on
archived Twitter data.</li>
<li><a href="https://github.com/turicas/crau">crau</a> - crau is the way
(most) Brazilians pronounce crawl, its the easiest command-line tool
for archiving the Web and playing archives: you just need a list of
URLs. <em>(Stable)</em></li>
<li><a href="https://git.autistici.org/ale/crawl">Crawl</a> - A simple
web crawler in Golang. <em>(Stable)</em></li>
<li><a href="https://github.com/promyloph/crocoite">crocoite</a> - Crawl
websites using headless Google Chrome/Chromium and save resources,
static DOM snapshot and page screenshots to WARC files. <em>(In
Development)</em></li>
<li><a href="https://github.com/dosyago/DiskerNet">DiskerNet</a> - A
non-WARC-based tool which hooks into the Chrome browser and archives
everything you browse making it available for offline replay. <em>(In
Development)</em></li>
<li><a href="https://github.com/justinlittman/fbarc">F(b)arc</a> - A
commandline tool and Python library for archiving data from <a
href="https://www.facebook.com/">Facebook</a> using the <a
href="https://developers.facebook.com/docs/graph-api">Graph API</a>.
<em>(Stable)</em></li>
<li><a href="https://github.com/WebMemex/freeze-dry">freeze-dry</a> -
JavaScript library to turn page into static, self-contained HTML
document; useful for browser extensions. <em>(In Development)</em></li>
<li><a href="https://github.com/ArchiveTeam/grab-site">grab-site</a> -
The archivists web crawler: WARC output, dashboard for all crawls,
dynamic ignore patterns. <em>(Stable)</em></li>
<li><a
href="https://github.com/internetarchive/heritrix3/wiki">Heritrix</a> -
An open source, extensible, web-scale, archival quality web crawler.
<em>(Stable)</em>
<ul>
<li><a
href="https://github.com/internetarchive/heritrix3/discussions/categories/q-a">Heritrix
Q&amp;A</a> - A discussion forum for asking questions and getting
answers about using Heritrix.</li>
<li><a
href="https://github.com/web-archive-group/heritrix-walkthrough">Heritrix
Walkthrough</a> <em>(In Development)</em></li>
</ul></li>
<li><a href="https://github.com/steffenfritz/html2warc">html2warc</a> -
A simple script to convert offline data into a single WARC file.
<em>(Stable)</em></li>
<li><a href="http://www.httrack.com/">HTTrack</a> - An open source
website copying utility. <em>(Stable)</em></li>
<li><a href="https://github.com/Y2Z/monolith">monolith</a> - CLI tool to
save a web page as a single HTML file. <em>(Stable)</em></li>
<li><a href="https://github.com/go-shiori/obelisk">Obelisk</a> - Go
package and CLI tool for saving web page as single HTML file.
<em>(Stable)</em></li>
<li><a href="https://github.com/harvard-lil/scoop">Scoop</a> -
High-fidelity, browser-based, single-page web archiving library and CLI
for witnessing the web. <em>(Stable)</em></li>
<li><a
href="https://github.com/gildas-lormeau/SingleFile">SingleFile</a> -
Browser extension for Firefox/Chrome and CLI tool to save a faithful
copy of a complete page as a single HTML file. <em>(Stable)</em></li>
<li><a href="http://mementoweb.github.io/SiteStory/">SiteStory</a> - A
transactional archive that selectively captures and stores transactions
that take place between a web client (browser) and a web server.
<em>(Stable)</em></li>
<li><a href="https://gwu-libraries.github.io/sfm-ui/">Social Feed
Manager</a> - Open source software that enables users to create social
media collections from Twitter, Tumblr, Flickr, and Sina Weibo public
APIs. <em>(Stable)</em></li>
<li><a href="https://github.com/N0taN3rd/Squidwarc">Squidwarc</a> - An
<a
href="http://ws-dl.blogspot.com/2017/07/2017-07-24-replacing-heritrix-with.html">open
source, high-fidelity, page interacting</a> archival crawler that uses
Chrome or Chrome Headless directly. <em>(In Development)</em></li>
<li><a href="http://stormcrawler.net/">StormCrawler</a> - A collection
of resources for building low-latency, scalable web crawlers on Apache
Storm. <em>(Stable)</em></li>
<li><a href="https://github.com/docnow/twarc">twarc</a> - A command line
tool and Python library for archiving Twitter JSON data.
<em>(Stable)</em></li>
<li><a href="https://github.com/machawk1/wail">WAIL</a> - A graphical
user interface (GUI) atop multiple web archiving tools intended to be
used as an easy way for anyone to preserve and replay web pages; <a
href="https://machawk1.github.io/wail/">Python</a>, <a
href="https://github.com/n0tan3rd/wail">Electron</a>.
<em>(Stable)</em></li>
<li><a href="https://github.com/internetarchive/warcprox">Warcprox</a> -
WARC-writing MITM HTTP/S proxy. <em>(Stable)</em></li>
<li><a href="http://matkelly.com/warcreate/">WARCreate</a> - A <a
href="https://www.google.com/intl/en/chrome/browser/">Google Chrome</a>
extension for archiving an individual webpage or website to a WARC file.
<em>(Stable)</em></li>
<li><a href="https://github.com/peterk/warcworker">Warcworker</a> - An
open source, dockerized, queued, high fidelity web archiver based on
Squidwarc with a simple web GUI. <em>(Stable)</em></li>
<li><a href="https://github.com/wabarc/wayback">Wayback</a> - A toolkit
for snapshot webpage to Internet Archive, archive.today, IPFS and
beyond. <em>(Stable)</em></li>
<li><a href="https://github.com/akamhy/waybackpy">Waybackpy</a> -
Wayback Machine Save, CDX and availability API interface in Python and a
command-line tool <em>(Stable)</em></li>
<li><a href="https://github.com/helgeho/Web2Warc">Web2Warc</a> - An
easy-to-use and highly customizable crawler that enables anyone to
create their own little Web archives (WARC/CDX). <em>(Stable)</em></li>
<li><a href="https://webcuratortool.org">Web Curator Tool</a> -
Open-source workflow management for selective web archiving.
<em>(Stable)</em></li>
<li><a href="https://github.com/WebMemex">WebMemex</a> - Browser
extension for Firefox and Chrome which lets you archive web pages you
visit. <em>(In Development)</em></li>
<li><a href="http://www.gnu.org/software/wget/">Wget</a> - An open
source file retrieval utility that of <a
href="http://www.archiveteam.org/index.php?title=Wget_with_WARC_output">version
1.14 supports writing warcs</a>. <em>(Stable)</em></li>
<li><a href="https://github.com/alard/wget-lua">Wget-lua</a> - Wget with
Lua extension. <em>(Stable)</em></li>
<li><a href="https://github.com/chfoo/wpull">Wpull</a> - A
Wget-compatible (or remake/clone/replacement/alternative) web downloader
and crawler. <em>(Stable)</em></li>
</ul>
<h3 id="replay">Replay</h3>
<ul>
<li><a href="https://github.com/oduwsdl/ipwb">InterPlanetary Wayback
(ipwb)</a> - Web Archive (WARC) indexing and replay using <a
href="https://ipfs.io/">IPFS</a>.</li>
<li><a href="https://github.com/iipc/openwayback/">OpenWayback</a> - The
open source project aimed to develop Wayback Machine, the key software
used by web archives worldwide to play back archived websites in the
users browser. <em>(Stable)</em></li>
<li><a href="https://github.com/webrecorder/pywb">PYWB</a> - A Python 3
implementation of web archival replay tools, sometimes also known as
Wayback Machine. <em>(Stable)</em></li>
<li><a
href="https://oduwsdl.github.io/Reconstructive/">Reconstructive</a> -
Reconstructive is a ServiceWorker module for client-side reconstruction
of composite mementos by rerouting resource requests to corresponding
archived copies (JavaScript).</li>
<li><a href="https://webrecorder.net/replaywebpage/">ReplayWeb.page</a>
- A browser-based, fully client-side replay engine for both local and
remote WARC &amp; WACZ files. Also available as an Electron based
desktop application. <em>(Stable)</em></li>
<li><a href="https://github.com/iipc/warc2html">warc2html</a> - Converts
WARC files to static HTML suitable for browsing offline or
rehosting.</li>
</ul>
<h3 id="search-discovery">Search &amp; Discovery</h3>
<ul>
<li><a href="https://github.com/medialab/hyphe">hyphe</a> - A webcrawler
built for research uses with a graphical user interface in order to
build web corpuses made of lists of web actors and maps of links between
them. <em>(Stable)</em></li>
<li><a href="https://github.com/machawk1/mink">Mink</a> - A <a
href="https://www.google.com/intl/en/chrome/">Google Chrome</a>
extension for querying Memento aggregators while browsing and
integrating live-archived web navigation. <em>(Stable)</em></li>
<li><a href="https://github.com/Guillaume-Levrier/PANDORAE">PANDORÆ</a>
- A desktop research software to be plugged on a Solr endpoint to query,
retrieve, normalize and visually explore web archives.
<em>(Stable)</em></li>
<li><a href="https://github.com/wabarc/playback">playback</a> - A
toolkit for searching archived webpages from
<!--lint ignore double-link--><a href="https://web.archive.org">Internet
Archive</a>, <a href="https://archive.today">archive.today</a>, <a
href="http://timetravel.mementoweb.org">Memento</a> and beyond. <em>(In
Development)</em></li>
<li><a href="https://securitytrails.com/">SecurityTrails</a> - Web based
archive for WHOIS and DNS records. REST API available free of
charge.</li>
<li><a href="http://tempas.L3S.de/v1">Tempas v1</a> - Temporal web
archive search based on <a
href="https://en.wikipedia.org/wiki/Delicious_(website)">Delicious</a>
tags. <em>(Stable)</em></li>
<li><a href="http://tempas.L3S.de/v2">Tempas v2</a> - Temporal web
archive search based on links and anchor texts extracted from the German
web from 1996 to 2013 (results are not limited to German pages, e.g., <a
href="http://tempas.l3s.de/v2/query?q=obama&amp;from=2005&amp;to=2009">Obama@2005-2009
in Tempas</a>). <em>(Stable)</em></li>
<li><a
href="https://github.com/ukwa/webarchive-discovery">webarchive-discovery</a>
- WARC and ARC full-text indexing and discovery tools, with a number of
associated tools capable of using the index shown below.
<em>(Stable)</em></li>
<li><a href="https://github.com/ukwa/shine">Shine</a> - A prototype web
archives exploration UI, developed with researchers as part of the <a
href="https://buddah.projects.history.ac.uk/">Big UK Domain Data for the
Arts and Humanities project</a>. <em>(Stable)</em></li>
<li><a
href="https://github.com/netarchivesuite/solrwayback">SolrWayback</a> -
A backend Java and frontend VUE JS project with freetext search and a
build in playback engine. Require Warc files has been index with the
Warc-Indexer. The web application also has a wide range of data
visualization tools and data export tools that can be used on the whole
webarchive. <a
href="https://github.com/netarchivesuite/solrwayback/releases">SolrWayback
4 Bundle release</a> contains all the software and dependencies in an
out-of-the box solution that is easy to install.</li>
<li><a
href="https://github.com/archivesunleashed/warclight">Warclight</a> - A
Project Blacklight based Rails engine that supports the discovery of web
archives held in the WARC and ARC formats. <em>(In
Development)</em></li>
<li><a href="https://github.com/webis-de/wasp">Wasp</a> - A fully
functional prototype of a personal <a
href="http://ceur-ws.org/Vol-2167/paper6.pdf">web archive and search
system</a>. <em>(In Development)</em></li>
<li>Other possible options for builting a front-end are listed on in the
<code>webarchive-discovery</code> wiki, <a
href="https://github.com/ukwa/webarchive-discovery/wiki/Front-ends">here</a>.</li>
</ul>
<h3 id="utilities">Utilities</h3>
<ul>
<li><a href="https://github.com/recrm/ArchiveTools">ArchiveTools</a> -
Collection of tools to extract and interact with WARC files
(Python).</li>
<li><a href="https://pypi.org/project/cdx-toolkit/">cdx-toolkit</a> -
Library and CLI to consult cdx indexes and create WARC extractions of
subsets. Abstracts away Common Crawls unusual crawl structure.
<em>(Stable)</em></li>
<li><a href="https://github.com/karust/gogetcrawl">Go Get Crawl</a> -
Extract web archive data using <!--lint ignore double-link--><a
href="https://web.archive.org/">Wayback Machine</a> and <a
href="https://commoncrawl.org/">Common Crawl</a>. <em>(Stable)</em></li>
<li><a href="https://github.com/nlnwa/gowarcserver">gowarcserver</a> -
<a href="https://github.com/dgraph-io/badger">BadgerDB</a>-based capture
index (CDX) and WARC record server, used to index and serve WARC files
(Go).</li>
<li><a href="https://github.com/webrecorder/har2warc">har2warc</a> -
Convert HTTP Archive (HAR) -&gt; Web Archive (WARC) format (Python).
<!--lint ignore double-link--></li>
<li><a href="https://httpreserve.info">httpreserve.info</a> - Service to
return the status of a web page or save it to the Internet Archive.
HTTPreserve includes disambiguation of well-known short link services.
It returns JSON via the browser or command line via CURL using GET.
Describes web sites using earliest and latest dates in the Internet
Archive and demonstrates the construction of Robust Links in its output
using that range. (Golang). <em>(Stable)</em></li>
<li><a href="https://github.com/httpreserve/linkstat">HTTPreserve
linkstat</a> - Command line implementation of
<!--lint ignore double-link--><a
href="https://httpreserve.info">httpreserve.info</a> to describe the
status of a web page. Can be easily scripted and provides JSON output to
enable querying through tools like JQ. HTTPreserve Linkstat describes
current status, and earliest and latest links on
<!--lint ignore double-link--><a
href="https://archive.org/">archive.org</a>. (Golang).
<em>(Stable)</em></li>
<li><a href="https://github.com/jjjake/internetarchive">Internet Archive
Library</a> - A command line tool and Python library for interacting
directly with <!--lint ignore double-link--><a
href="https://archive.org">archive.org</a>. (Python).
<em>(Stable)</em></li>
<li><a href="https://github.com/nla/httrack2warc">httrack2warc</a> -
Convert HTTrack archives to WARC format (Java).</li>
<li><a href="https://github.com/oduwsdl/MementoMap">MementoMap</a> - A
Tool to Summarize Web Archive Holdings (Python). <em>(In
Development)</em></li>
<li><a href="https://github.com/oduwsdl/MemGator">MemGator</a> - A
Memento Aggregator CLI and Server (Golang). <em>(Stable)</em></li>
<li><a href="https://github.com/N0taN3rd/node-cdxj">node-cdxj</a> - <a
href="https://github.com/oduwsdl/ORS/wiki/CDXJ">CDXJ</a> file parser
(Node.js). <em>(Stable)</em></li>
<li><a href="https://github.com/nla/outbackcdx">OutbackCDX</a> -
RocksDB-based capture index (CDX) server supporting incremental updates
and compression. Can be used as backend for OpenWayback, PyWb and <a
href="https://github.com/ukwa/ukwa-heritrix/blob/master/src/main/java/uk/bl/wap/modules/uriuniqfilters/OutbackCDXRecentlySeenUriUniqFilter.java">Heritrix</a>.
<em>(Stable)</em></li>
<li><a
href="https://github.com/unt-libraries/py-wasapi-client">py-wasapi-client</a>
- Command line application to download crawls from WASAPI (Python).
<em>(Stable)</em></li>
<li><a href="https://theunarchiver.com/">The Unarchiver</a> - Program to
extract the contents of many archive formats, inclusive of WARC, to a
file system. Free variant of The Archive Browser (macOS only,
Proprietary app).</li>
<li><a
href="https://github.com/httpreserve/tikalinkextract">tikalinkextract</a>
- Extract hyperlinks as a seed for web archiving from folders of
document types that can be parsed by Apache Tika (Golang, Apache Tika
Server). <em>(In Development)</em></li>
<li><a
href="https://github.com/sul-dlss/wasapi-downloader">wasapi-downloader</a>
- Java command line application to download crawls from WASAPI.
<em>(Stable)</em></li>
<li><a href="https://nlnwa.github.io/warchaeology/">Warchaeology</a> -
Warchaeology is a collection of tools for inspecting, manipulating,
deduplicating and validating WARC-files. <em>Stable</em></li>
<li><a href="https://github.com/florents-Tselai/warcdb">warcdb</a> - A
command line utility (Python) for importing WARC files into a SQLite
database. <em>(Stable)</em></li>
<li><a href="https://gitlab.com/taricorp/warcdedupe">warcdedupe</a> -
WARC deduplication tool (and WARC library) written in Rust. (In
Development)</li>
<li><a href="https://github.com/natliblux/warc-safe">warc-safe</a> -
Automatic detection of viruses and NSFW content in WARC files.</li>
<li><a
href="https://github.com/helgeho/WarcPartitioner">WarcPartitioner</a> -
Partition (W)ARC Files by MIME Type and Year. <em>(Stable)</em></li>
<li><a href="https://github.com/arcalex/warcrefs">warcrefs</a> - Web
archive deduplication tools. <em>Stable</em></li>
<li><a
href="https://github.com/ikreymer/webarchive-indexing">webarchive-indexing</a>
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file
system.</li>
<li><a href="https://github.com/WikiTeam/wikiteam">wikiteam</a> - Tools
for downloading and preserving wikis. <em>(Stable)</em></li>
</ul>
<h3 id="warc-io-libraries">WARC I/O Libraries</h3>
<ul>
<li><a
href="https://github.com/chatnoir-eu/chatnoir-resiliparse">FastWARC</a>
- A high-performance WARC parsing library (Python).</li>
<li><a
href="https://github.com/helgeho/HadoopConcatGz">HadoopConcatGz</a> - A
Splitable Hadoop InputFormat for Concatenated GZIP Files (and
<code>*.warc.gz</code>). <em>(Stable)</em></li>
<li><a href="https://github.com/iipc/jwarc">jwarc</a> - Read and write
WARC files with a type safe API (Java).</li>
<li><a href="https://github.com/netarchivesuite/jwat">Jwat</a> -
Libraries for reading/writing/validating WARC/ARC/GZIP files (Java).
<em>(Stable)</em></li>
<li><a
href="https://github.com/netarchivesuite/jwat-tools">Jwat-Tools</a> -
Tools for reading/writing/validating WARC/ARC/GZIP files (Java).
<em>(Stable)</em></li>
<li><a href="https://github.com/N0taN3rd/node-warc">node-warc</a> -
Parse WARC files or create WARC files using either <a
href="https://electron.atom.io/">Electron</a> or <a
href="https://github.com/cyrus-and/chrome-remote-interface">chrome-remote-interface</a>
(Node.js). <em>(Stable)</em></li>
<li><a href="https://github.com/internetarchive/Sparkling">Sparkling</a>
- Internet Archives Sparkling Data Processing Library.
<em>(Stable)</em></li>
<li><a href="https://github.com/emmadickson/unwarcit">Unwarcit</a> -
Command line interface to unzip WARC and WACZ files (Python).</li>
<li><a href="https://github.com/chfoo/warcat">Warcat</a> - Tool and
library for handling Web ARChive (WARC) files (Python).
<em>(Stable)</em></li>
<li><a href="https://github.com/chfoo/warcat-rs">Warcat-rs</a> -
Command-line tool and Rust library for handling Web ARChive (WARC)
files. <em>(In Development)</em></li>
<li><a href="https://github.com/webrecorder/warcio">warcio</a> -
Streaming WARC/ARC library for fast web archive IO (Python).
<em>(Stable)</em></li>
<li><a href="https://github.com/internetarchive/warctools">warctools</a>
- Library to work with ARC and WARC files (Python).</li>
<li><a href="https://github.com/richardlehane/webarchive">webarchive</a>
- Golang readers for ARC and WARC webarchive formats (Golang).</li>
</ul>
<h3 id="analysis">Analysis</h3>
<ul>
<li><a href="https://github.com/internetarchive/arch">Archives Research
Compute Hub</a> - Web application for distributed compute analysis of
Archive-It web archive collections. <em>(Stable)</em></li>
<li><a href="https://github.com/helgeho/ArchiveSpark">ArchiveSpark</a> -
An Apache Spark framework (not only) for Web Archives that enables easy
data processing, extraction as well as derivation.
<em>(Stable)</em></li>
<li><a href="https://github.com/archivesunleashed/notebooks">Archives
Unleashed Notebooks</a> - Notebooks for working with web archives with
the Archives Unleashed Toolkit, and derivatives generated by the
Archives Unleashed Toolkit. <em>(Stable)</em></li>
<li><a href="https://github.com/archivesunleashed/aut">Archives
Unleashed Toolkit</a> - Archives Unleashed Toolkit (AUT) is an
open-source platform for analyzing web archives with Apache Spark.
<em>(Stable)</em></li>
<li><a href="https://commoncrawl.org/tag/columnar-index/">Common Crawl
Columnar Index</a> - SQL-queryable index, with CDX info plus language
classification. <em>(Stable)</em></li>
<li><a href="https://commoncrawl.org/category/web-graph/">Common Crawl
Web Graph</a> - A host or domain-level graph of the web, with ranking
information. <em>(Stable)</em></li>
<li><a href="https://github.com/commoncrawl/cc-notebooks">Common Crawl
Jupyter notebooks</a> - A collection of notebooks using Common Crawls
various datasets. <em>(Stable)</em></li>
<li><a href="https://github.com/archivesunleashed/twut">Tweet Archvies
Unleashed Toolkit</a> - An open-source toolkit for analyzing
line-oriented JSON Twitter archives with Apache Spark. <em>(In
Development)</em></li>
<li><a href="http://webdatacommons.org/">Web Data Commons</a> -
Structured data extracted from Common Crawl. <em>(Stable)</em></li>
</ul>
<h3 id="quality-assurance">Quality Assurance</h3>
<ul>
<li><a
href="https://chrome.google.com/webstore/detail/check-my-links/ojkcdipcgfaekbeaelaapakgnjflfglf">Chrome
Check My Links</a> - Browser extension: a link checker with more
options.</li>
<li><a
href="https://chrome.google.com/webstore/detail/link-checker/aibjbgmpmnidnmagaefhmcjhadpffaoi">Chrome
link checker</a> - Browser extension: basic link checker.</li>
<li><a
href="https://chrome.google.com/webstore/detail/bpjdkodgnbfalgghnbeggfbfjpcfamkf/publish-accepted?hl=en-US&amp;gl=US">Chrome
link gopher</a> - Browser extension: link harvester on a page.</li>
<li><a
href="https://chrome.google.com/webstore/detail/open-multiple-urls/oifijhaokejakekmnjmphonojcfkpbbh?hl=de">Chrome
Open Multiple URLs</a> - Browser extension: opens multiple URLs and also
extracts URLs from text.</li>
<li><a
href="https://chrome.google.com/webstore/detail/revolver-tabs/dlknooajieciikpedpldejhhijacnbda">Chrome
Revolver</a> - Browser extension: switches between browser tabs.</li>
<li><a href="https://github.com/lupoDharkael/flameshot">FlameShot</a> -
Screen capture and annotation on Ubuntu.</li>
<li><a href="https://www.playonlinux.com/en/">PlayOnLinux</a> - For
running Xenu and Notepad++ on Ubuntu.</li>
<li><a href="https://www.playonmac.com/en/">PlayOnMac</a> - For running
Xenu and Notepad++ on macOS.</li>
<li><a
href="https://support.microsoft.com/en-gb/help/13776/windows-use-snipping-tool-to-capture-screenshots">Windows
Snipping Tool</a> - Windows built-in for partial screen capture and
annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut
for taking partial screen capture).</li>
<li><a href="http://winebottler.kronenberg.org/">WineBottler</a> - For
running Xenu and Notepad++ on macOS.</li>
<li><a href="https://github.com/jordansissel/xdotool">xDoTool</a> -
Click automation on Ubuntu.</li>
<li><a href="http://home.snafu.de/tilman/xenulink.html">Xenu</a> -
Desktop link checker for Windows.</li>
</ul>
<h3 id="curation">Curation</h3>
<ul>
<li><a href="https://robustlinks.mementoweb.org/zotero/">Zotero Robust
Links Extension</a> - A <a href="https://www.zotero.org/">Zotero</a>
extension that submits to and reads from web archives. Source <a
href="https://github.com/lanl/Zotero-Robust-Links-Extension">on
GitHub</a>. Supercedes <a
href="https://github.com/leonkt/zotero-memento">leonkt/zotero-memento</a>.</li>
</ul>
<h2 id="community-resources">Community Resources</h2>
<h3 id="other-awesome-lists">Other Awesome Lists</h3>
<ul>
<li><a
href="https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community">Web
Archiving Community</a></li>
<li><a href="https://github.com/machawk1/awesome-memento">Awesome
Memento</a></li>
<li><a
href="http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem">The
WARC Ecosystem</a></li>
<li><a href="http://coptr.digipres.org/Category:Web_Crawl">The Web Crawl
section of COPTR</a></li>
</ul>
<h3 id="blogs-and-scholarship">Blogs and Scholarship</h3>
<ul>
<li><a href="https://netpreserveblog.wordpress.com/">IIPC Blog</a></li>
<li><a href="https://webarchivingrt.wordpress.com/">Web Archiving
Roundtable</a> - Unofficial blog of the Web Archiving Roundtable of the
<a href="https://www2.archivists.org/">Society of American
Archivists</a> maintained by the members of the Web Archiving
Roundtable.</li>
<li><a href="https://www.uclpress.co.uk/products/84010">The Web as
History</a> - An open-source book that provides a conceptual overview to
web archiving research, as well as several case studies.</li>
<li><a href="https://ws-dl.blogspot.com/">WS-DL Blog</a> - Web Science
and Digital Libraries Research Group blogs about various Web archiving
related topics, scholarly work, and academic trip reports.</li>
<li><a href="https://blog.dshr.org/">DSHRs Blog</a> - David Rosenthal
regularly reviews and summarizes work done in the Digital Preservation
field.</li>
<li><a href="https://blogs.bl.uk/webarchive/">UK Web Archive
Blog</a></li>
<li><a href="https://commoncrawl.org/blog">Common Crawl Foundation
Blog</a> - <a href="http://commoncrawl.org/blog/rss.xml">rss</a></li>
</ul>
<h3 id="mailing-lists">Mailing Lists</h3>
<ul>
<li><a href="https://groups.google.com/g/common-crawl">Common
Crawl</a></li>
<li><a
href="http://netpreserve.org/about-us/iipc-mailing-list/">IIPC</a></li>
<li><a
href="https://groups.google.com/g/openwayback-dev">OpenWayback</a></li>
<li><a
href="https://groups.google.com/g/wasapi-community">WASAPI</a></li>
</ul>
<h3 id="slack">Slack</h3>
<ul>
<li><a href="https://iipc.slack.com/">IIPC Slack</a> - Ask <a
href="https://twitter.com/NetPreserve?s=20"><span class="citation"
data-cites="netpreserve">@netpreserve</span></a> for access.</li>
<li><a href="https://archivesunleashed.slack.com/">Archives Unleashed
Slack</a> - <a href="http://slack.archivesunleashed.org/">Fill out this
request form</a> for access to a researcher group of people working with
web archives.</li>
<li><a href="https://archivers.slack.com">Archivers Slack</a> - <a
href="https://archivers-slack.herokuapp.com/">Invite yourself</a> to a
multi-disciplinary effort for archiving projects run in affiliation with
<a href="https://envirodatagov.org/archiving/">EDGI</a> and <a
href="http://datatogether.org/">Data Together</a>.</li>
<li><a href="https://ccfpartners.slack.com/">Common Crawl Foundation
Partners</a> (ask greg zat commoncrawl zot org for an invite)</li>
</ul>
<h3 id="twitter">Twitter</h3>
<ul>
<li><a href="https://twitter.com/NetPreserve"><span class="citation"
data-cites="NetPreserve">@NetPreserve</span></a> - Official IIPC
handle.</li>
<li><a href="https://twitter.com/WebSciDL"><span class="citation"
data-cites="WebSciDL">@WebSciDL</span></a> - ODU Web Science and Digital
Libraries Research Group.</li>
<li><a
href="https://twitter.com/search?q=%23webarchiving">#WebArchiving</a></li>
<li><a
href="https://twitter.com/hashtag/webarchivewednesday">#WebArchiveWednesday</a></li>
</ul>
<h2 id="web-archiving-service-providers">Web Archiving Service
Providers</h2>
<p>The intention is that we only list services that allow web archives
to be exported in standard formats (WARC or WACZ). But this is not an
endorsement of these services, and readers should check and evaluate
these options based on their needs.</p>
<h3 id="self-hostable-open-source">Self-hostable, Open Source</h3>
<ul>
<li><a href="https://webrecorder.net/browsertrix/">Browsertrix</a> -
From <a href="https://webrecorder.net/">Webrecorder</a>, source
available at <a href="https://github.com/webrecorder/browsertrix"
class="uri">https://github.com/webrecorder/browsertrix</a>.</li>
<li><a href="https://conifer.rhizome.org/">Conifer</a> - From <a
href="https://rhizome.org/">Rhizome</a>, source available at <a
href="https://github.com/Rhizome-Conifer"
class="uri">https://github.com/Rhizome-Conifer</a>.</li>
</ul>
<h3 id="hosted-closed-source">Hosted, Closed Source</h3>
<ul>
<li><a href="https://archive-it.org/">Archive-It</a> - From the Internet
Archive.</li>
<li><a href="https://arkiwera.se/wp/websites/">Arkiwera</a></li>
<li><a href="https://www.hanzo.co/chronicle">Hanzo</a></li>
<li><a
href="https://www.mirrorweb.com/solutions/capabilities/website-archiving">MirrorWeb</a></li>
<li><a href="https://www.pagefreezer.com/">PageFreezer</a></li>
<li><a
href="https://www.smarsh.com/platform/compliance-management/web-archive">Smarsh</a></li>
</ul>
<p><a
href="https://github.com/iipc/awesome-web-archiving">webarchiving.md
Github</a></p>