701 lines
35 KiB
HTML
701 lines
35 KiB
HTML
<!--lint ignore awesome-github-->
|
||
<h1 id="awesome-web-archiving-awesome">Awesome Web Archiving <a
|
||
href="https://awesome.re"><img src="https://awesome.re/badge.svg"
|
||
alt="Awesome" /></a></h1>
|
||
<p>Web archiving is the process of collecting portions of the World Wide
|
||
Web to ensure the information is preserved in an archive for future
|
||
researchers, historians, and the public. Web archivists typically employ
|
||
Web crawlers for automated capture due to the massive scale of the Web.
|
||
Ever-evolving Web standards require continuous evolution of archiving
|
||
tools to keep up with the changes in Web technologies to ensure reliable
|
||
and meaningful capture and replay of archived web pages.</p>
|
||
<h2 id="contents">Contents</h2>
|
||
<ul>
|
||
<li><a href="#trainingdocumentation">Training/Documentation</a></li>
|
||
<li><a href="#resources-for-web-publishers">Resources for Web
|
||
Publishers</a></li>
|
||
<li><a href="#tools--software">Tools & Software</a>
|
||
<ul>
|
||
<li><a href="#acquisition">Acquisition</a></li>
|
||
<li><a href="#replay">Replay</a></li>
|
||
<li><a href="#search--discovery">Search & Discovery</a></li>
|
||
<li><a href="#utilities">Utilities</a></li>
|
||
<li><a href="#warc-io-libraries">WARC I/O Libraries</a></li>
|
||
<li><a href="#analysis">Analysis</a></li>
|
||
<li><a href="#quality-assurance">Quality Assurance</a></li>
|
||
<li><a href="#curation">Curation</a></li>
|
||
</ul></li>
|
||
<li><a href="#community-resources">Community Resources</a>
|
||
<ul>
|
||
<li><a href="#other-awesome-lists">Other Awesome Lists</a></li>
|
||
<li><a href="#blogs-and-scholarship">Blogs and Scholarship</a></li>
|
||
<li><a href="#mailing-lists">Mailing Lists</a></li>
|
||
<li><a href="#slack">Slack</a></li>
|
||
<li><a href="#twitter">Twitter</a></li>
|
||
</ul></li>
|
||
<li><a href="#web-archiving-service-providers">Web Archiving Service
|
||
Providers</a>
|
||
<ul>
|
||
<li><a href="#self-hostable-open-source">Self-hostable, Open
|
||
Source</a></li>
|
||
<li><a href="#hosted-closed-source">Hosted, Closed Source</a></li>
|
||
</ul></li>
|
||
</ul>
|
||
<h2 id="trainingdocumentation">Training/Documentation</h2>
|
||
<ul>
|
||
<li>Introductions to web archiving concepts:
|
||
<ul>
|
||
<li><a href="https://youtu.be/ubDHY-ynWi0">What is a web archive?</a> -
|
||
A video from <a
|
||
href="https://www.youtube.com/channel/UCJukhTSw8VRj-VNTpBcqWkw">the UK
|
||
Web Archive YouTube Channel</a></li>
|
||
<li><a
|
||
href="https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives">Wikipedia’s
|
||
List of Web Archiving Initiatives</a></li>
|
||
<li><a
|
||
href="https://support.archive-it.org/hc/en-us/articles/208111686-Glossary-of-Archive-It-and-Web-Archiving-Terms">Glossary
|
||
of Archive-It and Web Archiving Terms</a></li>
|
||
<li><a
|
||
href="https://archive-it.org/blog/post/announcing-the-web-archiving-life-cycle-model/">The
|
||
Web Archiving Lifecycle Model</a> - The Web Archiving Lifecycle Model is
|
||
an attempt to incorporate the technological and programmatic arms of the
|
||
web archiving into a framework that will be relevant to any organization
|
||
seeking to archive content from the web. Archive-It, the web archiving
|
||
service from the Internet Archive, developed the model based on its work
|
||
with memory institutions around the world.</li>
|
||
<li><a
|
||
href="https://kit.exposingtheinvisible.org/en/web-archive.html/">Retrieving
|
||
and Archiving Information from Websites by Wael Eskandar and Brad
|
||
Murray</a></li>
|
||
</ul></li>
|
||
<li>Training materials:
|
||
<ul>
|
||
<li><a
|
||
href="https://netpreserve.org/web-archiving/training-materials/">IIPC
|
||
and DPC Training materials: module for beginners (8 sessions)</a></li>
|
||
<li><a href="https://github.com/vphill/web-archiving-course">UNT Web
|
||
Archiving Course</a></li>
|
||
<li><a href="https://cedwarc.github.io/">Continuing Education to Advance
|
||
Web Archiving (CEDWARC)</a></li>
|
||
<li><a href="https://github.com/commoncrawl/whirlwind-python/">A
|
||
Whirlwind Tour of Common Crawl’s Datasets using Python</a></li>
|
||
</ul></li>
|
||
<li>The WARC Standard:
|
||
<ul>
|
||
<li>The <a
|
||
href="https://iipc.github.io/warc-specifications/">warc-specifications</a>
|
||
community HTML version of the official specification and hub for new
|
||
proposals.</li>
|
||
<li>The <a href="http://bibnum.bnf.fr/WARC/">offical ISO 28500 WARC
|
||
specification homepage</a>.</li>
|
||
</ul></li>
|
||
<li>For researchers using web archives:
|
||
<ul>
|
||
<li><a href="https://glam-workbench.github.io/web-archives/">GLAM
|
||
Workbench: Web Archives</a> - See also <a
|
||
href="https://netpreserveblog.wordpress.com/2020/05/28/asking-questions-with-web-archives/">this
|
||
related blog post on ‘Asking questions with web archives’</a>.</li>
|
||
<li><a href="https://aut.docs.archivesunleashed.org/">Archives Unleashed
|
||
Toolkit documentation</a></li>
|
||
<li><a
|
||
href="https://sobre.arquivo.pt/en/tutorial-for-humanities-researchers-about-how-to-use-arquivo-pt/">Tutorial
|
||
for Humanities researchers about how to explore Arquivo.pt</a></li>
|
||
</ul></li>
|
||
</ul>
|
||
<h2 id="resources-for-web-publishers">Resources for Web Publishers</h2>
|
||
<p>These resources can help when working with individuals or
|
||
organisations who publish on the web, and who want to make sure their
|
||
site can be archived.</p>
|
||
<ul>
|
||
<li><a
|
||
href="https://nullhandle.org/web-archivability/index.html">Definition of
|
||
Web Archivability</a> - This describes the ease with which web content
|
||
can be preserved. (<a
|
||
href="https://web.archive.org/web/20230728211501/https://library.stanford.edu/projects/web-archiving/archivability">Archived
|
||
version from the Stanford Libraries</a>)</li>
|
||
<li>The <a href="http://archiveready.com/">Archive Ready</a> tool, for
|
||
estimating how likely a web page will be archived successfully.</li>
|
||
</ul>
|
||
<h2 id="tools-software">Tools & Software</h2>
|
||
<p>This list of tools and software is intended to briefly describe some
|
||
of the most important and widely-used tools related to web archiving.
|
||
For more details, we recommend you refer to (and contribute to!) these
|
||
excellent resources from other groups:</p>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/archivers-space/research/tree/master/web_archiving">Comparison
|
||
of web archiving software</a></li>
|
||
<li><a
|
||
href="https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring">Awesome
|
||
Website Change Monitoring</a></li>
|
||
</ul>
|
||
<h3 id="acquisition">Acquisition</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/pirate/ArchiveBox">ArchiveBox</a> - A
|
||
tool which maintains an additive archive from RSS feeds, bookmarks, and
|
||
links using wget, Chrome headless, and other methods (formerly
|
||
<code>Bookmark Archiver</code>). <em>(In Development)</em></li>
|
||
<li><a href="https://github.com/oduwsdl/archivenow">archivenow</a> - A
|
||
<a
|
||
href="http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html">Python
|
||
library</a> to push web resources into on-demand web archives.
|
||
<em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://webrecorder.net/archivewebpage/">ArchiveWeb.Page</a> - A
|
||
plugin for Chrome and other Chromium based browsers that lets you
|
||
interactively archive web pages, replay them, and export them as WARC
|
||
& WACZ files. Also available as an Electron based desktop
|
||
application.</li>
|
||
<li><a href="https://github.com/bellingcat/auto-archiver">Auto
|
||
Archiver</a> - Python script to automatically archive social media
|
||
posts, videos, and images from a Google Sheets document. Read the <a
|
||
href="https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/">article
|
||
about Auto Archiver on bellingcat.com</a>.</li>
|
||
<li><a
|
||
href="https://github.com/webrecorder/browsertrix-crawler">Browsertrix
|
||
Crawler</a> - A Chromium based high-fidelity crawling system, designed
|
||
to run a complex, customizable browser-based crawl in a single Docker
|
||
container. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/internetarchive/brozzler">Brozzler</a> -
|
||
A distributed web crawler (爬虫) that uses a real browser (Chrome or
|
||
Chromium) to fetch pages and embedded urls and to extract links.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/wabarc/cairn">Cairn</a> - A npm package
|
||
and CLI tool for saving webpages. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/CGamesPlay/chronicler">Chronicler</a> -
|
||
Web browser with record and replay functionality. <em>(In
|
||
Development)</em></li>
|
||
<li><a href="https://www.community-archive.org/">Community Archive</a> -
|
||
Open Twitter Database and API with tools and resources for building on
|
||
archived Twitter data.</li>
|
||
<li><a href="https://github.com/turicas/crau">crau</a> - crau is the way
|
||
(most) Brazilians pronounce crawl, it’s the easiest command-line tool
|
||
for archiving the Web and playing archives: you just need a list of
|
||
URLs. <em>(Stable)</em></li>
|
||
<li><a href="https://git.autistici.org/ale/crawl">Crawl</a> - A simple
|
||
web crawler in Golang. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/promyloph/crocoite">crocoite</a> - Crawl
|
||
websites using headless Google Chrome/Chromium and save resources,
|
||
static DOM snapshot and page screenshots to WARC files. <em>(In
|
||
Development)</em></li>
|
||
<li><a href="https://github.com/dosyago/DiskerNet">DiskerNet</a> - A
|
||
non-WARC-based tool which hooks into the Chrome browser and archives
|
||
everything you browse making it available for offline replay. <em>(In
|
||
Development)</em></li>
|
||
<li><a href="https://github.com/justinlittman/fbarc">F(b)arc</a> - A
|
||
commandline tool and Python library for archiving data from <a
|
||
href="https://www.facebook.com/">Facebook</a> using the <a
|
||
href="https://developers.facebook.com/docs/graph-api">Graph API</a>.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/WebMemex/freeze-dry">freeze-dry</a> -
|
||
JavaScript library to turn page into static, self-contained HTML
|
||
document; useful for browser extensions. <em>(In Development)</em></li>
|
||
<li><a href="https://github.com/ArchiveTeam/grab-site">grab-site</a> -
|
||
The archivist’s web crawler: WARC output, dashboard for all crawls,
|
||
dynamic ignore patterns. <em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://github.com/internetarchive/heritrix3/wiki">Heritrix</a> -
|
||
An open source, extensible, web-scale, archival quality web crawler.
|
||
<em>(Stable)</em>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/internetarchive/heritrix3/discussions/categories/q-a">Heritrix
|
||
Q&A</a> - A discussion forum for asking questions and getting
|
||
answers about using Heritrix.</li>
|
||
<li><a
|
||
href="https://github.com/web-archive-group/heritrix-walkthrough">Heritrix
|
||
Walkthrough</a> <em>(In Development)</em></li>
|
||
</ul></li>
|
||
<li><a href="https://github.com/steffenfritz/html2warc">html2warc</a> -
|
||
A simple script to convert offline data into a single WARC file.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="http://www.httrack.com/">HTTrack</a> - An open source
|
||
website copying utility. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/Y2Z/monolith">monolith</a> - CLI tool to
|
||
save a web page as a single HTML file. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/go-shiori/obelisk">Obelisk</a> - Go
|
||
package and CLI tool for saving web page as single HTML file.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/harvard-lil/scoop">Scoop</a> -
|
||
High-fidelity, browser-based, single-page web archiving library and CLI
|
||
for witnessing the web. <em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://github.com/gildas-lormeau/SingleFile">SingleFile</a> -
|
||
Browser extension for Firefox/Chrome and CLI tool to save a faithful
|
||
copy of a complete page as a single HTML file. <em>(Stable)</em></li>
|
||
<li><a href="http://mementoweb.github.io/SiteStory/">SiteStory</a> - A
|
||
transactional archive that selectively captures and stores transactions
|
||
that take place between a web client (browser) and a web server.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://gwu-libraries.github.io/sfm-ui/">Social Feed
|
||
Manager</a> - Open source software that enables users to create social
|
||
media collections from Twitter, Tumblr, Flickr, and Sina Weibo public
|
||
APIs. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/N0taN3rd/Squidwarc">Squidwarc</a> - An
|
||
<a
|
||
href="http://ws-dl.blogspot.com/2017/07/2017-07-24-replacing-heritrix-with.html">open
|
||
source, high-fidelity, page interacting</a> archival crawler that uses
|
||
Chrome or Chrome Headless directly. <em>(In Development)</em></li>
|
||
<li><a href="http://stormcrawler.net/">StormCrawler</a> - A collection
|
||
of resources for building low-latency, scalable web crawlers on Apache
|
||
Storm. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/docnow/twarc">twarc</a> - A command line
|
||
tool and Python library for archiving Twitter JSON data.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/machawk1/wail">WAIL</a> - A graphical
|
||
user interface (GUI) atop multiple web archiving tools intended to be
|
||
used as an easy way for anyone to preserve and replay web pages; <a
|
||
href="https://machawk1.github.io/wail/">Python</a>, <a
|
||
href="https://github.com/n0tan3rd/wail">Electron</a>.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/internetarchive/warcprox">Warcprox</a> -
|
||
WARC-writing MITM HTTP/S proxy. <em>(Stable)</em></li>
|
||
<li><a href="http://matkelly.com/warcreate/">WARCreate</a> - A <a
|
||
href="https://www.google.com/intl/en/chrome/browser/">Google Chrome</a>
|
||
extension for archiving an individual webpage or website to a WARC file.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/peterk/warcworker">Warcworker</a> - An
|
||
open source, dockerized, queued, high fidelity web archiver based on
|
||
Squidwarc with a simple web GUI. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/wabarc/wayback">Wayback</a> - A toolkit
|
||
for snapshot webpage to Internet Archive, archive.today, IPFS and
|
||
beyond. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/akamhy/waybackpy">Waybackpy</a> -
|
||
Wayback Machine Save, CDX and availability API interface in Python and a
|
||
command-line tool <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/helgeho/Web2Warc">Web2Warc</a> - An
|
||
easy-to-use and highly customizable crawler that enables anyone to
|
||
create their own little Web archives (WARC/CDX). <em>(Stable)</em></li>
|
||
<li><a href="https://webcuratortool.org">Web Curator Tool</a> -
|
||
Open-source workflow management for selective web archiving.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/WebMemex">WebMemex</a> - Browser
|
||
extension for Firefox and Chrome which lets you archive web pages you
|
||
visit. <em>(In Development)</em></li>
|
||
<li><a href="http://www.gnu.org/software/wget/">Wget</a> - An open
|
||
source file retrieval utility that of <a
|
||
href="http://www.archiveteam.org/index.php?title=Wget_with_WARC_output">version
|
||
1.14 supports writing warcs</a>. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/alard/wget-lua">Wget-lua</a> - Wget with
|
||
Lua extension. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/chfoo/wpull">Wpull</a> - A
|
||
Wget-compatible (or remake/clone/replacement/alternative) web downloader
|
||
and crawler. <em>(Stable)</em></li>
|
||
</ul>
|
||
<h3 id="replay">Replay</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/oduwsdl/ipwb">InterPlanetary Wayback
|
||
(ipwb)</a> - Web Archive (WARC) indexing and replay using <a
|
||
href="https://ipfs.io/">IPFS</a>.</li>
|
||
<li><a href="https://github.com/iipc/openwayback/">OpenWayback</a> - The
|
||
open source project aimed to develop Wayback Machine, the key software
|
||
used by web archives worldwide to play back archived websites in the
|
||
user’s browser. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/webrecorder/pywb">PYWB</a> - A Python 3
|
||
implementation of web archival replay tools, sometimes also known as
|
||
‘Wayback Machine’. <em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://oduwsdl.github.io/Reconstructive/">Reconstructive</a> -
|
||
Reconstructive is a ServiceWorker module for client-side reconstruction
|
||
of composite mementos by rerouting resource requests to corresponding
|
||
archived copies (JavaScript).</li>
|
||
<li><a href="https://webrecorder.net/replaywebpage/">ReplayWeb.page</a>
|
||
- A browser-based, fully client-side replay engine for both local and
|
||
remote WARC & WACZ files. Also available as an Electron based
|
||
desktop application. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/iipc/warc2html">warc2html</a> - Converts
|
||
WARC files to static HTML suitable for browsing offline or
|
||
rehosting.</li>
|
||
</ul>
|
||
<h3 id="search-discovery">Search & Discovery</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/medialab/hyphe">hyphe</a> - A webcrawler
|
||
built for research uses with a graphical user interface in order to
|
||
build web corpuses made of lists of web actors and maps of links between
|
||
them. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/machawk1/mink">Mink</a> - A <a
|
||
href="https://www.google.com/intl/en/chrome/">Google Chrome</a>
|
||
extension for querying Memento aggregators while browsing and
|
||
integrating live-archived web navigation. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/Guillaume-Levrier/PANDORAE">PANDORÆ</a>
|
||
- A desktop research software to be plugged on a Solr endpoint to query,
|
||
retrieve, normalize and visually explore web archives.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/wabarc/playback">playback</a> - A
|
||
toolkit for searching archived webpages from
|
||
<!--lint ignore double-link--><a href="https://web.archive.org">Internet
|
||
Archive</a>, <a href="https://archive.today">archive.today</a>, <a
|
||
href="http://timetravel.mementoweb.org">Memento</a> and beyond. <em>(In
|
||
Development)</em></li>
|
||
<li><a href="https://securitytrails.com/">SecurityTrails</a> - Web based
|
||
archive for WHOIS and DNS records. REST API available free of
|
||
charge.</li>
|
||
<li><a href="http://tempas.L3S.de/v1">Tempas v1</a> - Temporal web
|
||
archive search based on <a
|
||
href="https://en.wikipedia.org/wiki/Delicious_(website)">Delicious</a>
|
||
tags. <em>(Stable)</em></li>
|
||
<li><a href="http://tempas.L3S.de/v2">Tempas v2</a> - Temporal web
|
||
archive search based on links and anchor texts extracted from the German
|
||
web from 1996 to 2013 (results are not limited to German pages, e.g., <a
|
||
href="http://tempas.l3s.de/v2/query?q=obama&from=2005&to=2009">Obama@2005-2009
|
||
in Tempas</a>). <em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://github.com/ukwa/webarchive-discovery">webarchive-discovery</a>
|
||
- WARC and ARC full-text indexing and discovery tools, with a number of
|
||
associated tools capable of using the index shown below.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/ukwa/shine">Shine</a> - A prototype web
|
||
archives exploration UI, developed with researchers as part of the <a
|
||
href="https://buddah.projects.history.ac.uk/">Big UK Domain Data for the
|
||
Arts and Humanities project</a>. <em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://github.com/netarchivesuite/solrwayback">SolrWayback</a> -
|
||
A backend Java and frontend VUE JS project with freetext search and a
|
||
build in playback engine. Require Warc files has been index with the
|
||
Warc-Indexer. The web application also has a wide range of data
|
||
visualization tools and data export tools that can be used on the whole
|
||
webarchive. <a
|
||
href="https://github.com/netarchivesuite/solrwayback/releases">SolrWayback
|
||
4 Bundle release</a> contains all the software and dependencies in an
|
||
out-of-the box solution that is easy to install.</li>
|
||
<li><a
|
||
href="https://github.com/archivesunleashed/warclight">Warclight</a> - A
|
||
Project Blacklight based Rails engine that supports the discovery of web
|
||
archives held in the WARC and ARC formats. <em>(In
|
||
Development)</em></li>
|
||
<li><a href="https://github.com/webis-de/wasp">Wasp</a> - A fully
|
||
functional prototype of a personal <a
|
||
href="http://ceur-ws.org/Vol-2167/paper6.pdf">web archive and search
|
||
system</a>. <em>(In Development)</em></li>
|
||
<li>Other possible options for builting a front-end are listed on in the
|
||
<code>webarchive-discovery</code> wiki, <a
|
||
href="https://github.com/ukwa/webarchive-discovery/wiki/Front-ends">here</a>.</li>
|
||
</ul>
|
||
<h3 id="utilities">Utilities</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/recrm/ArchiveTools">ArchiveTools</a> -
|
||
Collection of tools to extract and interact with WARC files
|
||
(Python).</li>
|
||
<li><a href="https://pypi.org/project/cdx-toolkit/">cdx-toolkit</a> -
|
||
Library and CLI to consult cdx indexes and create WARC extractions of
|
||
subsets. Abstracts away Common Crawl’s unusual crawl structure.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/karust/gogetcrawl">Go Get Crawl</a> -
|
||
Extract web archive data using <!--lint ignore double-link--><a
|
||
href="https://web.archive.org/">Wayback Machine</a> and <a
|
||
href="https://commoncrawl.org/">Common Crawl</a>. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/nlnwa/gowarcserver">gowarcserver</a> -
|
||
<a href="https://github.com/dgraph-io/badger">BadgerDB</a>-based capture
|
||
index (CDX) and WARC record server, used to index and serve WARC files
|
||
(Go).</li>
|
||
<li><a href="https://github.com/webrecorder/har2warc">har2warc</a> -
|
||
Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python).
|
||
<!--lint ignore double-link--></li>
|
||
<li><a href="https://httpreserve.info">httpreserve.info</a> - Service to
|
||
return the status of a web page or save it to the Internet Archive.
|
||
HTTPreserve includes disambiguation of well-known short link services.
|
||
It returns JSON via the browser or command line via CURL using GET.
|
||
Describes web sites using earliest and latest dates in the Internet
|
||
Archive and demonstrates the construction of Robust Links in its output
|
||
using that range. (Golang). <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/httpreserve/linkstat">HTTPreserve
|
||
linkstat</a> - Command line implementation of
|
||
<!--lint ignore double-link--><a
|
||
href="https://httpreserve.info">httpreserve.info</a> to describe the
|
||
status of a web page. Can be easily scripted and provides JSON output to
|
||
enable querying through tools like JQ. HTTPreserve Linkstat describes
|
||
current status, and earliest and latest links on
|
||
<!--lint ignore double-link--><a
|
||
href="https://archive.org/">archive.org</a>. (Golang).
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/jjjake/internetarchive">Internet Archive
|
||
Library</a> - A command line tool and Python library for interacting
|
||
directly with <!--lint ignore double-link--><a
|
||
href="https://archive.org">archive.org</a>. (Python).
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/nla/httrack2warc">httrack2warc</a> -
|
||
Convert HTTrack archives to WARC format (Java).</li>
|
||
<li><a href="https://github.com/oduwsdl/MementoMap">MementoMap</a> - A
|
||
Tool to Summarize Web Archive Holdings (Python). <em>(In
|
||
Development)</em></li>
|
||
<li><a href="https://github.com/oduwsdl/MemGator">MemGator</a> - A
|
||
Memento Aggregator CLI and Server (Golang). <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/N0taN3rd/node-cdxj">node-cdxj</a> - <a
|
||
href="https://github.com/oduwsdl/ORS/wiki/CDXJ">CDXJ</a> file parser
|
||
(Node.js). <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/nla/outbackcdx">OutbackCDX</a> -
|
||
RocksDB-based capture index (CDX) server supporting incremental updates
|
||
and compression. Can be used as backend for OpenWayback, PyWb and <a
|
||
href="https://github.com/ukwa/ukwa-heritrix/blob/master/src/main/java/uk/bl/wap/modules/uriuniqfilters/OutbackCDXRecentlySeenUriUniqFilter.java">Heritrix</a>.
|
||
<em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://github.com/unt-libraries/py-wasapi-client">py-wasapi-client</a>
|
||
- Command line application to download crawls from WASAPI (Python).
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://theunarchiver.com/">The Unarchiver</a> - Program to
|
||
extract the contents of many archive formats, inclusive of WARC, to a
|
||
file system. Free variant of The Archive Browser (macOS only,
|
||
Proprietary app).</li>
|
||
<li><a
|
||
href="https://github.com/httpreserve/tikalinkextract">tikalinkextract</a>
|
||
- Extract hyperlinks as a seed for web archiving from folders of
|
||
document types that can be parsed by Apache Tika (Golang, Apache Tika
|
||
Server). <em>(In Development)</em></li>
|
||
<li><a
|
||
href="https://github.com/sul-dlss/wasapi-downloader">wasapi-downloader</a>
|
||
- Java command line application to download crawls from WASAPI.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://nlnwa.github.io/warchaeology/">Warchaeology</a> -
|
||
Warchaeology is a collection of tools for inspecting, manipulating,
|
||
deduplicating and validating WARC-files. <em>Stable</em></li>
|
||
<li><a href="https://github.com/florents-Tselai/warcdb">warcdb</a> - A
|
||
command line utility (Python) for importing WARC files into a SQLite
|
||
database. <em>(Stable)</em></li>
|
||
<li><a href="https://gitlab.com/taricorp/warcdedupe">warcdedupe</a> -
|
||
WARC deduplication tool (and WARC library) written in Rust. (In
|
||
Development)</li>
|
||
<li><a href="https://github.com/natliblux/warc-safe">warc-safe</a> -
|
||
Automatic detection of viruses and NSFW content in WARC files.</li>
|
||
<li><a
|
||
href="https://github.com/helgeho/WarcPartitioner">WarcPartitioner</a> -
|
||
Partition (W)ARC Files by MIME Type and Year. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/arcalex/warcrefs">warcrefs</a> - Web
|
||
archive deduplication tools. <em>Stable</em></li>
|
||
<li><a
|
||
href="https://github.com/ikreymer/webarchive-indexing">webarchive-indexing</a>
|
||
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file
|
||
system.</li>
|
||
<li><a href="https://github.com/WikiTeam/wikiteam">wikiteam</a> - Tools
|
||
for downloading and preserving wikis. <em>(Stable)</em></li>
|
||
</ul>
|
||
<h3 id="warc-io-libraries">WARC I/O Libraries</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/chatnoir-eu/chatnoir-resiliparse">FastWARC</a>
|
||
- A high-performance WARC parsing library (Python).</li>
|
||
<li><a
|
||
href="https://github.com/helgeho/HadoopConcatGz">HadoopConcatGz</a> - A
|
||
Splitable Hadoop InputFormat for Concatenated GZIP Files (and
|
||
<code>*.warc.gz</code>). <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/iipc/jwarc">jwarc</a> - Read and write
|
||
WARC files with a type safe API (Java).</li>
|
||
<li><a href="https://github.com/netarchivesuite/jwat">Jwat</a> -
|
||
Libraries for reading/writing/validating WARC/ARC/GZIP files (Java).
|
||
<em>(Stable)</em></li>
|
||
<li><a
|
||
href="https://github.com/netarchivesuite/jwat-tools">Jwat-Tools</a> -
|
||
Tools for reading/writing/validating WARC/ARC/GZIP files (Java).
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/N0taN3rd/node-warc">node-warc</a> -
|
||
Parse WARC files or create WARC files using either <a
|
||
href="https://electron.atom.io/">Electron</a> or <a
|
||
href="https://github.com/cyrus-and/chrome-remote-interface">chrome-remote-interface</a>
|
||
(Node.js). <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/internetarchive/Sparkling">Sparkling</a>
|
||
- Internet Archive’s Sparkling Data Processing Library.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/emmadickson/unwarcit">Unwarcit</a> -
|
||
Command line interface to unzip WARC and WACZ files (Python).</li>
|
||
<li><a href="https://github.com/chfoo/warcat">Warcat</a> - Tool and
|
||
library for handling Web ARChive (WARC) files (Python).
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/chfoo/warcat-rs">Warcat-rs</a> -
|
||
Command-line tool and Rust library for handling Web ARChive (WARC)
|
||
files. <em>(In Development)</em></li>
|
||
<li><a href="https://github.com/webrecorder/warcio">warcio</a> -
|
||
Streaming WARC/ARC library for fast web archive IO (Python).
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/internetarchive/warctools">warctools</a>
|
||
- Library to work with ARC and WARC files (Python).</li>
|
||
<li><a href="https://github.com/richardlehane/webarchive">webarchive</a>
|
||
- Golang readers for ARC and WARC webarchive formats (Golang).</li>
|
||
</ul>
|
||
<h3 id="analysis">Analysis</h3>
|
||
<ul>
|
||
<li><a href="https://github.com/internetarchive/arch">Archives Research
|
||
Compute Hub</a> - Web application for distributed compute analysis of
|
||
Archive-It web archive collections. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/helgeho/ArchiveSpark">ArchiveSpark</a> -
|
||
An Apache Spark framework (not only) for Web Archives that enables easy
|
||
data processing, extraction as well as derivation.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://github.com/archivesunleashed/notebooks">Archives
|
||
Unleashed Notebooks</a> - Notebooks for working with web archives with
|
||
the Archives Unleashed Toolkit, and derivatives generated by the
|
||
Archives Unleashed Toolkit. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/archivesunleashed/aut">Archives
|
||
Unleashed Toolkit</a> - Archives Unleashed Toolkit (AUT) is an
|
||
open-source platform for analyzing web archives with Apache Spark.
|
||
<em>(Stable)</em></li>
|
||
<li><a href="https://commoncrawl.org/tag/columnar-index/">Common Crawl
|
||
Columnar Index</a> - SQL-queryable index, with CDX info plus language
|
||
classification. <em>(Stable)</em></li>
|
||
<li><a href="https://commoncrawl.org/category/web-graph/">Common Crawl
|
||
Web Graph</a> - A host or domain-level graph of the web, with ranking
|
||
information. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/commoncrawl/cc-notebooks">Common Crawl
|
||
Jupyter notebooks</a> - A collection of notebooks using Common Crawl’s
|
||
various datasets. <em>(Stable)</em></li>
|
||
<li><a href="https://github.com/archivesunleashed/twut">Tweet Archvies
|
||
Unleashed Toolkit</a> - An open-source toolkit for analyzing
|
||
line-oriented JSON Twitter archives with Apache Spark. <em>(In
|
||
Development)</em></li>
|
||
<li><a href="http://webdatacommons.org/">Web Data Commons</a> -
|
||
Structured data extracted from Common Crawl. <em>(Stable)</em></li>
|
||
</ul>
|
||
<h3 id="quality-assurance">Quality Assurance</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://chrome.google.com/webstore/detail/check-my-links/ojkcdipcgfaekbeaelaapakgnjflfglf">Chrome
|
||
Check My Links</a> - Browser extension: a link checker with more
|
||
options.</li>
|
||
<li><a
|
||
href="https://chrome.google.com/webstore/detail/link-checker/aibjbgmpmnidnmagaefhmcjhadpffaoi">Chrome
|
||
link checker</a> - Browser extension: basic link checker.</li>
|
||
<li><a
|
||
href="https://chrome.google.com/webstore/detail/bpjdkodgnbfalgghnbeggfbfjpcfamkf/publish-accepted?hl=en-US&gl=US">Chrome
|
||
link gopher</a> - Browser extension: link harvester on a page.</li>
|
||
<li><a
|
||
href="https://chrome.google.com/webstore/detail/open-multiple-urls/oifijhaokejakekmnjmphonojcfkpbbh?hl=de">Chrome
|
||
Open Multiple URLs</a> - Browser extension: opens multiple URLs and also
|
||
extracts URLs from text.</li>
|
||
<li><a
|
||
href="https://chrome.google.com/webstore/detail/revolver-tabs/dlknooajieciikpedpldejhhijacnbda">Chrome
|
||
Revolver</a> - Browser extension: switches between browser tabs.</li>
|
||
<li><a href="https://github.com/lupoDharkael/flameshot">FlameShot</a> -
|
||
Screen capture and annotation on Ubuntu.</li>
|
||
<li><a href="https://www.playonlinux.com/en/">PlayOnLinux</a> - For
|
||
running Xenu and Notepad++ on Ubuntu.</li>
|
||
<li><a href="https://www.playonmac.com/en/">PlayOnMac</a> - For running
|
||
Xenu and Notepad++ on macOS.</li>
|
||
<li><a
|
||
href="https://support.microsoft.com/en-gb/help/13776/windows-use-snipping-tool-to-capture-screenshots">Windows
|
||
Snipping Tool</a> - Windows built-in for partial screen capture and
|
||
annotation. On macOS you can use Command + Shift + 4 (keyboard shortcut
|
||
for taking partial screen capture).</li>
|
||
<li><a href="http://winebottler.kronenberg.org/">WineBottler</a> - For
|
||
running Xenu and Notepad++ on macOS.</li>
|
||
<li><a href="https://github.com/jordansissel/xdotool">xDoTool</a> -
|
||
Click automation on Ubuntu.</li>
|
||
<li><a href="http://home.snafu.de/tilman/xenulink.html">Xenu</a> -
|
||
Desktop link checker for Windows.</li>
|
||
</ul>
|
||
<h3 id="curation">Curation</h3>
|
||
<ul>
|
||
<li><a href="https://robustlinks.mementoweb.org/zotero/">Zotero Robust
|
||
Links Extension</a> - A <a href="https://www.zotero.org/">Zotero</a>
|
||
extension that submits to and reads from web archives. Source <a
|
||
href="https://github.com/lanl/Zotero-Robust-Links-Extension">on
|
||
GitHub</a>. Supercedes <a
|
||
href="https://github.com/leonkt/zotero-memento">leonkt/zotero-memento</a>.</li>
|
||
</ul>
|
||
<h2 id="community-resources">Community Resources</h2>
|
||
<h3 id="other-awesome-lists">Other Awesome Lists</h3>
|
||
<ul>
|
||
<li><a
|
||
href="https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community">Web
|
||
Archiving Community</a></li>
|
||
<li><a href="https://github.com/machawk1/awesome-memento">Awesome
|
||
Memento</a></li>
|
||
<li><a
|
||
href="http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem">The
|
||
WARC Ecosystem</a></li>
|
||
<li><a href="http://coptr.digipres.org/Category:Web_Crawl">The Web Crawl
|
||
section of COPTR</a></li>
|
||
</ul>
|
||
<h3 id="blogs-and-scholarship">Blogs and Scholarship</h3>
|
||
<ul>
|
||
<li><a href="https://netpreserveblog.wordpress.com/">IIPC Blog</a></li>
|
||
<li><a href="https://webarchivingrt.wordpress.com/">Web Archiving
|
||
Roundtable</a> - Unofficial blog of the Web Archiving Roundtable of the
|
||
<a href="https://www2.archivists.org/">Society of American
|
||
Archivists</a> maintained by the members of the Web Archiving
|
||
Roundtable.</li>
|
||
<li><a href="https://www.uclpress.co.uk/products/84010">The Web as
|
||
History</a> - An open-source book that provides a conceptual overview to
|
||
web archiving research, as well as several case studies.</li>
|
||
<li><a href="https://ws-dl.blogspot.com/">WS-DL Blog</a> - Web Science
|
||
and Digital Libraries Research Group blogs about various Web archiving
|
||
related topics, scholarly work, and academic trip reports.</li>
|
||
<li><a href="https://blog.dshr.org/">DSHR’s Blog</a> - David Rosenthal
|
||
regularly reviews and summarizes work done in the Digital Preservation
|
||
field.</li>
|
||
<li><a href="https://blogs.bl.uk/webarchive/">UK Web Archive
|
||
Blog</a></li>
|
||
<li><a href="https://commoncrawl.org/blog">Common Crawl Foundation
|
||
Blog</a> - <a href="http://commoncrawl.org/blog/rss.xml">rss</a></li>
|
||
</ul>
|
||
<h3 id="mailing-lists">Mailing Lists</h3>
|
||
<ul>
|
||
<li><a href="https://groups.google.com/g/common-crawl">Common
|
||
Crawl</a></li>
|
||
<li><a
|
||
href="http://netpreserve.org/about-us/iipc-mailing-list/">IIPC</a></li>
|
||
<li><a
|
||
href="https://groups.google.com/g/openwayback-dev">OpenWayback</a></li>
|
||
<li><a
|
||
href="https://groups.google.com/g/wasapi-community">WASAPI</a></li>
|
||
</ul>
|
||
<h3 id="slack">Slack</h3>
|
||
<ul>
|
||
<li><a href="https://iipc.slack.com/">IIPC Slack</a> - Ask <a
|
||
href="https://twitter.com/NetPreserve?s=20"><span class="citation"
|
||
data-cites="netpreserve">@netpreserve</span></a> for access.</li>
|
||
<li><a href="https://archivesunleashed.slack.com/">Archives Unleashed
|
||
Slack</a> - <a href="http://slack.archivesunleashed.org/">Fill out this
|
||
request form</a> for access to a researcher group of people working with
|
||
web archives.</li>
|
||
<li><a href="https://archivers.slack.com">Archivers Slack</a> - <a
|
||
href="https://archivers-slack.herokuapp.com/">Invite yourself</a> to a
|
||
multi-disciplinary effort for archiving projects run in affiliation with
|
||
<a href="https://envirodatagov.org/archiving/">EDGI</a> and <a
|
||
href="http://datatogether.org/">Data Together</a>.</li>
|
||
<li><a href="https://ccfpartners.slack.com/">Common Crawl Foundation
|
||
Partners</a> (ask greg zat commoncrawl zot org for an invite)</li>
|
||
</ul>
|
||
<h3 id="twitter">Twitter</h3>
|
||
<ul>
|
||
<li><a href="https://twitter.com/NetPreserve"><span class="citation"
|
||
data-cites="NetPreserve">@NetPreserve</span></a> - Official IIPC
|
||
handle.</li>
|
||
<li><a href="https://twitter.com/WebSciDL"><span class="citation"
|
||
data-cites="WebSciDL">@WebSciDL</span></a> - ODU Web Science and Digital
|
||
Libraries Research Group.</li>
|
||
<li><a
|
||
href="https://twitter.com/search?q=%23webarchiving">#WebArchiving</a></li>
|
||
<li><a
|
||
href="https://twitter.com/hashtag/webarchivewednesday">#WebArchiveWednesday</a></li>
|
||
</ul>
|
||
<h2 id="web-archiving-service-providers">Web Archiving Service
|
||
Providers</h2>
|
||
<p>The intention is that we only list services that allow web archives
|
||
to be exported in standard formats (WARC or WACZ). But this is not an
|
||
endorsement of these services, and readers should check and evaluate
|
||
these options based on their needs.</p>
|
||
<h3 id="self-hostable-open-source">Self-hostable, Open Source</h3>
|
||
<ul>
|
||
<li><a href="https://webrecorder.net/browsertrix/">Browsertrix</a> -
|
||
From <a href="https://webrecorder.net/">Webrecorder</a>, source
|
||
available at <a href="https://github.com/webrecorder/browsertrix"
|
||
class="uri">https://github.com/webrecorder/browsertrix</a>.</li>
|
||
<li><a href="https://conifer.rhizome.org/">Conifer</a> - From <a
|
||
href="https://rhizome.org/">Rhizome</a>, source available at <a
|
||
href="https://github.com/Rhizome-Conifer"
|
||
class="uri">https://github.com/Rhizome-Conifer</a>.</li>
|
||
</ul>
|
||
<h3 id="hosted-closed-source">Hosted, Closed Source</h3>
|
||
<ul>
|
||
<li><a href="https://archive-it.org/">Archive-It</a> - From the Internet
|
||
Archive.</li>
|
||
<li><a href="https://arkiwera.se/wp/websites/">Arkiwera</a></li>
|
||
<li><a href="https://www.hanzo.co/chronicle">Hanzo</a></li>
|
||
<li><a
|
||
href="https://www.mirrorweb.com/solutions/capabilities/website-archiving">MirrorWeb</a></li>
|
||
<li><a href="https://www.pagefreezer.com/">PageFreezer</a></li>
|
||
<li><a
|
||
href="https://www.smarsh.com/platform/compliance-management/web-archive">Smarsh</a></li>
|
||
</ul>
|
||
<p><a
|
||
href="https://github.com/iipc/awesome-web-archiving">webarchiving.md
|
||
Github</a></p>
|