update lists
This commit is contained in:
@@ -74,9 +74,11 @@ Murray</a></li>
|
||||
href="https://netpreserve.org/web-archiving/training-materials/">IIPC
|
||||
and DPC Training materials: module for beginners (8 sessions)</a></li>
|
||||
<li><a href="https://github.com/vphill/web-archiving-course">UNT Web
|
||||
Archiving Course 2022</a></li>
|
||||
Archiving Course</a></li>
|
||||
<li><a href="https://cedwarc.github.io/">Continuing Education to Advance
|
||||
Web Archiving (CEDWARC)</a></li>
|
||||
<li><a href="https://github.com/commoncrawl/whirlwind-python/">A
|
||||
Whirlwind Tour of Common Crawl’s Datasets using Python</a></li>
|
||||
</ul></li>
|
||||
<li>The WARC Standard:
|
||||
<ul>
|
||||
@@ -106,8 +108,11 @@ organisations who publish on the web, and who want to make sure their
|
||||
site can be archived.</p>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://library.stanford.edu/projects/web-archiving/archivability">Stanford
|
||||
Libraries’ Archivability pages</a></li>
|
||||
href="https://nullhandle.org/web-archivability/index.html">Definition of
|
||||
Web Archivability</a> - This describes the ease with which web content
|
||||
can be preserved. (<a
|
||||
href="https://web.archive.org/web/20230728211501/https://library.stanford.edu/projects/web-archiving/archivability">Archived
|
||||
version from the Stanford Libraries</a>)</li>
|
||||
<li>The <a href="http://archiveready.com/">Archive Ready</a> tool, for
|
||||
estimating how likely a web page will be archived successfully.</li>
|
||||
</ul>
|
||||
@@ -115,11 +120,15 @@ estimating how likely a web page will be archived successfully.</li>
|
||||
<p>This list of tools and software is intended to briefly describe some
|
||||
of the most important and widely-used tools related to web archiving.
|
||||
For more details, we recommend you refer to (and contribute to!) these
|
||||
excellent resources from other groups: * <a
|
||||
excellent resources from other groups:</p>
|
||||
<ul>
|
||||
<li><a
|
||||
href="https://github.com/archivers-space/research/tree/master/web_archiving">Comparison
|
||||
of web archiving software</a> * <a
|
||||
of web archiving software</a></li>
|
||||
<li><a
|
||||
href="https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring">Awesome
|
||||
Website Change Monitoring</a></p>
|
||||
Website Change Monitoring</a></li>
|
||||
</ul>
|
||||
<h3 id="acquisition">Acquisition</h3>
|
||||
<ul>
|
||||
<li><a href="https://github.com/pirate/ArchiveBox">ArchiveBox</a> - A
|
||||
@@ -131,10 +140,12 @@ links using wget, Chrome headless, and other methods (formerly
|
||||
href="http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html">Python
|
||||
library</a> to push web resources into on-demand web archives.
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://archiveweb.page">ArchiveWeb.Page</a> - A plugin for
|
||||
Chrome and other Chromium based browsers that lets you interactively
|
||||
archive web pages, replay them, and export them as WARC data. Also
|
||||
available as an Electron based desktop application.</li>
|
||||
<li><a
|
||||
href="https://webrecorder.net/archivewebpage/">ArchiveWeb.Page</a> - A
|
||||
plugin for Chrome and other Chromium based browsers that lets you
|
||||
interactively archive web pages, replay them, and export them as WARC
|
||||
& WACZ files. Also available as an Electron based desktop
|
||||
application.</li>
|
||||
<li><a href="https://github.com/bellingcat/auto-archiver">Auto
|
||||
Archiver</a> - Python script to automatically archive social media
|
||||
posts, videos, and images from a Google Sheets document. Read the <a
|
||||
@@ -142,9 +153,9 @@ href="https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-cont
|
||||
about Auto Archiver on bellingcat.com</a>.</li>
|
||||
<li><a
|
||||
href="https://github.com/webrecorder/browsertrix-crawler">Browsertrix
|
||||
Crawler</a> - A Chrome based high-fidelity crawling system, designed to
|
||||
run a complex, customizable browser-based crawl in a single Docker
|
||||
container.</li>
|
||||
Crawler</a> - A Chromium based high-fidelity crawling system, designed
|
||||
to run a complex, customizable browser-based crawl in a single Docker
|
||||
container. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/internetarchive/brozzler">Brozzler</a> -
|
||||
A distributed web crawler (爬虫) that uses a real browser (Chrome or
|
||||
Chromium) to fetch pages and embedded urls and to extract links.
|
||||
@@ -154,6 +165,9 @@ and CLI tool for saving webpages. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/CGamesPlay/chronicler">Chronicler</a> -
|
||||
Web browser with record and replay functionality. <em>(In
|
||||
Development)</em></li>
|
||||
<li><a href="https://www.community-archive.org/">Community Archive</a> -
|
||||
Open Twitter Database and API with tools and resources for building on
|
||||
archived Twitter data.</li>
|
||||
<li><a href="https://github.com/turicas/crau">crau</a> - crau is the way
|
||||
(most) Brazilians pronounce crawl, it’s the easiest command-line tool
|
||||
for archiving the Web and playing archives: you just need a list of
|
||||
@@ -209,7 +223,7 @@ for witnessing the web. <em>(Stable)</em></li>
|
||||
href="https://github.com/gildas-lormeau/SingleFile">SingleFile</a> -
|
||||
Browser extension for Firefox/Chrome and CLI tool to save a faithful
|
||||
copy of a complete page as a single HTML file. <em>(Stable)</em></li>
|
||||
<li><a href="http://mementoweb.github.com/SiteStory/">SiteStory</a> - A
|
||||
<li><a href="http://mementoweb.github.io/SiteStory/">SiteStory</a> - A
|
||||
transactional archive that selectively captures and stores transactions
|
||||
that take place between a web client (browser) and a web server.
|
||||
<em>(Stable)</em></li>
|
||||
@@ -258,9 +272,6 @@ Open-source workflow management for selective web archiving.
|
||||
<li><a href="https://github.com/WebMemex">WebMemex</a> - Browser
|
||||
extension for Firefox and Chrome which lets you archive web pages you
|
||||
visit. <em>(In Development)</em></li>
|
||||
<li><a href="https://webrecorder.io/">Webrecorder</a> - Create
|
||||
high-fidelity, interactive recordings of any web site you browse.
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="http://www.gnu.org/software/wget/">Wget</a> - An open
|
||||
source file retrieval utility that of <a
|
||||
href="http://www.archiveteam.org/index.php?title=Wget_with_WARC_output">version
|
||||
@@ -280,32 +291,40 @@ href="https://ipfs.io/">IPFS</a>.</li>
|
||||
open source project aimed to develop Wayback Machine, the key software
|
||||
used by web archives worldwide to play back archived websites in the
|
||||
user’s browser. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/ikreymer/pywb">PyWb</a> - A Python (2
|
||||
and 3) implementation of web archival replay tools, sometimes also known
|
||||
as ‘Wayback Machine’. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/webrecorder/pywb">PYWB</a> - A Python 3
|
||||
implementation of web archival replay tools, sometimes also known as
|
||||
‘Wayback Machine’. <em>(Stable)</em></li>
|
||||
<li><a
|
||||
href="https://oduwsdl.github.io/Reconstructive/">Reconstructive</a> -
|
||||
Reconstructive is a ServiceWorker module for client-side reconstruction
|
||||
of composite mementos by rerouting resource requests to corresponding
|
||||
archived copies (JavaScript).</li>
|
||||
<li><a href="https://replayweb.page/">ReplayWeb.Page</a> - A
|
||||
browser-based, fully client-side replay engine for both local and remote
|
||||
WARC files.</li>
|
||||
<li><a href="https://webrecorder.net/replaywebpage/">ReplayWeb.page</a>
|
||||
- A browser-based, fully client-side replay engine for both local and
|
||||
remote WARC & WACZ files. Also available as an Electron based
|
||||
desktop application. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/iipc/warc2html">warc2html</a> - Converts
|
||||
WARC files to static HTML suitable for browsing offline or
|
||||
rehosting.</li>
|
||||
</ul>
|
||||
<h3 id="search-discovery">Search & Discovery</h3>
|
||||
<ul>
|
||||
<li><a href="https://github.com/medialab/hyphe">hyphe</a> - A webcrawler
|
||||
built for research uses with a graphical user interface in order to
|
||||
build web corpuses made of lists of web actors and maps of links between
|
||||
them. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/machawk1/mink">Mink</a> - A <a
|
||||
href="https://www.google.com/intl/en/chrome/">Google Chrome</a>
|
||||
extension for querying Memento aggregators while browsing and
|
||||
integrating live-archived web navigation. <em>(Stable)</em>
|
||||
<!--lint ignore double-link--></li>
|
||||
integrating live-archived web navigation. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/Guillaume-Levrier/PANDORAE">PANDORÆ</a>
|
||||
- A desktop research software to be plugged on a Solr endpoint to query,
|
||||
retrieve, normalize and visually explore web archives.
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/wabarc/playback">playback</a> - A
|
||||
toolkit for searching archived webpages from <a
|
||||
href="https://web.archive.org">Internet Archive</a>, <a
|
||||
href="https://archive.today">archive.today</a>, <a
|
||||
toolkit for searching archived webpages from
|
||||
<!--lint ignore double-link--><a href="https://web.archive.org">Internet
|
||||
Archive</a>, <a href="https://archive.today">archive.today</a>, <a
|
||||
href="http://timetravel.mementoweb.org">Memento</a> and beyond. <em>(In
|
||||
Development)</em></li>
|
||||
<li><a href="https://securitytrails.com/">SecurityTrails</a> - Web based
|
||||
@@ -324,8 +343,7 @@ in Tempas</a>). <em>(Stable)</em></li>
|
||||
href="https://github.com/ukwa/webarchive-discovery">webarchive-discovery</a>
|
||||
- WARC and ARC full-text indexing and discovery tools, with a number of
|
||||
associated tools capable of using the index shown below.
|
||||
<em>(Stable)</em>
|
||||
<ul>
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/ukwa/shine">Shine</a> - A prototype web
|
||||
archives exploration UI, developed with researchers as part of the <a
|
||||
href="https://buddah.projects.history.ac.uk/">Big UK Domain Data for the
|
||||
@@ -352,19 +370,18 @@ system</a>. <em>(In Development)</em></li>
|
||||
<li>Other possible options for builting a front-end are listed on in the
|
||||
<code>webarchive-discovery</code> wiki, <a
|
||||
href="https://github.com/ukwa/webarchive-discovery/wiki/Front-ends">here</a>.</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
<h3 id="utilities">Utilities</h3>
|
||||
<ul>
|
||||
<li><a href="https://github.com/recrm/ArchiveTools">ArchiveTools</a> -
|
||||
Collection of tools to extract and interact with WARC files (Python).
|
||||
<!--lint ignore double-link--></li>
|
||||
Collection of tools to extract and interact with WARC files
|
||||
(Python).</li>
|
||||
<li><a href="https://pypi.org/project/cdx-toolkit/">cdx-toolkit</a> -
|
||||
Library and CLI to consult cdx indexes and create WARC extractions of
|
||||
subsets. Abstracts away Common Crawl’s unusual crawl structure.
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/karust/gogetcrawl">Go Get Crawl</a> -
|
||||
Extract web archive data using <a
|
||||
Extract web archive data using <!--lint ignore double-link--><a
|
||||
href="https://web.archive.org/">Wayback Machine</a> and <a
|
||||
href="https://commoncrawl.org/">Common Crawl</a>. <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/nlnwa/gowarcserver">gowarcserver</a> -
|
||||
@@ -372,26 +389,29 @@ href="https://commoncrawl.org/">Common Crawl</a>. <em>(Stable)</em></li>
|
||||
index (CDX) and WARC record server, used to index and serve WARC files
|
||||
(Go).</li>
|
||||
<li><a href="https://github.com/webrecorder/har2warc">har2warc</a> -
|
||||
Convert HTTP Archive (HAR) -> Web Archive (WARC) format
|
||||
(Python).</li>
|
||||
<li><a href="https://httpreserve.info/">httpreserve.info</a> - Service
|
||||
to return the status of a web page or save it to the Internet Archive.
|
||||
Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python).
|
||||
<!--lint ignore double-link--></li>
|
||||
<li><a href="https://httpreserve.info">httpreserve.info</a> - Service to
|
||||
return the status of a web page or save it to the Internet Archive.
|
||||
HTTPreserve includes disambiguation of well-known short link services.
|
||||
It returns JSON via the browser or command line via CURL using GET.
|
||||
Describes web sites using earliest and latest dates in the Internet
|
||||
Archive and demonstrates the construction of Robust Links in its output
|
||||
using that range. (Golang). <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/httpreserve/linkstat">HTTPreserve
|
||||
linkstat</a> - Command line implementation of <a
|
||||
linkstat</a> - Command line implementation of
|
||||
<!--lint ignore double-link--><a
|
||||
href="https://httpreserve.info">httpreserve.info</a> to describe the
|
||||
status of a web page. Can be easily scripted and provides JSON output to
|
||||
enable querying through tools like JQ. HTTPreserve Linkstat describes
|
||||
current status, and earliest and latest links on <a
|
||||
current status, and earliest and latest links on
|
||||
<!--lint ignore double-link--><a
|
||||
href="https://archive.org/">archive.org</a>. (Golang).
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/jjjake/internetarchive">Internet Archive
|
||||
Library</a> - A command line tool and Python library for interacting
|
||||
directly with <a href="https://archive.org">archive.org</a>. (Python).
|
||||
directly with <!--lint ignore double-link--><a
|
||||
href="https://archive.org">archive.org</a>. (Python).
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/nla/httrack2warc">httrack2warc</a> -
|
||||
Convert HTTrack archives to WARC format (Java).</li>
|
||||
@@ -412,14 +432,9 @@ href="https://github.com/ukwa/ukwa-heritrix/blob/master/src/main/java/uk/bl/wap/
|
||||
href="https://github.com/unt-libraries/py-wasapi-client">py-wasapi-client</a>
|
||||
- Command line application to download crawls from WASAPI (Python).
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://archivebrowser.c3.cx/">The Archive Browser</a> -
|
||||
The Archive Browser is a program that lets you browse the contents of
|
||||
archives, as well as extract them. It will let you open files from
|
||||
inside archives, and lets you preview them using Quick Look. WARC is
|
||||
supported (macOS only, Proprietary app).</li>
|
||||
<li><a href="http://unarchiver.c3.cx/unarchiver">The Unarchiver</a> -
|
||||
Program to extract the contents of many archive formats, inclusive of
|
||||
WARC, to a file system. Free variant of The Archive Browser (macOS only,
|
||||
<li><a href="https://theunarchiver.com/">The Unarchiver</a> - Program to
|
||||
extract the contents of many archive formats, inclusive of WARC, to a
|
||||
file system. Free variant of The Archive Browser (macOS only,
|
||||
Proprietary app).</li>
|
||||
<li><a
|
||||
href="https://github.com/httpreserve/tikalinkextract">tikalinkextract</a>
|
||||
@@ -439,6 +454,8 @@ database. <em>(Stable)</em></li>
|
||||
<li><a href="https://gitlab.com/taricorp/warcdedupe">warcdedupe</a> -
|
||||
WARC deduplication tool (and WARC library) written in Rust. (In
|
||||
Development)</li>
|
||||
<li><a href="https://github.com/natliblux/warc-safe">warc-safe</a> -
|
||||
Automatic detection of viruses and NSFW content in WARC files.</li>
|
||||
<li><a
|
||||
href="https://github.com/helgeho/WarcPartitioner">WarcPartitioner</a> -
|
||||
Partition (W)ARC Files by MIME Type and Year. <em>(Stable)</em></li>
|
||||
@@ -462,8 +479,12 @@ Splitable Hadoop InputFormat for Concatenated GZIP Files (and
|
||||
<code>*.warc.gz</code>). <em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/iipc/jwarc">jwarc</a> - Read and write
|
||||
WARC files with a type safe API (Java).</li>
|
||||
<li><a href="https://sbforge.org/display/JWAT/JWAT">Jwat</a> - Libraries
|
||||
and tools for reading/writing/validating WARC/ARC/GZIP files (Java).
|
||||
<li><a href="https://github.com/netarchivesuite/jwat">Jwat</a> -
|
||||
Libraries for reading/writing/validating WARC/ARC/GZIP files (Java).
|
||||
<em>(Stable)</em></li>
|
||||
<li><a
|
||||
href="https://github.com/netarchivesuite/jwat-tools">Jwat-Tools</a> -
|
||||
Tools for reading/writing/validating WARC/ARC/GZIP files (Java).
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/N0taN3rd/node-warc">node-warc</a> -
|
||||
Parse WARC files or create WARC files using either <a
|
||||
@@ -478,6 +499,9 @@ Command line interface to unzip WARC and WACZ files (Python).</li>
|
||||
<li><a href="https://github.com/chfoo/warcat">Warcat</a> - Tool and
|
||||
library for handling Web ARChive (WARC) files (Python).
|
||||
<em>(Stable)</em></li>
|
||||
<li><a href="https://github.com/chfoo/warcat-rs">Warcat-rs</a> -
|
||||
Command-line tool and Rust library for handling Web ARChive (WARC)
|
||||
files. <em>(In Development)</em></li>
|
||||
<li><a href="https://github.com/webrecorder/warcio">warcio</a> -
|
||||
Streaming WARC/ARC library for fast web archive IO (Python).
|
||||
<em>(Stable)</em></li>
|
||||
@@ -598,6 +622,8 @@ regularly reviews and summarizes work done in the Digital Preservation
|
||||
field.</li>
|
||||
<li><a href="https://blogs.bl.uk/webarchive/">UK Web Archive
|
||||
Blog</a></li>
|
||||
<li><a href="https://commoncrawl.org/blog">Common Crawl Foundation
|
||||
Blog</a> - <a href="http://commoncrawl.org/blog/rss.xml">rss</a></li>
|
||||
</ul>
|
||||
<h3 id="mailing-lists">Mailing Lists</h3>
|
||||
<ul>
|
||||
@@ -624,6 +650,8 @@ href="https://archivers-slack.herokuapp.com/">Invite yourself</a> to a
|
||||
multi-disciplinary effort for archiving projects run in affiliation with
|
||||
<a href="https://envirodatagov.org/archiving/">EDGI</a> and <a
|
||||
href="http://datatogether.org/">Data Together</a>.</li>
|
||||
<li><a href="https://ccfpartners.slack.com/">Common Crawl Foundation
|
||||
Partners</a> (ask greg zat commoncrawl zot org for an invite)</li>
|
||||
</ul>
|
||||
<h3 id="twitter">Twitter</h3>
|
||||
<ul>
|
||||
@@ -646,10 +674,10 @@ endorsement of these services, and readers should check and evaluate
|
||||
these options based on their needs.</p>
|
||||
<h3 id="self-hostable-open-source">Self-hostable, Open Source</h3>
|
||||
<ul>
|
||||
<li><a href="https://browsertrix.cloud/">Browsertrix Cloud</a> - From <a
|
||||
href="https://webrecorder.net/">Webrecorder</a>, source available at <a
|
||||
href="https://github.com/webrecorder/browsertrix-cloud"
|
||||
class="uri">https://github.com/webrecorder/browsertrix-cloud</a>.</li>
|
||||
<li><a href="https://webrecorder.net/browsertrix/">Browsertrix</a> -
|
||||
From <a href="https://webrecorder.net/">Webrecorder</a>, source
|
||||
available at <a href="https://github.com/webrecorder/browsertrix"
|
||||
class="uri">https://github.com/webrecorder/browsertrix</a>.</li>
|
||||
<li><a href="https://conifer.rhizome.org/">Conifer</a> - From <a
|
||||
href="https://rhizome.org/">Rhizome</a>, source available at <a
|
||||
href="https://github.com/Rhizome-Conifer"
|
||||
@@ -667,3 +695,6 @@ href="https://www.mirrorweb.com/solutions/capabilities/website-archiving">Mirror
|
||||
<li><a
|
||||
href="https://www.smarsh.com/platform/compliance-management/web-archive">Smarsh</a></li>
|
||||
</ul>
|
||||
<p><a
|
||||
href="https://github.com/iipc/awesome-web-archiving">webarchiving.md
|
||||
Github</a></p>
|
||||
|
||||
Reference in New Issue
Block a user