From: Amos Jeffries Date: Mon, 13 Jun 2022 15:55:06 +0000 (+0000) Subject: Maintenance: Improve CONTRIBUTORS and its updates (#980) X-Git-Tag: SQUID_6_0_1~167 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=6e541c01f8e69f6781623ded7ce21a546a911805;p=thirdparty%2Fsquid.git Maintenance: Improve CONTRIBUTORS and its updates (#980) The Squid Project used a script to collect CONTRIBUTORS entries, but it was not called from source-maintenance.sh, it did not really understand the structure of those entries, and its results required significant manual polishing efforts. CONTRIBUTORS file kept deteriorating. This change consists of three major parts detailed further below: * a major (semi-manual) CONTRIBUTORS update and cleanup; * scripts/update-contributors.pl: Merges new entries into CONTRIBUTORS; * collectAuthors() in source-maintenance.sh: Finds new entries. Part 1: CONTRIBUTORS update: We collected (and then pruned/polished) all contributors from the following (master and v3+ branches) sources: * all commit authors; * all commit co-authors (from Co-authored-by trailer and note entries); * all CONTRIBUTORS file revisions (the latest one was missing entries). Part 2: update-contributors.pl understands and enforces a more formal CONTRIBUTORS structure. After a non-indented preamble text ending with an empty line, indented CONTRIBUTORS entries now use these formats: name name The entries are case-insensitively sorted by the two fields (name, email), with several conflict-resolution rules aimed at achieving natural and stable order. Non-ASCII entries are still banned (for now) because properly supporting them requires a serious dedicated effort. The program merges current CONTRIBUTORS and all well-formed contributor entries (read from the standard input) except these: * entries with an already represented email * entries with an already represented name * entries containing "squidadm" in name or email The matching is case-insensitive. These filtering rules appear to work better in Squid CONTRIBUTORS context than more accurate/strict rules do. Part 3: collectAuthors() feeds update-contributors.pl with new contributors collected from "git log" output. The function only looks at the current git branch commits made after a "vetted" point. That point is updated by certain CONTRIBUTORS commits, as detailed in the collectAuthors() description inside source-maintenance.sh. It can also be specified manually via the new --update-contributors-since option. It is not critical (and is probably impossible) for CONTRIBUTORS automation to properly detect and merge every single new contributor. When the scripts get it wrong, a human can always update CONTRIBUTORS manually. Rare mistakes are not a big deal in this context. For example, if a past contributor now needs to be listed with another email (e.g. representing a new employer), we manually add a second entry. This change is a reference point for automated CONTRIBUTORS updates. --- diff --git a/CONTRIBUTORS b/CONTRIBUTORS index e9c28fa048..b770ef71ba 100644 --- a/CONTRIBUTORS +++ b/CONTRIBUTORS @@ -2,13 +2,17 @@ This file contains a list of Squid contributors: people and organizations that have volunteered their time, effort, code, and ideas to make Squid software. Thank you! + Aaron Costello <56684862+aaron-costello@users.noreply.github.com> Adam Ciarcinski + Adam Majer Adrian Chadd Aecio F. Alan Mizrahi Alan Nastac Aleksa - Aleksa ??u??uli?? + Alex Dowad + Alex Rousskov + Alex Wu Alexander B. Demenshin Alexander Gozman Alexander Gozman @@ -17,23 +21,21 @@ Thank you! Alexander Lukyanov Alexandre Chappaz Alexandre SIMON - Alex Dowad Alexey Veselovsky Alexis Robert - Alex Rousskov - Alex Rousskov - Alex Wu Alin Nastac Alter + Ambrose Li + Amish <3330468+amishmm@users.noreply.github.com> + Amish Amit Klein - Amos Jeffries Amos Jeffries Amos Jeffries Amos Jeffries - Amos Anatoli - Andrea Gagliardi Andre Albsmeier + Andrea Gagliardi + Andreas Hasenack Andreas Jaeger Andreas Lamprecht Andreas Weigel @@ -47,20 +49,21 @@ Thank you! Andrey Andrey Shorin Anonymous - Anonymous Pootle User Anonymous + Anonymous Pootle User Ansgar Hockmann Anthony Baxter + antiago Garcia Mantinan Antonino Iannella - Arjan de Vet Arjan de Vet + Arjan de Vet Arkin + Armin Wolfermann Arno Streuli Arthur Arthur Arthur Tumanyan Assar Westerlund - Automatic source maintenance Axel Westerhold Aymeric Vincent Barry Dobyns @@ -73,25 +76,26 @@ Thank you! Bojan Smojver Brad Smith Bratislav + Brian Brian Degenhardt Brian Denehy - Brian Bruce Murphy - Carson Gaspar (carson@lehman.com, carson@cs.columbia.edu) + Carson Gaspar + Carson Gaspar Carsten Grzemba Cephas Chad E. Naugle Chad Naugle Changming Chao + chi-mf <43963496+chi-mf@users.noreply.github.com> Chris Addie Chris Hills Christian Wittmer - Christopher Kerr - Christophe Saout Christoph Lechleitner + Christophe Saout + Christopher Kerr Christos Tsantilas - Christos Tsantilas Chudy Fernandez Cloyce Clytie Siddall @@ -99,13 +103,15 @@ Thank you! Constantin Rack Cord Beermann Craig Gowing + D Kazarov + Dale + Dan Searle Daniel Beschorner Daniel O'Callaghan Daniel Walter - Dan Searle - Dan Searle + Daris A Nevil Dave Dykstra - David Carlier + David CARLIER David Hill David Isaacs David J N Begley @@ -115,18 +121,19 @@ Thank you! Declan White Dennis Felippa Dennis Glatting + desbma-s1n <62935004+desbma-s1n@users.noreply.github.com> Dhaval Varia Diego Woitasen + Dimitry Andric Diogenes S. Jesus - D Kazarov Dmitry Kurochkin Don Hopkins Doug Dixon Doug Urner + Dr. Tilmann Bubeck Dragutin Cirkovic DrDaveD drserge - Dr. Tilmann Bubeck Duane Wessels Dustin J. Mitchell Ed Knowles @@ -138,65 +145,74 @@ Thank you! Eldar Akchurin Eliezer Croitoru Elmar Vonlanthen + Emil Hessman <248952+ceh@users.noreply.github.com> Emilio Casbas Emmanuel Fuste Endre Balint Nagy - Eray Aslan + Eneas U de Queiroz Eray Aslan + Eray Aslan Eric Stern Erik Hofman Eugene Gladchenko Evan Jones Evgeni Eygene Ryabinkin + F Wolff Fabian Hugelshofer + Fabrice Fontaine fancyrabbit Felix Meschberger Feshchuk Yuriy Finn Thain Flavio Pescuma Florent + flozilla folkert Francesco Chemolli - Francesco Francesco Salvestrini Francis Daly Francois Cami Frank Balluffi Frank Schmirler - Frederic Bourgeois Fred - F Wolff + Frederic Bourgeois + FX Coudert Fyodor Garri Djavadyan Geoff Keating + George Machitidze George Michaelson Georgy Salnikov Gerard Eviston Gerben Wierda Gergely + ghulands Giancarlo Razzolini Gilles Espinasse gkeeling Glen Gibb - Glenn Chisholm Glen Newton + Glenn Chisholm Glenn Newton Golub Mikhail Gonzalo Arana Graham Keeling Guido Serassio Guido Serassio + Guido Vranken + guijan Gustavo Zacarias Guy Helmer Hank Hampel + Hans-Werner Braun Hasso Tepper + Heinrich Schuchardt helix84 Henrik Nordstrom Henrik Nordstrom Hide Nagaoka HONDA Hirofumi - huaraz Hussam Al-Tayeb Ian Castle Ian Clark @@ -207,29 +223,31 @@ Thank you! isaac Isnard Ivan Larionov - Ivan Mas??r Jakob Bohm Jakub Wilk James Bowe James Brotchie + James DeFelice James R Grinter Jamie Strandboge Jan Klemkow Jan Niehusmann Jan Sievers Javad Kouhi + Javier Pacheco Jean-Francois Micouleau Jean-Gabriel Dick Jean-Philippe Menil Jeff Licquia - Jens-S. V?ckler + Jens-S. Voeckler Jeremy Allison Jerry Murdock + jijiwawa <33614409+jijiwawa@users.noreply.github.com> Jiri Skala Jiri Skala jltallon Joachim Bauch - Joachim Bauch (mail@joachim-bauch.de) + Joachim Bauch Joao Alves Neto Jochen Obalek Jochen Voss @@ -237,20 +255,22 @@ Thank you! Joe Ramey Joe Ramey Joerg Lehrke - Johnathan Conley John Dilley - John@MCC.ac.uk John M Cooper - John@Pharmweb.NET John Saunders John Xue - Jonathan Larmour - Jonathan Wolfe + + + Johnathan Conley Jon Kinred Jon Thackray + Jonathan Larmour + Jonathan Newton + Jonathan Wolfe Jorge Ivan Burgos Aguilar Jose Luis Godoy Jose-Marcio Martins da Cruz + josepjones Joshua Rogers Joshua Root Joshua Root @@ -273,42 +293,45 @@ Thank you! Lubos Uhliarik Luigi Gangitano Luis Daniel Lucio Quiroz - Lukas B??gelei + Lukas Bogelein Luke Howard Lutz Donnerhacke + mahdi1001 Manu Garg + Manuel Meitinger + Marc van Selm Marcello Romani Marcin Wisnicki Marco Beck Marcos Mello Marcos Mello Marcus Kool - Marc van Selm Marin Stavrev Marios Makassikis Mark Bergsma Mark Nottingham - Marko Cupac - Marko Mark Treacy + Marko + Marko Cupac Markus Gyger Markus Mayer Markus Moeller - Markus Moeller (markus_moeller at compuserve.com) + Markus Moeller Markus Rietzler Markus Stumpf - Martin Hamilton Martin Hamilton - Martin Huter + Martin Hamilton Martin Huter + Martin Huter Martin Stolle Martin von Gagern Masashi Fujita Massimo Zito Mathias Fischer Matthew Morgan - Matthias Pitzl Matthias "Silamael" + Matthias Pitzl + Matthieu Herrb Max Okumoto Merik Karman @@ -325,7 +348,6 @@ Thank you! Miguel A.L. Paraz Mike Groeneweg Mike Mitchell - Mike Mitchell Mikio Kishi Milen Pankov Ming Fu @@ -333,44 +355,53 @@ Thank you! mkishi Moez Mahfoudh Mohsen Saeedi + Mrcus Kool + mrumph Mukaigawa Shin'ichi Nathan Hoad Neil Murray + new23d nglnx - Rosetta Project Niall Doherty Nick Rogers + Nick Wellnhofer + Nikita <32056979+Roo4L@users.noreply.github.com> Nikolai Gorchilov - 'noloader' + noloader Ole Christensen Oliver Dumschat Oliver Hookins Olivier Montanuy Olivier W. + Opendium OpenSolaris Project Oskar Pearson + Patrick Scott Best Patrick Welche - Paulo Matias Paul Z - Pavel Timofeev + Paulo Matias + Pavel Simerda Pavel Timofeev Pawel Worach Pedro Lineu Orso Pedro Ribeiro Pete Bentley + Peter Eisenhauer Peter Hidas Peter Payne Peter Pramberger + Phil Oester Philip Allison Philippe Lantin - Phil Oester Pierangelo Masarati Pierre LALET Pierre-Louis Brenac - Pierre-Louis BRENAC Poul-Henning Kamp Priyanka Gupta Przemek Czerkas - Rabellino Sergio (rabellino@di.unito.it) + Quentin THEURET + R Phillips + Rabellino Sergio Rafael Martinez Rafael Martinez Torres Rafal Ramocki @@ -379,29 +410,29 @@ Thank you! Ralf Wildenhues Ralph Loader Ramon de Carvalho + Raphael Hertzog Regardt van de Vyver Regents of the University of California (UCSD) Reinhard Posmyk Reinhard Sojka Rene Geile - Ren? Geile Reuben Farrelly - Richard Huveneers Richard Huveneers + Richard Huveneers Richard Sharpe Richard Wall + Rob Cowart Robert Collins Robert Collins - Robert + Robert Dessa Robert Forster Robert Walsh Robin Elfrink Rodrigo Campos - Rodrigo Campos (rodrigo@geekbunker.org) Rodrigo Rubira Branco Rodrigo Rubira Branco Ron Gomes - R Phillips + Rosen Penev Russell Street Russell Vincent Rusty Bird @@ -409,36 +440,42 @@ Thank you! Rybakov Andrey Samba Project Santiago Garcia Mantinan + sborrill <33655983+sborrill@users.noreply.github.com> Scott James Remnant Scott Schram Sean Critica Sebastian Krahmer Sebastien Wenske + Serassio Guido + Sergey Kirpa <44341362+Sergey-Kirpa@users.noreply.github.com> Sergey Merzlikin Sergio Durigan Junior Sergio Rabellino Shigechika Aikawa Silamael Simon Deziel - SquidAdm - squidadm + squidcontrib <56416132+squidcontrib@users.noreply.github.com> Stefan Fritsch Stefan Kruger Stefano Cordibella + Stepan Broz <32738079+brozs@users.noreply.github.com> Stephen Baynes Stephen R. van den Berg Stephen Thorne + Stephen Welker Steve Bennett Steve Hill + Steve Snyder Steven Lawrance Steven Wilton - Steve Snyder Stewart Forster Stuart Henderson Stuart Henderson + sujiacong Susant Sahani Sven Eisenberg Svenx + swilton Taavi Talvik Takahiro Kambe Taketo Kabe @@ -447,47 +484,53 @@ Thank you! The Squid Software Foundation Thomas De Schampheleire Thomas Hozza - Thomas-Martin Seck Thomas Ristic Thomas Weber + Thomas Zajic + Thomas-Martin Seck Tianyin Xu Tilmann Bubeck Tim Brown + Tim Starling Timo Teras Timo Tseras - Tim Starling Todd C. Miller Tomas Hozza + tomofumi-yoshida <51390036+tomofumi-yoshida@users.noreply.github.com> Tony Lorimer + trapexit Trever Adams Tsantilas Christos + Unknown Unknown - Debian Project - Unknown FreeBSD Contributor Unknown - NetBSD Project + Unknown FreeBSD Contributor Vadim Aleksandrov - Various + Vadim Salavatov Various Translators Victor Jose Hernandez Gomez Vince Brimhall Vincent Regnard - Vitaliy Matytsyn (main) Vitaliy Matytsyn + Vitaliy Matytsyn (main) Vitaly Lavrov vollkommen Walter Wang DaQing Warren Baker Wesha - William Lima Will Roberts + William Lima Wojciech Zatorski Wojtek Sylwestrzak Wolfgang Breyha Wolfgang Nothdurft + wouldsmina Xavier Redon yabuki Yannick Bergeron Yuhua Wu Yuriy M. Kaminskiy + Zachary Lee Andrews Zhanpeng Chen diff --git a/scripts/Makefile.am b/scripts/Makefile.am index e40d0542c6..816830e56d 100644 --- a/scripts/Makefile.am +++ b/scripts/Makefile.am @@ -11,6 +11,7 @@ EXTRA_DIST = AnnounceCache.pl access-log-matrix.pl cache-compare.pl \ find-alive.pl trace-job.pl trace-master.pl \ trace-context.pl \ icpserver.pl tcp-banger.pl udp-banger.pl upgrade-1.0-store.pl \ + update-contributors.pl \ calc-must-ids.pl calc-must-ids.sh dist_noinst_SCRIPTS = remove-cfg.sh diff --git a/scripts/source-maintenance.sh b/scripts/source-maintenance.sh index 53bb153334..03ad33dff7 100755 --- a/scripts/source-maintenance.sh +++ b/scripts/source-maintenance.sh @@ -39,11 +39,18 @@ ASTYLE='astyle' # whether to check and, if necessary, update boilerplate copyright years CheckAndUpdateCopyright=yes +# How to sync CONTRIBUTORS with the current git branch commits: +# * never: Do not update CONTRIBUTORS at all. +# * auto: Check commits added since the last similar update. +# * SHA1/etc: Check commits added after the specified git commit. +UpdateContributorsSince=auto + printUsage () { echo "Usage: $0 [option...]" echo "options:" echo " --keep-going|-k" echo " --check-and-update-copyright " + echo " --update-contributors-since " echo " --with-astyle " } @@ -65,6 +72,16 @@ while [ $# -ge 1 ]; do CheckAndUpdateCopyright=$2 shift 2 ;; + --update-contributors-since) + if test "x$2" = x + then + printUsage + echo "Error: Option $1 expects an argument." + exit 1; + fi + UpdateContributorsSince="$2" + shift 2; + ;; --help|-h) printUsage exit 0; @@ -493,6 +510,82 @@ make -C src/http gperf-files run_ checkMakeNamedErrorDetails || exit 1 +# This function updates CONTRIBUTORS based on the recent[1] branch commit log. +# Fresh contributor entries are filtered using the latest vetted CONTRIBOTORS +# file on the current branch. The following CONTRIBUTORS commits are +# considered vetted: +# +# * authored (in "git log --author" sense) by squidadm, +# * matching (in "git log --grep" sense) $vettedCommitPhraseRegex set below. +# +# A human authoring an official GitHub pull request containing a new +# CONTRIBUTORS version (that they want to be used as a new vetting point) +# should add a phrase matching $vettedCommitPhraseRegex to the PR description. +# +# [1] As defined by the --update-contributors-since script parameter. +collectAuthors () +{ + if test "x$UpdateContributorsSince" = xnever + then + return 0; # successfully did nothing, as requested + fi + + vettedCommitPhraseRegex='[Rr]eference point for automated CONTRIBUTORS updates' + + since="$UpdateContributorsSince" + if test "x$UpdateContributorsSince" = xauto + then + # find the last CONTRIBUTORS commit vetted by a human + humanSha=`git log -n1 --format='%H' --grep="$vettedCommitPhraseRegex" CONTRIBUTORS` + # find the last CONTRIBUTORS commit attributed to this script + botSha=`git log -n1 --format='%H' --author=squidadm CONTRIBUTORS` + if test "x$humanSha" = x && test "x$botSha" = x + then + echo "ERROR: Unable to determine the commit to start contributors extraction from" + return 1; + fi + + # find the latest commit among the above one or two commits + if test "x$humanSha" = x + then + since=$botSha + elif test "x$botSha" = x + then + since=$humanSha + elif git merge-base --is-ancestor $humanSha $botSha + then + since=$botSha + else + since=$humanSha + fi + echo "Collecting contributors since $since" + fi + range="$since..HEAD" + + # We add four leading spaces below to mimic CONTRIBUTORS entry style. + # add commit authors: + git log --format=' %an <%ae>' $range > authors.tmp + # add commit co-authors: + git log $range | \ + grep -Ei '^[[:space:]]*Co-authored-by:' | \ + sed -r 's/^\s*Co-authored-by:\s*/ /i' >> authors.tmp + # but do not add committers (--format=' %cn <%ce>'). + + # add collected new (co-)authors, if any, to CONTRIBUTORS + if ./scripts/update-contributors.pl < authors.tmp > CONTRIBUTORS.new + then + updateIfChanged CONTRIBUTORS CONTRIBUTORS.new \ + "A human PR description should match: $vettedCommitPhraseRegex" + fi + result=$? + + rm -f authors.tmp + return $result +} + +# Update CONTRIBUTORS content +run_ collectAuthors || exit 1 + # Run formatting srcFormat || exit 1 diff --git a/scripts/update-contributors.pl b/scripts/update-contributors.pl new file mode 100755 index 0000000000..84e1a842c5 --- /dev/null +++ b/scripts/update-contributors.pl @@ -0,0 +1,361 @@ +#!/usr/bin/perl -w +# +## Copyright (C) 1996-2022 The Squid Software Foundation and contributors +## +## Squid software is distributed under GPLv2+ license and includes +## contributions from numerous individuals and organizations. +## Please see the COPYING and CONTRIBUTORS files for details. +## + +use strict; +use warnings; + +# Reads (presumed to be previously vetted) CONTRIBUTORS file. +# Reads untrusted CONTIBUTORS-like new input (without the preamble). +# Reports and ignores invalid new contributor entries. +# Reports and ignores valid new entries already covered by CONTRIBUTORS. +# Prints CONTRIBUTORS preamble, vetted entries, and imported new contributors +# using CONTRIBUTORS file format. + +my $VettedLinesIn = 0; +my $NewLinesIn = 0; +my $LinesOut = 0; +my $SkippedBanned = 0; +my $SkippedAlreadyVetted = 0; +my $SkippedNewDuplicates = 0; +my $SkippedEmptyLines = 0; +my $SkippedBadLines = 0; + +my @VettedContributors = (); +my @NewContributors = (); +my %Problems = (); + +exit &main(); + +# whether the new entry is already sufficiently represented by the vetted one +sub similarToVetted +{ + my ($c, $vetted) = @_; + + # It is not critical (and is probably impossible) to get this right for + # every single use case. When the script gets it wrong, a human can always + # update CONTRIBUTORS manually. Rare mistakes are not a big deal. + + # same email is enough, regardless of name differences + if (defined($c->{email}) && defined($vetted->{email})) { + my $diff = &caseCmp($c->{email}, $vetted->{email}); + return 1 if $diff == 0; + } + + # same name is enough, regardless of email differences + if (defined($c->{name}) && defined($vetted->{name})) { + my $diff = &caseCmp($c->{name}, $vetted->{name}); + return 1 if $diff == 0; + } + + return 0; +} + +# ensures final, stable order while guaranteeing no duplicates +sub cmpContributorsForPrinting +{ + my ($l, $r) = @_; + + my $diff = &cmpContributors($l, $r); + return $diff if $diff; + + # now case-sensitively + $diff = &contributorToString($l) cmp &contributorToString($r); + return $diff if $diff; + die("duplicates in output"); +} + +# case-insensitive comparison +# for list stability, use cmpContributorsForPrinting() when ordering entries +sub cmpContributors +{ + my ($l, $r) = @_; + + # Compare based on the first field (name or, if nameless, email) + # Do not use &contributorToString() on nameless entries because the + # leading "<" in such entries will group them all together. We want + # nameless entries to use email (without brackets) for this comparison. + my $lRep = defined($l->{name}) ? $l->{name} : $l->{email}; + my $rRep = defined($r->{name}) ? $r->{name} : $r->{email}; + die() unless defined($lRep) && defined($rRep); + my $diff = &caseCmp($lRep, $rRep); + return $diff if $diff; + + # nameless entries go after (matching) named entries + return -1 if defined($l->{name}) && !defined($r->{name}); + return +1 if !defined($l->{name}) && defined($r->{name}); + return 0 if !defined($l->{name}) && !defined($r->{name}); + + # we are left with the same-name entries + die() unless defined($l->{name}) && defined($r->{name}); + + # email-less entries go after (same-name) with-email entries + return -1 if defined($l->{email}) && !defined($r->{email}); + return +1 if !defined($l->{email}) && defined($r->{email}); + return 0 if !defined($l->{email}) && !defined($r->{email}); + + # we are left with same-name entries with emails + return &caseCmp($l->{email}, $r->{email}); +} + +# whether the given entry is (better) represented by the other one +sub worseThan +{ + my ($l, $r) = @_; + + return 1 if &cmpContributors($l, $r) == 0; # pure duplicate + + return 1 if !defined($l->{name}) && defined($r->{email}) && + $l->{email} eq $r->{email}; + + return 1 if !defined($l->{email}) && defined($r->{name}) && + $l->{name} eq $r->{name}; + + return 0; +} + +# whether the entry should be excluded based on some out-of-band rules +sub isManuallyExcluded +{ + my ($c) = @_; + return lc(contributorToString($c)) =~ /squidadm/; # a known bot +} + +sub contributorToString +{ + my ($c) = @_; + + if (defined($c->{name}) && defined($c->{email})) { + return sprintf("%s <%s>", $c->{name}, $c->{email}); + } + + if (defined $c->{name}) { + return $c->{name}; + } + + die() unless defined $c->{email}; + return sprintf("<%s>", $c->{email}); +} + +sub printContributors +{ + foreach my $c (sort { &cmpContributorsForPrinting($a, $b) } (@VettedContributors, @NewContributors)) { + my $entry = &contributorToString($c); + die() unless defined $entry && length $entry; + &lineOut(" $entry\n"); + } +} + +# convert an unvetted/raw input line into a {name, email, ...} object +sub parseContributor +{ + s/^\s+|\s+$//g; # trim + my $trimmedRaw = $_; + + s/\s+/ /g; # canonical space characters + die() unless length $_; + + return "entry with strange characters" if /[^-,_+'" a-zA-Z0-9@<>().]/; + + my $name = undef(); + my $email = undef(); + + if (s/\s*<(.*)>$//) { + $email = $1 if length $1; + + return "multiple emails" if defined($email) && $email =~ /,/; + return "suspicious email" if defined($email) && !&isEmail($email); + } + + # convert: name@example.com <> + # into: + if (!defined($email) && &isEmail($_)) { + $email = $_; + $_ = ''; + } + + $name = $_ if length $_; + + if (defined($name)) { + return "name that looks like email" if $name =~ /@|<|\sat\s|^unknown$/; + + # strip paired surrounding quotes + if ($name =~ /^'\s*(.*)\s*'$/ || $name =~ /^"\s*(.*)\s*"$/) { + $name = $1; + } + } + + return "nameless, email-less entry" if !defined($name) && !defined($email); + + return { + name => $name, + email => $email, + raw => $trimmedRaw, + }; +} + +# Handle CONTRIBUTORS file, printing preamble and loading vetted entries. The +# parsing rules here are a lot more relaxed because we know that this vetted +# content might contain manual entries that violate our automated rules. +sub loadVettedContributors +{ + my ($vettedFilename) = @_; + open(IF, "<$vettedFilename") or die("Cannot open $vettedFilename: $!\n"); + while () { + my $original = $_; + ++$VettedLinesIn; + + if (s/^\S// || s/^\s*$//) { + # preamble and its terminator (a more-or-less empty line) + &lineOut($original); + next; + } + + chomp; + + s/^\s+|\s+$//g; # trim + my $trimmedRaw = $_; + + my ($name, $email); + if (s/\s*<(.+)>$//) { + $email = $1; + } + if (length $_) { + $name = $_; + die("Malformed vetted entry name: ", $name) if $name =~ /[@<>]/; + } + + die("Malformed $vettedFilename entry:", $original) if !defined($name) && !defined($email); + + push @VettedContributors, { + name => $name, + email => $email, + raw => $trimmedRaw, + }; + } + close(IF) or die(); + die() unless @VettedContributors; +} + +# import contributor (name, email) pairs from CONTRIBUTOR-like input +# skip unwanted entries where the decision can be made w/o knowing all entries +sub loadCandidates +{ + while (<>) { + ++$NewLinesIn; + my $original = $_; + chomp; + + s/^\s+|\s+$//g; # trim + + if (!length $_) { + ++$SkippedEmptyLines; + next; + } + + my $c = &parseContributor(); + die() unless $c; + + if (!ref($c)) { + ¬eProblem("Skipping %s: %s", $c, $original); + ++$SkippedBadLines; + next; + } + die(ref($c)) unless ref($c) eq 'HASH'; + + if (&isManuallyExcluded($c)) { + ¬eProblem("Skipping banned entry: %s\n", $c->{raw}); + ++$SkippedBanned; + next; + } + + if (my ($vettedC) = grep { &similarToVetted($c, $_) } @VettedContributors) { + ¬eProblem("Skipping already vetted:\n %s\n %s\n", $vettedC->{raw}, $c->{raw}) + unless &contributorToString($vettedC) eq &contributorToString($c); + ++$SkippedAlreadyVetted; + next; + } + + push @NewContributors, $c; + } +} + +sub pruneCandidates +{ + my @ngContributors = (); + + while (@NewContributors) { + my $c = pop @NewContributors; + if (my ($otherC) = grep { &worseThan($c, $_) } (@VettedContributors, @NewContributors, @ngContributors)) { + ¬eProblem("Skipping very similar:\n %s\n %s\n", $otherC->{raw}, $c->{raw}) + unless &contributorToString($otherC) eq &contributorToString($c); + ++$SkippedNewDuplicates; + next; + } + push @ngContributors, $c; + } + + @NewContributors = @ngContributors; +} + +sub lineOut { + print(@_); + ++$LinesOut; +} + +# report the given problem, once +sub noteProblem { + my $format = shift; + my $problem = sprintf($format, @_); + return if exists $Problems{$problem}; + $Problems{$problem} = undef(); + print(STDERR $problem); +} + +sub isEmail +{ + my ($raw) = @_; + return $raw =~ /^\S+@\S+[.]\S+$/; +} + +sub caseCmp +{ + my ($l, $r) = @_; + return (uc $l) cmp (uc $r); +} + +sub main +{ + &loadVettedContributors("CONTRIBUTORS"); + &loadCandidates(); + &pruneCandidates(); + + my $loadedNewContributors = scalar @NewContributors; + die("$NewLinesIn != $SkippedEmptyLines + $SkippedBadLines + $SkippedBanned + $SkippedAlreadyVetted + $SkippedNewDuplicates + $loadedNewContributors; stopped") + unless $NewLinesIn == $SkippedEmptyLines + $SkippedBadLines + $SkippedBanned + $SkippedAlreadyVetted + $SkippedNewDuplicates + $loadedNewContributors; + + &printContributors(); + + # TODO: Disable this debugging-like dump (by default). Or just remove? + printf(STDERR "Vetted lines in: %4d\n", $VettedLinesIn); + printf(STDERR "Updated lines out: %4d\n", $LinesOut); + printf(STDERR "\n"); + printf(STDERR "New lines in: %4d\n", $NewLinesIn); + printf(STDERR "Skipped empty lines: %4d\n", $SkippedEmptyLines); + printf(STDERR "Skipped banned: %4d\n", $SkippedBanned); + printf(STDERR "Skipped similar: %4d\n", $SkippedAlreadyVetted); + printf(STDERR "Skipped duplicates: %4d\n", $SkippedNewDuplicates); + printf(STDERR "Skipped bad lines: %4d\n", $SkippedBadLines); + printf(STDERR "\n"); + printf(STDERR "Vetted contributors: %3d\n", scalar @VettedContributors); + printf(STDERR "New contributors: %3d\n", scalar @NewContributors); + printf(STDERR "Contributors out: %3d\n", @VettedContributors + @NewContributors); + + return 0; +} +