From b0666906f7c3be7f21f0ce4f457ccc63d4fa78d8 Mon Sep 17 00:00:00 2001 From: Jim Hague Date: Fri, 15 Dec 2017 16:53:59 +0000 Subject: [PATCH] eit: add info on EIT scraper config file format to scraper README (#4795) Info on the EIT scraper config file contents is a bit scattered, and not completely up to date. Add a description to the EIT scraper README. Issue: #4795 --- data/conf/epggrab/eit/scrape/README | 56 +++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/data/conf/epggrab/eit/scrape/README b/data/conf/epggrab/eit/scrape/README index 9c1698991..015a4fe24 100644 --- a/data/conf/epggrab/eit/scrape/README +++ b/data/conf/epggrab/eit/scrape/README @@ -1,6 +1,62 @@ The directory contains configuration file for general regular expressions to be applied to the EPG. +Configuration file format +------------------------- + +A configuration file is in JSON format. Possible members of the top-level +object are: + +* season_num +* episode_num +* airdate +* is_new +* scrape_subtitle + +Each member's value is a list of regular expressions. Each regular +expression must contain at least one sub-pattern, i.e. a pattern +enclosed in (). Input data is matched against the first regex in the +list. If no match is found, the second regex is tried, and so on until +a match is found or the list exhausted. + +For each EPG episode, the title, description and summary are matched +in turn against the season_num, episode_num, airdate and is_new regexes. + +- season_num converts the contents of the first sub-pattern to an integer, + and if successful sets the EPG season number. +- episode_num converts the contents of the first sub-pattern to an integer, + and if successful sets the EPG eipsode number. +- airdate converts the contents of the first sub-pattern to an integer, + and if successful sets the EPG copyright year. +- is_new sets the EPG is_new flag on any match. Remember the regexp must + have one sub-pattern to make a successful match; in this case the content + of the sub-pattern is ignored. + +Finally, the summary only is matched against the scrape_subtitle regexs. +On an match, the EPG subtitle is set to the contents of the first sub-pattern. +If a second sub-pattern is present in the regex, the EPG summary is set to +the contents of that sub-pattern. If no second sub-pattern is present, the +EPG summary is not changed. + +Regular expression engine +------------------------- + +If the PCRE or PCRE2 library is found during configuration, that library +is used for regular expression matching. Otherwise, the default C library +POSIX regular expression handling is used, and the regular expressions +treated as extended POSIX regular expressions. + +If a regular expression is intended for universal use, you need to be careful +to ensure that it works as expected on PCRE and POSIX engines. A useful +reference is at http://www.regular-expressions.info/refbasic.html. + +Testing +------- + There is a test harness for these files in the development tree at support/eitscrape_test.py with test harness files at support/testdata/eitscrape. + +WARNING: The test harness uses Python's re regular expression handling. +Python regular expressions are neither POSIX nor PCRE, though in general +they are closer to PCRE than POSIX. -- 2.47.3