CREG Journal Search Engine
- Development
Version 0v28, build date Mon 31-May-2021 14:14:20, by
David Gibson
- Warning: This page
is under development and is likely to contain strange bugs and poor
documentation. Please report anything odd or incorrect. Please
include, in your report, the date and time of your search, your IP
address (which is reportedly 3.146.34.148)
and a description of what happened. Just saying "it didnt work" is not
particularly helpful!
Back to top of page
Results of Search (still searching...)
This page may take a few seconds to
load. Please wait ...
DEVELOPMENT NOTE: If you think something odd is
happening, try forcing a page refresh by typing (probably) CTRL-SHIFT-R - but
check your browser documentation. Reason: Your browser might be relying
on old cached copies of JS or CSS files that have been modified recently,
during development, but which your browser has decided not to download.
Browsers are capricious like that.
- Plain Text Searches
- Wildcard Searches
- Boolean Searches
- Regular Expression Searches
- The Search Algorithms
- Regex Conversions
-
Plain Text Searches
What you type is what is searched for, but please note the following
exceptions...
- When you submit your search expression, any space characters are
converted to underscores (_) for on-screen clarity. This means that you cannot
search for an underscore using a plain text search.
- Similarly, < and > are converted to underscore. This is to prevent cross-site scripting attacks. This conversion means that you cannot
search for < or >.
- Spaces (and underscores) in your search expression are interpreted as
matching any number of consecutive spaces. This is so that your search will not
be spoiled if the database accidentally happens to contain two spaces between
words. You can change this behaviour using the checkbox in Advanced Settings,
above.
- In a plain text search, you cannot search for non-7-bit ascii
characters (e.g. accented characters and symbols such as
±½²³). These are stored in the database as HTML
Character Entities and - if you need to find them - you should use a wildcard
search.
- The database contains HTML tags and Character Entities. Your search
will look inside these items, because it is faster not to exclude them,
but this could lead to strange results. You can change this behaviour using the
checkboxs in Advanced Settings for Do not search inside HTML tags and
Do not search inside Character Entities. You can also avoid the display
of strangely-formatted results by selecting the option Do not tag matched
text.
-
Wildcard Searches
A Wildcard search works like a Plain Text search but,
additionally...
- In a wildcard search the characters ? and * have a
special meaning. ? matches a single character; * matches a string
of any characters, but is prioritised to be as short as possible.
- In other respects this search is the same as a plain text search.
- For the wildcard *, "As short as possible" means that if the
string being searched was, for example, "electric field and magnetic field"
then the search term elec*field would match "electric field" rather than
"electric field and magnetic field").
-
Boolean Searches
Not yet implemented, but you
may be able to achieve a similar result with an appropriate Regular Expression.
A Boolean search allows you to combine Wildcard search strings
with the logical operators NOT, AND, OR, XOR and
IMP (IMPLIES), and to group them with parentheses, ( and
).
- XOR is the EXCLUSIVE OR operator, which is equivalent
to (aaa AND NOT bbb) or (bbb AND NOT aaa).
- IMP is the IMPLIES operator, where A IMP B is
equivalent to B OR NOT A.
- In this implementation, the search string and the operators must each
be separated by one or more spaces but you can still use spaces inside
your search strings. If you need to use a space at the beginning or end of your
search string you should enter it as an underscore instead. (See note on spaces
in the Plain Text Search notes above).
- You cannot use ( or ) in your search string unless you
select the adanced option Allow ( and ) inside Boolean search
strings. If you select that option you must ensure, if you also use
( and ) to group your search expressions, that you separate these
'group separators' from the search strings and the other operators using
spaces
- You can use the operator keywords (AND, OR, etc) in your
search string, provided they are not bounded by spaces.
- Unlike some Boolean searches, this one does not execute with a
simple left-to-right evaluation. Instead, the operators have a presidence
ranking which, in high-to-low order, is NOT, AND &
IMP, OR & XOR. As an example, aaa ddd OR bbb AND
ccc would execute as aaa ddd OR ( bbb AND CCC ) rather then the
left-to-right execution of ( aaa ddd OR bbb ) AND ccc.
- For a Boolean search, each search term is matched by a separate parse
of the database, so a complex search with many search strings could be
slow.
- Because of the structure of the database, a Boolean search is
potentially more likely than a Plain Text search to produce strangely-formatted
results. You can avoid this by selecting the option Do not tag matched
text
-
Regular Expression Searches
Unlike Plain Text, Wildcard and Boolean searches, your search string
is interpreted directly, as a Regular Expression - but see the note on spaces,
below.
- Regular Expression searches use PHP-style expressons (which are
PCRE-based).
- Your search string is delimited using / characters. If you
include a / character in your search string, it will be escaped with
\.
- The mode modifier i will be appended if you have specified a
case-insensitive search.
- When you submit your search expression it is trimmed to remove
leading and trailing spaces so, to search for such a space you should use the
RegEx syntax \s
Regular Expression (RegEx) searches are very powerful, but they are only
suitable for experts. In particular, you may need to know something about the
database structure in order to use a regular expression to best advantage. You
can do some advanced searches using RegExs. Examples...
- To to locate all non-7-bit printable ascii characters, which need to
be converted to HTML entities, use the search expression [^\s!-~]
- To locate any & characters that have not been entered as
an HTML entity &, use the search expression
&(?!.{0,6}?(;|=))
- To locate all HTML Character Entities, use the search expression
&#?\w+?;
- < and > are converted to underscore before the search expression is executed. This is to prevent cross-site scripting attacks. This conversion means that you cannot
search for < or >.
-
The Search Algorithms
The Titles and Abstracts are searched separately. The search scores as a
hit if...
- The search string was found in the Title AND Search
Titles was selected, OR...
- The search string was found in the Abstract AND Search
Abstracts was selected
Clicking the Negate the Search Result box causes the result of
the above logical test be inverted. This means that if you elect to search
Titles AND Abstracts then, to be scored a hit, the search term must not
appear in either.
Entries in the database are in HTML-compatible text. That is, they
include escaped 'Character Entities' and HTML tags (in particular the Anchor
tag). For speed, your search will look inside tags and entities, but
this can lead to strange results. For searches other than RegExs, you have the
option to exclude tags and entities from the search by using the checkboxs in
Advanced Settings for Do not search inside HTML tags and Do not
search inside Character Entities. You can also avoid the display of
strangely-formatted results by selecting the option Do not tag matched
text.
A couple of examples will illustrate this...
- If you do a plain text search for hp your search will include
matches within the string phpBB that appears inside some HTML tags. If
you click on such a result, the hyperlink will not work because the matched
string has been replaced by HTML code to display the match in red.
- If you do a plain text search for cut your search will
include matches within the string acute that appears inside some
Character Entities. The matched string will be replaced by HTML code to display
the match in red, so it will not longer function correctly as a Character
Entity, and will display as (e.g.) é instead of é.
Some further technical details...
- Inside a tag means "inside the < and
> symbols"; not what appears inbetween a start tag and
its matched end tag. Do not search inside... is interpreted as
meaning a search string must not finish inside a tag (or entity). That
is clearly not exactly what the description implies, and you cannot
(easily) search for a string that encompasses an HTML tag. It would be possible
to strip the tags out before searching - and this might be a future
option.
- In the database, the Titles are followed by the page numbers in
parentheses. A Title Search does not search the page numbers.
- References to the CREG Forum. Title
records can contain a reference to the CREG Forum. Search for cregf to
display these records. Technical Spec. Titles can contain text
like [cregf:viewtopic.php?f=27&t=1203]. The text after
'cregf' is removed from the Title before it is searched or displayed. The text
after ':' is appended to 'http://british-caving.org.uk/phpBB3/' to form the
URL. The phrase must begin '[cregf' and end with ']'.
For a search other than a Regular Expression search, your search
string, and any options you specify, are converted into the approrpiate regular
expression which is then used for the search. A list of the conversion
operations applied to non-RegEx searches is given in RegEx Conversions,
below. For a regular expression search, you are expected to specify the search
string precisely, including any arcane terms to tailor the search to work a
particular way.
Boolean searches have a more complicated algorithm than the other
types of search, which proceeds as follows.
- Unless you have selected the option to Allow ( and
) inside Boolean search strings your search text is searched and all
instances of ( and ) will have spaces inserted before and after
them.
- The search string is then parsed and separated into 'tokens', using
'space' as a delimiter. Each token thus represent a search string (or part of a
search string) or an operator. If you have selected the option to
Allow ( and ) inside Boolean search strings you must
ensure that all uses of ( and ) outside a search expression have
a space before and after them.
- The tokens are then examined, in turn. If a token matches an operator
exactly then it is treated as an operator, else it is treated as a
string. Two adjacent tokens that are both strings are joined together into a
single string, with a single space between them. The sequence of tokens is
checked for syntax errors.
- The search expression is re-ordered into Reverse Polish Notation that
computer languages use internally to process expressions. additionally, this
takes into account rules of operator presidence.
- The search expression is then parsed for a fourth time, converting
each search string into a regular expression as described under Wildcard
searches, above, and RegEx Conversions, below.
- Additionally, the individual search terms are combined into a single
regular expression, $match, using an OR syntax, which is saved for later
use, should there be a match.
- The complete parsed and processed search expression is then displayed
(as a debugging aid) and passed to the search engine.
- The search engine examines and executes each token in turn, placing
the logical result of the operation on a stack.
- If the option Negate the search result has been specified the
logical result of the search is inverted.
- If the result is TRUE then the result is prepared for
outputting to the screen.
- Unless the option Do not tag matched text has been selected,
the $match expression assembled earlier is used to modify the printable
result to highlight all the search terms.
-
RegEx Conversions
Plain Text searches
Plain-text searches are converted to Regular Expressions before the
search is executed. The sequence of operations is as follows.
- Spaces are converted to underscores before the expression is
submitted
- < and > are converted to underscore.
- All characters that have a special meaning in RegExs are escaped with
\
- The characters & " £ are replaced with
their HTML entities
- _ is replaced with /s+ so that the search matches a
string of spaces. This behaviour can be modified by an Advanced Setting
- Hyphen is replaced by (\-|–) so that the search
matches a hyphen or an en-dash. This behaviour can be modified by an Advanced
Setting
- If you specified Match Whole Words then the RegEx is bounded
by the metacharacter \b for 'word boundary'
- If you specified Do not search inside HTML tags the RegEx
phrase (?![^<]*?>) is appended to your search so that matches
inside an HTML tag are ignored.
- If you specified Do not search inside Character Entities the
RegEx phrase(?![^&]*?;;) is appended to your search so that matches
inside an HTML entity are ignored. For this to work, the database is
temporarily altered to replace the single terminating ; of an Entity, by
;;. This apparent 'botch' is considered the simplest way of performing
the match, because the alternative method of using a RegEx lookbehind
function is tediously long-winded.
- If you did not specify Match Case then the mode modifier i is
appended to the regular expression, for 'case-insensitive matching'
Wildcard and Boolean searches All the above, plus...
- The wildcard ? is replaced by (.|&#?\w+?;) so that
it matches any single character or a Character Entity
- The wildcard * is replaced by .*? to specify a search
for a string of any characters, but one which is prioritised to be as short as
possible.
RegEx searches
- The / character, which has a special meaning in a RegEx is
escaped with \
- If you did not specify Match Case then the mode modifier i is
appended to the Regular Expression, for 'case-insensitive matching'
Some Preset
Searches
Some special searches
The following list is intended mostly for 'debugging', but feel free to try
them.
Known Issues /
Things to Do
Software Issues
- The Special Searches for double-quotes and &-in-tags doesnt
populate the search box properly, although the search works. Something to do
with the unbalanced or unescaped quotation marks or entities not being properly
escaped in URLs ... whatever.
Database Issues
- Page numbers: are not given for the earlier journals - the
database needs updating
- 8-bit characters: in the database need replacing with Character
Entities
- Tagging of Authors: From j99, authors' names are tagged with
<SPAN CLASS="author">. This should be extended back through all
issues.
- Some HTML tags could be replaced by entities: Consider replacing
HTML tags for <sup> with entities, or improve regEx so that it searches
them as it would an entity. Related: why do I not use <sup> in CKS
database; but use CSS instead?
- Articles containing corrections and updates. Check whether
searching for "corrections" brings up all published corrections. Check Julie's
database for this. Also, check her notes of 'associated articles' to see if it
can be incorporated, and a special search for "updates, corrections and related
articles" introduced, perhaps?
Future Development
- Consider translating 8-bit chars in Search String to HTML entities. Or,
at least, flagging them to user and suggesting he use a wildcard
- Add Booleans. Use a presidence stack to convert to RP.
- Finish debugging the new feature that puts tabs between title items, and
extend it to all listings. This feature is not yet advertised to the user.
- Consider adding option to strip HTML tags before searching
- Consider changing code so that the "Using RegEx" text isnt put in the
SPAN "searchReport" innerHTML until the search is complete. This is just to
tidy the HTML output and make it easier to debug
- Change database files - we no longer need lo list all the links, now that
covers.php handles this. Still handy to list the links to raw data files I
suppose - or do that via a query string, e.g. mode=raw
Additions, Bugs, Corrections (not necessarily a complete list)
- 12-Nov-2017 Version 0v11: Bug correction: Ampersand not converting
to Entity in Plain Text search. Correction made to search.php; forgot that
& is not converted by preg_quote()
- 12-Nov-2017 Version 0v11: Layout change: Added <SPAN
CLASS="keepTogether"> to keep INPUT items on same line as their text, in
list of search options. This necessitated adding a parentNode clause in their
ONCLICKS. Updated /pub/popup.css
- 12-Nov-2017 Version 0v11:New Feature: Added Do not search
inside Character Entities by using a 'botch' - see ;; above. This
seems the simplest way to do it though, because lookbehind (for matching the
opening & of an Entity) requires a fixed length search term.
- 14-Nov-2017 Version 0v13: Documentation: revised notes for Boolean
operations (although they are still not implemented). Program: various changes
to comments in search.html and format_creg.php. Moved PHP error handling into
separate function ( for which see test mechanism at the end of printdata()).
- 14-Nov-2017 Version 0v15: HTML: added #results to submitted FORM
so that it jumps down to start of results when search is complete. Also added
popup info box (position: fixed) that duplicates the info in "Results of
Search", and which disappears when results are complete. This makes it easier
to inspect the regex (as displayed) during a long or buggy search.
- 15-Nov-2017 Version 0v16: HTML: added notes on boolean searches.
Added feature to limit search to a range of years.
- 16-Nov-2017 Version 0v17: Corrected bug due to accidentally wiped
code in format_creg for handling title and abs checkboxes. Updated 'years'
facility to give default string '(All)' and to move some operations out of the
For loop in printallData
- 20-Nov-2017 Version 0v19: Searches for 'cregf' string now handled
better. This required a change to the database structure. See References
to the CREG Forum, in the Help notes above.
- 21-Nov-2017 Version 0v19 Documentation: revised
- 27-Nov-2017 Version 0v21 __search updated.
Contents.php edited to show links to local copies of PDFs when run under
localhost. format_creg updated for stripping of HTML pre-amble and
adding new pre-amble if it does not exist. Modified .htaccess and
contents.php to give new format for representing links to data. Updated
pub/popup.css.
- 27-Nov-2017 Version 0v21 Documentation: update to
pub/dataformat and to
database.html
- 28-Nov-2017 Version 0v22 Layout changes to search.html. Added
showDOI. Added reset date to tooltip for downloads counter. Added tabs to
separate title items. See further work
- 30-Nov-2017 Version 0v23 Version number is now PHP variable. Added
logging of search requests. Associated changes to docstore log files
- 30-Nov-2017 Version 0v24 Sandbox handling corrected for
searches.
- 03-Dec-2017 Version 0v25 Format updating now also includes
conversion to HTML Entities, but these features are only enabled at Localhost,
because of file permission and character set issues. Further corrections to
data files, for HTML entities - both 8-bit chars and one file with rogue
double-quots
- 04-Dec-2017 Version 0v26 Corrected contents.php to remove bad type
conversion when testing for 'cregf', which was preventing CKS from finding 000
files. Changed covers.php to display new unique URLs instead of query strings.
Changed name of search Text Box in this file, to deter bots; renamed $v box to
'search' to use as a flag, to avoid needing to update other files. Amended
search.*, log_search* and fetch_logs accordingly.
- 08-Dec-2017 Version 0v27 Preset searches for development changed
to use ONMOUSEOVER to build URL, so search engines cannot follow the links.
Added notes about searching for authors via authors.html
- 08-Dec-2017 Version 0v28 < and > are now converted to underscore before any other processing. This is to prevent cross-site scripting attacks. This conversion means that you cannot
search for < or >. In practice, it would be OK to search for < and > in a plain text search, so I might alter this behaviour again later.
|
View Contents:
BCRA is a UK registered charity and is a constituent body of
the British Caving Association,
undertaking charitable activities on behalf of the BCA.
BCRA publishes a range of periodicals and books.
Click here for further information.
Searching
To Search our pages using Google, type a search
string in the box at the top of the page and hit your Return key
You can also search our publications catalogue at the British Caving Library
The CREG Journal Search Engine is a new, powerful search engine which will, sometime, be extended
to cover Cave & Karst Science. We have a keyword search facility on our Cave Science Indexes pages but this may be rather out-of-date.
|
For staff use: Link to Database
Show/Hide
download figures next to each item (if available and non-zero; you might need to refresh page first). Counters last
reset on Thu 03-Jan-2019 17:29:28 +00:00. The figures are non-unique
click-throughs.
Users please note: that, for debugging purposes, all search requests
are logged. The logged data includes the client IP address as reported to our
web server.
Development notes to self: Reminder: files are in /bookshop/,
pub/cregj/ and/pub/php/ run
at BCRA | run at
Localhost | location.reload(true) |