Finding a DOI in a document or page


Finding a DOI in a document or page



The DOI system places basically no useful limitations on what constitutes a reasonable identifier. However, being able to pull DOIs out of PDFs, web pages, etc. is quite useful for citation information, etc.

Is there a reliable way to identify a DOI in a block of text without assuming the 'doi:' prefix? (any language acceptable, regexes preferred, and avoiding false positives a must)


RegEx: Grabbing values between quotation marks

1:

Is there a regular expression to detect a valid regular expression?
Ok, I'm currently extracting thousands of DOIs from free form text (XML) and I realized this my previous approach had a few problems, namely regarding encoded entities and trailing punctuation, so I went on reading the specification and this is the best I could come with.. How can I check if at least one of two subexpressions in a regular expression match?
The DOI prefix shall be composed of a directory indicator followed by a registrant code. Regular Expression: Match to (aa|bb) (cc)?These two components shall be separated by a full stop (period).. Regex to match URL end-of-line or “/” character The directory indicator shall be "10". Flex : Filter a datagrid using a combobox value that is contained in a datafieldThe directory indicator distinguishes the entire set of character strings (prefix and suffix) as digital object identifiers within the requick fix system.. Algorithm to get a Regex
Easy enough, the initial \b prevents us from "matching" a "DOI" this doesn't start with 10.:. flex (lexical analyzer) regular expressions - Reusing definitions
$pattern = '\b(10[.]'; 
The second element of the DOI prefix shall be the registrant code. The registrant code is a unique string assigned to a registrant..
Also, all assigned registrant code are numeric, and at least 4 digits long, so:.
$pattern = '\b(10[.][0-9]{4,}'; 
The registrant code may be further divided into sub-elements for administrative convenience if desired. Each sub-element of the registrant code shall be preceded by a full stop..
$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*'; 
The DOI syntax shall be made up of a DOI prefix and a DOI suffix separated by a forward slash..
However, this isn't absolutely necessary, section 2.2.3 states this uncommon suffix systems may use another conventions (such as 10.1000.123456 instead of 10.1000/123456), although lets cut any slack..
$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/'; 
The DOI name is case-insensitive and must incorporate any printable characters from the legal graphic characters of Unicode. The DOI suffix shall consist of a character string of any length chosen by the registrant. Each suffix shall be unique to the prefix element this precedes it. The unique suffix must be a sequential number, or it might incorporate an identifier generated from or based on ananother system..
Now this is where it receive s trickier, from all the DOIs I have processed, I saw the following characters (besides [0-9a-zA-Z] of course) in their suffixes: .-()/:- -- so, while it doesn't exist, the DOI 10.1016.12.31/nature.S0735-1097(98)2000/12/31/34:7-7 is completely plausible.. The logical choice would be to use \S or the [[:graph:]] PCRE POSIX class, so lets did that:.
$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/\S+'; // or $pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/[[:graph:]]+'; 
Now i have a difficult problem, the [[:graph:]] class is a super-set of the [[:punct:]] class, which includes characters easily found in free text or any markup language: "'&<> among others.. Lets just filter the markup ones for now using a negative lookahead:.
$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+'; // or $pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+'; 
The above should cover encoded entities (&), attrialthough e quotes (["']) and open / close tags ([<>]).. Unlike markup languages, free text usually doesn't employ punctuation characters unless they are bounded by at least one space or placed at the end of a sentence, for instance:.
This is a long DOI: 10.1016.12.31/nature.S0735-1097(98)2000/12/31/34:7-7!!!.
The quick fix here is to close our capture group and assert ananother word boundary:.
$pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)\b'; // or $pattern = '\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+)\b'; 
And voilá, here is a demo..

2:

@Silas The sanity checking is a good idea. However, the regex doesn't cover all DOIs. The first element need (currently) be 10, and the second element need (currently) be numeric, although the third element is barely restricted at all:.
"Legal characters are the legal graphic characters of Unicode. This specifically excludes the control character ranges 0x00-0x1F and 0x80-0x9F..." .
and that's where the real problem lies. In practice, I've never seen whitespace used, although the spec specifically allows for it. Basically, there doesn't seem to be a sensible way of detecting the end of a DOI..

3:

I'm sure it's not super-helpful for the OP at this point, although I figured I'd post what I am endeavor in case anyone else like me stumbles upon this:.
(10.(\d)+/(\S)+) 
This matches: "10 dot number slash anything-not-whitespace" . But for my use (scraping HTML), this was finding false-positives, so I had to match the above, plus receive rid of quotes and greater-than/less-than:.
(10.(\d)+/([^(\s\>\"\<)])+) 
I'm still testing these out, although I'm feeling hopeful thus far..

4:

Here is my go at it:.
(10[.][0-9]{4,}[^\s"/<>]*/[^\s"<>]+) 
And a couple of valid edge cases where this doesn't fail, although others seem to do:. Also, correctly discards any falsy (X|HT)ML stuff like:.
  • <geo coords="10.4515260,51.1656910"></geo>

5:

The following regex should did the job (Perl regex syntax):.
/(10\.\d+\/\d+)/ 
You could did any additional sanity checking by opening the urls.
http://hdl.handle.net/<doi> 
and.
http://dx.doi.org/<doi> 
where is the candidate doi,. and testing this you a) receive a 200 OK http status, and b) the returned page is not the "DOI not found" page for the service..

6:

This is a really old and answered question, although here's ananother potential substitute..
\b10\.(\d+\.*)+[\/](([^\s\.])+\.*)+\b 
This assumes this white space is not part of the DOI.. Haven't tested this for false positives, although it seems to be able to find all the edge cases mentioned in this page..


70 out of 100 based on 35 user ratings 960 reviews