Can I extract complete dates from file with grep command?

I need help using grep to extract a zoned date time from a file on a Linux system.

Source file is a XML with the data below:

<item start="20231010073000 +0100" stop="20231010100000 +0100">...</item>

And I need to extract the complete start date, but with grep I can’t get it as a complete result. My code:

for startDate in $(grep -Eo 'start="[0-9]{14} [+|-][0-9]{4}"' "$filepath" ); do
  echo "$startDate"
done

And I get it in two different results:

start="20231010073000
+0100"

Can I get it as bellow:

start="20231010073000 +0100"

I’ve tried with s, [[:space:]], and other examples, but with the same solution.

It seems an error in my code but I can’t fix it!

I am thankful for any kind of help!!!

Asked By: aris

||

The problem lies in your loop: it will, by default, split on $IFS (so, with the default value of $IFS: any sequence of space, tab, or newline characters, and it will also discard the first and last ones)

There are many ways to fix this, for example:

while IFS= read -r StartDate; do
    echo "$StartDate"
done < <(grep -Eo -- 'start="[0-9]{14} [+-][0-9]{4}"' "$filepath")

(I use the: loop < <( command generating the input ) form instead of command generating the input | loop form: so that the loop is in the current shell and not in a subshell as would be the case with the bash shell when the lastpipe option is not enabled. This is not always necessary, but is useful to know for example if you want to see the latest value of $StartDate after the loop: if in a subshell, the value will have disappeared at the end of the subshell and can’t be retrieved in the current shell.)

Answered By: Olivier Dulac

Since you are dealing with XML, we should really be using an XML parser to get the attribute’s value.

The following shows how to get the start attribute’s value from any item node in the entire document using xmlstarlet:

$ xmlstarlet select --template --value-of '//item/@start' --nl file
20231010073000 +0100

Or, using abbreviated option names:

$ xmlstarlet sel -t -v '//item/@start' -n file
20231010073000 +0100

If there are several item nodes, and you only want the value from the start attribute of the first one, use //item[1]/@start in the XPath query.

You may then transfer the result into a variable with a standard command substitution:

start=$( xmlstarlet sel -t -v '//item[1]/@start' file )

(I dropped the -n option from the above command as it’s no longer needed. It adds a newline character to the end of the output, but the command substitution would remove it.)

Or, you can read them all into a bash array with readarray:

readarray -t startarray < <(
    xmlstarlet sel -t -v '//item/@start' -n file
)

and then loop over it (for start in "${startarray[@]}"; do ...; done) or loop over the output of xmlstarlet directly:

while IFS= read -r start; do
   # ...
done < <( xmlstarlet ...as above... )
Answered By: Kusalananda

Don’t use grep nor regex to parse HTML/XML you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel, xmlstarlet or xmllint if you need a quick shot from a command line shell… Never accept a job if you don’t have access to proper tools.


The most advanced command line tools around is xidel. The syntax is much more intuitive/modern (and do support XPath3 where other tools are stuck with limited XPath1), than xmlstarlet or xmllint, see:

xidel -e '//item/@start' -s file.xml
20231010073000 +0100
  • -e for XPath expression
  • -s for silent (no status informations)

The query language is XPath and it’s very useful in many cases to parse XML/HTML.


XPath tutorials:

https://developer.mozilla.org/en-US/docs/Web/XPath
http://www.w3schools.com/xpath/xpath_functions.asp
http://stackoverflow.com/tags/xpath/info
https://topswagcode.com/xpath/ (interactive XPath game, when you have the basics and want to practice interactively)

Answered By: Gilles Quénot

If you cannot install extra dependences on the system to properly parse the XML, then I would write a script that handles the parsing a bit more elegantly and not trying to do it on one line.

Here is an example script of me parsing those out those times from the line your provided.

#!/usr/bin/env bash

INPUT_FILE="$1"
TIME_FILTER='[0-9]*s(+|-)[0-9]*'

__getStart(){
  line="$1"
  echo "$line" | egrep -o "start="${TIME_FILTER}"" | egrep -o "$TIME_FILTER"
}

__getStop(){
  line="$1"
  echo "$line" | egrep -o "stop="${TIME_FILTER}""  | egrep -o "$TIME_FILTER"
}
  

while IFS= read -r line; do
    start_time="$(__getStart "$line")"
    stop_time="$(__getStop "$line")"
    echo "Start Time: ${start_time}"
    echo "Stop Time: ${stop_time}"
done < "$INPUT_FILE"

You can use the script in this fashion

[/var/tmp] $ ./get-dates.sh date-extraction.xml 
Start Time: 20231010073000 +0100
Stop Time: 20231010100000 +0100
Answered By: bitmvr
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.