Can I extract complete dates from file with grep command?
I need help using grep to extract a zoned date time from a file on a Linux system.
Source file is a XML with the data below:
<item start="20231010073000 +0100" stop="20231010100000 +0100">...</item>
And I need to extract the complete start date, but with grep I can’t get it as a complete result. My code:
for startDate in $(grep -Eo 'start="[0-9]{14} [+|-][0-9]{4}"' "$filepath" ); do
echo "$startDate"
done
And I get it in two different results:
start="20231010073000
+0100"
Can I get it as bellow:
start="20231010073000 +0100"
I’ve tried with s
, [[:space:]]
, and other examples, but with the same solution.
It seems an error in my code but I can’t fix it!
I am thankful for any kind of help!!!
The problem lies in your loop: it will, by default, split on $IFS
(so, with the default value of $IFS
: any sequence of space
, tab
, or newline
characters, and it will also discard the first and last ones)
There are many ways to fix this, for example:
while IFS= read -r StartDate; do
echo "$StartDate"
done < <(grep -Eo -- 'start="[0-9]{14} [+-][0-9]{4}"' "$filepath")
(I use the: loop < <( command generating the input )
form instead of command generating the input | loop
form: so that the loop is in the current shell and not in a subshell as would be the case with the bash shell when the lastpipe
option is not enabled. This is not always necessary, but is useful to know for example if you want to see the latest value of $StartDate
after the loop: if in a subshell, the value will have disappeared at the end of the subshell and can’t be retrieved in the current shell.)
Since you are dealing with XML, we should really be using an XML parser to get the attribute’s value.
The following shows how to get the start
attribute’s value from any item
node in the entire document using xmlstarlet
:
$ xmlstarlet select --template --value-of '//item/@start' --nl file
20231010073000 +0100
Or, using abbreviated option names:
$ xmlstarlet sel -t -v '//item/@start' -n file
20231010073000 +0100
If there are several item
nodes, and you only want the value from the start
attribute of the first one, use //item[1]/@start
in the XPath query.
You may then transfer the result into a variable with a standard command substitution:
start=$( xmlstarlet sel -t -v '//item[1]/@start' file )
(I dropped the -n
option from the above command as it’s no longer needed. It adds a newline character to the end of the output, but the command substitution would remove it.)
Or, you can read them all into a bash
array with readarray
:
readarray -t startarray < <(
xmlstarlet sel -t -v '//item/@start' -n file
)
and then loop over it (for start in "${startarray[@]}"; do ...; done
) or loop over the output of xmlstarlet
directly:
while IFS= read -r start; do
# ...
done < <( xmlstarlet ...as above... )
Don’t use grep
nor regex
to parse HTML/XML
you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel
, xmlstarlet
or xmllint
if you need a quick shot from a command line shell… Never accept a job if you don’t have access to proper tools.
The most advanced command line tools around is xidel
. The syntax is much more intuitive/modern (and do support XPath3
where other tools are stuck with limited XPath1
), than xmlstarlet
or xmllint
, see:
xidel -e '//item/@start' -s file.xml
20231010073000 +0100
-e
forXPath
e
xpression-s
fors
ilent (no status informations)
The query language is XPath
and it’s very useful in many cases to parse XML/HTML.
XPath
tutorials:
https://developer.mozilla.org/en-US/docs/Web/XPath
http://www.w3schools.com/xpath/xpath_functions.asp
http://stackoverflow.com/tags/xpath/info
https://topswagcode.com/xpath/ (interactive XPath
game, when you have the basics and want to practice interactively)
If you cannot install extra dependences on the system to properly parse the XML, then I would write a script that handles the parsing a bit more elegantly and not trying to do it on one line.
Here is an example script of me parsing those out those times from the line your provided.
#!/usr/bin/env bash
INPUT_FILE="$1"
TIME_FILTER='[0-9]*s(+|-)[0-9]*'
__getStart(){
line="$1"
echo "$line" | egrep -o "start="${TIME_FILTER}"" | egrep -o "$TIME_FILTER"
}
__getStop(){
line="$1"
echo "$line" | egrep -o "stop="${TIME_FILTER}"" | egrep -o "$TIME_FILTER"
}
while IFS= read -r line; do
start_time="$(__getStart "$line")"
stop_time="$(__getStop "$line")"
echo "Start Time: ${start_time}"
echo "Stop Time: ${stop_time}"
done < "$INPUT_FILE"
You can use the script in this fashion
[/var/tmp] $ ./get-dates.sh date-extraction.xml
Start Time: 20231010073000 +0100
Stop Time: 20231010100000 +0100