Sed to print only first pattern match of the line

I have some data like

<td><a href="data1">abc</a> ... <a href="data2">abc</a> ... <a href="data3">abc</a>

( Would refer to above line as data in code below )

I need data1 in between the first " and " so I do

echo 'data' | sed 's/.*"(.*)".*/1/'

but it returns me the last string in between " and " always, i.e in this case it would return me data3 instead instead of data1

In order to get data1, I end up doing

echo 'data' | sed 's/.*"(.*)".*".*".*".*".*/1/'

How do I get data1 without this much of redundancy in sed

Asked By: GypsyCosmonaut

||

The .* in the regex pattern is greedy, it matches as long a string as it can, so the quotes that are matched will be the last ones.

Since the separator is only one character here, we can use an inverted bracket group to match anything but a quote, i.e. [^"], and then repeats of that to match a number of characters that aren’t quotes.

$ echo '... "foo" ... "bar" ...' | sed 's/[^"]*"([^"]*)".*/1/'
foo

Another way would be to just remove everything up to the first quote, then remove everything starting from the (new) first quote:

$ echo '... "foo" ... "bar" ...' | sed 's/^[^"]*"//; s/".*$//'
foo

In Perl regexes, the * and + specifiers can be made non-greedy by appending a question mark, so .*? would anything, but as few characters/bytes as possible.

Answered By: ilkkachu

I won’t bore you with the classic warning against using simple regular expressions to parse HTML. Suffice it to say that you should use a dedicated parser instead. That said, the issue here is that sed uses greedy matching. So it will always match the longest possible string. This means that your .* goes on for ever and matches the entire line.

You could do this in sed (see below), but using a tool that allows non-greedy matches would be simpler:

$ perl -pe 's/.*?"(.*?)".*/$1/' file
data1

Since sed doesn’t support non-greedy matches, you need some other trickery. The simplest would be to use the “not quotes” approach in ikkachu’s answer. Here’s an alternative:

$ rev file | sed 's/.*"(.*)".*/1/' | rev
data1

This just reverses the file (rev), uses your original approach which now works since the 1st occurrence is now the last, and then reverses the file back again.

Answered By: terdon

You can also use a non greedy search using perl regular expression’s look ahead and look behind:

cat data | grep -Po '(?<=href=").*?(?=")' | head -n1
Answered By: Ravexina

Here are a couple of ways you could pull out data1 from your input:

grep -oP '^[^"]*"K[^"]*'

sed -ne '
   /n/!{y/"/n/;D;}
   P
'

perl -lne '/"([^"]*)"/ and print($1),last'
Answered By: user218374

While Question is not tagged with awk, but why not using it as it’s simply as it is:

awk -F" '{print $2}' infile.txt 
Answered By: αғsнιη

If you want to position sed to the first of multiple matches in a line, simply first modify the first match:

echo abcmatchdefmatchghimatchjkl | 
sed -e "s/match/m#1#atch/" 
 -e "s/^.*m#1#atch/match/" 
gives the result
matchdefmatchghimatchjkl

This can be modified for e.g. the second match:

echo abcmatchdefmatchghimatchjkl | sed 
 -e "s/match/m#1#atch/" 
 -e "s/match/m#2#atch/" 
 -e "s/^.*m#2#atch/match/" 

gives the result

matchghimatchjkl
Answered By: franz muell
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.