Changing non-standard date timestamp format in CSV using awk/sed

I have a csv with a few hundred thousand lines and I’m trying to change the date format in the second field. I should also add the second field is sometimes not populated at all.
The deplorable input format is DayofWeek MonthofYear DayofMonth Hour:Minute:Second Timezone Year

Example:

Mon Jul 03 14:48:54 EDT 2023

My desired output format is YYYY-MM-DD HH:MM:SS
Example:

2023-07-03 14:48:54

I am familiar with sed, so I got this sed regex replace line to get it in almost the right format, but the month not being a number is an issue.

sed -E "s/[A-Za-z]{3}s([A-Za-z]{3})s([0-9]{2})s([0-9]{2}:[0-9]{2}:[0-9]{2})s[A-Z]+s([0-9]{4})/4-1-2 3/"

I don’t think its possible to run the date command inside the sed replace section using the capture group 1 (but please correct me if I’m wrong).

I don’t know how to go about referencing the month and parsing it with the date command once the sed command finishes, and I think it would be better to do the processing without piping the entire output to another command.
This command is just one in a long line of piped commands for formatting the rest of the data.

It seems that maybe awk can do the entire formatting all at once, but I don’t really know how to use awk that well.

What’s the most efficient way to get the timestamp into the correct format?

Just to address some of the comments with more background info:

This data is generated by an app that outputs csv log data to a file.
It is not my app and there is no configuration control over how the app logs.
The CSV is unqouted (even if data in the field contains spaces) and empty fields contain nothing.

I am loading the csv data directly into a mysql database. While timezone would be a good idea generally, this data is always timestamped with the local time and when visualizing the data (grafana), I have no need to store it in UTC then convert to EDT just for viewing (why convert the time to UTC just to convert it back to EDT). Plus, each csv line contains longitude and latitude (so if I wanted to go back and change the timestamp to UTC, it wouldn’t be impossible to figure out what local time was).

The additional formatting I am doing is not much, and probably could be done with awk (again, I am not too familiar with the syntax there).
It doesn’t help that the original data needs an ID column added, and qoutes put around some fields, and there are two date-time fields in TWO different formats.
So my long and terrible pipe line generally looks like this:

cat file | add ID column | format timestamp in second csv field | format timestamp in third csv field | qoute any field with spaces | replace empty fields with N > output file

I had some trouble with mysql and empty fields, so I added the explicit null character.
There is definitely better ways to do this, once I get the whole process working I’ll go back through and simplify.

I do very appreciate everyones responses.

Asked By: ehammer

||

If efficiency is a concern, it is better to use a scripting language, as there will be no excessive calling of external commands.

Here is an example Python script for reference only

from datetime import datetime
import re
import csv


def convert_datetime(dt):
    # as `EDT`` isn't in zoneinfo, it would need to be removed
    date_string = re.sub("(w+ w+ d+ d+:d+:d+) w+ (w+)", r"1 2", dt)
    date_obj = datetime.strptime(date_string, "%a %b %d %H:%M:%S %Y")
    return date_obj.strftime("%Y-%m-%d %H:%M:%S")


with open("original.csv", "r") as infile, open("processed.csv", "w") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    header = next(reader, None)
    if header:
        writer.writerow(header)
    for row in reader:
        # convert datetime in the second field
        try:
            row[1] = convert_datetime(row[1])
        except ValueError:
            pass
        writer.writerow(row)
Answered By: memchr

You could do something like:

LC_ALL=C sed '
  s/$/;Jan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/
  s/[A-Z][a-z][a-z] ([A-Z][a-z][a-z]) ([0-9][0-9]) ([0-2][0-9]:[0-5][0-9]:[0-5][0-9]) [A-Z]{3,} ([0-9]{4})(.*;.*1([01][0-9])[^;]*)$/4-6-2 35/
  s/;[^;]*$//'

Where we first append a month name to number translation table at the end of the line (separated with a ;), then look up the number for a given month name by using back references (for which we need BREs, not EREs¹) using a ...([A-Z][a-z][a-z])...;.*1([01][0-9])... regexp, so the 1 back references the month name captured in the text, and the two digits following that end up in 6.

Then we remove our translation table.

If there may be more than one timestamp to be converted per line, change it to:

LC_ALL=C sed '
  s/$/;Jan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/
  :1
    s/[A-Z][a-z][a-z] ([A-Z][a-z][a-z]) ([0-9][0-9]) ([0-2][0-9]:[0-5][0-9]:[0-5][0-9]) [A-Z]{3,} ([0-9]{4})(.*;.*1([01][0-9])[^;]*)$/4-6-2 35/
  t1
  s/;[^;]*$//'

where t1 branches to the :1 label only if there has been a successful substitution which is a common way to implement loops in sed.

For arbitrary header-less CSVs, only reformatting the first field:

mlr --csv -N put '$1 = strftime(strptime($1, "%a %b %d %H:%M:%S %Z %Y"), "%F %T")'

(adapted from @Kusalananda’s answer to How can I transform dates expressed with month names to numerical month designations?).

Miller’s strptime() complains if the timestamp cannot be decoded but apparently not in the case where the field is empty.

%Z is not among the directives recognised by standard strptime(), but the GNU implementation at least recognises and ignores it (and consumes s*S* from the input; there’s nothing it could do with it given those EDT and co have different meaning over time and for different people).


¹ though some sed implementations, including GNU sed which you may be using as you’re using the s GNUism, support them for EREs as well as an extension over the standard.

Answered By: Stéphane Chazelas

You did mention:

I’m trying to change the date format in the second field. I should also add the second field is sometimes not populated at all.

The following awk script will satisfy the requirement. Save it as date.awk (thanks to @EdMorton for the nitpicks) :

BEGIN {
  FS = OFS = ","
  months = "JanFebMarAprMayJunJulAugSepOctNovDec" 
}

$2 != "" {
  split($2, date, / /)
  month = sprintf("%02d", (index(months, date[2]) + 2) / 3)
  $2 = sprintf("%04d-%02d-%02d %s", date[6], month, date[3], date[4])
}

1

then execute awk using the script:

awk -f date.awk input.csv

Original Answer

You can easily change date formats using date command. For instance:

$ date -d "Mon Jul 03 14:48:54 EDT 2023" +"%Y-%m-%d %H:%M:%S"
2023-07-03 14:48:54

Then, you can use awk to convert only a specific column (in this example, $1):

awk 'BEGIN {FS=OFS=","} {"date -d "" $1 "" +"%Y-%m-%d %H:%M:%S"" | getline res; $1=res; print}' file.csv

The result will be in your local time, so if you want to convert the timezone, then simply add TZ=EDT (or any timezone) before date.

However, as per @StéphaneChazelas mentioned in the comments, this is vulnerable to command injection if somehow the field in a line contains malicious command, and this will run slowly because it needs to execute sh and date for each line.

Answered By: annahri

With GNU sed, you can use the s///e modifier to execute the resultant string:

s/.*/date -d "&" +"%F %T"/e

Better than that, though, is to use GNU date’s -f flag, which itself can process lines of input, rather than spawning a new process for each line:

$ TZ=UTC0 date -f /dev/stdin +'%F %T' <<<$'Mon Jul 03 14:48:54 EDT 2023nTue, 04 Jul 2023 11:30:45 +0100'
2023-07-03 18:48:54
2023-07-04 10:30:45

This is also much safer if you’re not in a position to trust the input.

Answered By: Toby Speight
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.