Fetch data from a large file based on time and date recursively
Fetch data based on date(column 1) and time (column 2). Each date has time in column 2. Based on each date in column1 two files are to be generated which will have all the fields based on time 6:00 to 21:50:08 (day) and 22:00:00 to 5:50:00 (night). Trying to get two files for each date as date_day and date_night based on the times specified.
Input file:
Date Time R1 R2 R3
03/10/2023 19:00:08 19.06 39.870 5.12
03/10/2023 19:10:08 18.87 39.970 4.98
03/10/2023 19:20:08 18.68 39.940 4.80
03/10/2023 19:30:08 18.84 40.110 5.01
03/10/2023 19:40:08 18.89 38.960 4.64
03/10/2023 19:50:08 18.60 39.100 4.43
03/10/2023 23:30:08 18.03 34.200 2.03
03/10/2023 23:40:08 17.94 33.930 1.84
03/10/2023 23:50:08 17.87 33.840 1.74
03/11/2023 00:00:08 17.75 33.790 1.61
03/11/2023 00:10:08 17.96 34.060 1.91
03/11/2023 00:20:08 18.13 33.690 1.91
03/11/2023 00:30:08 17.91 33.620 1.68
Output files:
03/10/2023_day
Date Time R1 R2 R3
03/10/2023 19:00:08 19.06 39.870 5.12
03/10/2023 19:10:08 18.87 39.970 4.98
03/10/2023 19:20:08 18.68 39.940 4.80
03/10/2023 19:30:08 18.84 40.110 5.01
03/10/2023 19:40:08 18.89 38.960 4.64
03/10/2023_night
Date Time R1 R2 R3
03/10/2023 19:50:08 18.60 39.100 4.43
03/10/2023 23:30:08 18.03 34.200 2.03
03/10/2023 23:40:08 17.94 33.930 1.84
03/10/2023 23:50:08 17.87 33.840 1.74
03/11/2023_night
:
Date Time R1 R2 R3
03/11/2023 00:00:08 17.75 33.790 1.61
03/11/2023 00:10:08 17.96 34.060 1.91
03/11/2023 00:20:08 18.13 33.690 1.91
03/11/2023 00:30:08 17.91 33.620 1.68
I tried the following to fetch day and night files. This I will have to do for each date again and again. I removed the ‘:’ from the time column in my code to get a number without :. Can anyone please help to put it in a loop and the separate day and night files for each date?
awk '$1 ~ /03/10/2023/ && $2 >= 060000 && $2 <= 215000' data |sed 's/t/,/g' > 03_10_23_day.csv
awk '$1 ~ /03/10/2023/ && $2 > 215000' data |sed 's/t/,/g' > 03_10_23_night.csv
You want string comparisons rather than number comparisons in awk
and generate all output files in one awk
invocation. Using a loop wouldn’t make sense here:
awk -v OFS=, '
{$1 = $1} # force reformatting with comma delimiters
NR == 1 {header = $0; next}
{
split($1, f, "/")
outfile = f[1] "_" f[2] "_" substr(f[3], 3) "_"
($2 >= "06:00:00" && $2 < "22:00:00" ? "day" : "night")
".csv"
if (!seen[outfile]++) print header > outfile
print > outfile
}' < data
($2 >= "06" && $2 < "22"
would also work here)
I’d suggest naming your files 2023-10-03-night.csv
rather than 03_10_23_night.csv
(assuming it’s meant to be the third of October rather than the tenth of March) which would mean ls
for instance would show them in chronological order and that 2023-10-03
is an international standard unambiguous format.
Using Raku (formerly known as Perl_6)
#OUTPUT A SPECIFIED 'TIME-OF-DAY' RANGE FOR ALL DATES IN FILE:
~$ raku -e 'my $hdr = get; my @a = lines.map: *.split(" ");
my @b = do for @a { .[0..1].join("T").subst(/ (d**2) / (d**2) / (d**4) /, {"$2-$0-$1"} ).DateTime, .[2..*] };
put $hdr; for @b {
my $start = .[0].truncated-to("day") + Duration.new(21600);
my $stop = .[0].truncated-to("day") + Duration.new(71408);
put $_ if $_.[0] ~~ $start ..^ $stop };' file
Above (first answer) is an approach using Raku, a member of the Perl-family of programming languages. An advantage of using Raku is that ISO-8601
DateTimes are built-in. Above filters lines of input, giving output within a defined $start .. $stop
range. The ..^
range operator (with caret) excludes the RHS timepoint from output.
#OUTPUT A 'TIME-OF-DAY' RANGE FOR A SPECIFIED DATE IN FILE:
~$ raku -e 'my $target_date = DateTime.new("2023-03-10");
say $target_date; my $hdr = get;
my @a = lines.map: *.split(" "); my @b = do for @a { .[0..1].join("T").subst(/ (d**2) / (d**2) / (d**4) /, {"$2-$0-$1"} ).DateTime, .[2..*] };
put $hdr; for @b {
my $start = $target_date + Duration.new(21600);
my $stop = $target_date + Duration.new(71408);
put $_ if $_.[0] ~~ $start ..^ $stop };' file
Above (second answer) somewhat more specifically you can define a $target_date
and only retain a ‘time-of-day’ range for that Date in output.
Sample Input (OP’s example plus two extra lines added at end):
Date Time R1 R2 R3
03/10/2023 19:00:08 19.06 39.870 5.12
03/10/2023 19:10:08 18.87 39.970 4.98
03/10/2023 19:20:08 18.68 39.940 4.80
03/10/2023 19:30:08 18.84 40.110 5.01
03/10/2023 19:40:08 18.89 38.960 4.64
03/10/2023 19:50:08 18.60 39.100 4.43
03/10/2023 23:30:08 18.03 34.200 2.03
03/10/2023 23:40:08 17.94 33.930 1.84
03/10/2023 23:50:08 17.87 33.840 1.74
03/11/2023 00:00:08 17.75 33.790 1.61
03/11/2023 00:10:08 17.96 34.060 1.91
03/11/2023 00:20:08 18.13 33.690 1.91
03/11/2023 00:30:08 17.91 33.620 1.68
03/12/2023 19:00:08 19.06 39.870 5.12
03/12/2023 19:10:08 18.87 39.970 4.98
Sample Output (1):
Date Time R1 R2 R3
2023-03-10T19:00:08Z 19.06 39.870 5.12
2023-03-10T19:10:08Z 18.87 39.970 4.98
2023-03-10T19:20:08Z 18.68 39.940 4.80
2023-03-10T19:30:08Z 18.84 40.110 5.01
2023-03-10T19:40:08Z 18.89 38.960 4.64
2023-03-12T19:00:08Z 19.06 39.870 5.12
2023-03-12T19:10:08Z 18.87 39.970 4.98
Sample Output (2):
2023-03-10T00:00:00Z
Date Time R1 R2 R3
2023-03-10T19:00:08Z 19.06 39.870 5.12
2023-03-10T19:10:08Z 18.87 39.970 4.98
2023-03-10T19:20:08Z 18.68 39.940 4.80
2023-03-10T19:30:08Z 18.84 40.110 5.01
2023-03-10T19:40:08Z 18.89 38.960 4.64
https://docs.raku.org/language/temporal
https://docs.raku.org/type/DateTime
https://raku.org
Assumptions/Understandings:
- output filenames are to convert dates from using
/
to dates using_
(eg,03/10/2023
becomes03_10_2023
) - output filenames are to be of the format
DD_MM_YYYY_day
orDD_MM_YYY_night
– per expected outputs (ie, for this answer we’ll ignore the.csv
extentions as shown in OP’s sample code) - input/output field delimiters are white space – as per sample inputs/outputs (ie, for this answer we’ll ignore the implied tab/comma delimiters from OP’s
sed s/t/,/g
) - input data is already sorted by date (column #1) and time (column #2)
- time ranges are defined as follows (removing gaps introduced by OP’s definitions):
day
=06:00:00
to21:59:59
(vs OP:06:00:00
to21:50:08
???)night
=22:00:00
to05:59:59
(vs OP:22:00:00
to05:50:00
???)- the
03/10/2023 19:50:08
entry should reside in the resulting_day
file and not the_night
file (as OP has shown in the expected output) - OP wants to place
0[0-5]
(early morning) entries in the date’s_night
file (as opposed to either a) placing in previous date’s_night
file or b) placing in a new_morning
file) - NOTE: if any of these assumptions/understandings are incorrect then OP will need to update the question to provide more details and clarity
Adding a couple more lines to sample input file:
$ cat data
Date Time R1 R2 R3
03/10/2023 03:10:08 19.06 39.870 5.12 # new: to be placed in '_night' file
03/10/2023 05:30:08 18.87 39.970 4.98 # new: to be placed in '_night' file
03/10/2023 19:00:08 19.06 39.870 5.12
03/10/2023 19:10:08 18.87 39.970 4.98
03/10/2023 19:20:08 18.68 39.940 4.80
03/10/2023 19:30:08 18.84 40.110 5.01
03/10/2023 19:40:08 18.89 38.960 4.64
03/10/2023 19:50:08 18.60 39.100 4.43
03/10/2023 23:30:08 18.03 34.200 2.03
03/10/2023 23:40:08 17.94 33.930 1.84
03/10/2023 23:50:08 17.87 33.840 1.74
03/11/2023 00:00:08 17.75 33.790 1.61
03/11/2023 00:10:08 17.96 34.060 1.91
03/11/2023 00:20:08 18.13 33.690 1.91
03/11/2023 00:30:08 17.91 33.620 1.68
NOTE: file does not contain comments
One awk
idea:
awk '
NR==1 { hdr = $0; next } # save header
$1 != prev_dt { close(out_day) # if new date then close output files
close(out_night)
prev_dt = out_dt = $1 # make note of new date
gsub(///,"_",out_dt) # replace "/" with "_"
out_day = out_dt "_day" # define new output file names
out_night = out_dt "_night"
hdr_flag_day = hdr_flag_night = 1 # reset "print header?" flag
}
$2 >= "06:00:00" && # "day"
$2 <= "21:59:59" { if ( hdr_flag_day ) {
print hdr > out_day
hdr_flag_day = 0
}
print $0 > out_day
next
}
{ if ( hdr_flag_night ) { # "night"
print hdr > out_night
hdr_flag_night = 0
}
print $0 > out_night
}
' data
This generates:
$ head 03*20??_[dn]*
==> 03_10_2023_day <==
Date Time R1 R2 R3
03/10/2023 19:00:08 19.06 39.870 5.12
03/10/2023 19:10:08 18.87 39.970 4.98
03/10/2023 19:20:08 18.68 39.940 4.80
03/10/2023 19:30:08 18.84 40.110 5.01
03/10/2023 19:40:08 18.89 38.960 4.64
03/10/2023 19:50:08 18.60 39.100 4.43
==> 03_10_2023_night <==
Date Time R1 R2 R3
03/10/2023 03:10:08 19.06 39.870 5.12
03/10/2023 05:30:08 18.87 39.970 4.98
03/10/2023 23:30:08 18.03 34.200 2.03
03/10/2023 23:40:08 17.94 33.930 1.84
03/10/2023 23:50:08 17.87 33.840 1.74
==> 03_11_2023_night <==
Date Time R1 R2 R3
03/11/2023 00:00:08 17.75 33.790 1.61
03/11/2023 00:10:08 17.96 34.060 1.91
03/11/2023 00:20:08 18.13 33.690 1.91
03/11/2023 00:30:08 17.91 33.620 1.68