KeyError in a get_group in python pandas

I am trying to create a count of patients that are present in a particular ward on the same time (eg 10:00am each day) of each day for a period of time. I’m trying to also include whether that patient was an emergency or elective patient. The data set contains the patients date and time of transfer to each ward and the date and time of transfer out of each ward, and the emergency/elective status of that patient’s admission.

I have approached this by using a nested for loop – creating a dataframe for each combination of ward, emergency/elective status, and saving each of those dataframes to a dictionary to allow querying/extraction later if need be.

The complicating this is that not every ward will have a patient in it at the said time and date, and so when I try and iterate (with the FOR loop) through the combination of wards and emergency status of each patient, python throws a KeyError . for example, there might be no patients in ward 8 at 2021-11-01 10:00, meaning that the when the for loop looks at Ward 8, and gets to 10:00 on 1 Nov are no records in the sub-data frame that I create, which seems to throw a “KeyError”… I’m not sure how to get around this…any suggestions appreciated…

In the specific output below – you’ll see it got to ward 6ANRW, and, as there were no patients in that particular ward from with an elective status at that time, it threw a KeyError.

Example data:

|Adm no|Emergency Status|TransferWard|TransferDateTime|nextWardTferDateTime|
|A1|EMERGENCY ADMISSION|EmergencyDept|1/12/2021 5:30|17/12/2021 9:28|
|A1|EMERGENCY ADMISSION|Ward 9|17/12/2021 9:28|17/12/2021 12:30|
|B1|ELECTIVE ADMISSION|Outpatients|22/08/2021 10:30|22/08/2021 12:45|
|C1|ELECTIVE ADMISSION|Outpatients|23/08/2021 13:10|23/08/2021 16:30|
|C2|ELECTIVE ADMISSION|Outpatients|26/08/2021 9:45|26/08/2021 13:40|
|C3|ELECTIVE ADMISSION|Day Surgery|3/09/2021 6:30|3/09/2021 10:37|
|C3|ELECTIVE ADMISSION|Ward 6|3/09/2021 10:37|6/09/2021 15:30|
|D1|ELECTIVE ADMISSION|Imaging|11/08/2021 11:36|11/08/2021 14:45|
|D2|ELECTIVE ADMISSION|Imaging|18/08/2021 13:39|18/08/2021 15:20|
|D3|EMERGENCY ADMISSION|EmergencyDept|6/11/2021 4:40|6/11/2021 9:46|
|D3|EMERGENCY ADMISSION|Ward 8|6/11/2021 9:46|8/11/2021 19:29|
|E1|ELECTIVE ADMISSION|Day Surgery|25/01/2022 10:00|25/01/2022 15:07|
|F1|ELECTIVE ADMISSION|Day Surgery|1/02/2022 8:40|1/02/2022 9:59|
|F1|ELECTIVE ADMISSION|Day Surgery|1/02/2022 9:59|1/02/2022 14:45|
|F2|EMERGENCY ADMISSION|EmergencyDept|6/02/2022 10:34|6/02/2022 14:57|
|F2|EMERGENCY ADMISSION|Ward 7|6/02/2022 14:57|10/02/2022 10:45|
|F3|ELECTIVE ADMISSION|Ward 6|22/03/2022 8:30|22/03/2022 11:12|

Code that I’ve used:



import pandas as pd
import numpy as np

# this is the data source for ward occupancy 
Data = pd.read_excel('WardMovts_Jul21_Mar22.xlsx')

#set the date range for the timeseries we're wanting to create
import datetime
start = datetime.datetime(2021, 9, 1,10,0)
end = datetime.datetime(2022, 11, 30, 10,0)
daterange = pd.date_range(start,end, freq = "d")



#create a timeseries DataFrame

#first get a list of all the transfer wards
transferWards = Data['TransferWard'].unique().tolist()
transferWards.sort()
wardCount = len(transferWards)


#get a list of all the AdmitStatus values
AdmitStatuses = Data['AdmStatusDesc'].unique().tolist()
AdmitStatuses.sort()
AdmStatusCount = len(AdmitStatuses)


#create an empty dataframe
tseries = pd.DataFrame(columns = ['DateHour', 'NumPts', 'TferWard', 'StreamPredictModel', 'AdmitStatus'])

for ward in transferWards:
    for admStatus in AdmitStatuses:
        tseries2 = pd.DataFrame(columns = ['DateHour', 'NumPts', 'TferWard', 'StreamPredictModel', 'AdmitStatus'])
        tseries2['DateHour'] = daterange
        tseries2['NumPts']=0
        tseries2['TferWard'] = ward
        tseries2['AdmitStatus'] = admStatus
       #tseries2['StreamPredictModel']= StreamPredictModel
        tseries = pd.concat([tseries, tseries2])




#drop the indexes to allow comparison)
#tseries.reset_index(drop = True)
#Data.reset_index(drop = True)
totalAdmits = len(Data)
totalDays = len(tseries)

#exclude the admissions that are outside of the period you're interested in
#exclude if admit is after last day
#exclude if discharge is before first day
Data['AdmDate'] = pd.to_datetime(Data['AdmDate'])

DataSubset = Data[~(Data['AdmDate'] > '2021-12-01')]
DataSubset = DataSubset[~(Data['DischargeDate'] < '2021-09-01')]

#create a temporary subset of the main data set by ward to allow faster looping

groupedData = Data.groupby(['TransferWard','AdmStatusDesc'])
groupedTimeSeries = tseries.groupby(['TferWard','AdmitStatus'])

#create a dictionary that will hold all the timeseries data by ward
WardTSeriesDict = {}



#define a function so can display a progress bar for the data extraction
def progress_bar(current, total, bar_length = 20):
    percent = float(current) * 100 / total
    arrow   = '-' * int(percent/100 * bar_length - 1) + '>'
    spaces=" " * (bar_length - len(arrow))

    print('rProgress: [%s%s] %d %%' % (arrow, spaces, percent), end='r')


#now create the datasets for each ward

wardcounter = 1
for thisWard1 in transferWards:
    for thisStatus in AdmitStatuses:
        progress = 1 #re-set the progress counter for this loop)
        subsetTseries = groupedTimeSeries.get_group((thisWard1, thisStatus))
        subsetData = groupedData.get_group((thisWard1, thisStatus))
        df_name = thisWard1 + '_' + thisStatus + '_Tseries'
        print('rExtracting ward', thisWard1, thisStatus,'(Ward ',wardcounter,'of ', wardCount, ')')
        #progress_bar(progress,len(subsetData)) 
        for index, row in subsetTseries.iterrows():
            timestep = row['DateHour']
            currentWard = row['TferWard']
            idx = subsetTseries.index[index]
            progress_bar(progress,len(subsetTseries)) 
            countPts = 0
            for index, row in subsetData.iterrows():
                 admDate = row['TransferDateTime']
                 dischDate = row['nextWardTferDateTime']
                 thisWard = row['TransferWard']
                 AdmitStatus1 = row['AdmStatusDesc']
                 #
                 if (admDate <= timestep) and (dischDate >= timestep) and (thisWard == currentWard) and (AdmitStatus1 == thisStatus):
                     countPts = countPts + 1
                 else:
                     countPts = countPts
            subsetTseries.loc[idx,'NumPts'] = countPts
            progress = progress + 1
        WardTSeriesDict[df_name]=subsetTseries
    wardcounter = wardcounter +1

Console output:


Extracting ward 23HRS ELECTIVE ADMISSION (Ward  1 of  98 )
Progress: [>                   ] 0 %C:ProgramDataAnaconda3libsite-packagespandascoreindexing.py:1765: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)
Extracting ward 23HRS EMERGENCY ADMISSION (Ward  1 of  98 )
Extracting ward 23HRS NOT ASSIGNED (Ward  1 of  98 )
Extracting ward 4A ELECTIVE ADMISSION (Ward  2 of  98 )
Extracting ward 4A EMERGENCY ADMISSION (Ward  2 of  98 )
Extracting ward 4A NOT ASSIGNED (Ward  2 of  98 )
Extracting ward 4C ELECTIVE ADMISSION (Ward  3 of  98 )
Extracting ward 4C EMERGENCY ADMISSION (Ward  3 of  98 )
Extracting ward 4C NOT ASSIGNED (Ward  3 of  98 )
Extracting ward 5C ELECTIVE ADMISSION (Ward  4 of  98 )
Extracting ward 5C EMERGENCY ADMISSION (Ward  4 of  98 )
Extracting ward 5C NOT ASSIGNED (Ward  4 of  98 )
Extracting ward 5C RAD ELECTIVE ADMISSION (Ward  5 of  98 )
Extracting ward 5C RAD EMERGENCY ADMISSION (Ward  5 of  98 )
Extracting ward 5C RAD NOT ASSIGNED (Ward  5 of  98 )
Extracting ward 6ANRW ELECTIVE ADMISSION (Ward  6 of  98 )
Extracting ward 6ANRW EMERGENCY ADMISSION (Ward  6 of  98 )
Extracting ward 6ANRW NOT ASSIGNED (Ward  6 of  98 )
Traceback (most recent call last):00 %

  File "<ipython-input-8-ce2db9fdc9d0>", line 6, in <module>
    subsetData = groupedData.get_group((thisWard1, thisStatus))

  File "C:ProgramDataAnaconda3libsite-packagespandascoregroupbygroupby.py", line 810, in get_group
    raise KeyError(name)

KeyError: ('6ANURS', 'ELECTIVE ADMISSION')

How do I get around this issue…?

Leave a Comment