Python Function read excel / csv files from a given directory and its subdirectories
Code
The code for the function is as shown below:
import os
import pandas as pd
import numpy as np
from datetime import datetime
def extract_data(directory_path, columns_to_extract, date_columns=[], rows_to_skip=0, output_filename='output.csv'):
"""
Extracts data from all Excel and CSV files in the specified directory and its subdirectories that contain all the
specified columns.
:param directory_path: The path to the directory to search for Excel and CSV files.
:param columns_to_extract: A list of column names to extract from each file.
:param date_columns: A list of column names to parse as dates using pd.to_datetime().
:param output_filename: The path and filename to save the extracted data in a csv
:return: A DataFrame containing the extracted data from all Excel and CSV files that contain the specified columns,
or None if no files contain the specified columns.
"""
start_time = datetime.now()
extracted_data = pd.DataFrame()
files_with_no_columns = []
sheet_read = ''
extracted_columns = columns_to_extract.copy()
extracted_columns.extend(['File Name','SubDir','CreatedDate','LastModifiedDate'])
for root, dirs, files in os.walk(directory_path):
for file in files:
print('Started reading ' + file + ' at: {}'.format(datetime.now()) + ' ...')
if file.endswith('.xlsx') or file.endswith('.xls') or file.endswith('.csv'):
file_path = os.path.join(root, file)
columns_found_outer = False
if file.endswith('.csv'):
df = pd.read_csv(file_path,skiprows=rows_to_skip)
for col in columns_to_extract:
if col in df.columns:
columns_found_outer = True
#print(col + 'in csv true loop')
else:
columns_found_outer = False
#print(col + 'in false loop')
break
else:
dfs = pd.read_excel(file_path, sheet_name=None,skiprows=rows_to_skip)
for key in dfs.keys():
columns_found = False
for col in columns_to_extract:
if col in dfs[key].columns:
columns_found = True
#print(col + 'in true loop')
else:
columns_found = False
#print(col + 'in false loop')
break
if columns_found:
columns_found_outer = True
df = dfs[key]
sheet_read = key
break
if columns_found_outer:
if len(date_columns) > 0:
for col in date_columns:
if col in df.columns:
df[col] = pd.to_datetime(df[col])
df['File Name'] = file
df['SubDir'] = os.path.basename(root)
createx = modx = os.path.getctime(file_path)
xcreate = datetime.fromtimestamp(modx)
df['CreatedDate'] = xcreate
modx = os.path.getmtime(file_path)
xmod = datetime.fromtimestamp(modx)
df['LastModifiedDate'] = xmod
extracted_data = pd.concat([extracted_data, df[extracted_columns]], ignore_index=True)
#print(xmod)
if file.endswith('.csv'):
print(file + ' has been read')
else:
print(file + ' has been read and it was last modified on ' + xmod.strftime('%Y-%m-%d') + '. The name of the sheet that was read is: ' + sheet_read)
else:
files_with_no_columns.append(file_path)
if len(files_with_no_columns) > 0:
print("The following files do not contain the specified columns:")
for file_path in files_with_no_columns:
print(file_path)
if not extracted_data.empty:
extracted_data = extracted_data.applymap(lambda s: s.upper() if type(s) == str else s).fillna('')
extracted_data.to_csv(output_filename)
print('Started at: {}'.format(start_time) + '. \nEnded at: {}'.format(datetime.now()) + '. \nTime elapsed (hh:mm:ss.ms) {}'.format(datetime.now() - start_time))
return extracted_data
else:
print("Specified columns do not exist in any file in the provided directory.")
print('Started at: {}'.format(start_time) + '. \nEnded at: {}'.format(datetime.now()) + '. \nTime elapsed (hh:mm:ss.ms) {}'.format(datetime.now() - start_time))
return None
Explanation
-
The function
extract_data
takes in the following parameters:- directory_path: the path to the directory to search for Excel and CSV files
- columns_to_extract: a list of column names to extract from each file
- date_columns: a list of column names to parse as dates using pd.to_datetime()
- rows_to_skip: the number of rows to skip when reading each file
- output_filename: the path and filename to save the extracted data in a CSV file
-
The function starts by creating an empty DataFrame called
extracted_data
. -
It also initializes variables
files_with_no_columns
andsheet_read
to track files that do not contain the specified columns and to keep track of the current sheet being read when extracting data from Excel files. -
The list
extracted_columns
is created by copying the inputcolumns_to_extract
list and adding additional columns for thefilename
,subdirectory name
,file creation date
, andlast modified date
. -
The function then loops through all the files in the specified directory and its subdirectories using os.walk().
-
For each file, it checks if it has a “.xlsx”, “.xls”, or “.csv” extension.
-
If the file is a CSV file, the function reads it into a DataFrame using
pd.read_csv()
, skipping the number of rows specified byrows_to_skip
. -
It then checks if all the columns in
columns_to_extract
are present in the DataFrame. If so, it setscolumns_found_outer
toTrue
and proceeds to the next step. If not, it setscolumns_found_outer
toFalse
and moves on to the next file. -
If the file is an Excel file, the function reads all sheets in the file into a dictionary of DataFrames using
pd.read_excel()
andsheet_name=None
. It then loops through all the sheets and all the columns incolumns_to_extract
, checking if each column is present in each sheet’s DataFrame. If all the columns are present in a sheet, it setscolumns_found
toTrue
and proceeds to the next step. If not, it setscolumns_found
toFalse
and moves on to the next sheet in the same Excel file. If at least one sheet contains all the specified columns, the function combines the DataFrames of all sheets into one usingpd.concat()
. -
If
columns_found_outer
isTrue
, it extracts thefilename
,subdirectory name
,file creation date
, andlast modified date
usingos.path.basename()
,os.path.getctime()
, andos.path.getmtime()
, and adds them as new columns to the DataFrame. It then appends the DataFrame to theextracted_data
DataFrame. -
If
columns_found_outer
isFalse
, it increments thefiles_with_no_columns
counter and prints a warning message. -
Finally, the function checks if any files contained the specified columns. If not, it returns None. Otherwise, it sorts the
extracted_data
DataFrame by filename and saves it to a CSV file usingto_csv()
. It then prints the number of files processed, the number of files that did not contain the specified columns, and the path to the output file.
Sample Usage
The function can be called in python as shown below:
# set directory path
directory_path = './Work' # Path for the directory where all the files containing data for extraction are to be searched
# set the columns you want to extract
columns_to_extract = ['Region','Country', 'Product Number','Quantity','Date of Sale']
date_col = ['Date of Sale']
# set the rows to skip
skiprows = 0 #this basically will be number of rows in begining of the files which must be skipped to reach the header row of the data
output_filename = 'Regional Sales Data.csv'
#Call function
df = extract_data(directory_path,columns_to_extract,date_col,skiprows,output_filename)
Now, assuming there were 12 separate files for past 12 months inside the folder then so long as all those files, irrespective of whether they are csv or excel, have the columns Region
,Country
, Product Number
,Quantity
,Date of Sale
; the function will read the files and extract the data and return it to the dataframe df
.
GUI Implementation
A very basic GUI implementation of above function using PySimpleGUI with all codeis available here
Usage
The script can directly be copied to a Jupyter cell or can be run from terminal. Following command should ensure all dependencies are installed:
py -m pip install pandas, numpy, pysimplegui
Some things the GUI takes care of are:
- Allows selection of columns to be extracted from a sample
.csv
file - Allows user to specify which of the selected columns should be parsed as
date
- Gives a date based filename to
output
- Shows colour coded log for which files were read in green and which were ignored in red background.
Screenshots
Some screenshots of the resulting app are as shown below:
Leave a comment