name: inverse layout: true class: center, middle, inverse --- # Python for Data analysis ## Lecture 3 --- layout: false ## Reading Files Before you begin, import the necessary libraries: ```python import pandas as pd ``` To read a file, use the `pd.read_csv()` function: ```python data = pd.read_csv('data.csv') ``` you can change delimiter by using the `sep` argument: ```python data = pd.read_csv('data.tsv', sep='\t') ``` --- ## Writing Files To write a file, use the `to_csv()` method: ```python data.to_csv('data.csv') ``` Again, you can change delimiter by using the `sep` argument: ```python data.to_csv('data.tsv', sep='\t') ``` other formats (read/write) are supported, such as Excel, JSON, and SQL databases [(and many more!)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). --- ## Basic Information Display a concise summary of the DataFrame including data types and memory usage. ```python data.info() ``` Return a tuple representing the dimensionality of the DataFrame (rows, columns). ```python data.shape ``` Return the column labels of the DataFrame. ```python data.columns ``` Return the index labels of the DataFrame. ```python data.index ``` --- ## Viewing Data To view the first `n` rows of the DataFrame, use the `head()` method: ```python data.head(3) ``` To view the last `n` rows of the DataFrame, use the `tail()` method: ```python data.tail(3) ``` To view a random sample of n rows from the DataFrame, use the sample() method: ```python data.sample(3) ``` --- ## Summary Statistics (part1): To generate descriptive statistics that summarize the central tendency, dispersion, and shape of the numerical data, use the `describe()` method: ```python data.describe() ``` To calculate the mean of all numerical columns, use the `mean()` method: ```python data.mean() ``` To calculate the median of all numerical columns, use the `median()` method: ```python data.median() ``` To find the minimum value of each numerical column, use the `min()` method: ```python data.min() ``` To find the maximum value of each numerical column, use the `max()` method: ```python data.max() ``` --- ## Summary Statistics (part2): To calculate the standard deviation of all numerical columns, use the `std()` method: ```python data.std() ``` To count non-null values for each column, use the `count()` method: ```python data.count() ``` To count the number of unique values for each column, use the `nunique()` method: ```python data.nunique() ``` To compute the variance of all numerical columns, use the `var()` method: ```python data.var() ``` To calculate the correlation between numerical columns, use the `corr()` method: ```python data.corr() ``` --- ## Checking for Missing Values: To check for missing values in the DataFrame and return a DataFrame of boolean values indicating missing values, use the `isna()` method: ```python data.isna() ``` To check for missing values in the DataFrame (alias for `isna()`), use the `isnull()` method: ```python data.isnull() ``` To check for non-missing values in the DataFrame and return a DataFrame of boolean values indicating non-missing values, use the `notna()` method: ```python data.notna() ``` To check for non-missing values in the DataFrame (alias for `notna()`), use the `notnull()` method: ```python data.notnull() ``` --- ## Reshaping Data: Reshaping data involves transforming the layout of your DataFrame to better suit your analysis or presentation needs. Pandas provides several methods for reshaping data, including melting, pivoting, and stacking/unstacking.
--- ## Reshaping Data: ### Melting: To unpivot a DataFrame from wide format to long format, use the `pd.melt()` function: ```python # Melting DataFrame melted_df = pd.melt(df, id_vars=['id_vars'], value_vars=['value_vars']) ``` --- ## Reshaping Data: ### Pivoting: To pivot a DataFrame from long format to wide format, use the `pivot()` or `pivot_table()` methods: ```python # Pivoting DataFrame pivoted_df = df.pivot(index='index_column', columns='column_to_pivot', values='values_to_fill') # Or pivot_table for more advanced operations pivot_table_df = pd.pivot_table(df, values='values_to_fill', index='index_column', columns='column_to_pivot', aggfunc='agg_function') ``` --- ## Combining Datasets: ### Concatenation: To concatenate two DataFrames along rows or columns, use the `pd.concat()` function: ```python # Concatenate along rows df_concatenated = pd.concat([df1, df2]) # Concatenate along columns df_concatenated = pd.concat([df1, df2], axis=1) ``` --- ## Combining Datasets: ### Merging: To merge two DataFrames based on one or more keys, use the `pd.merge()` function: ```python # Merge based on a single key merged_df = pd.merge(df1, df2, on='key_column') # Merge based on multiple keys merged_df = pd.merge(df1, df2, on=['key_column1', 'key_column2']) # Different types of joins: 'inner', 'left', 'right', 'outer' merged_df = pd.merge(df1, df2, on='key_column', how='inner') ``` These methods allow you to efficiently combine datasets based on specific criteria, enabling comprehensive data analysis and insights. --- ## Stacking/Unstacking: To reshape hierarchical index DataFrame from long format to wide format or vice versa, use the `stack()` and `unstack()` methods: ```python # Stacking DataFrame stacked_df = df.stack() # Unstacking DataFrame unstacked_df = df.unstack() ``` --- ## Subsetting Rows: **Drop Duplicates:** Remove duplicate rows from the DataFrame. ```python df.drop_duplicates() ``` **Sample:** Return a random sample of items from the DataFrame. ```python df.sample(n=5) # Return 5 random rows ``` **nlargest:** Return the top n largest values of a specific column. ```python df.nlargest(n=3, columns='column_name') # Return 3 largest values ``` **nsmallest:** Return the top n smallest values of a specific column. ```python df.nsmallest(n=3, columns='column_name') # Return 3 smallest values ``` --- ## Subsetting Columns: To select specific columns by their labels, you can use the indexing operator `[]` or the `loc[]` accessor. ```python # Using indexing operator df[['column1', 'column2']] # Using loc accessor df.loc[:, ['column1', 'column2']] ``` To select columns by their integer position, you can use the `iloc[]` accessor. ```python df.iloc[:, [0, 1]] # Select first two columns ``` To drop specific columns from the DataFrame, you can use the `drop()` method. ```python df.drop(columns=['column1', 'column2']) ``` To select columns based on their data type, you can use the `select_dtypes()` method. ```python df.select_dtypes(include='number') # Select numeric columns ``` To rename columns, you can use the `rename()` method. ```python df.rename(columns={'old_name': 'new_name'}) ``` --- ## Sorting: To sort the DataFrame by one or more columns, use the `sort_values()` method. ```python df.sort_values(by='column_name') # Sort by a single column df.sort_values(by=['column1', 'column2']) # Sort by multiple columns ``` To sort the DataFrame by the index, use the `sort_index()` method. ```python df.sort_index() # Sort by index labels ``` --- ## Indexing: To select rows by their index labels, use the `loc[]` accessor. ```python df.loc['index_label'] # Select a single row df.loc['start_index':'end_index'] # Select a range of rows ``` --- ## Grouping and Aggregating: To group the DataFrame by one or more columns and apply an aggregation function, use the `groupby()` method. ```python df.groupby('column_name') # Group by a single column df.groupby(['column1', 'column2']) # Group by multiple columns ``` After grouping, you can apply aggregation functions to the grouped data using methods like sum(), mean(), count(), etc. ```python df.groupby('column_name').sum() # Sum of grouped data df.groupby('column_name').mean() # Mean of grouped data df.groupby('column_name').count() # Count of grouped data ``` To aggregate the grouped data, use the `agg()` method. ```python df.groupby('column_name').agg({'column_to_aggregate': 'agg_function'}) ``` To apply multiple aggregation functions, use a list of functions. ```python df.groupby('column_name').agg({'column_to_aggregate': ['agg_function1', 'agg_function2']}) ``` --- ## Plotting (part1) To create a line plot of the DataFrame, use the `plot()` method with `kind='line'`. ```python df.plot(kind='line') ``` To create a bar plot use `kind='bar'`. ```python df.plot(kind='bar') ``` To create a histogram use `kind='hist'`. ```python df.plot(kind='hist') ``` To create a box plot use `kind='box'`. ```python df.plot(kind='box') ``` --- ## Plotting (part2) To create a scatter plot use `kind='scatter'`. ```python df.plot(kind='scatter', x='column1', y='column2') ``` To create a kernel density estimation plot use `kind='kde'`. ```python df.plot(kind='kde') ``` To create an area plot use `kind='area'`. ```python df.plot(kind='area') ``` --- name: last-page template: inverse ## That's all folks (for now)!