Pandas Introduction (Lecture 3)

name: inverse
layout: true
class: center, middle, inverse

---
# Python for Data analysis
## Lecture 3

---
layout: false
## Reading Files

Before you begin, import the necessary libraries:

```python
import pandas as pd
```

To read a file, use the `pd.read_csv()` function:

```python
data = pd.read_csv('data.csv')
```

you can change delimiter by using the `sep` argument:

```python
data = pd.read_csv('data.tsv', sep='\t')
```

---

## Writing Files

To write a file, use the `to_csv()` method:

```python
data.to_csv('data.csv')
```

Again, you can change delimiter by using the `sep` argument:

```python
data.to_csv('data.tsv', sep='\t')
```

other formats (read/write) are supported, such as Excel, JSON, and SQL databases [(and many
more!)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

---

## Basic Information

Display a concise summary of the DataFrame including data types and memory usage.

```python
data.info()
```

Return a tuple representing the dimensionality of the DataFrame (rows, columns).

```python
data.shape
```

Return the column labels of the DataFrame.

```python
data.columns
```

Return the index labels of the DataFrame.

```python
data.index
```

---

## Viewing Data

To view the first `n` rows of the DataFrame, use the `head()` method:

```python
data.head(3)
```

To view the last `n` rows of the DataFrame, use the `tail()` method:

```python
data.tail(3)
```

To view a random sample of n rows from the DataFrame, use the sample() method:

```python
data.sample(3)
```

---

## Summary Statistics (part1):

To generate descriptive statistics that summarize the central tendency, dispersion, and shape of the numerical data, use
the `describe()` method:

```python
data.describe()
```

To calculate the mean of all numerical columns, use the `mean()` method:

```python
data.mean()
```

To calculate the median of all numerical columns, use the `median()` method:

```python
data.median()
```

To find the minimum value of each numerical column, use the `min()` method:

```python
data.min()
```

To find the maximum value of each numerical column, use the `max()` method:

```python
data.max()
```

---

## Summary Statistics (part2):

To calculate the standard deviation of all numerical columns, use the `std()` method:

```python
data.std()
```

To count non-null values for each column, use the `count()` method:

```python
data.count()
```

To count the number of unique values for each column, use the `nunique()` method:

```python
data.nunique()
```

To compute the variance of all numerical columns, use the `var()` method:

```python
data.var()
```

To calculate the correlation between numerical columns, use the `corr()` method:

```python
data.corr()
```

---

## Checking for Missing Values:

To check for missing values in the DataFrame and return a DataFrame of boolean values indicating missing values, use the
`isna()` method:

```python
data.isna()
```

To check for missing values in the DataFrame (alias for `isna()`), use the `isnull()` method:

```python
data.isnull()
```

To check for non-missing values in the DataFrame and return a DataFrame of boolean values indicating non-missing values,
use the `notna()` method:

```python
data.notna()
```

To check for non-missing values in the DataFrame (alias for `notna()`), use the `notnull()` method:

```python
data.notnull()
```

---
## Reshaping Data:

Reshaping data involves transforming the layout of your DataFrame to better suit your analysis or presentation needs.
Pandas provides several methods for reshaping data, including melting, pivoting, and stacking/unstacking.

---

## Reshaping Data:

### Melting:

To unpivot a DataFrame from wide format to long format, use the `pd.melt()` function:

```python
# Melting DataFrame
melted_df = pd.melt(df, id_vars=['id_vars'], value_vars=['value_vars'])
```

---

## Reshaping Data:

### Pivoting:

To pivot a DataFrame from long format to wide format, use the `pivot()` or `pivot_table()` methods:

```python
# Pivoting DataFrame
pivoted_df = df.pivot(index='index_column', 
columns='column_to_pivot', values='values_to_fill')

# Or pivot_table for more advanced operations
pivot_table_df = pd.pivot_table(df, values='values_to_fill', 
index='index_column', columns='column_to_pivot',
aggfunc='agg_function')
```
---

## Combining Datasets:

### Concatenation:

To concatenate two DataFrames along rows or columns, use the `pd.concat()` function:

```python
# Concatenate along rows
df_concatenated = pd.concat([df1, df2])

# Concatenate along columns
df_concatenated = pd.concat([df1, df2], axis=1)
```

---

## Combining Datasets:

### Merging:

To merge two DataFrames based on one or more keys, use the `pd.merge()` function:

```python
# Merge based on a single key
merged_df = pd.merge(df1, df2, on='key_column')

# Merge based on multiple keys
merged_df = pd.merge(df1, df2, on=['key_column1', 'key_column2'])

# Different types of joins: 'inner', 'left', 'right', 'outer'
merged_df = pd.merge(df1, df2, on='key_column', how='inner')
```

These methods allow you to efficiently combine datasets based on specific criteria, enabling comprehensive data analysis
and insights.

---

## Stacking/Unstacking:

To reshape hierarchical index DataFrame from long format to wide format or vice versa, use the `stack()` and `unstack()`
methods:

```python
# Stacking DataFrame
stacked_df = df.stack()

# Unstacking DataFrame
unstacked_df = df.unstack()
```

---
## Subsetting Rows:

**Drop Duplicates:** Remove duplicate rows from the DataFrame.

```python
df.drop_duplicates()
```

**Sample:** Return a random sample of items from the DataFrame.

```python
df.sample(n=5) # Return 5 random rows
```

**nlargest:** Return the top n largest values of a specific column.

```python
df.nlargest(n=3, columns='column_name') # Return 3 largest values
```

**nsmallest:** Return the top n smallest values of a specific column.

```python
df.nsmallest(n=3, columns='column_name') # Return 3 smallest values
```
---

## Subsetting Columns:

To select specific columns by their labels, you can use the indexing operator `[]` or the `loc[]` accessor.

```python
# Using indexing operator
df[['column1', 'column2']]
# Using loc accessor
df.loc[:, ['column1', 'column2']]
```

To select columns by their integer position, you can use the `iloc[]` accessor.

```python
df.iloc[:, [0, 1]] # Select first two columns
```

To drop specific columns from the DataFrame, you can use the `drop()` method.

```python
df.drop(columns=['column1', 'column2'])
```

To select columns based on their data type, you can use the `select_dtypes()` method.

```python
df.select_dtypes(include='number') # Select numeric columns
```

To rename columns, you can use the `rename()` method.

```python
df.rename(columns={'old_name': 'new_name'})
```

---

## Sorting:

To sort the DataFrame by one or more columns, use the `sort_values()` method.

```python
df.sort_values(by='column_name') # Sort by a single column
df.sort_values(by=['column1', 'column2']) # Sort by multiple columns
```

To sort the DataFrame by the index, use the `sort_index()` method.

```python
df.sort_index() # Sort by index labels
```

---
## Indexing:

To select rows by their index labels, use the `loc[]` accessor.

```python
df.loc['index_label'] # Select a single row
df.loc['start_index':'end_index'] # Select a range of rows
```

---

## Grouping and Aggregating:

To group the DataFrame by one or more columns and apply an aggregation function, use the `groupby()` method.

```python
df.groupby('column_name') # Group by a single column
df.groupby(['column1', 'column2']) # Group by multiple columns
```

After grouping, you can apply aggregation functions to the grouped data using methods like sum(), mean(), count(), etc.

```python
df.groupby('column_name').sum() # Sum of grouped data
df.groupby('column_name').mean() # Mean of grouped data
df.groupby('column_name').count() # Count of grouped data
```

To aggregate the grouped data, use the `agg()` method.

```python
df.groupby('column_name').agg({'column_to_aggregate': 'agg_function'})
```

To apply multiple aggregation functions, use a list of functions.

```python
df.groupby('column_name').agg({'column_to_aggregate': ['agg_function1', 'agg_function2']})
```

---
## Plotting (part1)

To create a line plot of the DataFrame, use the `plot()` method with `kind='line'`.

```python
df.plot(kind='line')
```

To create a bar plot use `kind='bar'`.

```python
df.plot(kind='bar')
```

To create a histogram use `kind='hist'`.

```python
df.plot(kind='hist')
```

To create a box plot use `kind='box'`.

```python
df.plot(kind='box')
```

---
## Plotting (part2)

To create a scatter plot use `kind='scatter'`.

```python
df.plot(kind='scatter', x='column1', y='column2')
```

To create a kernel density estimation plot use `kind='kde'`.

```python
df.plot(kind='kde')
```

To create an area plot use `kind='area'`.

```python
df.plot(kind='area')
```

---
name: last-page
template: inverse
## That's all folks (for now)!