Featured image of post Scaling and Normalizing Arrays - A Practical Guide for Data Preprocessing

Scaling and Normalizing Arrays - A Practical Guide for Data Preprocessing

In this post, we’ll explore what scaling and normalization mean, the key techniques, when to use each, and how to implement them in Python using NumPy and scikit-learn.

Scaling and Normalizing Arrays: A Practical Guide for Data Preprocessing

In machine learning, data analysis, and scientific computing, raw data often comes in different scales. One feature might range from 0 to 1, while another spans 1,000 to 100,000. Feeding such mismatched data into algorithms can lead to poor performance, slow convergence, or biased results. This is where scaling and normalizing arrays come in—essential preprocessing steps that bring features to a common scale.

In this post, we’ll explore what scaling and normalization mean, the key techniques, when to use each, and how to implement them in Python using NumPy and scikit-learn.


Why Scale or Normalize?

Most distance-based algorithms (like K-Means, KNN, SVM) and gradient-based methods (like neural networks, logistic regression) are sensitive to the magnitude of features. Without scaling:

  • Features with larger ranges dominate the model.
  • Training becomes unstable or slow.
  • Interpretability of coefficients (e.g., in linear models) is compromised.

Goal: Make features comparable without distorting differences in ranges of values.


Key Techniques

1. Min-Max Scaling (Normalization to [0,1])

Transforms data to a fixed range, usually [0, 1].

$$ X_{\text{scaled}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}} $$
  • When to use: When you need bounded values (e.g., neural networks with sigmoid/tanh activations, image pixel normalization).
  • Pros: Preserves the original distribution shape; interpretable.
  • Cons: Sensitive to outliers (one extreme value compresses all others).

Python Example (NumPy)

1
2
3
4
5
6
7
8
9
import numpy as np

def min_max_scale(arr):
    return (arr - arr.min()) / (arr.max() - arr.min())

X = np.array([1, 5, 10, 15, 20])
X_scaled = min_max_scale(X)
print(X_scaled)
# Output: [0.   0.21 0.47 0.74 1.  ]

With scikit-learn

1
2
3
4
5
6
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X = np.array([[1], [5], [10], [15], [20]])
X_scaled = scaler.fit_transform(X)
print(X_scaled.flatten())

2. Standardization (Z-score Normalization)

Centers data around mean 0 with standard deviation 1.

$$ X_{\text{standardized}} = \frac{X - \mu}{\sigma} $$
  • When to use: Most common choice. Ideal for algorithms assuming Gaussian distribution (e.g., linear regression, logistic regression, PCA).
  • Pros: Robust to outliers; works well with gradient descent.
  • Cons: Doesn’t bound values (can be negative or >1).

Python Example

1
2
3
4
5
6
7
def standardize(arr):
    return (arr - arr.mean()) / arr.std()

X = np.array([1, 5, 10, 15, 20])
X_std = standardize(X)
print(X_std)
# Output: [-1.42 -0.66 -0.08  0.66  1.42]

With scikit-learn

1
2
3
4
5
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.reshape(-1, 1))
print(X_scaled.flatten())

3. Robust Scaling

Uses median and interquartile range (IQR) instead of mean/std.

$$ X_{\text{robust}} = \frac{X - \text{median}}{IQR} $$
  • When to use: Data with outliers.
  • Pros: Outlier-resistant.
  • Cons: Less intuitive scaling.
1
2
3
4
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_robust = scaler.fit_transform(X.reshape(-1, 1))

4. Max Absolute Scaling

Scales by dividing by the maximum absolute value.

$$ X_{\text{scaled}} = \frac{X}{|X|_{\max}} $$
  • Range: [-1, 1]
  • Great for sparse data (e.g., text TF-IDF).
1
2
3
4
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X.reshape(-1, 1))

Comparison Table

MethodRangeOutlier Robust?Use Case
Min-Max[0,1] or customNoNeural nets, bounded inputs
Standardization~[-3,3]ModerateMost ML algorithms
Robust ScalingVariesYesOutlier-heavy data
MaxAbs Scaling[-1,1]YesSparse data

Best Practices

  1. Fit on Training Data Only
    Never use test data statistics to avoid data leakage.

    1
    2
    3
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)  # Only transform!
    
  2. Apply Same Transformation to All Splits
    Consistency is key.

  3. Inverse Transform When Needed
    Convert predictions back to original scale.

    1
    
    X_original = scaler.inverse_transform(X_scaled)
    
  4. Handle 1D vs 2D Arrays
    scikit-learn expects 2D input (n_samples × n_features).


Real-World Example: Scaling Before PCA

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.randn(100, 5) * [1, 10, 100, 1000, 0.1] + [0, 50, 500, 5000, 0]

# Without scaling: high-variance features dominate
pca_raw = PCA(n_components=2)
X_pca_raw = pca_raw.fit_transform(X)

# With scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Explained variance (scaled):", pca.explained_variance_ratio_)

Result: Scaling ensures all features contribute fairly to principal components.


Conclusion

Scaling ≠ Normalization — though often used interchangeably:

  • Normalization typically refers to rescaling to a norm (e.g., unit norm or [0,1]).
  • Scaling is a broader term including standardization.

Rule of Thumb:

  • Use StandardScaler by default.
  • Use MinMaxScaler when you need bounded values.
  • Use RobustScaler if outliers are a concern.

Proper scaling is a small step that leads to giant leaps in model performance.


Your Turn: What’s your go-to scaler? Drop a comment or tweet @ferdous!

Happy preprocessing!


comments powered by Disqus
We use Google Analytics to collect website traffic data. You consent to use GA by using this site.