Activation Functions

$f (x) = max (0, x)$

$f (x) = \frac{1}{1 + e ^{- x}}$

Squashes input into (0, 1) range.
Useful for binary classification.
Can cause vanishing gradients in deep networks due to saturation at extremes.

$f (x) = tanh (x)$

x & \text{if } x > 0 \\ 0.01x & \text{otherwise} \end{cases}$$ - Fixes ReLU’s "dying neuron" issue by allowing a small gradient for negative inputs. - Maintains sparsity but avoids total shutdown of some neurons. ### **5. Softplus** $$f(x) = \log(1 + e^x)$$ - A **smooth approximation of ReLU**. - Always has a positive gradient, so no dead neurons. - More computationally expensive than ReLU. ### **6. [[Softmax Function]]**: It's in the link. ## What's the point with activation functions? Take the simple, 2 class [[Logistic Regression]] formula, but without the sigmoid (so just $w \cdot x + b$). No matter how many ways you combine multiple neurons together, you'll always end up with a linear function (i.e. a straight line). *This is known as linearity.* (Only possibility in single-layer network) But when you combine them with an activation function (i.e. non-linear functions), e.g. sigmoid, you introduce non-linearity. Non-linear functions are far more powerful than a single dividing line on a plane. > ***Universal Approximation Theorem:*** Any two-layer network (one input layer, one hidden layer) with a non-linear activation function and a sufficient number of hidden units can ***approximate any continuous function***.

~/leocamacho.co