Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for categorical variables in regression #38

Open
agisga opened this issue Jun 13, 2015 · 2 comments
Open

Support for categorical variables in regression #38

agisga opened this issue Jun 13, 2015 · 2 comments

Comments

@agisga
Copy link

agisga commented Jun 13, 2015

Categorical (as opposed to numeric) variables are ubiquitous in data analysis and linear regression, but they seem not to be supported by Statsample::Regression.
Here is an example of what I mean:

In R, I can do:

> head(fake.salaries)
      salary years ethnicity
1  5.0823594     9     black
2 -0.4459633     3     black
3 16.0734587     2     white
4 10.5554305     7     other
5  9.9438798     8     other
6  9.6776724     6    latino
> mod <- lm(salary ~ years + ethnicity, fake.salaries)
> summary(mod)

Call:
lm(formula = salary ~ years + ethnicity, data = fake.salaries)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5068 -1.1283 -0.3713  1.1227  3.3027 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.5421     0.9851   1.565    0.131    
years              0.1729     0.1561   1.108    0.279    
ethnicitylatino    6.7300     0.9984   6.741 5.67e-07 ***
ethnicitymexican   5.4826     0.8755   6.262 1.79e-06 ***
ethnicityother     6.6404     0.9034   7.351 1.37e-07 ***
ethnicitywhite    11.5310     0.9309  12.387 6.46e-12 ***

---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Residual standard error: 1.66 on 24 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8503 
F-statistic: 33.95 on 5 and 24 DF,  p-value: 3.942e-10

We see that lm regards the variable "ethnicity" as a categorical variable and fits a model accordingly. We can see in the output that in this case it takes ethnicity "black" as the base level, and that all other ethnicities have a statistically significant effect on "salary" (with p-values of 1e-6 or smaller) when compared to the base level.

When I try to analyse the same data in Statsample:

pry(main)> df = Statsample::CSV.read("/home/alexej/Desktop/fake_salaries.csv")
=> #<Statsample::Dataset:69956503513460 @name=Dataset 1 @fields=[salary,years,ethnicity] cases=30
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
NoMethodError: NoMethodError
from /home/alexej/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/statsample-1.5.0/lib/statsample/vector.rb:186:in `_check_type'

So, "NoMethodError". And when I delete "ethinicity", the model can be fit:

pry(main)> df.delete_vector("ethnicity")
=> ["ethnicity"]
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
=> #<Statsample::Regression::Multiple::RubyEngine:0x007f4008733620
> puts mod.summary
= Multiple reggresion of years on salary
  Engine: Statsample::Regression::Multiple::RubyEngine
  Cases(listwise)=30(30)
  R=0.061
  R^2=0.004
  R^2 Adj=-0.032
  Std.Error R=4.358
  Equation=7.046 + 0.125years
  == ANOVA
    ANOVA Table
+------------+---------+----+--------+-------+-------+
|   source   |   ss    | df |   ms   |   f   |   p   |
+------------+---------+----+--------+-------+-------+
| Regression | 1.979   | 1  | 1.979  | 0.104 | 0.749 |
| Error      | 531.824 | 28 | 18.994 |       |       |
| Total      | 533.804 | 29 | 20.973 |       |       |
+------------+---------+----+--------+-------+-------+

  Beta coefficients
+----------+-------+-------+-------+-------+
|  coeff   |   b   | beta  |  se   |   t   |
+----------+-------+-------+-------+-------+
| Constant | 7.046 | -     | 2.233 | 3.155 |
| years    | 0.125 | 0.061 | 0.386 | 0.323 |
+----------+-------+-------+-------+-------+

This issue possibly allows for a common solution with SciRuby/statsample-glm#11 and SciRuby/daru#9.

@dansbits
Copy link

+1 Has there been any progress on this?

@v0dro
Copy link
Member

v0dro commented Aug 13, 2016

Yes @lokeshh is working on it as part of his GSOC project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants