本科生（问卷类）毕业论文数据分析基本流程

Jan 1, 0001 10 min read

1.基本概念的描述

1.1为什么做数据分析

了解我们要研究的对象的基本情况，探讨变量间的相互关系
抽样分析。如果能够获得总体的数据，那么“数据即是理论”，而不需要进行推断统计
由“科学”到“广义科学”：从寻求确定的结果到寻求稳定区间内的结果

1.2总体与样本

总体(population)：全体CSNU的学生
样本(sample)：CSNU心理学专业学生
总体参数：CSNU平均学习时间
样本统计量：CSNU心理学专业学生的平均学习时间
样本统计量 ≠ 总体参数
因此，从样本推论总体情况的时候，总是存在（不准确的情况）误差
置信区间(confidence interval)：有多大的概率我们估计的数据会落入这一区间内

1.3数据类型

定性数据
- 称名数据：性别，专业等
- 顺序数据：学历
定量数据
- 离散数据：学生人数（只能计为个数，不存在小数）
- 连续数据：体重，年龄等

1.4显著性水平与统计功效

显著性水平:犯第一类错误的最大概率的大小α。
p值：当H0是对的时，然后给定某个数，跟这个数一样极端或者比它还极端的概率就是P-value。
统计功效：统计功效指的是在假设检验中,H1(alternative hypothesis)为真时,正确地拒绝H0(null hypothesis)的概率,或者1-β。
一类错误与二类错误：

1.4显著性水平与统计功效

举例：在未来，女性已经统治了地球，他们觉的男人太过讨厌，于是想了一个办法来清除男性，他们商讨一番，决定使用的新鲜武器：自动判别，如果小于A罩杯，则杀无赦；如果等于或大于A罩杯，则放过。这个武器本意是区分男性和女性，杀死所有男性，放过所有女性。硝烟过后，大家可以想象得到结果，有些可怜的mm因为胸太小被误杀，这就是武器的判别程序犯的一类错误。本属于女性这个群体，却被错误的判断为不属于。有些胸肌发达的gg因为胸很大而活下来，这就是武器的判别程序犯的二类错误，本不属于女性这个群体，却被误判为属于。而所有被杀害的男性，则是该判别程序的效力（power，i.e. 1-β）。

1.5信度与效度

信度：问卷或测量工具的稳定性指标，说明可重复性的可能
效度：针对某一变量的问卷或测量工具有效性程度。
效度高，信度一定高；信度高，效度不一定高。

2.描述性统计

2.1 被试群体的基本信息

举例：大学生社会流动信念是如何影响学习投入的?
被试性别分布

被试性别与年级分布

2.2研究工具的信、效度

信度分析

## 
## 载入程辑包：'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## [1] 0.8586419

## [1] 0.8925558

## [1] 0.9360799

## [1] 0.9561618

效度分析:以社会阶层流动信念问卷为例

## This is lavaan 0.6-3

## lavaan is BETA software! Please report any bugs.

## 
## 载入程辑包：'lavaan'

## The following object is masked from 'package:psych':
## 
##     cor2cov

## lavaan 0.6-3 ended normally after 22 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         12
## 
##   Number of observations                           895
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     287.879
##   Degrees of freedom                                 9
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             2454.614
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.886
##   Tucker-Lewis Index (TLI)                       0.809
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -7682.579
##   Loglikelihood unrestricted model (H1)      -7538.639
## 
##   Number of free parameters                         12
##   Akaike (AIC)                               15389.157
##   Bayesian (BIC)                             15446.719
##   Sample-size adjusted Bayesian (BIC)        15408.609
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.186
##   90 Percent Confidence Interval          0.168  0.205
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.068
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   G =~                                                                  
##     a1                1.000                               0.836    0.691
##     a2                1.116    0.056   19.897    0.000    0.933    0.746
##     a3                0.957    0.061   15.663    0.000    0.800    0.575
##     a4                1.174    0.054   21.799    0.000    0.982    0.834
##     a5                0.855    0.050   17.177    0.000    0.715    0.634
##     a6                1.210    0.058   20.805    0.000    1.012    0.786
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .a1                0.764    0.041   18.424    0.000    0.764    0.522
##    .a2                0.694    0.040   17.379    0.000    0.694    0.444
##    .a3                1.298    0.066   19.698    0.000    1.298    0.670
##    .a4                0.421    0.030   14.231    0.000    0.421    0.304
##    .a5                0.760    0.040   19.157    0.000    0.760    0.598
##    .a6                0.634    0.039   16.258    0.000    0.634    0.382
##     G                 0.699    0.062   11.201    0.000    1.000    1.000
## 
## R-Square:
##                    Estimate
##     a1                0.478
##     a2                0.556
##     a3                0.330
##     a4                0.696
##     a5                0.402
##     a6                0.618

效度分析:以学习投入量表为例

## lavaan 0.6-3 ended normally after 51 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         37
## 
##   Number of observations                           895
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                    1731.394
##   Degrees of freedom                               116
##   P-value (Chi-square)                           0.000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic            12656.568
##   Degrees of freedom                               136
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    0.871
##   Tucker-Lewis Index (TLI)                       0.849
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -20293.204
##   Loglikelihood unrestricted model (H1)     -19427.507
## 
##   Number of free parameters                         37
##   Akaike (AIC)                               40660.409
##   Bayesian (BIC)                             40837.891
##   Sample-size adjusted Bayesian (BIC)        40720.386
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.125
##   90 Percent Confidence Interval          0.120  0.130
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.061
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 =~                                                                 
##     d1                1.000                               0.882    0.603
##     d2                1.018    0.059   17.229    0.000    0.899    0.681
##     d3                1.133    0.062   18.401    0.000    1.000    0.746
##     d5                0.974    0.053   18.372    0.000    0.859    0.745
##     d7                1.297    0.062   20.757    0.000    1.144    0.895
##     d9                1.216    0.063   19.290    0.000    1.073    0.799
##   f2 =~                                                                 
##     d4                1.000                               0.958    0.759
##     d8                1.141    0.039   29.462    0.000    1.094    0.884
##     d10               1.180    0.041   29.023    0.000    1.131    0.873
##     d12               1.034    0.044   23.439    0.000    0.991    0.731
##     d15               0.976    0.041   24.060    0.000    0.935    0.748
##     d17               0.970    0.056   17.177    0.000    0.930    0.554
##   f3 =~                                                                 
##     d6                1.000                               0.954    0.781
##     d11               1.190    0.039   30.652    0.000    1.135    0.873
##     d13               1.021    0.042   24.506    0.000    0.974    0.735
##     d14               0.725    0.038   19.315    0.000    0.691    0.602
##     d16               0.952    0.048   19.782    0.000    0.908    0.615
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   f1 ~~                                                                 
##     f2                0.854    0.061   14.033    0.000    1.011    1.011
##   f2 ~~                                                                 
##     f3                0.948    0.058   16.446    0.000    1.038    1.038
##   f1 ~~                                                                 
##     f3                0.847    0.060   14.181    0.000    1.007    1.007
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .d1                1.366    0.066   20.717    0.000    1.366    0.637
##    .d2                0.933    0.046   20.475    0.000    0.933    0.536
##    .d3                0.795    0.039   20.140    0.000    0.795    0.443
##    .d5                0.593    0.029   20.152    0.000    0.593    0.445
##    .d7                0.324    0.019   17.207    0.000    0.324    0.198
##    .d9                0.651    0.033   19.675    0.000    0.651    0.361
##    .d4                0.677    0.033   20.655    0.000    0.677    0.425
##    .d8                0.335    0.017   19.236    0.000    0.335    0.219
##    .d10               0.398    0.020   19.509    0.000    0.398    0.237
##    .d12               0.855    0.041   20.750    0.000    0.855    0.465
##    .d15               0.690    0.033   20.696    0.000    0.690    0.441
##    .d17               1.951    0.093   21.023    0.000    1.951    0.693
##    .d6                0.581    0.028   20.460    0.000    0.581    0.390
##    .d11               0.404    0.022   18.656    0.000    0.404    0.239
##    .d13               0.807    0.039   20.753    0.000    0.807    0.460
##    .d14               0.840    0.040   21.057    0.000    0.840    0.638
##    .d16               1.358    0.065   21.043    0.000    1.358    0.622
##     f1                0.779    0.079    9.821    0.000    1.000    1.000
##     f2                0.918    0.069   13.372    0.000    1.000    1.000
##     f3                0.910    0.065   13.930    0.000    1.000    1.000
## 
## R-Square:
##                    Estimate
##     d1                0.363
##     d2                0.464
##     d3                0.557
##     d5                0.555
##     d7                0.802
##     d9                0.639
##     d4                0.575
##     d8                0.781
##     d10               0.763
##     d12               0.535
##     d15               0.559
##     d17               0.307
##     d6                0.610
##     d11               0.761
##     d13               0.540
##     d14               0.362
##     d16               0.378

2.3 描述统计

平均数M:用于描述群体基本状况的数据
标准差SD：用于描述群体数据离散程度的数据，标准差越大，群体数据越分散。

##        vars   n  mean    sd median trimmed   mad min max range  skew
## SocMob    1 895 24.42  5.69     25   24.63  4.45   6  36    30 -0.50
## LeaMov    2 895 56.81  9.45     57   56.95  8.90  18  80    62 -0.47
## PsyCap    3 895 77.57 13.60     77   77.84 11.86  15 105    90 -0.70
## LeaEng    4 895 83.76 17.15     84   83.87 16.31  17 119   102 -0.30
##        kurtosis   se
## SocMob     0.86 0.19
## LeaMov     1.92 0.32
## PsyCap     2.64 0.45
## LeaEng     0.90 0.57

相关系数r: 用于描述两个变量之间的关系

2.4推断统计

t检验:用于描述变量在两个水平上差异的检验方法。
社会流动信念t检验

## 载入需要的程辑包：boot

## 
## 载入程辑包：'boot'

## The following object is masked from 'package:psych':
## 
##     logit

## 载入需要的程辑包：magrittr

## DABEST (Data Analysis with Bootstrap Estimation) v0.2.2
## =======================================================
## 
## Variable: SocMob 
## 
## Unpaired mean difference of female (n=514) minus male (n=381)
##  -0.0917 [95CI  -0.838; 0.602]
## 
## 
## 5000 bootstrap resamples.
## All confidence intervals are bias-corrected and accelerated.

学习动机t检验

## DABEST (Data Analysis with Bootstrap Estimation) v0.2.2
## =======================================================
## 
## Variable: LeaMov 
## 
## Unpaired mean difference of female (n=514) minus male (n=381)
##  -1.35 [95CI  -2.55; -0.142]
## 
## 
## 5000 bootstrap resamples.
## All confidence intervals are bias-corrected and accelerated.

ANOVA检验:在描述性统计部分，主要了解单因素方差分析

3.中介作用

3.1概念模型

## 
## 载入程辑包：'processR'

## The following object is masked from 'package:psych':
## 
##     corPlot

3.2统计模型

3.3中介模型的实现

## LeaMov~a*SocMob
## LeaEng~c*SocMob+b*LeaMov
## indirect :=(a)*(b)
## direct :=c
## total := direct + indirect
## prop.mediated := indirect / total

## lavaan 0.6-3 ended normally after 19 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           895
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
##   Minimum Function Value               0.0000000000000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   LeaMov ~                                            
##     SocMob     (a)    0.752    0.049   15.199    0.000
##   LeaEng ~                                            
##     SocMob     (c)    0.950    0.092   10.379    0.000
##     LeaMov     (b)    0.673    0.055   12.213    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .LeaMov           70.906    3.352   21.154    0.000
##    .LeaEng          192.941    9.121   21.154    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     indirect          0.506    0.053    9.520    0.000
##     direct            0.950    0.092   10.379    0.000
##     total             1.456    0.088   16.523    0.000
##     prop.mediated     0.348    0.038    9.251    0.000

4.调节作用

4.1概念模型

4.2统计模型

4.3调节模型的实现

## LeaEng~c1*SocMob+c2*PsyCap+c3*SocMob:PsyCap
## PsyCap ~ PsyCap.mean*1
## PsyCap ~~ PsyCap.var*PsyCap
## direct :=c1+c3*PsyCap.mean
## direct.below:=c1+c3*(PsyCap.mean-sqrt(PsyCap.var))
## direct.above:=c1+c3*(PsyCap.mean+sqrt(PsyCap.var))

## lavaan 0.6-3 ended normally after 52 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         12
## 
##   Number of observations                           895
## 
##   Estimator                                        GLS
##   Model Fit Test Statistic                     412.550
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate    Std.Err  z-value  P(>|z|)
##   LeaEng ~                                              
##     SocMob    (c1)      0.181    1.131    0.160    0.873
##     PsyCap    (c2)      0.655    0.389    1.684    0.092
##     ScMb:PsyC (c3)      0.004    0.016    0.283    0.777
## 
## Covariances:
##                    Estimate    Std.Err  z-value  P(>|z|)
##   SocMob ~~                                             
##     SocMob:PsyCap    1739.963   92.476   18.815    0.000
## 
## Intercepts:
##                    Estimate    Std.Err  z-value  P(>|z|)
##     PsyCap  (PsC.)     77.568    0.455  170.565    0.000
##    .LeaEng             19.919   32.871    0.606    0.545
##     SocMob             24.420    0.190  128.252    0.000
##     ScMb:PC          1934.194   22.091   87.557    0.000
## 
## Variances:
##                    Estimate    Std.Err  z-value  P(>|z|)
##     PsyCap  (PsC.)     14.249    2.428    5.869    0.000
##    .LeaEng            148.595    7.028   21.142    0.000
##     SocMob             23.077    1.129   20.443    0.000
##     ScMb:PC        139721.037 8427.093   16.580    0.000
## 
## Defined Parameters:
##                    Estimate    Std.Err  z-value  P(>|z|)
##     direct              0.525    0.122    4.319    0.000
##     direct.below        0.509    0.090    5.632    0.000
##     direct.above        0.542    0.169    3.216    0.001

5.有调节的中介

5.1概念模型

5.2统计模型

5.3有调节的中介作用模型实现

## LeaMov~a1*SocMob+a2*PsyCap+a3*SocMob:PsyCap
## LeaEng~c1*SocMob+c2*PsyCap+c3*SocMob:PsyCap+b1*LeaMov+b2*LeaMov:PsyCap
## PsyCap ~ PsyCap.mean*1
## PsyCap ~~ PsyCap.var*PsyCap
## CE.XonM :=a1+a3*PsyCap.mean
## CE.MonY :=b1+b2*PsyCap.mean
## indirect :=(a1+a3*PsyCap.mean)*(b1+b2*PsyCap.mean)
## direct :=c1+c3*PsyCap.mean
## total := direct + indirect
## prop.mediated := indirect / total
## CE.XonM.below :=a1+a3*(PsyCap.mean-sqrt(PsyCap.var))
## CE.MonY.below :=b1+b2*(PsyCap.mean-sqrt(PsyCap.var))
## indirect.below :=(a1+a3*(PsyCap.mean-sqrt(PsyCap.var)))*(b1+b2*(PsyCap.mean-sqrt(PsyCap.var)))
## CE.XonM.above :=a1+a3*(PsyCap.mean+sqrt(PsyCap.var))
## CE.MonY.above :=b1+b2*(PsyCap.mean+sqrt(PsyCap.var))
## indirect.above :=(a1+a3*(PsyCap.mean+sqrt(PsyCap.var)))*(b1+b2*(PsyCap.mean+sqrt(PsyCap.var)))
## direct.below:=c1+c3*(PsyCap.mean-sqrt(PsyCap.var))
## direct.above:=c1+c3*(PsyCap.mean+sqrt(PsyCap.var))
## total.below := direct.below + indirect.below
## total.above := direct.above + indirect.above
## prop.mediated.below := indirect.below / total.below
## prop.mediated.above := indirect.above / total.above

## Warning in lav_data_full(data = data, group = group, cluster = cluster, :
## lavaan WARNING: some observed variances are (at least) a factor 1000 times
## larger than others; use varTable(fit) to investigate

## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan WARNING: some observed variances are larger than 1000000
##   lavaan NOTE: use varTable(fit) to investigate

## Warning in lav_model_vcov(lavmodel = lavmodel, lavsamplestats = lavsamplestats, : lavaan WARNING:
##     Could not compute standard errors! The information matrix could
##     not be inverted. This may be a symptom that the model is not
##     identified.

## Warning in lav_test_yuan_bentler(lavobject = NULL, lavsamplestats = lavsamplestats, : lavaan WARNING: could not invert information matrix

## lavaan 0.6-3 ended normally after 65 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         23
## 
##   Number of observations                           895
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                    4566.675
##   Degrees of freedom                                 4
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Observed information based on                Hessian
##   Standard Errors                   Robust.huber.white
## 
## Regressions:
##                    Estimate     Std.Err  z-value  P(>|z|)
##   LeaMov ~                                               
##     SocMob    (a1)       0.852       NA                  
##     PsyCap    (a2)       0.355       NA                  
##     ScMb:PsyC (a3)      -0.005       NA                  
##   LeaEng ~                                               
##     SocMob    (c1)       1.047       NA                  
##     PsyCap    (c2)       0.322       NA                  
##     ScMb:PsyC (c3)      -0.008       NA                  
##     LeaMov    (b1)      -0.410       NA                  
##     LMv:PsyCp (b2)       0.010       NA                  
## 
## Covariances:
##                    Estimate     Std.Err  z-value  P(>|z|)
##   SocMob ~~                                              
##     SocMob:PsyCap     3399.918       NA                  
##     LeaMov:PsyCap     3922.738       NA                  
##   SocMob:PsyCap ~~                                       
##     LeaMov:PsyCap   637676.532       NA                  
## 
## Intercepts:
##                    Estimate     Std.Err  z-value  P(>|z|)
##     PsyCap  (PsC.)      77.568       NA                  
##    .LeaMov              18.366       NA                  
##    .LeaEng              28.295       NA                  
##     SocMob              24.420       NA                  
##     ScMb:PC           1934.194       NA                  
##     LMv:PsC           4469.566       NA                  
## 
## Variances:
##                    Estimate     Std.Err  z-value  P(>|z|)
##     PsyCap  (PsC.)     184.686       NA                  
##    .LeaMov              62.424       NA                  
##    .LeaEng             137.790       NA                  
##     SocMob              32.375       NA                  
##     ScMb:PC         435780.850       NA                  
##     LMv:PsC        1586916.422       NA                  
## 
## Defined Parameters:
##                    Estimate     Std.Err  z-value  P(>|z|)
##     CE.XonM              0.454                           
##     CE.MonY              0.364                           
##     indirect             0.165                           
##     direct               0.390                           
##     total                0.555                           
##     prop.mediated        0.298                           
##     CE.XonM.below        0.524                           
##     CE.MonY.below        0.229                           
##     indirect.below       0.120                           
##     CE.XonM.above        0.384                           
##     CE.MonY.above        0.500                           
##     indirect.above       0.192                           
##     direct.below         0.505                           
##     direct.above         0.275                           
##     total.below          0.625                           
##     total.above          0.467                           
##     prop.medtd.blw       0.192                           
##     prop.meditd.bv       0.411

彭顺 (Shun Peng)

Doctoral Candidate in Psychology

matter.