Skip to content

Commit

Permalink
Create basic ddi parser (#13)
Browse files Browse the repository at this point in the history
* added test data files to package.

* added EzXML to IPUMS.jl file.

* created ddi_parser.jl file, and added include and export to IPUMs.jl file.

* added test cases for ddi parser.

* added EzXML to Project.toml and removed extraneous print statement from ddi_parser.jl

* moved structs to structs.jl, and added docstrings.

* moved xpath constants to constants.jl file and added documentation.

* updated project.toml with my email addesss. Changing file type test as per request.

* added throw to the ArgumentError.

* added documentation to the loops. added a test case to check whether files existence causes error.

* got preliminary regex stuff working. Need to test.

* added a test case for the data summary. added more documentation.

* added backticks to DDIInfo in docs.

* escape the double dashes.

* fixed some documentation formatting, and added ns testcase.

* indicated in parse_ddi() function that the user should look at docs on DDIInfo for more info.

* re-added small check functions.

* adjusted struct documentation example to reflect updated API.

* Fixed typos and reviewed documentation

---------

Co-authored-by: TheCedarPrince <[email protected]>
  • Loading branch information
00krishna and TheCedarPrince authored Apr 16, 2024
1 parent 476f223 commit 67c8286
Show file tree
Hide file tree
Showing 8 changed files with 1,107 additions and 3 deletions.
3 changes: 2 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
name = "IPUMS"
uuid = "51d1f77e-d457-4c14-a89d-9ed71839f38d"
authors = ["TheCedarPrince <[email protected]> and Krishna Bhogaonker"]
authors = ["TheCedarPrince <[email protected]> and Krishna Bhogaonker <[email protected]>"]
version = "0.0.1"

[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
EzXML = "8f5d6c58-4d21-5cfd-889c-e3ad7ee6a615"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
OpenAPI = "d5e62ea6-ddf3-4d43-8e4c-ad5e6c8bfd7d"
Expand Down
6 changes: 4 additions & 2 deletions src/IPUMS.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ module IPUMS
Client
using TimeZones
using URIs

using EzXML

include("structs.jl")
include("constants.jl")
include("helpers.jl")
Expand All @@ -30,6 +31,7 @@ module IPUMS
include("modelincludes.jl")
include("apis/api_DefaultApi.jl")
include("piracy.jl")
include("parsers/ddi_parser.jl")

#=
export from_json
Expand Down Expand Up @@ -57,5 +59,5 @@ module IPUMS

export Client
export DefaultApi

export parse_ddi
end
56 changes: 56 additions & 0 deletions src/constants.jl
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,59 @@ const global ipums_sources = [
collection_type = "microdata"
)
]


#=
XPaths for parsing DDI XML files
This is a list of XPATHs for locating file-level and variable-level metadata
in a DDI XML file. The constants are stored here, but used in the `ddi_parser.jl`
file.
Extract level XPATHs:
EXTRACT_CONDITIONS - IPUMS conditions for fair and legal use of public data.
EXTRACT_CITATION - Citation information for referencing IPUMS data
EXTRACT_IPUMS_PROJECT - Name of the IPUMS project from which the data is taken, such as CPS, or DHS, etc.
EXTRACT_NOTES - User provided notes or additional miscellaneous information provided about the extract.
EXTRACT_DATE - The date that the extract was generated.
Variable level XPATHs:
VAR_NODE_LOCATION = The base nodes that correspond to each variable in the dataset.
VAR_NAME_XPATH = The name of the IPUMS variable.
VAR_STARTPOS_XPATH = The start position (in text columns) of the variable in a fixed width file specification.
VAR_ENDPOS_XPATH = The end position (in text columns) of the variable in a fixed width file specification.
VAR_WIDTH_XPATH = The width postion (in text columns) of variable in a fixed width file specification.
VAR_LABL_XPATH = A short description for the data contained in a variable.
VAR_TXT_XPATH = A longer and more complete description of the data contained in a variable.
VAR_DCML_XPATH = The number of decimal points contained in a variable.
VAR_TYPE_XPATH = An indicator of whether the variable is either a string or numeric data type.
VAR_INTERVAL_XPATH = An indicator of whether a numeric variable is continuous or discrete.
VAR_CATEGORY_XPATH = A description of the category levels and corresponding numerical values for a categorical variable, such as
"Women => 0, Men => 1"
=#

const EXTRACT_CONDITIONS = "/x:codeBook/x:stdyDscr/x:dataAccs/x:useStmt/x:conditions"
const EXTRACT_CITATION = "/x:codeBook/x:stdyDscr/x:dataAccs/x:useStmt/x:citReq"
const EXTRACT_IPUMS_PROJECT = "/x:codeBook/x:stdyDscr/x:citation/x:serStmt/x:serName"
const EXTRACT_NOTES = "/x:codeBook/x:stdyDscr/x:notes"
const EXTRACT_DATE = "/x:codeBook/x:stdyDscr/x:citation/x:prodStmt/x:prodDate/@date"

const VAR_NODE_LOCATION = "/x:codeBook/x:dataDscr/x:var"
const VAR_NAME_XPATH = "/x:codeBook/x:dataDscr/x:var/@name"
const VAR_STARTPOS_XPATH = "/x:codeBook/x:dataDscr/x:var/x:location/@StartPos"
const VAR_ENDPOS_XPATH = "/x:codeBook/x:dataDscr/x:var/x:location/@EndPos"
const VAR_WIDTH_XPATH = "/x:codeBook/x:dataDscr/x:var/x:location/@width"
const VAR_LABL_XPATH = "/x:codeBook/x:dataDscr/x:var/x:labl"
const VAR_TXT_XPATH = "/x:codeBook/x:dataDscr/x:var/x:txt"
const VAR_DCML_XPATH = "/x:codeBook/x:dataDscr/x:var/@dcml"
const VAR_TYPE_XPATH = "/x:codeBook/x:dataDscr/x:var/x:varFormat/@type"
const VAR_INTERVAL_XPATH = "/x:codeBook/x:dataDscr/x:var/@intrvl"
const VAR_CATEGORY_XPATH = "/x:codeBook/x:dataDscr/x:var/x:catgry"




Loading

0 comments on commit 67c8286

Please sign in to comment.