download.url
url
getURL
getForm
postForm
getURL
download.url
getURL("http://www.omegahat.org/RCurl/index.html")
getURL("https://sourceforge.net")
download.url
getURL
getURL
getCurlOptionConstants
names(getCurlOptionsConstants())
sort(names(getCurlOptionsConstants())) [1] "autoreferer" "buffersize" [3] "cainfo" "capath" [5] "closepolicy" "connecttimeout" [7] "cookie" "cookiefile" [9] "cookiejar" "cookiesession" [11] "crlf" "customrequest" [13] "debugdata" "debugfunction" [15] "dns.cache.timeout" "dns.use.global.cache" [17] "egdsocket" "encoding" [19] "errorbuffer" "failonerror" [21] "file" "filetime" [23] "followlocation" "forbid.reuse" [25] "fresh.connect" "ftp.create.missing.dirs" [27] "ftp.response.timeout" "ftp.ssl" [29] "ftp.use.eprt" "ftp.use.epsv" [31] "ftpappend" "ftplistonly" [33] "ftpport" "header" [35] "headerfunction" "http.version" [37] "http200aliases" "httpauth" [39] "httpget" "httpheader" [41] "httppost" "httpproxytunnel" [43] "infile" "infilesize" [45] "infilesize.large" "interface" [47] "ipresolve" "krb4level" [49] "low.speed.limit" "low.speed.time" [51] "maxconnects" "maxfilesize" [53] "maxfilesize.large" "maxredirs" [55] "netrc" "netrc.file" [57] "nobody" "noprogress" [59] "nosignal" "port" [61] "post" "postfields" [63] "postfieldsize" "postfieldsize.large" [65] "postquote" "prequote" [67] "private" "progressdata" [69] "progressfunction" "proxy" [71] "proxyauth" "proxyport" [73] "proxytype" "proxyuserpwd" [75] "put" "quote" [77] "random.file" "range" [79] "readfunction" "referer" [81] "resume.from" "resume.from.large" [83] "share" "ssl.cipher.list" [85] "ssl.ctx.data" "ssl.ctx.function" [87] "ssl.verifyhost" "ssl.verifypeer" [89] "sslcert" "sslcertpasswd" [91] "sslcerttype" "sslengine" [93] "sslengine.default" "sslkey" [95] "sslkeypasswd" "sslkeytype" [97] "sslversion" "stderr" [99] "tcp.nodelay" "telnetoptions" [101] "timecondition" "timeout" [103] "timevalue" "transfertext" [105] "unrestricted.auth" "upload" [107] "url" "useragent" [109] "userpwd" "verbose" [111] "writefunction" "writeheader" [113] "writeinfo"Each of these and what it controls is described in the libcurl man(ual) page for curl_easy_setopt and that is the authoritative documentation. Anything we provide here is merely repetition or additional explanation. The names of the options require a slight explanation. These correspond to symbolic names in the C code of libcurl. For example, the option url in R corresponds to CURLOPT_URL in C. Firstly, uppercase letters are annoying to type and read, so we have mapped them to lower case letters in R. We have also removed the prefix "CURLOPT_" since we know the context in which they option names are being used. And lastly, any option names that have a _ (after we have removed the CURLOPT_ prefix) are changed to replace the '_' with a '.' so we can type them in R without having to quote them. For example, combining these three rules, "CURLOPT_URL" becomes url and CURLOPT_NETRC_FILE becomes netrc.file. That is the mapping scheme. The code that handles options in RCurl automatically maps the user's inputs to lower case. This means that you can use any mixture of upper-case that makes your code more readable to you and others. For example, we might write
writeFunction = basicTextGatherer()
or
HTTPHeader = c(Accept="text/html")
We specify one or more options by using the names. To make
interactive use easier, we perform partial matching on the names
relative to the set of know names. So, for example, we could specify
getURL("http://www.omegahat.org/RCurl/testPassword", verbose = TRUE)
getURL("http://www.omegahat.org/RCurl/testPassword", v = TRUE)
[1] "autoreferer" "buffersize" [3] "closepolicy" "connecttimeout" [5] "cookiesession" "crlf" [7] "dns.cache.timeout" "dns.use.global.cache" [9] "failonerror" "followlocation" [11] "forbid.reuse" "fresh.connect" [13] "ftp.create.missing.dirs" "ftp.response.timeout" [15] "ftp.ssl" "ftp.use.eprt" [17] "ftp.use.epsv" "ftpappend" [19] "ftplistonly" "header" [21] "http.version" "httpauth" [23] "httpget" "httpproxytunnel" [25] "infilesize" "ipresolve" [27] "low.speed.limit" "low.speed.time" [29] "maxconnects" "maxfilesize" [31] "maxredirs" "netrc" [33] "nobody" "noprogress" [35] "nosignal" "port" [37] "post" "postfieldsize" [39] "proxyauth" "proxyport" [41] "proxytype" "put" [43] "resume.from" "ssl.verifyhost" [45] "ssl.verifypeer" "sslengine.default" [47] "sslversion" "tcp.nodelay" [49] "timecondition" "timeout" [51] "timevalue" "transfertext" [53] "unrestricted.auth" "upload" [55] "verbose"The connecttimeout gives the maximum number of seconds the connection should take before raising an error, so this is a number. The header option, on the other hand, is merely a flag to indicate whether header information from the response should be included. So this can be a logical value (or a number that is 0 to say FALSE or non-zero for TRUE.) At present, all numbers passed from R are converted to long when used in libcurl. Many options are specified as strings. For example, we can specify the user password for a URI as
getURL("http://www.omegahat.org/RCurl/testPassword/index.html", userpwd = "bob:duncantl", verbose = TRUE)
getURL("http://www.omegahat.org/RCurl/index.html", useragent="RCurl", referer="http://www.omegahat.org")
getURL("http://www.omegahat.org/RCurl", httpheader = c(Accept="text/html", 'Made-up-field' = "bob"))
> getURL("http://www.omegahat.org", httpheader = c(Accept="text/html", 'Made-up-field' = "bob"), verbose = TRUE) * About to connect() to www.omegahat.org port 80 * Connected to www.omegahat.org (169.237.46.32) port 80 > GET / HTTP/1.1 Host: www.omegahat.org Pragma: no-cache Accept: text/html Made-up-field: bob(Note that not all servers will tolerate setting header fields arbitrarily and may return an error.) The key thing to note is that headers are specified as name-value pairs in a character vector. R takes these and pastes the name and value together and passes the resulting character vector to libcurl. So while it is convenient to express the headers as
c(name = "value", name = "value")
c("name: value", "name: value")
getNativeSymbolInfo
basicTextGatherer
getURL
getURL
basicTextGatherer
h = basicTextGatherer() txt = getURL("http://www.omegahat.org/RCurl", header = TRUE, headerfunction = h$update)
getURL
h$value()
debugGatherer
TRUE
, libcurl will provide a lot of information about
its actions. By default, these will be written on the console
(e.g. stderr). In some cases, we would not want these to be on the
screen but instead, for example, displayed in a GUI or stored in a
variable for closer examination. We can do this by providing a
callback function for the debugging output via the debugfunction
option for libcurl.
The debugGatherer
d = debugGatherer() x = getURL("http://www.omegahat.org/RCurl", debugfunction=d$update, verbose = TRUE)
(R) names(d$value()) [1] "text" "headerIn" "headerOut" "dataIn" "dataOut"The headerIn and headerOut fields report the text of the header for the response from the Web server and for our request respectively. Similarly, the dataIn and dataOut fields give the body of the response and request. And the text is just messages from libcurl. We should note that not all options are (currently)) meaningful in R. For example, it is not currently possible to redirect standard error for libcurl to a different FILE* via the "stderr" option. (In the future, we may be able to specify an R function for writing errors from libcurl, but we have not put that in yet.)
http://www.omegahat.org/cgi-bin/form.pl?a=1&b=2
getForm
postForm
getForm("http://www.google.com/search", hl="en", lr="", ie="ISO-8859-1", q="RCurl", btnG="Search")
htmlTreeParse
getForm
curlEscape
curlUnescape
postForm
postForm("http://www.speakeasy.org/~cgires/perl_form.cgi", "some_text" = "Duncan", "choice" = "Ho", "radbut" = "eep", "box" = "box1, box2" )
getForm
postForm
getURL
getCurlHandle
handle = getCurlHandle() a = getURL("http://www.omegahat.org/RCurl", curl = handle) b = getURL("http://www.omegahat.org/", curl = handle)
header=TRUE
option in the first call
above, it would remain set for the second call. This can be sometimes
inconvenient. In such cases, either use separate libcurl handles, or
reset the options.
The function
dupCurlHandle
curlPerform
getURL
getCurlInfo
h = getCurlHandle() getURL("http://www.omegahat.org", curl = h) names(getCurlInfo(h))
[1] "effective.url" "response.code" [3] "total.time" "namelookup.time" [5] "connect.time" "pretransfer.time" [7] "size.upload" "size.download" [9] "speed.download" "speed.upload" [11] "header.size" "request.size" [13] "ssl.verifyresult" "filetime" [15] "content.length.download" "content.length.upload" [17] "starttransfer.time" "content.type" [19] "redirect.time" "redirect.count" [21] "private" "http.connectcode" [23] "httpauth.avail" "proxyauth.avail"These provide us the actual name of the URI downloaded after redirections, etc.; information about the transfer speed, etc.; etc. See the man page for curl_easy_getinfo.
curlVersion
curlVersion
$age [1] 2 $version [1] "7.12.0" $vesion_num [1] 461824 $host [1] "powerpc-apple-darwin7.4.0" $features ipv6 ssl libz ntlm largefile 1 4 8 16 512 $ssl_version [1] " OpenSSL/0.9.7b" $ssl_version_num [1] 9465903 $libz_version [1] "1.2.1" $protocols [1] "ftp" "gopher" "telnet" "dict" "ldap" "http" "file" "https" [9] "ftps" $ares [1] "" $ares_num [1] 0 $libidn [1] ""The help page for the R function explains the fields which are hopefully clear from the names. The only ones that might be obscure are ares and libidn. ares refers to asynchronous domain name server (DNS) lookup for resolving the IP address (e.g. 128.41.12.2) corresponding to a machine name (e.g. www.omegahat.org). "GNU Libidn is an implementation of the Stringprep, Punycode and IDNA specifications defined by the IETF Internationalized Domain Names (IDN)" (taken from http://www.gnu.org/software/libidn/).
curlGlobalInit
curlGlobalInit
none ssl win32 all 0 1 2 3 attr(,"class") [1] "CurlGlobalBits" "BitIndicator"We would call
curlGlobalInit
curlGlobalInit(c("ssl", "win32"))
curlGlobalInit(c("ssl"))
setBitIndicators
curlOpts
mapCurlOptNames
opts = curlOptions(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE) getURL("http://www.omegahat.org/RCurl/testPassword/index.html", verbose = TRUE, .opts = opts)
h = getCurlHandle(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE) getURL("http://www.omegahat.org/RCurl/testPassword/index.html", verbose = TRUE, curl = h)
getURL
curlSetOpt
POST /hibye.cgi HTTP/1.1
Connection: close
Accept: text/xml
Accept: multipart/*
Host: services.soaplite.com
User-Agent: SOAP::Lite/Perl/0.55
Content-Length: 450
Content-Type: text/xml; charset=utf-8
SOAPAction: "http://www.soaplite.com/Demo#hi"
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/1999/XMLSchema"
xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance">
<SOAP-ENV:Body>
<namesp1:hi xmlns:namesp1="http://www.soaplite.com/Demo"/>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
Accept: text/xml
Accept: multipart/*
SOAPAction: "http://www.soaplite.com/Demo#hi"
Content-Type: text/xml; charset=utf-8
body = '<?xml version="1.0" encoding="UTF-8"?>\ <SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" \ xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" \ xmlns:xsd="http://www.w3.org/1999/XMLSchema" \ xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" \ xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance">\ <SOAP-ENV:Body>\ <namesp1:hi xmlns:namesp1="http://www.soaplite.com/Demo"/>\ </SOAP-ENV:Body>\ </SOAP-ENV:Envelope>\n' curlPerform(url="http://services.soaplite.com/hibye.cgi", httpheader=c(Accept="text/xml", Accept="multipart/*", SOAPAction='"http://www.soaplite.com/Demo#hi"', 'Content-Type' = "text/xml; charset=utf-8"), postfields=body, verbose = TRUE )
getURL
curlPerform
curlPerform
getURL
curlPerform(url="http://services.soaplite.com/hibye.cgi", httpheader=c(Accept="text/xml", Accept="multipart/*", SOAPAction='"http://www.soaplite.com/Demo#hi"', 'Content-Type' = "text/xml; charset=utf-8"), postfields=body, verbose = TRUE )
getURL
getURL
htmlTreeParse
htmlTreeParse
getURL
Note | |
---|---|
This is not a very compelling example anymore! |
Using libcurl is by no means the only approach to getting HTTP access in R. Firstly, we have HTTP access in R via the facilities incorporated from libxml (nanohttp and nanoftp). These are, as the names suggest, basic implementations of the protocols and do not provide all the bells and whistles we might need generally. Also, they are not customizable from within R. Specifically, we cannot add header fields, handle binary data, set the body of the request, etc.
We can use R's socket connections and implement the details of HTTP ourselves. There is a great deal of work in this as we have discussed before. Also, we currently don't have secure sockets (i.e. using SSL) in R[1] I initially started using this approach so that I could discover the nuances of HTTP. It quickly gets overwhelming to handle all the details. It is more tedious than technically challenging, especially when others have done it already in C libraries and done it well. The code that I have is in an unreleased package named httpClient. If anyone is interested, please contact me. Using R's sockets is also used in the httpRequest package on CRAN. This allows submitting forms and retrieving URIs. It is useful and, as the authors state, a "basic HTTP request" implementation. It doesn't escape characters, handle chunked responses, do redirects, support SSL, etc. It is flexible but leaves a lot to the user to do to setup the request and process the response. RCurl inherits many, many good features for "free" from libcurl.
libcurl is not the only C-level library that we could have used. Alternative libraries include libwww from the W3 group. We may find that that is more suitable, but libcurl will definitely suffice for the present.
libcurl can use ares for asynchronous DNS resolution.
[1] The RHTMLForms package: creating S functions from HTML forms.. RHTMLForms
[2] Programming Web Services with SOAP. O'Reilly
[1] I have a local version (not with SSL) but they are not connections since the connection data structure is not exposed in the R API, yet!