stringi: Character String Processing Facilities
ICUライブラリを利用した文字列操作のためのパッケージ
- CRAN: http://cran.r-project.org/web/packages/stringi/index.html
- GitHub: https://github.com/Rexamine/stringi
> library(stringi)
バージョン: 1.1.2
関数名 | 概略 |
---|---|
%s+% |
Concatenate Two Character Vectors |
%s<% |
Compare Strings with or without Collation |
stri_compare |
Compare Strings with or without Collation |
stri_count |
Count the Number of Pattern Matches |
stri_count_boundaries |
Count the Number of Text Boundaries |
stri_datetime_add |
[DRAFT API] Date and Time Arithmetic |
stri_datetime_create |
[DRAFT API] Create a Date-Time Object |
stri_datetime_fields |
[DRAFT API] Get Values for Date and Time Fields |
stri_datetime_format |
[DRAFT API] Date and Time Formatting and Parsing |
stri_datetime_fstr |
[DRAFT API] Convert 'strptime'-style Format Strings |
stri_datetime_now |
[DRAFT API] Get Current Date and Time |
stri_datetime_symbols |
[DRAFT API] List Localizable Date-Time Formatting Data |
stri_detect |
Detect a Pattern Match |
stri_dup |
Duplicate Strings |
stri_duplicated |
Determine Duplicated Elements |
stri_enc_detect |
[DRAFT API] Detect Character Set and Language |
stri_enc_detect2 |
[DRAFT API] Detect Locale-Sensitive Character Encoding |
stri_enc_fromutf32 |
Convert From UTF-32 |
stri_enc_info |
Query a Character Encoding |
stri_enc_isascii |
Check If a Data Stream Is Possibly in ASCII |
stri_enc_isutf16be |
Check If a Data Stream Is Possibly in UTF16 or UTF32 |
stri_enc_isutf8 |
Check If a Data Stream Is Possibly in UTF-8 |
stri_enc_list |
List Known Character Encodings |
stri_enc_mark |
Get Declared Encodings of Each String |
stri_enc_set |
Set or Get Default Character Encoding in 'stringi' |
stri_enc_toascii |
Convert To ASCII |
stri_enc_tonative |
Convert Strings To Native Encoding |
stri_enc_toutf32 |
Convert Strings To UTF-32 |
stri_enc_toutf8 |
Convert Strings To UTF-8 |
stri_encode |
Convert Strings Between Given Encodings |
stri_escape_unicode |
Escape Unicode Code Points |
stri_extract_all |
Extract Occurrences of a Pattern |
stri_extract_all_boundaries |
Extract Text Between Text Boundaries |
stri_flatten |
Flatten a String |
stri_info |
Query Default Settings for 'stringi' |
stri_install_check |
Installation-Related Utilities [DEPRECATED] |
stri_isempty |
Determine if a String is of Length Zero |
stri_join |
Concatenate Character Vectors |
stri_length |
Count the Number of Code Points |
stri_list2matrix |
Convert a List to a Character Matrix |
stri_locale_info |
Query Given Locale |
stri_locale_list |
List Available Locales |
stri_locale_set |
Set or Get Default Locale in 'stringi' |
stri_locate_all |
Locate Occurrences of a Pattern |
stri_locate_all_boundaries |
Locate Specific Text Boundaries |
stri_match_all |
Extract Regex Pattern Matches, Together with Capture Groups |
stri_numbytes |
Count the Number of Bytes |
stri_opts_brkiter |
Generate a List with BreakIterator Settings |
stri_opts_collator |
Generate a List with Collator Settings |
stri_opts_fixed |
Generate a List with Fixed Pattern Search Engine's Settings |
stri_opts_regex |
Generate a List with Regex Matcher Settings |
stri_order |
Ordering Permutation and Sorting |
stri_pad_both |
Pad (Center/Left/Right Align) a String |
stri_rand_lipsum |
A Lorem Ipsum Generator |
stri_rand_shuffle |
Randomly Shuffle Code Points in Each String |
stri_rand_strings |
Generate Random Strings |
stri_read_lines |
[DRAFT API] Read Text Lines from a Text File |
stri_read_raw |
[DRAFT API] Read Whole Text File as Raw |
stri_replace_all |
Replace Occurrences of a Pattern |
stri_replace_na |
Replace Missing Values in a Character Vector |
stri_reverse |
Reverse Each String |
stri_split |
Split a String By Pattern Matches |
stri_split_boundaries |
Split a String at Specific Text Boundaries |
stri_split_lines |
Split a String Into Text Lines |
stri_startswith |
Determine if the Start or End of a String Matches a Pattern |
stri_stats_general |
General Statistics for a Character Vector |
stri_stats_latex |
Statistics for a Character Vector Containing LaTeX Commands |
stri_sub |
Extract a Substring From or Replace a Substring In a Character Vector |
stri_subset |
Select Elements that Match a Given Pattern |
stri_timezone_get |
[DRAFT API] Set or Get Default Time Zone in 'stringi' |
stri_timezone_info |
[DRAFT API] Query a Given Time Zone |
stri_timezone_list |
[DRAFT API] List Available Time Zone Identifiers |
stri_trans_char |
Translate Characters |
stri_trans_general |
General Text Transforms, Including Transliteration |
stri_trans_list |
List Available Text Transforms and Transliterators |
stri_trans_nfc |
Perform or Check For Unicode Normalization |
stri_trans_tolower |
Transform String with Case Mapping |
stri_trim_both |
Trim Characters from the Left and/or Right Side of a String |
stri_unescape_unicode |
Unescape All Escape Sequences |
stri_unique |
Extract Unique Elements |
stri_width |
Determine the Width of Code Points |
stri_wrap |
Word Wrap Text to Format Paragraphs |
stri_write_lines |
[DRAFT API] Write Text Lines to a Text File |
stringi-arguments |
Passing Arguments to Functions in 'stringi' |
stringi-encoding |
Character Encodings and 'stringi' |
stringi-locale |
Locales and 'stringi' |
stringi-package |
THE String Processing Package |
stringi-search |
String Searching |
stringi-search-boundaries |
Text Boundary Analysis in 'stringi' |
stringi-search-charclass |
Character Classes in 'stringi' |
stringi-search-coll |
Locale-Sensitive Text Searching in 'stringi' |
stringi-search-fixed |
Locale-Insensitive Fixed Pattern Matching in 'stringi' |
stringi-search-regex |
Regular Expressions in 'stringi' |
stri_count_boundaries / stri_count_words
Arguments
- str
- ...
- opts_brkiter
- locale
> text <- "こんにちは 今日は 天気が 良い です ね。明日の天気はどうでしょうか"
> stri_count_boundaries(text, type = "word")
[1] 22
> stri_count_boundaries(text, type = "sentence")
[1] 2
stri_datetime_format
日付・時間の表示フォーマット
> stri_datetime_format(stri_datetime_now(), "datetime_relative_medium")
[1] "今日 2:22:31"
stri_detect / stri_detect_fixed / stri_detect_charclass / stri_detect_coll / stri_detect_regex
返り値は真偽値
Arguments
- str
- ...
> stri_detect_fixed(c("stringi R", "REXAMINE", "123"), c('i', 'R', '0'))
[1] TRUE TRUE FALSE
> stri_detect_fixed(c("stringi R", "REXAMINE", "123"), 'R')
[1] TRUE TRUE FALSE
> stri_detect_charclass(c("stRRRingi","REXAMINE", "123"), c("\\p{Ll}", "\\p{Lu}", "\\p{Zs}"))
[1] TRUE TRUE FALSE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), 'R.')
[1] FALSE TRUE FALSE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), '[[:alpha:]]*?')
[1] TRUE TRUE TRUE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), '[a-zC1]')
[1] TRUE FALSE TRUE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), '( R|RE)')
[1] TRUE TRUE FALSE
> stri_detect_regex("stringi", "STRING.", case_insensitive = TRUE)
[1] TRUE
stri_enc_list
ICUでサポートされているエンコーディングの種類一覧を取得する
> stri_enc_list(simplify = TRUE) %>% length()
[1] 1201
stri_locate_all / stri_locate_first / stri_locate_last
> stri_locate_all('XaaaaX',
+ regex=c('\\p{Ll}', '\\p{Ll}+', '\\p{Ll}{2,3}', '\\p{Ll}{2,3}?'))
[[1]]
start end
[1,] 2 2
[2,] 3 3
[3,] 4 4
[4,] 5 5
[[2]]
start end
[1,] 2 5
[[3]]
start end
[1,] 2 4
[[4]]
start end
[1,] 2 3
[2,] 4 5
stri_locale_list
ロケールの一覧を表示する
> stri_locale_list() %>% {
+ length(.) %>% print()
+ head(.)
+ }
[1] 683
[1] "af" "af_NA" "af_ZA" "agq" "agq_CM" "ak"
stri_locale_info
> stri_locale_info(locale = "Pl_pL")
$Language
[1] "pl"
$Country
[1] "PL"
$Variant
[1] ""
$Name
[1] "pl_PL"
stri_locale_set / stri_locale_get
現在のロケールの出力とロケールの変更
> stri_locale_get()
[1] "ja_JP"
> stri_locale_set("ja_JP"); stri_locale_get()
You are now working with stringi_1.1.2 (ja_JP.UTF-8; ICU4C 55.1 [bundle]; Unicode 7.0)
[1] "ja_JP"
stri_rand_lipsum
> stri_rand_lipsum(nparagraphs = 1, start_lipsum = TRUE)
[1] "Lorem ipsum dolor sit amet, velit est imperdiet ut. Donec dapibus aliquam convallis at neque nulla sit, dis aliquam risus sed faucibus malesuada. Blandit aliquam per auctor pellentesque, nisl nec bibendum magnis felis. Ipsum hac a nisi! Ac sem et nec, nulla massa. Scelerisque nec molestie aenean. Finibus non egestas phasellus tortor ligula vitae in a sollicitudin mattis vulputate, nec eu, sociis. Mi, quam nec massa nunc a commodo nulla mattis et euismod enim."
stri_rand_strings
> stri_rand_strings(n = 10, length = 5)
[1] "KCZtW" "afWV1" "2vlCe" "eOoXs" "YuaQw" "hP1ZU" "soqMs" "9Hfya"
[9] "W3y75" "sVKt1"
stri_split
文字列の分割
Arguments
- str
- ...
- pattern, regex, fixed, coll, charclass
- n
- omit_empty
- tokens_only
- simplify
- opts_collator, opts_fixed, opts_regex
> stri_split_fixed("a_b_c_d", "_")
[[1]]
[1] "a" "b" "c" "d"
> stri_split_charclass("Lorem ipsum dolor sit amet", "\\p{WHITE_SPACE}")
[[1]]
[1] "Lorem" "ipsum" "dolor" "sit" "amet"
stri_timezone_get
タイムゾーンの取得
> stri_timezone_get()
[1] "Asia/Tokyo"
> # stri_timezone_set("Europe/Warsaw")
stri_timezone_info
タイムゾーンに関する情報
Arguments
- tz
- locale
- display_type
> stri_timezone_info()
$ID
[1] "Asia/Tokyo"
$Name
[1] "日本標準時"
$Name.Daylight
[1] NA
$Name.Windows
[1] "Tokyo Standard Time"
$RawOffset
[1] 9
$UsesDaylightTime
[1] FALSE
> sapply(c("short", "long", "generic_short", "generic_long",
+ "gmt_short", "gmt_long", "common", "generic_location"),
+ function(e) stri_timezone_info("Europe/London", display_type=e))
short long
ID "Europe/London" "Europe/London"
Name "GMT" "グリニッジ標準時"
Name.Daylight "GMT+1" "英国夏時間"
Name.Windows "GMT Standard Time" "GMT Standard Time"
RawOffset 0 0
UsesDaylightTime TRUE TRUE
generic_short generic_long
ID "Europe/London" "Europe/London"
Name "イギリス時間" "イギリス時間"
Name.Daylight "イギリス時間" "イギリス時間"
Name.Windows "GMT Standard Time" "GMT Standard Time"
RawOffset 0 0
UsesDaylightTime TRUE TRUE
gmt_short gmt_long
ID "Europe/London" "Europe/London"
Name "+0000" "GMT"
Name.Daylight "+0100" "GMT+01:00"
Name.Windows "GMT Standard Time" "GMT Standard Time"
RawOffset 0 0
UsesDaylightTime TRUE TRUE
common generic_location
ID "Europe/London" "Europe/London"
Name "GMT" "イギリス時間"
Name.Daylight "GMT+1" "イギリス時間"
Name.Windows "GMT Standard Time" "GMT Standard Time"
RawOffset 0 0
UsesDaylightTime TRUE TRUE
stri_timezone_list
タイムゾーンの一覧
Arguments
- region
- offset
> stri_timezone_list() %>% {
+ length(.) %>% print()
+ grep("Asia", ., value = TRUE) %>% head()
+ }
[1] 621
[1] "Asia/Aden" "Asia/Almaty" "Asia/Amman" "Asia/Anadyr" "Asia/Aqtau"
[6] "Asia/Aqtobe"
stri_trans_char
文字列ベクトルの置換
Arguments
- str
- pattern
- replacement
> stri_trans_char("id.123", ".", "_")
[1] "id_123"
stri_trans_general
テキスト変換(半角全角変換、大文字・小文字変換、16進文字コード)
Arguments
- str
- id
> stri_trans_general("gro\u00df", "latin-ascii")
[1] "gross"
> stri_trans_general("tato nie wraca ranki wieczory", "pl-pl_FONIPA")
[1] "tatɔ ɲɛ vrat͡sa ranki vʲɛt͡ʂɔrɨ"
> stri_trans_general("キャンパス", "Katakana-Latin")
[1] "kyanpasu"
> stri_trans_general("东京", "Any-ch_FONIPA")
[1] "dōng jīng"
> stri_trans_general("东京", "Simplified-Traditional")
[1] "東京"
stri_trans_list
テキスト正規化のための一覧
> stri_trans_list() %>% {
+ head(.) %>% print()
+ length(.)
+ }
[1] "ASCII-Latin" "Accents-Any" "Amharic-Latin/BGN"
[4] "Any-Accents" "Any-Publishing" "Arabic-Latin"
[1] 286
stri_unique
重複する要素を取り除いたユニークな文字列を返す
Arguments
- str
- ...
- opts_collator
> c("gro\u00df", "GROSS", "Gro\u00df", "Gross") %>% stri_unique(strength = 1)
[1] "groß"
stri_width
> stri_width(LETTERS[1:5])
[1] 1 1 1 1 1