stringi: Character String Processing Facilities

ICUライブラリを利用した文字列操作のためのパッケージ

> library(stringi)

バージョン: 1.1.2


関数名 概略
%s+% Concatenate Two Character Vectors
%s<% Compare Strings with or without Collation
stri_compare Compare Strings with or without Collation
stri_count Count the Number of Pattern Matches
stri_count_boundaries Count the Number of Text Boundaries
stri_datetime_add [DRAFT API] Date and Time Arithmetic
stri_datetime_create [DRAFT API] Create a Date-Time Object
stri_datetime_fields [DRAFT API] Get Values for Date and Time Fields
stri_datetime_format [DRAFT API] Date and Time Formatting and Parsing
stri_datetime_fstr [DRAFT API] Convert 'strptime'-style Format Strings
stri_datetime_now [DRAFT API] Get Current Date and Time
stri_datetime_symbols [DRAFT API] List Localizable Date-Time Formatting Data
stri_detect Detect a Pattern Match
stri_dup Duplicate Strings
stri_duplicated Determine Duplicated Elements
stri_enc_detect [DRAFT API] Detect Character Set and Language
stri_enc_detect2 [DRAFT API] Detect Locale-Sensitive Character Encoding
stri_enc_fromutf32 Convert From UTF-32
stri_enc_info Query a Character Encoding
stri_enc_isascii Check If a Data Stream Is Possibly in ASCII
stri_enc_isutf16be Check If a Data Stream Is Possibly in UTF16 or UTF32
stri_enc_isutf8 Check If a Data Stream Is Possibly in UTF-8
stri_enc_list List Known Character Encodings
stri_enc_mark Get Declared Encodings of Each String
stri_enc_set Set or Get Default Character Encoding in 'stringi'
stri_enc_toascii Convert To ASCII
stri_enc_tonative Convert Strings To Native Encoding
stri_enc_toutf32 Convert Strings To UTF-32
stri_enc_toutf8 Convert Strings To UTF-8
stri_encode Convert Strings Between Given Encodings
stri_escape_unicode Escape Unicode Code Points
stri_extract_all Extract Occurrences of a Pattern
stri_extract_all_boundaries Extract Text Between Text Boundaries
stri_flatten Flatten a String
stri_info Query Default Settings for 'stringi'
stri_install_check Installation-Related Utilities [DEPRECATED]
stri_isempty Determine if a String is of Length Zero
stri_join Concatenate Character Vectors
stri_length Count the Number of Code Points
stri_list2matrix Convert a List to a Character Matrix
stri_locale_info Query Given Locale
stri_locale_list List Available Locales
stri_locale_set Set or Get Default Locale in 'stringi'
stri_locate_all Locate Occurrences of a Pattern
stri_locate_all_boundaries Locate Specific Text Boundaries
stri_match_all Extract Regex Pattern Matches, Together with Capture Groups
stri_numbytes Count the Number of Bytes
stri_opts_brkiter Generate a List with BreakIterator Settings
stri_opts_collator Generate a List with Collator Settings
stri_opts_fixed Generate a List with Fixed Pattern Search Engine's Settings
stri_opts_regex Generate a List with Regex Matcher Settings
stri_order Ordering Permutation and Sorting
stri_pad_both Pad (Center/Left/Right Align) a String
stri_rand_lipsum A Lorem Ipsum Generator
stri_rand_shuffle Randomly Shuffle Code Points in Each String
stri_rand_strings Generate Random Strings
stri_read_lines [DRAFT API] Read Text Lines from a Text File
stri_read_raw [DRAFT API] Read Whole Text File as Raw
stri_replace_all Replace Occurrences of a Pattern
stri_replace_na Replace Missing Values in a Character Vector
stri_reverse Reverse Each String
stri_split Split a String By Pattern Matches
stri_split_boundaries Split a String at Specific Text Boundaries
stri_split_lines Split a String Into Text Lines
stri_startswith Determine if the Start or End of a String Matches a Pattern
stri_stats_general General Statistics for a Character Vector
stri_stats_latex Statistics for a Character Vector Containing LaTeX Commands
stri_sub Extract a Substring From or Replace a Substring In a Character Vector
stri_subset Select Elements that Match a Given Pattern
stri_timezone_get [DRAFT API] Set or Get Default Time Zone in 'stringi'
stri_timezone_info [DRAFT API] Query a Given Time Zone
stri_timezone_list [DRAFT API] List Available Time Zone Identifiers
stri_trans_char Translate Characters
stri_trans_general General Text Transforms, Including Transliteration
stri_trans_list List Available Text Transforms and Transliterators
stri_trans_nfc Perform or Check For Unicode Normalization
stri_trans_tolower Transform String with Case Mapping
stri_trim_both Trim Characters from the Left and/or Right Side of a String
stri_unescape_unicode Unescape All Escape Sequences
stri_unique Extract Unique Elements
stri_width Determine the Width of Code Points
stri_wrap Word Wrap Text to Format Paragraphs
stri_write_lines [DRAFT API] Write Text Lines to a Text File
stringi-arguments Passing Arguments to Functions in 'stringi'
stringi-encoding Character Encodings and 'stringi'
stringi-locale Locales and 'stringi'
stringi-package THE String Processing Package
stringi-search String Searching
stringi-search-boundaries Text Boundary Analysis in 'stringi'
stringi-search-charclass Character Classes in 'stringi'
stringi-search-coll Locale-Sensitive Text Searching in 'stringi'
stringi-search-fixed Locale-Insensitive Fixed Pattern Matching in 'stringi'
stringi-search-regex Regular Expressions in 'stringi'

stri_count_boundaries / stri_count_words

Arguments

  • str
  • ...
  • opts_brkiter
  • locale
> text <- "こんにちは 今日は 天気が 良い です ね。明日の天気はどうでしょうか"
> stri_count_boundaries(text, type = "word")
[1] 22
> stri_count_boundaries(text, type = "sentence")
[1] 2

stri_datetime_format

日付・時間の表示フォーマット

> stri_datetime_format(stri_datetime_now(), "datetime_relative_medium")
[1] "今日 2:22:31"

stri_detect / stri_detect_fixed / stri_detect_charclass / stri_detect_coll / stri_detect_regex

返り値は真偽値

Arguments

  • str
  • ...
> stri_detect_fixed(c("stringi R", "REXAMINE", "123"), c('i', 'R', '0'))
[1]  TRUE  TRUE FALSE
> stri_detect_fixed(c("stringi R", "REXAMINE", "123"), 'R')
[1]  TRUE  TRUE FALSE
> stri_detect_charclass(c("stRRRingi","REXAMINE", "123"), c("\\p{Ll}", "\\p{Lu}", "\\p{Zs}"))
[1]  TRUE  TRUE FALSE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), 'R.')
[1] FALSE  TRUE FALSE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), '[[:alpha:]]*?')
[1] TRUE TRUE TRUE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), '[a-zC1]')
[1]  TRUE FALSE  TRUE
> stri_detect_regex(c("stringi R", "REXAMINE", "123"), '( R|RE)')
[1]  TRUE  TRUE FALSE
> stri_detect_regex("stringi", "STRING.", case_insensitive = TRUE)
[1] TRUE

stri_enc_list

ICUでサポートされているエンコーディングの種類一覧を取得する

> stri_enc_list(simplify = TRUE) %>% length()
[1] 1201

stri_locate_all / stri_locate_first / stri_locate_last

> stri_locate_all('XaaaaX',
+    regex=c('\\p{Ll}', '\\p{Ll}+', '\\p{Ll}{2,3}', '\\p{Ll}{2,3}?'))
[[1]]
     start end
[1,]     2   2
[2,]     3   3
[3,]     4   4
[4,]     5   5

[[2]]
     start end
[1,]     2   5

[[3]]
     start end
[1,]     2   4

[[4]]
     start end
[1,]     2   3
[2,]     4   5

stri_locale_list

ロケールの一覧を表示する

> stri_locale_list() %>% {
+   length(.) %>% print()
+   head(.)
+ }
[1] 683
[1] "af"     "af_NA"  "af_ZA"  "agq"    "agq_CM" "ak"

stri_locale_info

> stri_locale_info(locale = "Pl_pL")
$Language
[1] "pl"

$Country
[1] "PL"

$Variant
[1] ""

$Name
[1] "pl_PL"

stri_locale_set / stri_locale_get

現在のロケールの出力とロケールの変更

> stri_locale_get()
[1] "ja_JP"
> stri_locale_set("ja_JP"); stri_locale_get()
You are now working with stringi_1.1.2 (ja_JP.UTF-8; ICU4C 55.1 [bundle]; Unicode 7.0)
[1] "ja_JP"

stri_rand_lipsum

> stri_rand_lipsum(nparagraphs = 1, start_lipsum = TRUE)
[1] "Lorem ipsum dolor sit amet, velit est imperdiet ut. Donec dapibus aliquam convallis at neque nulla sit, dis aliquam risus sed faucibus malesuada. Blandit aliquam per auctor pellentesque, nisl nec bibendum magnis felis. Ipsum hac a nisi! Ac sem et nec, nulla massa. Scelerisque nec molestie aenean. Finibus non egestas phasellus tortor ligula vitae in a sollicitudin mattis vulputate, nec eu, sociis. Mi, quam nec massa nunc a commodo nulla mattis et euismod enim."

stri_rand_strings

> stri_rand_strings(n = 10, length = 5)
 [1] "KCZtW" "afWV1" "2vlCe" "eOoXs" "YuaQw" "hP1ZU" "soqMs" "9Hfya"
 [9] "W3y75" "sVKt1"

stri_split

文字列の分割

Arguments

  • str
  • ...
  • pattern, regex, fixed, coll, charclass
  • n
  • omit_empty
  • tokens_only
  • simplify
  • opts_collator, opts_fixed, opts_regex
> stri_split_fixed("a_b_c_d", "_")
[[1]]
[1] "a" "b" "c" "d"
> stri_split_charclass("Lorem ipsum dolor sit amet", "\\p{WHITE_SPACE}")
[[1]]
[1] "Lorem" "ipsum" "dolor" "sit"   "amet"

stri_timezone_get

タイムゾーンの取得

> stri_timezone_get()
[1] "Asia/Tokyo"
> # stri_timezone_set("Europe/Warsaw")

stri_timezone_info

タイムゾーンに関する情報

Arguments

  • tz
  • locale
  • display_type
> stri_timezone_info()
$ID
[1] "Asia/Tokyo"

$Name
[1] "日本標準時"

$Name.Daylight
[1] NA

$Name.Windows
[1] "Tokyo Standard Time"

$RawOffset
[1] 9

$UsesDaylightTime
[1] FALSE
> sapply(c("short", "long", "generic_short", "generic_long",
+          "gmt_short", "gmt_long", "common", "generic_location"),
+   function(e) stri_timezone_info("Europe/London", display_type=e))
                 short               long               
ID               "Europe/London"     "Europe/London"    
Name             "GMT"               "グリニッジ標準時" 
Name.Daylight    "GMT+1"             "英国夏時間"       
Name.Windows     "GMT Standard Time" "GMT Standard Time"
RawOffset        0                   0                  
UsesDaylightTime TRUE                TRUE               
                 generic_short       generic_long       
ID               "Europe/London"     "Europe/London"    
Name             "イギリス時間"      "イギリス時間"     
Name.Daylight    "イギリス時間"      "イギリス時間"     
Name.Windows     "GMT Standard Time" "GMT Standard Time"
RawOffset        0                   0                  
UsesDaylightTime TRUE                TRUE               
                 gmt_short           gmt_long           
ID               "Europe/London"     "Europe/London"    
Name             "+0000"             "GMT"              
Name.Daylight    "+0100"             "GMT+01:00"        
Name.Windows     "GMT Standard Time" "GMT Standard Time"
RawOffset        0                   0                  
UsesDaylightTime TRUE                TRUE               
                 common              generic_location   
ID               "Europe/London"     "Europe/London"    
Name             "GMT"               "イギリス時間"     
Name.Daylight    "GMT+1"             "イギリス時間"     
Name.Windows     "GMT Standard Time" "GMT Standard Time"
RawOffset        0                   0                  
UsesDaylightTime TRUE                TRUE

stri_timezone_list

タイムゾーンの一覧

Arguments

  • region
  • offset
> stri_timezone_list() %>% {
+   length(.) %>% print()
+   grep("Asia", ., value = TRUE) %>% head()
+ }
[1] 621
[1] "Asia/Aden"   "Asia/Almaty" "Asia/Amman"  "Asia/Anadyr" "Asia/Aqtau" 
[6] "Asia/Aqtobe"

stri_trans_char

文字列ベクトルの置換

Arguments

  • str
  • pattern
  • replacement
> stri_trans_char("id.123", ".", "_")
[1] "id_123"

stri_trans_general

テキスト変換(半角全角変換、大文字・小文字変換、16進文字コード)

Arguments

  • str
  • id
> stri_trans_general("gro\u00df", "latin-ascii")
[1] "gross"
> stri_trans_general("tato nie wraca ranki wieczory", "pl-pl_FONIPA")
[1] "tatɔ ɲɛ vrat͡sa ranki vʲɛt͡ʂɔrɨ"
> stri_trans_general("キャンパス", "Katakana-Latin")
[1] "kyanpasu"
> stri_trans_general("东京", "Any-ch_FONIPA")
[1] "dōng jīng"
> stri_trans_general("东京", "Simplified-Traditional")
[1] "東京"

stri_trans_list

テキスト正規化のための一覧

> stri_trans_list() %>% {
+   head(.) %>% print()
+   length(.)
+ }
[1] "ASCII-Latin"       "Accents-Any"       "Amharic-Latin/BGN"
[4] "Any-Accents"       "Any-Publishing"    "Arabic-Latin"
[1] 286

stri_unique

重複する要素を取り除いたユニークな文字列を返す

Arguments

  • str
  • ...
  • opts_collator
> c("gro\u00df", "GROSS", "Gro\u00df", "Gross") %>% stri_unique(strength = 1)
[1] "groß"

stri_width

> stri_width(LETTERS[1:5])
[1] 1 1 1 1 1